## **Implementing WaveNet: A Generative Model for Raw Audio (Adapted for Word Generation)**

* WaveNet is originally a **generative model for raw audio**.
  In this notebook, I will adapt its underlying **architecture** for our **word generation model**, since the core design principles are similar.

* The goal is to **improve our baseline model’s architecture** by aligning it more closely with the **WaveNet-inspired approach** discussed in the research paper "https://arxiv.org/pdf/1609.03499".

---

### **Baseline Model (Previous Architecture)**

```python
model_2 = Sequential([
    Embeddings(vocab_size, n_embeddings),
    Flatten(),
    Linear(in_features, out_features),  BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, vocab_size)
])
```

---

### **Upgrading Towards a WaveNet-like Model**

In this notebook, I will move beyond the **basic Neural Network + Batch Normalization setup** and introduce a **WaveNet-inspired architecture**.

Unlike the baseline model, where squashing/non-linear transformations occur **suddenly**, the WaveNet approach allows features to be **progressively compressed and transformed** through **causal and dilated convolutions**, capturing **hierarchical patterns** in a smoother, more structured manner.

---

### **Advantages of WaveNet over Plain Neural Networks**

1. **Better Long-Term Dependency Modeling**

   * WaveNet’s **dilated convolutions** can capture patterns over much longer contexts without requiring extremely deep layers.

2. **Smoother Feature Extraction**

   * Instead of forcing representations to collapse quickly, WaveNet progressively refines them, leading to more stable and expressive outputs.

3. **Improved Generative Quality**

   * The autoregressive setup enables WaveNet to generate highly realistic sequences (in audio or text), compared to the sometimes rigid outputs of standard feedforward networks.

4. **Scalability**

   * Easier to parallelize compared to recurrent models like RNNs or LSTMs, while still capturing temporal dependencies.

In [1]:
import torch
import matplotlib.pyplot as plt
import torch.nn.functional as F

# Laoding dataset

In [2]:
# Load all the words from the '.txt' file
words = open('names.txt', mode = 'r', encoding='utf-8').read().splitlines()
words[:10]

# Encoder and Decoder
chars = sorted(list(set(''.join(words))))
stoi = {c:i+1 for i, c in enumerate(chars)}
stoi['.'] = 0
itos = {i:c for c, i in stoi.items()}

# Generate train, test and validation Dataset
def generate_dataset(words, block_size):
    x, y = [], []
    for w in words:
        # print(w)
        context = [0] * block_size
        for ch in w + '.':
            idx = stoi[ch]
            x.append(context)
            y.append(idx)
            # print(f"{''.join([itos[i] for i in context])} --> {itos[idx]}")
            context = context[1:] + [idx]
    x, y = torch.tensor(x), torch.tensor(y)
    return x, y

def get_split(data, train_split: float, test_split: float, val_split: float, block_size: int):
    import random
    random.seed(42)

    if (train_split + test_split + val_split) != 1:
        raise ValueError("All splits must sum to 100% of the data")
    else: 
        random.shuffle(data)
        n1 = int(train_split* len(data))
        n2 = int((train_split + val_split) * len(data))
        x_train, y_train = generate_dataset(data[:n1], block_size)
        x_val, y_val = generate_dataset(data[n1:n2], block_size)
        x_test, y_test = generate_dataset(data[n2:], block_size)

        return x_train, y_train, x_val, y_val, x_test, y_test

x_train, y_train, x_val, y_val, x_test, y_test = get_split(data = words, train_split = 0.8, test_split = 0.1, val_split = 0.1, block_size = 8)

## Defining the Framework

In [3]:
class Embeddings:
    def __init__(self, embd_dim, n_classes):
        self.embd_dim = embd_dim
        self.n_classes = n_classes
        self.embd_matrix = torch.nn.Parameter(torch.randn(self.n_classes, self.embd_dim, requires_grad=True))
    
    def forward(self, x):
        return self.embd_matrix[x]
    
    def parameters(self):
        return [self.embd_matrix]
    
    def __call__(self, x):
        return self.forward(x)
    
    def to(self, device):
        self.embd_matrix = torch.nn.Parameter(self.embd_matrix.to(device))
        return self


class Linear:
    def __init__(self, *, in_features, out_features):
        self.in_features = in_features
        self.out_features = out_features
        self.weight = torch.nn.Parameter(torch.randn(in_features, out_features))
        self.bias = torch.nn.Parameter(torch.randn(out_features,))
    
    def forward(self, x):
        return x @ self.weight + self.bias
    
    def __call__(self, x):
        return self.forward(x)
    
    def parameters(self):
        return [self.weight] + [self.bias]
    
    def to(self, device):
        self.weight = torch.nn.Parameter(self.weight.to(device))
        self.bias = torch.nn.Parameter(self.bias.to(device))
        return self


class Flatten:
    def forward(self, x):
        return x.view(x.shape[0], -1)
    
    def __call__(self, x):
        return self.forward(x)
    
    def parameters(self):
        return []
    
    def to(self, device):
        return self


class Sequential:
    def __init__(self, layers):
        self.layers = layers
        # Model Configuration
        self.embd_dim = None
        self.n_classes = None
        self.in_features = None
        self.out_features = None
        for layer in self.layers:
            if isinstance(layer, Embeddings):
                self.embd_dim = layer.embd_dim
                self.n_classes = layer.n_classes
                break
        for layer in self.layers:
            if isinstance(layer, Linear):
                self.in_features = layer.in_features
                self.out_features = layer.out_features
                break
    
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x
    
    def __call__(self, x):
        return self.forward(x)
    
    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]
    
    def to(self, device):
        for layer in self.layers:
            layer.to(device)
        return self


class Tanh:
    def forward(self, x):
        return torch.tanh(x)
    
    def __call__(self, x):
        return self.forward(x)
    
    def parameters(self):
        return []
    
    def to(self, device):
        return self

class ReLU:
    def forward(self, x):
        return torch.relu(x)
    
    def __call__(self, x):
        return self.forward(x)
    
    def parameters(self):
        return []
    
    def to(self, device):
        return self


class Softmax:
    def forward(self, logits):
        exp_logits = torch.exp(logits)
        probs = exp_logits / exp_logits.sum(dim = 1, keepdim = True)
        return probs
    
    def __call__(self, logits):
        return self.forward(logits)
    
    def parameters(self):
        return []
    
    def to(self, device):
        return self


class CrossEntropyLoss:
    def forward(self, logits, y_true):
        softmax = Softmax()
        probs = softmax(logits)
        loss = -(probs[torch.arange(0, len(probs)), y_true].log().mean())
        return loss
    
    def __call__(self, logits, y_true):
        return self.forward(logits, y_true)
    
    def parameters(self):
        return []

class LayerNorm1d:
    def __init__(self, in_features, eps = 1e-8):
        self.in_features = in_features
        self.gamma = torch.nn.Parameter(torch.ones(1, self.in_features))
        self.beta = torch.nn.Parameter(torch.zeros(1, self.in_features))
        self.eps = eps
    
    def forward(self, x):
        mean = x.mean(dim = -1, keepdim  = True)
        std = x.std(dim = -1, keepdim  = True)
        x_norm = ((x - mean) / (std + self.eps))
        return x_norm * self.gamma + self.beta # here * is Hadamard Multiplication (element-wise) multiplication
    
    def parameters(self):
        return [self.gamma] + [self.beta]
    
    def to(self, device):
        self.gamma = torch.nn.Parameter(self.gamma.to(device))
        self.beta = torch.nn.Parameter(self.beta.to(device))
        return self
    
    def __call__(self, x):
        return self.forward(x)

## Class for registering the Model

In [4]:
class RegisterModel:
    def __init__(self, model_name, model_version, model, device):
        self.model_name = model_name
        self.model_version = model_version
        self.model = model
        self.layers = self.model.layers
        self.embd_dim = self.model.embd_dim
        self.n_classes = self.model.n_classes
        self.in_features = self.model.in_features
        self.out_features = self.model.out_features
        self.loss_fn = CrossEntropyLoss()
        self.parameters = self.model.parameters()
        self.device = device
        self.to(device)
        self.n_parameters = sum([p.nelement() for p in self.parameters])
        print(f"Model registered with {self.n_parameters} Parameters")
    
    def train(self, x, y, epochs, lr):
        batch_size = 512
        self.summary(epochs, lr)
        print("-" * 100)

        initial_lr = lr
        
        for epoch in range(epochs):
            # Learning rate decay
            lr = initial_lr * (0.95 ** (epoch // 200))

            total_loss = 0
            
            for i in range(0, len(x), batch_size):
                xb = x[i:i + batch_size]
                yb = y[i:i + batch_size]

                logits = self.model(xb)
                loss = self.loss_fn(logits=logits, y_true=yb)
                total_loss += loss.item()

                # Reset gradients
                for p in self.model.parameters():
                    p.grad = None

                loss.backward()

                # Manual SGD update
                for p in self.model.parameters():
                    p.data -= lr * p.grad

            if epoch % 50 == 0:
                avg_loss = total_loss / (len(x) // batch_size)
                print(f"Epoch {epoch}/{epochs} | Loss: {avg_loss:.4f} | lr={lr:.5f}")

                
    def to(self, device):
        self.model.to(device)
        return self
    
    def __call__(self, x, y, epochs, lr):
        return self.train(x, y, epochs, lr)
    
    def summary(self, epochs, lr):
        print(f"Training {self.model_name} | {self.model_version} | Epochs = {epochs} | lr = {lr} | device = {self.device}")

        print("\n" + "="*90)
        print("Training Configuration Summary")
        print("="*90)
        print(f"Model Name        : {self.model_name}")
        print(f"Model Version     : {self.model_version}")
        print(f"Device            : {self.device}")
        print(f"Epochs            : {epochs}")
        print(f"Learning Rate     : {lr}")
        print("-"*90)
        print("Model Hyperparameters:")
        print(f"  ├─ Embedding Dimension : {self.embd_dim}")
        print(f"  ├─ Number of Classes   : {self.n_classes}")
        print(f"  ├─ Input Dimension     : {self.in_features}")
        print(f"  └─ Hidden Dimension    : {self.out_features}")
        print("-"*90)

        print("Model Architecture:")
        print("-" * 60)
        for layer in self.layers:
            layer_name = layer.__class__.__name__
            print(f"  └── {layer_name}()")
        print("-" * 60)

        print(f"Total Trainable Parameters : {self.n_parameters:,}")
        print("="*90 + "\n")

In [5]:
# ## Define Architecture of the Neural Network: Model
n_classes = 27
embd_dim = 2
block_size = 8
in_features = block_size * embd_dim
out_features = 8
softmax = Softmax()
loss_fn = CrossEntropyLoss()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

MLP_v1 = Sequential([
    Embeddings(n_classes = n_classes, embd_dim = embd_dim),
    Flatten(),
    Linear(in_features = in_features, out_features = out_features),
    Tanh(),
    Linear(in_features = out_features, out_features = out_features),
    Tanh(),
    Linear(in_features = out_features, out_features = n_classes)
])

MLP_v2 = Sequential([
    Embeddings(n_classes = n_classes, embd_dim = embd_dim),
    Flatten(),
    Linear(in_features = in_features, out_features = out_features),
    LayerNorm1d(in_features = out_features),
    Tanh(),
    Linear(in_features = out_features, out_features = out_features),
    LayerNorm1d(in_features = out_features),
    Tanh(),
    Linear(in_features = out_features, out_features = n_classes)
])

MLP_v1.to(device)
MLP_v2.to(device)

x_train = x_train.to(device)
y_train = y_train.to(device)

# Debug
# embeddings = Embeddings(n_classes = n_classes, embd_dim = embd_dim).to(device)
# print(f"x_train.shape = {x_train.shape}")
# x_enc = embeddings(x_train)
# print(f"x_enc.shape = {x_enc.shape}")
# flatten_layer = Flatten().to(device)
# x_enc_flatten = flatten_layer(x_enc)
# print(f"x_enc_flatten.shape = {x_enc_flatten.shape}")

In [6]:
MultiLayerPerceptronModel_v1 = RegisterModel(model = MLP_v1, model_name = "MultiLayerPerceptronModel_v1", model_version = "version-1", device = device)

Model registered with 505 Parameters


In [7]:
MultiLayerPerceptronModel_v1(x = x_train, y = y_train, epochs = 100, lr = 0.01)

Training MultiLayerPerceptronModel_v1 | version-1 | Epochs = 100 | lr = 0.01 | device = cuda

Training Configuration Summary
Model Name        : MultiLayerPerceptronModel_v1
Model Version     : version-1
Device            : cuda
Epochs            : 100
Learning Rate     : 0.01
------------------------------------------------------------------------------------------
Model Hyperparameters:
  ├─ Embedding Dimension : 2
  ├─ Number of Classes   : 27
  ├─ Input Dimension     : 16
  └─ Hidden Dimension    : 8
------------------------------------------------------------------------------------------
Model Architecture:
------------------------------------------------------------
  └── Embeddings()
  └── Flatten()
  └── Linear()
  └── Tanh()
  └── Linear()
  └── Tanh()
  └── Linear()
------------------------------------------------------------
Total Trainable Parameters : 505

----------------------------------------------------------------------------------------------------
Epoch 0/100 | Lo

In [8]:
MultiLayerPerceptronModel_v2 = RegisterModel(model = MLP_v2, model_name = "MultiLayerPerceptronModel_v2", model_version = "version-2", device = device)

Model registered with 537 Parameters


In [9]:
MultiLayerPerceptronModel_v2(x = x_train, y = y_train, epochs = 100, lr = 0.01)

Training MultiLayerPerceptronModel_v2 | version-2 | Epochs = 100 | lr = 0.01 | device = cuda

Training Configuration Summary
Model Name        : MultiLayerPerceptronModel_v2
Model Version     : version-2
Device            : cuda
Epochs            : 100
Learning Rate     : 0.01
------------------------------------------------------------------------------------------
Model Hyperparameters:
  ├─ Embedding Dimension : 2
  ├─ Number of Classes   : 27
  ├─ Input Dimension     : 16
  └─ Hidden Dimension    : 8
------------------------------------------------------------------------------------------
Model Architecture:
------------------------------------------------------------
  └── Embeddings()
  └── Flatten()
  └── Linear()
  └── LayerNorm1d()
  └── Tanh()
  └── Linear()
  └── LayerNorm1d()
  └── Tanh()
  └── Linear()
------------------------------------------------------------
Total Trainable Parameters : 537

-----------------------------------------------------------------------------

## Sample from the Model

In [10]:
def generate_names(count: int, model, block_size: int):
    names = []

    for _ in range(count):
        out = []
        context = [0] * block_size
        
        while True:
            x = torch.tensor([context]).to(device)
            logits = model(x)

            # Only use logits from last position (autoregressive prediction)
            probs = softmax(logits)

            idx = torch.multinomial(probs, num_samples=1).item()
            context = context[1:] + [idx]

            if idx == 0:  # end-of-word token
                break

            out.append(idx)

        # Decode indices → characters
        name = ''.join(itos[i] for i in out).capitalize()
        names.append(name)

    return names

In [16]:
names_v1 = generate_names(count = 5, model = MLP_v1, block_size = 8)
names_v2 = generate_names(count = 5, model = MLP_v2, block_size = 8)
print(f"Names Generated from Model-1: {names_v1}")
print(f"Names Generated from Model-2: {names_v2}")

Names Generated from Model-1: ['Eomae', 'Erditen', 'Csortsney', 'Aaenzo', 'Slrre']
Names Generated from Model-2: ['Yrcevsa', 'Broka', 'Zelarav', 'Ali', 'Arl']


In [17]:
embeddings = Embeddings(n_classes = n_classes, embd_dim = embd_dim).to(device)
flatten = Flatten()
layer_1 = Linear(in_features = in_features, out_features = out_features).to(device)
tanh = Tanh()
layer_2 = Linear(in_features = out_features, out_features = out_features).to(device)
tanh = Tanh()
layer_3 = Linear(in_features = out_features, out_features = out_features).to(device)
tanh = Tanh()
layer_4 = Linear(in_features = out_features, out_features = n_classes).to(device)
softmax = Softmax()
loss_fn = CrossEntropyLoss()
eps = 1e-6

In [63]:
def forward_pass(x):
    x = layer_1(x)
    x = tanh(x)
    x = layer_2(x)
    x = layer_2(x)
    x = tanh(x)
    x = layer_3(x)
    x = tanh(x)
    x = layer_4(x)
    return x

In [58]:
x_example = torch.randint(1, 27, size = [1, 8])
x_example

tensor([[10, 17, 24,  7, 10, 11,  7, 10]])

In [59]:
embeddings(x_example).shape

torch.Size([1, 8, 2])

In [61]:
flatten_example = flatten(embeddings(x_example))
flatten_example.shape

torch.Size([1, 16])

In [78]:
(forward_pass(flatten_example)).shape

torch.Size([1, 27])

In [83]:
(torch.randn(1, 2, 3, 16) @ torch.randn(16, 27) + torch.randn(1, 27)).shape

torch.Size([1, 2, 3, 27])

## Loss Analysis

I evaluated the impact of increasing the `block_size` from **3 → 8** on both a Plain MLP and a BatchNorm-augmented MLP.

| Model             | Block Size | Train Loss | Test Loss | Val Loss |
| ----------------- | ---------- | ---------- | --------- | -------- |
| **Plain MLP**     | 3          | 2.0001     | 2.0939    | 2.1003   |
| **BatchNorm MLP** | 3          | 2.0006     | 2.1040    | 2.1081   |
| **Plain MLP**     | 8          | 1.750       | 2.051    | 2.0525   |
| **BatchNorm MLP** | 8          | 1.8535     | 2.0409    | 2.0408   |

---

**Observation & Interpretation:**

* Both models show reduced loss when the block size is increased.
* **Plain MLP** achieves the largest improvement, especially in validation loss, suggesting stronger generalization.
* **BatchNorm MLP** also benefits, though the improvements are smaller since BatchNorm already stabilizes training.
* The results indicate that a larger block size enables the model to **capture longer-range dependencies** in the data, which enhances learning efficiency and reduces overfitting.

## Summary

* To be added after implementing the WaveNet architecture.