## **Implementing WaveNet: A Generative Model for Raw Audio (Adapted for Word Generation)**

* WaveNet is originally a **generative model for raw audio**.
  In this notebook, I will adapt its underlying **architecture** for our **word generation model**, since the core design principles are similar.

* The goal is to **improve our baseline model’s architecture** by aligning it more closely with the **WaveNet-inspired approach** discussed in the research paper "https://arxiv.org/pdf/1609.03499".

---

### **Baseline Model (Previous Architecture)**

```python
model_2 = Sequential([
    Embeddings(vocab_size, n_embeddings),
    Flatten(),
    Linear(in_features, out_features),  BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, vocab_size)
])
```

---

### **Upgrading Towards a WaveNet-like Model**

In this notebook, I will move beyond the **basic Neural Network + Batch Normalization setup** and introduce a **WaveNet-inspired architecture**.

Unlike the baseline model, where squashing/non-linear transformations occur **suddenly**, the WaveNet approach allows features to be **progressively compressed and transformed** through **causal and dilated convolutions**, capturing **hierarchical patterns** in a smoother, more structured manner.

---

### **Advantages of WaveNet over Plain Neural Networks**

1. **Better Long-Term Dependency Modeling**

   * WaveNet’s **dilated convolutions** can capture patterns over much longer contexts without requiring extremely deep layers.

2. **Smoother Feature Extraction**

   * Instead of forcing representations to collapse quickly, WaveNet progressively refines them, leading to more stable and expressive outputs.

3. **Improved Generative Quality**

   * The autoregressive setup enables WaveNet to generate highly realistic sequences (in audio or text), compared to the sometimes rigid outputs of standard feedforward networks.

4. **Scalability**

   * Easier to parallelize compared to recurrent models like RNNs or LSTMs, while still capturing temporal dependencies.

In [1]:
import torch
import matplotlib.pyplot as plt
import torch.nn.functional as F

# Laoding dataset

In [2]:
# Load all the words from the '.txt' file
words = open('names.txt', mode = 'r', encoding='utf-8').read().splitlines()
words[:10]

# Encoder and Decoder
chars = sorted(list(set(''.join(words))))
stoi = {c:i+1 for i, c in enumerate(chars)}
stoi['.'] = 0
itos = {i:c for c, i in stoi.items()}

# Generate train, test and validation Dataset
def generate_dataset(words, block_size):
    x, y = [], []
    for w in words:
        # print(w)
        context = [0] * block_size
        for ch in w + '.':
            idx = stoi[ch]
            x.append(context)
            y.append(idx)
            # print(f"{''.join([itos[i] for i in context])} --> {itos[idx]}")
            context = context[1:] + [idx]
    x, y = torch.tensor(x), torch.tensor(y)
    return x, y

def get_split(data, train_split: float, test_split: float, val_split: float, block_size: int):
    import random
    random.seed(42)

    if (train_split + test_split + val_split) != 1:
        raise ValueError("All splits must sum to 100% of the data")
    else: 
        random.shuffle(data)
        n1 = int(train_split* len(data))
        n2 = int((train_split + val_split) * len(data))
        x_train, y_train = generate_dataset(data[:n1], block_size)
        x_val, y_val = generate_dataset(data[n1:n2], block_size)
        x_test, y_test = generate_dataset(data[n2:], block_size)

        return x_train, y_train, x_val, y_val, x_test, y_test

x_train, y_train, x_val, y_val, x_test, y_test = get_split(data = words, train_split = 0.8, test_split = 0.1, val_split = 0.1, block_size = 8)

## Defining the Framework

In [40]:
class Linear:
    def __init__(self, in_features, out_features, bias: bool = True):
        self.weight = torch.nn.Parameter(torch.empty(in_features, out_features))
        torch.nn.init.xavier_uniform_(self.weight)
        self.bias = torch.nn.Parameter(torch.zeros(out_features)) if bias else None
        self.output = None
    
    def forward(self, x):
        self.output = x @ self.weight + self.bias
        return self.output
    
    def __call__(self, x): 
        return self.forward(x)
    
    def parameters(self):
        return [self.weight] + ([] if self.bias is None else [self.bias])

class BatchNorm1d:
    def __init__(self, in_features, training: bool = True, momentum = 0.1, eps = 1e-05):
        self.in_features = in_features
        self.training = training
        self.gamma = torch.nn.Parameter(torch.ones(1, self.in_features), requires_grad = True)
        self.beta = torch.nn.Parameter(torch.zeros(1, self.in_features), requires_grad = True)
        self.running_mean = torch.zeros(1, self.in_features)
        self.running_var = torch.ones(1, self.in_features)
        self.momentum = momentum
        self.eps = eps
        self.output = None
    
    def forward(self, x):
        if self.training:
            batch_mean = x.mean(dim = 0, keepdim = True)
            batch_var = x.var(dim = 0, keepdim = True, unbiased = False)

        else: 
            batch_mean = self.running_mean 
            batch_var = self.running_var
        
        x_hat = (x - batch_mean) / torch.sqrt(batch_var + self.eps)
        self.output = x_hat * self.gamma + self.beta

        if self.training:
            with torch.no_grad():
                self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * batch_mean
                self.running_var = (1 - self.momentum) * self.running_var + self.momentum * batch_var

        return self.output
    
    def __call__(self, x):
        return self.forward(x)
    
    def parameters(self):
        return [self.gamma, self.beta]


class Tanh:
    def __call__(self, input):
        self.output = torch.tanh(input)
        return self.output
    
    def parameters(self):
        return []

class Embeddings:
    def __init__(self, in_features, out_features):
        self.in_features = in_features
        self.out_features = out_features
        self.weight = torch.nn.Parameter(torch.randn(in_features, out_features))
        torch.nn.init.xavier_uniform_(self.weight)
        self.output = None
    
    def forward(self, x):
        self.output = self.weight[x]
        return self.output

    def __call__(self, x):
        return self.forward(x)
    
    def parameters(self):
        return [self.weight]

class Flatten:
    def forward(self, x):
        self.output = x.view(x.shape[0], -1)
        return self.output

    def __call__(self, x):
        return self.forward(x)
    
    def parameters(self):
        return []

class Sequential:
    def __init__(self, layers):
        self.layers = layers
    
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

    def __call__(self, x):
        return self.forward(x)

    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

## Class for registering the Model

In [41]:
class MLPModel:
    def __init__(self, model):
        self.model = model
        self.layers = self.model.layers
        self.parameters = model.parameters()
        self.n_parameters = sum([p.nelement() for p in self.parameters])
        self.model_type = self.check_model()
        print(f"{self.model_type} registered with Learnable Parameters: {self.n_parameters}")
    
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x
    
    def __call__(self, x):
        return self.forward(x)
    
    def generate_names(self, num_names: int = 5, block_size: int = 8):
        print("-" * 40)
        print(f"Generating names from {self.model_type}")
        g = torch.Generator().manual_seed(42)
        for _ in range(num_names):
            out = []
            context = [0] * block_size
            while True:
                for layer in self.layers:
                    if isinstance(layer, BatchNorm1d):
                        layer.training = False
                
                x = torch.tensor([context])
                for layer in self.layers:
                    x = layer(x)
                    
                probs = F.softmax(x, dim = 1)
                idx = torch.multinomial(probs, num_samples=1, generator=g).item()
                context = context[1:] + [idx]
                out.append(idx)
                if idx == 0:
                    break
        
            print(''.join(itos[i] for i in out))
    
    def train_model(self, lr: float = 0.01, epochs: int = 200000):
        print("-" * 40)
        print(f"Training {self.model_type} | Epochs: {200000} | lr: {lr}")
        print("-" * 40)
        
        for i in range(epochs):
            # mini-batch processing
            rand_idx = torch.randint(0, x_train.shape[0], (32,))
            
            x = x_train[rand_idx]
            for layer in self.layers:
                x = layer(x)

            loss = F.cross_entropy(x, y_train[rand_idx])

            # Backward pass
            for p in self.parameters:
                p.grad = None

            loss.backward()

            for p in self.parameters:
                p.data -= lr * p.grad
            
            if i % 10000 == 0:
                print(f"{i} / {epochs} Loss: {loss}")
            
            # break
    
    def check_model(self):
        self.model_type = "Plain_MLP_Model"
        for layer in self.layers:
            if isinstance(layer, BatchNorm1d):
                self.model_type = "BatchNorm_MLP_Model"
                break
        return self.model_type
    
    # Evaluate the loss on validation test
    def eval_loss(self, split):
        if split == "train":
            x_data, y_data = x_train, y_train
        elif split == "test":
            x_data, y_data = x_test, y_test
        elif split == "val":
            x_data, y_data = x_val, y_val
        else:
            raise ValueError("split must be 'train', 'test', or 'val'")

        x = x_data
        for layer in self.layers:
            if isinstance(layer, BatchNorm1d):
                layer.training = False
                break

        for layer in self.layers:
            x = layer(x)

        loss_val = F.cross_entropy(x, y_data)
        print(f"Loss on {split} split = {loss_val}")
        return loss_val

In [5]:
g = torch.Generator().manual_seed(42)
n_embedings = 10
vocab_size = len(chars) + 1
block_size = 8
in_features = n_embedings * block_size
out_features = 200

model_1 = Sequential([
    Embeddings(vocab_size, n_embedings),
    Flatten(),
    Linear(in_features, out_features), Tanh(),
    Linear(out_features, out_features), Tanh(),
    Linear(out_features, vocab_size)
])

model_2 = Sequential([
    Embeddings(vocab_size, n_embedings),
    Flatten(),
    Linear(in_features, out_features),  BatchNorm1d(out_features), Tanh(),
    Linear(out_features, out_features), BatchNorm1d(out_features), Tanh(),
    Linear(out_features, vocab_size)
])


# Register the Model for further tracking
model_1 = MLPModel(model = model_1)
model_2 = MLPModel(model = model_2)

Plain_MLP_Model registered with Learnable Parameters: 62097
BatchNorm_MLP_Model registered with Learnable Parameters: 62897


In [6]:
# Train the Models
model_1.train_model()
model_2.train_model(lr = 0.05)

----------------------------------------
Training Plain_MLP_Model | Epochs: 200000 | lr: 0.01
----------------------------------------
0 / 200000 Loss: 3.3664844036102295
10000 / 200000 Loss: 1.8311465978622437
20000 / 200000 Loss: 2.044177532196045
30000 / 200000 Loss: 2.006683588027954
40000 / 200000 Loss: 2.6584677696228027
50000 / 200000 Loss: 2.1751649379730225
60000 / 200000 Loss: 2.027599334716797
70000 / 200000 Loss: 1.8963265419006348
80000 / 200000 Loss: 2.0012118816375732
90000 / 200000 Loss: 2.311673879623413
100000 / 200000 Loss: 1.9995720386505127
110000 / 200000 Loss: 1.9229329824447632
120000 / 200000 Loss: 2.4603238105773926
130000 / 200000 Loss: 1.8924949169158936
140000 / 200000 Loss: 2.2731924057006836
150000 / 200000 Loss: 2.0405640602111816
160000 / 200000 Loss: 2.2669055461883545
170000 / 200000 Loss: 2.355252742767334
180000 / 200000 Loss: 1.8750427961349487
190000 / 200000 Loss: 1.8019914627075195
----------------------------------------
Training BatchNorm_MLP_

In [7]:
# generate some words from different Models
model_1.generate_names(block_size)
model_2.generate_names(block_size)

----------------------------------------
Generating names from Plain_MLP_Model
yessy.
havilin.
dlagkin.
zainaya.
tryvie.
chen.
emberly.
milah.
----------------------------------------
Generating names from BatchNorm_MLP_Model
yeosyah.
marie.
daxen.
sadee.
jyanna.
amerie.
caena.
dayson.


In [8]:
# Test and Validation loss checking logic: will implemnet later; because i need to re-train the entire model
print("Plain_MLP_Model")
model_1.eval_loss('train')
model_1.eval_loss('test')
model_1.eval_loss('val')
print("-" * 40)
print("BatchNorm_MLP_Model")
model_2.eval_loss('train')
model_2.eval_loss('test')
model_2.eval_loss('val')

Plain_MLP_Model
Loss on train split = 1.9853750467300415
Loss on test split = 2.0512566566467285
Loss on val split = 2.052518129348755
----------------------------------------
BatchNorm_MLP_Model
Loss on train split = 1.8535659313201904
Loss on test split = 2.0409789085388184
Loss on val split = 2.0408637523651123


tensor(2.0409, grad_fn=<NllLossBackward0>)

In [None]:
def get_class_name(obj: object) -> str:
    return str(type(obj).__name__)

def get_shape_description(model: MLPModel) -> None:
    class_names = []
    for layer in model.layers:
        class_names.append(get_class_name(layer))
    class_names

    rand_idx = torch.randint(1, x_train.shape[0], size = (4, ))
    x_batch, _ = x_train[rand_idx], y_train[rand_idx]

    input = x_batch
    print(f"{class_names[0]}_input.shape = {input.shape}")
    for i in range(len(class_names)):
        input = model.layers[i](input)
        print(f"output.shape after passing from {class_names[i]}_layer : {input.shape}")

In [95]:
print("Model = Plain_MLP_Model")
print("-" * 80)
get_shape_description(model = model_1)
print("-" * 80)
print("Model = BatchNorm_MLP_Model")
get_shape_description(model = model_2)
print("-" * 80)

Model = Plain_MLP_Model
--------------------------------------------------------------------------------
Embeddings_input.shape = torch.Size([4, 8])
output.shape after passing from Embeddings_layer : torch.Size([4, 8, 10])
output.shape after passing from Flatten_layer : torch.Size([4, 80])
output.shape after passing from Linear_layer : torch.Size([4, 200])
output.shape after passing from Tanh_layer : torch.Size([4, 200])
output.shape after passing from Linear_layer : torch.Size([4, 200])
output.shape after passing from Tanh_layer : torch.Size([4, 200])
output.shape after passing from Linear_layer : torch.Size([4, 27])
--------------------------------------------------------------------------------
Model = BatchNorm_MLP_Model
Embeddings_input.shape = torch.Size([4, 8])
output.shape after passing from Embeddings_layer : torch.Size([4, 8, 10])
output.shape after passing from Flatten_layer : torch.Size([4, 80])
output.shape after passing from Linear_layer : torch.Size([4, 200])
output.shap

## Loss Analysis

I evaluated the impact of increasing the `block_size` from **3 → 8** on both a Plain MLP and a BatchNorm-augmented MLP.

| Model             | Block Size | Train Loss | Test Loss | Val Loss |
| ----------------- | ---------- | ---------- | --------- | -------- |
| **Plain MLP**     | 3          | 2.0001     | 2.0939    | 2.1003   |
| **BatchNorm MLP** | 3          | 2.0006     | 2.1040    | 2.1081   |
| **Plain MLP**     | 8          | 1.750       | 2.051    | 2.0525   |
| **BatchNorm MLP** | 8          | 1.8535     | 2.0409    | 2.0408   |

---

**Observation & Interpretation:**

* Both models show reduced loss when the block size is increased.
* **Plain MLP** achieves the largest improvement, especially in validation loss, suggesting stronger generalization.
* **BatchNorm MLP** also benefits, though the improvements are smaller since BatchNorm already stabilizes training.
* The results indicate that a larger block size enables the model to **capture longer-range dependencies** in the data, which enhances learning efficiency and reduces overfitting.

## Summary

* To be added after implementing the WaveNet architecture.