# Lecture 5: Convolutional Neural Networks

In this lecture, we will introduce Convolutional Neural Networks (CNN).

CNN architecture is widely used in image recognition tasks. However, it can also be used in other domains such as Natural Language Processing and speech recognition. Let's focus on the application in NLP and reproduce WaveNet.

CNN papers:
- LeNet: [LeCun et al. 1989](http://vision.stanford.edu/cs598_spring07/papers/Lecun98.pdf)
- AlexNet: [Krizhevsky et al. 2012](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)
- WaveNet: [van den Oord et al. 2016](https://arxiv.org/pdf/1609.03499)

## Importing libraries

In [1]:
import os
import math
import itertools
from dataclasses import dataclass
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn import functional as F
import wandb
from src.utils import load_text, set_seed, configure_device

## Configuration

In [2]:
@dataclass
class CNNConfig:
    root_dir: str = os.getcwd() + "/../../"
    dataset_path: str = "data/names.txt"
    device: torch.device = torch.device('cpu')  # Automatic device configuration

    # Tokenizer
    vocab_size: int = 0  # Set later

    # Model
    context_size: int = 16  # Increase the context size to 16
    d_embed: int = 8
    d_hidden: int = 64

    # Training
    val_size: float = 0.1
    batch_size: int = 32
    max_steps: int = 10000
    lr: float = 0.01
    val_interval: int = 100
    log_interval: int = 100

    seed: int = 101

## Weights & Biases

In [3]:
wandb.login(key=os.environ.get("WANDB_API_KEY"))
wandb.init(
    project="lecture-05",
    dir=CNNConfig.root_dir
)

[34m[1mwandb[0m: Currently logged in as: [33mpathfinderkr[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


## Reproducibility

In [4]:
set_seed(CNNConfig.seed)

Random seed set to 101


## Device

In [5]:
CNNConfig.device = configure_device()

Running on mps


## Dataset

In [6]:
# Load text and split by lines
names = load_text(CNNConfig.root_dir + CNNConfig.dataset_path).splitlines()

Loaded text data from /Users/pathfinder/Documents/GitHub/LLM101/notebooks/Lectures/../../data/names.txt (length: 228145 characters).


## Tokenizer

In [7]:
chars = [chr(i) for i in range(97, 123)]  # all alphabet characters
chars.insert(0, ".")  # Add special token
CNNConfig.vocab_size = len(chars)
str2idx = {char: idx for idx, char in enumerate(chars)}
idx2str = {idx: char for char, idx in str2idx.items()}

## Preprocessing

In [8]:
# Train-Val Split
train_names, val_names = train_test_split(names, test_size=CNNConfig.val_size, random_state=CNNConfig.seed)

In [9]:
# Dataset and DataLoader
class NamesDataset(Dataset):
    def __init__(self, _names, context_size):
        self.inputs, self.targets = [], []

        for name in _names:
            context = [0] * context_size

            for char in name + ".":
                idx = str2idx[char]
                self.inputs.append(context)
                self.targets.append(idx)
                context = context[1:] + [idx]

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        input_ids = torch.tensor(self.inputs[idx])
        target_id = torch.tensor(self.targets[idx])
        return input_ids, target_id

train_dataset = NamesDataset(train_names, context_size=CNNConfig.context_size)
val_dataset = NamesDataset(val_names, context_size=CNNConfig.context_size)
train_loader = DataLoader(train_dataset, batch_size=CNNConfig.batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=CNNConfig.batch_size, shuffle=False)

In [10]:
# Visualize the dataset
for i in range(20):
    context, target = train_dataset[i]
    context_str = ''.join([idx2str[int(token)] for token in context])
    target_char = idx2str[int(target)]
    print(f"{context_str} --> {target_char}")

........ --> k
.......k --> e
......ke --> y
.....key --> l
....keyl --> e
...keyle --> r
..keyler --> .
........ --> t
.......t --> i
......ti --> t
.....tit --> u
....titu --> s
...titus --> .
........ --> r
.......r --> y
......ry --> l
.....ryl --> i
....ryli --> .
........ --> j
.......j --> a


## Model

### Multi-Layer Perceptron (MLP)

Let's discuss the architecture of a Multi-Layer Perceptron (MLP).

![MLP](../../assets/mlp.png)

Q1: How do the embedding tokens communicate with each other? What operation is performed to do so?

Q2: Imagine having a context size of 3 when using ChatGPT... Let's increase the context size to 128, 1024, etc. What would be the challenges in this MLP architecture?



In [11]:
# Model
################################################################################
# TODO:                                                                        #
# Implement the MLP model.                                                     #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
class MLP(nn.Module):
    def __init__(self, vocab_size, context_size, d_embed, d_hidden):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_embed)
        self.linear1 = nn.Linear(context_size * d_embed, d_hidden, bias=True)
        self.linear2 = nn.Linear(d_hidden, vocab_size, bias=True)

    def forward(self, x):  # x: (batch_size, context_size)
        x_embed = self.embedding(x)  # (batch_size, context_size, d_embed)
        x_embed = x_embed.view(x_embed.size(0), -1)  # (batch_size, context_size * d_embed)
        x = F.relu(self.linear1(x_embed))  # (batch_size, d_hidden)
        x = self.linear2(x)  # (batch_size, vocab_size)
        return x
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

In [12]:
# Initialize the model
mlp = MLP(CNNConfig.vocab_size, CNNConfig.context_size, d_embed=CNNConfig.d_embed, d_hidden=CNNConfig.d_hidden)
mlp.to(CNNConfig.device)
print(mlp)
print("Number of parameters:", sum(p.numel() for p in mlp.parameters()))

MLP(
  (embedding): Embedding(27, 16)
  (linear1): Linear(in_features=128, out_features=64, bias=True)
  (linear2): Linear(in_features=64, out_features=27, bias=True)
)
Number of parameters: 10443


In [13]:
# Training
def train(
        model: nn.Module,
        train_loader: DataLoader,
        val_loader: DataLoader,
        max_steps: int,
        lr: float,
        val_interval: int,
        log_interval: int,
        device: torch.device
):
    """
    Train the model for a fixed number of steps.

    Args:
        model (nn.Module): The model to train.
        train_loader (DataLoader): DataLoader for the training data.
        val_loader (DataLoader): DataLoader for the validation data.
        max_steps (int): Maximum number of steps to train.
        lr (float): Learning rate.
        val_interval (int): Interval for validation.
        log_interval (int): Interval for logging.
        device (torch.device): Device to run the model on.
    """
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    wandb.watch(model, log="all", log_freq=log_interval)
    running_loss = 0.0
    train_iter = itertools.cycle(train_loader)  # Infinite dataloader
    progress_bar = tqdm(total=max_steps, desc="Training", leave=True)

    for step in range(1, max_steps + 1):
        model.train()
        train_inputs, train_targets = next(train_iter)
        train_inputs, train_targets = train_inputs.to(device), train_targets.to(device)
        optimizer.zero_grad()
        logits = model(train_inputs)
        loss = F.cross_entropy(logits, train_targets)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        progress_bar.set_postfix(loss=f"{running_loss / step:.4f}")
        progress_bar.update(1)

        if step % val_interval == 0:
            model.eval()
            val_loss = 0.0
            total_samples = 0
            with torch.no_grad():
                for val_inputs, val_targets in val_loader:
                    val_inputs, val_targets = val_inputs.to(device), val_targets.to(device)
                    val_logits = model(val_inputs)
                    batch_loss = F.cross_entropy(val_logits, val_targets)
                    val_loss += batch_loss.item() * val_inputs.size(0)
                    total_samples += val_inputs.size(0)
            wandb.log({"Val Loss": val_loss / total_samples}, step=step)

        if step % log_interval == 0:
            wandb.log({"Train Loss": running_loss / step}, step=step)

        step += 1

    progress_bar.close()
    wandb.finish()

In [None]:
# Training
train(
    model=mlp,
    train_loader=train_loader,
    val_loader=val_loader,
    max_steps=CNNConfig.max_steps,
    lr=CNNConfig.lr,
    val_interval=CNNConfig.val_interval,
    log_interval=CNNConfig.log_interval,
    device=CNNConfig.device
)

Training:  83%|████████▎ | 8299/10000 [02:14<00:18, 94.07it/s, loss=2.4206] 

In [15]:
################################################################################
# TODO:                                                                        #
# Write your answer to the questions above.                                    #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# A1:
# A2:
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

### Convolutional Neural Network (CNN)

![WaveNet](../../assets/wavenet.png)

Instead of connecting each token to all other tokens, CNN uses convolutional layers to connect tokens within a certain range.


In [None]:
# Simple WaveNet
# Example: WaveNet with 4 convolutional layers
# Input: 16 tokens (embedded)
# -> Conv -> 8 tokens
# -> Conv -> 4 tokens
# -> Conv -> 2 tokens
# -> Conv -> 1 token: Logits

class Conv1d(nn.Module):
    def __init__(self,
        super().__init__()
        # concatenate the two input vectors
        self.conv =

    def forward(self, x):
        return self.conv(x)


class CNN(nn.Module):
    def __init__(self, vocab_size, context_size, d_embed, d_hidden):
        super().__init__()
        assert context_size & (context_size - 1) == 0, "Context size should be a power of 2"
        self.n_layers = int(math.log2(context_size))

        # Embedding
        self.embedding = nn.Embedding(vocab_size, d_embed)

        # Convolutional layers
        self.layers = nn.ModuleList([
            Conv1d(
            for _ in range(self.n_layers)
        ])

        # Output layer
        self.linear = nn.Linear(d_hidden, vocab_size)

    def forward(self, x):
        batch_size = x.size(0)

        x = self.embedding(x)  # (batch_size, context_size, d_embed)

        for layer in self.layers:
            x = layer(x
            x = F.relu(x)

        x = x.view(batch_size, -1)  # (batch_size, context_size * d_hidden)
        x = self.linear(x)  # (batch_size, vocab_size)
        return x

In [None]:
# Initialize the model
cnn = CNN(CNNConfig.vocab_size, context_size=CNNConfig.context_size, d_embed=CNNConfig.d_embed, d_hidden=CNNConfig.d_hidden)
cnn.to(CNNConfig.device)
print(cnn)
print("Number of parameters:", sum(p.numel() for p in cnn.parameters()))

In [None]:
train(
    model=cnn,
    train_loader=train_loader,
    val_loader=val_loader,
    max_steps=CNNConfig.max_steps,
    lr=CNNConfig.lr,
    val_interval=CNNConfig.val_interval,
    log_interval=CNNConfig.log_interval,
    device=CNNConfig.device
)

- Rule-based Language Model:
    - Params: 26 * (27 + 27^2 + 27^3 + ... + 27^15) = 3 x 10^21
- Bigram Language Model:
    - Params: 27 * 27 = 729
- MLP Language Model:
    - Params: 27 * 16 + 16 * 64 + 64 * 27 = 3,456
- CNN Language Model:
    - Params: 27 * 16 + 16 * 16 + 16 * 16 + 16 * 16 + 16 * 27 = 1,600

Neural Network is a compression algorithm. It learns the patterns in the data and stores them in the weights.


## Information theory
[A mathematical theory of communication](https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf)

- *The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. (Claude Shannon, 1948)*
- Received signal = Original signal + Noise
    - Goal: **Remove noise**
- Entropy: Measure of uncertainty