# Lecture 3: Multi-Layer Perceptrons

In this lecture, we will introduce Multi-Layer Perceptrons (MLP).

We will reproduce the following paper [A Neural Probabilistic Language Model](https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

## Curse of Dimensionality

For a character-level language model, the input vector is a one-hot vector of size 27 (26 characters + 1 space).

If the model is not a character-level language model, (e.g., word-level language model), the input vector size is the size of the vocabulary which is usually very large (e.g., 20,000).

This leads to the curse of dimensionality.

**Solution**: Use a lower-dimensional representation of the input vector.

The hypothesis is that similar words will have similar representations (e.g., dog and cat). Let's find a way to embed words into a lower-dimensional space.

**Example**
- The cat is walking in the bedroom. (Train data)
- A dog was running in a room. (Train data)
- The cat was running in a room. (Train data)
- A dog is walking in a bedroom. (Train data)
- A cat was running in a <?> (Test data)

The model should be able to predict the word "room"(or the similar words) in the test data.

## MLP

In the previous lecture, we have successfully implemented Bigram language model.
In this lecture, we will implement a Multi-Layer Perceptron (MLP) language model.

For practical reasons, let's use a character-level language model.

![MLP](https://github.com/EunjinWoo/LLM101n/blob/master/assets/mlp.png?raw=1)

### Importing Libraries

In [1]:
import os
import matplotlib.pyplot as plt
from dataclasses import dataclass
import torch
from torch.nn import functional as F
from utils import load_text, set_seed
%matplotlib inline

ModuleNotFoundError: No module named 'utils'

### Configuration

In [None]:
@dataclass
class MLPConfig:
    root_dir: str = os.getcwd() + "/../../"
    dataset_path: str = "data/names.txt"

    # Tokenizer
    vocab_size: int = 0  # Set later

    # Model
    context_size: int = 3
    d_embed: int = 2
    d_hidden: int = 32

    # Training
    batch_size: int = 32
    lr: float = 2e-4
    max_steps: int = 10000

    seed: int = 101

config = MLPConfig()

### Reproducibility

In [None]:
set_seed(config.seed)
generator = torch.Generator().manual_seed(config.seed)

### Dataset

In [None]:
# Load text and split by lines
names = load_text(config.root_dir + config.dataset_path).splitlines()

### Tokenizer

In [None]:
chars = [chr(i) for i in range(97, 123)]  # all alphabet characters
chars.insert(0, ".")  # Add special token
config.vocab_size = len(chars)
str2idx = {char: idx for idx, char in enumerate(chars)}
idx2str = {idx: char for char, idx in str2idx.items()}

### Preprocessing

We need to create a dataset of (Input, Target) pairs.
- Input: current 3 characters
- Output: next 1 character

In [None]:
def prepare_dataset(_names):
    _inputs, _targets = [], []

    for name in _names:
        #print(name)
        context = [0] * config.context_size  # How many characters do we take to predict the next character

        for char in name + ".":
            idx = str2idx[char]
            _inputs.append(context)
            _targets.append(idx)
            #print(''.join(idx2str[i] for i in context), '--->', idx2str[idx])
            context = context[1:] + [idx]  # Shift the context by 1 character

        #print("="*10)

    _inputs = torch.tensor(_inputs)
    _targets = torch.tensor(_targets)

    return _inputs, _targets

In [None]:
inputs, targets = prepare_dataset(names)

print(f"Number of Input, Target pairs: {len(inputs)}")
print(f"Input shape: {inputs.shape}, Output shape: {targets.shape}")
print(f"First (Input, Target): {inputs[0]}, {targets[0]}")
print(f"Second (Input, Target): {inputs[1]}, {targets[1]}")

### Model

The model consists of the following components:
- Embedding
- Hidden Layer
- Output Layer


#### Embedding

Embedding is a lookup table that maps an input character to a lower-dimensional representation.

Example
- Input: 'a'
- Output: [0.1, 0.2]

[0.1, 0.2] is the represents the character 'a' in a lower-dimensional space.

In [None]:
# Embedding example
C = torch.randn(27, 2)
print(f"Embedding shape: {C.shape}")

In [None]:
# Embedding example
# a: [1, :]
a_embed = C[1, :]
print(f"Embedding of 'a': {a_embed}")

How **Forward Pass** works in MLP:
1. Embed the input characters.
2. Concatenate the embeddings.
3. Pass the concatenated embeddings through a hidden layer.
4. Pass the hidden layer output through an output layer.
5. Get the output of shape (vocab_size=27).


Don't know how to concatenate? PyTorch provides concatenation functionality. [PyTorch Documentation](https://pytorch.org/docs/stable/generated/torch.cat.html)

In [None]:
# Example forward pass
# Input: ".em"
e_embed = C[0]    # .: (embedding_size=2)
m_embed = C[5]    # e: (embedding_size=2)
m_embed2 = C[13]  # m: (embedding_size=2)

################################################################################
# TODO:                                                                        #
# Concatenate the embeddings                                                   #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

print(f"Concatenated shape: {x.shape}")

#### Hidden Layer

The hidden layer is a linear transformation followed by a non-linear activation function.

In [None]:
# Hidden layer
################################################################################
# TODO:                                                                        #
# Implement the hidden layer                                                   #
# Use tanh as your activation function.                                        #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
W1 = torch.randn(6,4)
h = F.tanh(x @ W1)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

print(f"Hidden shape: {h.shape}")

#### Output Layer

To get the logits, we need to pass the hidden layer output through another linear transformation.

The output layer is a linear transformation.

Example
- Input: any kind of vector
- Output: a vector of size 27 (vocab_size) representing the probability of each character.

In [None]:
# Output layer
################################################################################
# TODO:                                                                        #
# Implement the output layer                                                   #
# Activation function must be ???                                              #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
W2 = torch.randn(4, 27)
y = F.softmax(h @ W2)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

print(f"Output shape: {y.shape}")
print(f"Output: {y}")
print(f"Sum of probabilities: {y.sum()}")

In [None]:
# Example of prediction
print(f"Input characters: {idx2str[0]}, {idx2str[5]}, {idx2str[13]}")
print(f"Output character (probability): {idx2str[y.argmax().item()]}, {y.max()}")

Let's refactor the code

- Input:
    - Shape: (batch_size, context_size)
    - Example: [[5, 12, 12], [12, 12, 5]]  # "emm", "mme"

- Parameters:
    - Embedding:
        - Shape: (vocab_size, d_embed)
    - W1:
        - Shape: (d_embed * context_size, d_hidden)
    - W2:
        - Shape: (d_hidden, vocab_size)

- Output:
    - Shape: (batch_size, vocab_size)
    - Example: [[0.04, 0.03, ..., 0.02], [0.02, 0.03, ..., 0.04]]

What is a **mini-batch**?
- It is a subset of the dataset.
- In practice, the dataset is too large to fit into memory. Therefore, we divide the dataset into mini-batches, then feed the model batch by batch.
- batch_size: the number of samples in a mini-batch

In [None]:
################################################################################
# TODO:                                                                        #
# Initialize the parameters                                                    #
# C: (vocab_size, d_embed)                                                     #
# W1: (d_embed * context_size, d_hidden)                                       #
# b1: (d_hidden)                                                               #
# W2: (d_hidden, vocab_size)                                                   #
# b2: (vocab_size)                                                             #
# Set requires_grad to True                                                    #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

parameters = [C, W1, b1, W2, b2]
print(f"Number of parameters: {sum(p.numel() for p in parameters)}")

### Training

PyTorch provides a CrossEntropyLoss function. [PyTorch Documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)

**Note**: Softmax is already included in the CrossEntropyLoss function.

Change the learning rate so that the loss graph looks as following

![Loss](https://github.com/EunjinWoo/LLM101n/blob/master/assets/train_loss.png?raw=1)

In [None]:
lr = config.lr  # Change learning rate
steps = []
losses = []

for i in range(config.max_steps):
    # Mini-batch
    idx = torch.randint(0, len(inputs), (config.batch_size,))
    batch_input = inputs[idx]  # (batch_size, context_size)
    batch_target = targets[idx]  # (batch_size)
    if i == 0:
        print(f"Input batch shape: {batch_input.shape}")
        print(f"Target batch shape: {batch_target.shape}")

    # Forward pass
    ################################################################################
    # TODO:                                                                        #
    # Implement the forward pass                                                   #
    # 1. Embed the input characters.                                               #
    # 2. Concatenate the embeddings.                                               #
    # 3. Pass the concatenated embeddings through a hidden layer.                  #
    # 4. Pass the hidden layer output through an output layer.                     #
    # DO NOT INCLUDE SOFTMAX                                                       #
    ################################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    # Embedding

    # Concatenate

    # Hidden layer

    # Output layer

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    loss = F.cross_entropy(logits, batch_target)  # (batch_size)

    # Backward pass
    for param in parameters:
        param.grad = None
    loss.backward()

    # Update parameters
    for param in parameters:
        param.data += -lr * param.grad

    # Track loss
    steps.append(i)
    losses.append(loss.log10().item())

# Plot loss
plt.plot(steps, losses)
plt.xlabel("Steps")
plt.ylabel("Loss")
plt.show()

In [None]:
# Visualization of the embedding matrix
plt.figure(figsize=(10, 10))
plt.scatter(C[:,0].data, C[:,1].data, s=200)  # dimensions of 0 and 1
for i in range(C.shape[0]):
    plt.text(C[i,0].item(), C[i,1].item(), idx2str[i], ha="center", va="center", color='white')
plt.grid('minor')
plt.show()

### Inference

In [None]:
def generate_name():
    new_name = []
    context = [0] * config.context_size

    while True:
        ################################################################################
        # TODO:                                                                        #
        # 1. Forward pass                                                              #
        # 2. Sample the next token                                                     #
        # 3. Decode the token                                                          #
        # 4. Update the start_idx                                                      #
        # 5. Break if the next character is "."                                        #
        # Hint: Use F.softmax to get the probabilities                                 #
        ################################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        # Forward pass

        # Sample


        # Update context

        # Break if "."

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    return "".join(new_name)

for _ in range(5):
    print(generate_name())

In [None]:
# TODO: Change the learning rate to get better results