# Lecture 3: Multi-Layer Perceptrons

In this lecture, we will introduce Multi-Layer Perceptrons (MLP).

We will reproduce the following paper [A Neural Probabilistic Language Model](https://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

## Curse of Dimensionality

For a character-level language model, the input vector is a one-hot vector of size 27 (26 characters + 1 space).

If the model is not a character-level language model, (e.g., word-level language model), the input vector size is the size of the vocabulary which is usually very large (e.g., 20,000).

This leads to the curse of dimensionality.

**Solution**: Use a lower-dimensional representation of the input vector.

The hypothesis is that similar words will have similar representations (e.g., dog and cat). Let's find a way to embed words into a lower-dimensional space.

**Example**
- The cat is walking in the bedroom. (Train data)
- A dog was running in a room. (Train data)
- The cat was running in a room. (Train data)
- A dog is walking in a bedroom. (Train data)
- A cat was running in a <?> (Test data)

The model should be able to predict the word "room"(or the similar words) in the test data.

## MLP

In the previous lecture, we have successfully implemented Bigram language model. 
In this lecture, we will implement a Multi-Layer Perceptron (MLP) language model.

For practical reasons, let's use a character-level language model.

![MLP](../images/MLP.png)

### Importing Libraries

In [1]:
import os
import matplotlib.pyplot as plt
import torch
from torch.nn import functional as F
from dataclasses import dataclass
from src.utils import load_text, set_seed
%matplotlib inline

### Configuration

In [2]:
@dataclass
class MLPConfig:
    root_dir: str = os.getcwd() + "/../../"
    dataset_path: str = "data/raw/names.txt"

    # Tokenizer
    vocab_size: int = 0  # Set later
    
    # Model
    block_size: int = 3
    hidden_size: int = 64

    seed: int = 101

### Reproducibility

In [3]:
set_seed(MLPConfig.seed)

Random seed set to 101


### Dataset

In [4]:
# Load text and split by lines
names = load_text(MLPConfig.root_dir + MLPConfig.dataset_path).splitlines()

Loaded text data from /mnt/c/Users/cheir/GitHub/LLM101/notebooks/Lectures/../../data/raw/names.txt (length: 228145 characters).


### Tokenizer

In [5]:
chars = [chr(i) for i in range(97, 123)]  # all alphabet characters
chars.insert(0, ".")  # Add special token
MLPConfig.vocab_size = len(chars)
str2idx = {char: idx for idx, char in enumerate(chars)}
idx2str = {idx: char for char, idx in str2idx.items()}

### Preprocessing

We need to create a dataset of (Input, Target) pairs.
- Input: current 3 characters
- Output: next 1 character

In [27]:
def get_dataloader(names):
    block_size = MLPConfig.block_size  # How many characters do we take to predict the next character
    xs, ys = [], []

    for name in names:
        #print(name)
        context = [0] * block_size
        
        for char in name + ".":
            idx = str2idx[char]
            xs.append(context)
            ys.append(idx)
            #print(''.join(idx2str[i] for i in context), '--->', idx2str[idx])
            context = context[1:] + [idx]  # Shift the context by 1 character
        
        #print("="*10)

    xs = torch.tensor(xs)
    ys = torch.tensor(ys)
    
    return xs, ys

In [28]:
xs, ys = get_dataloader(names[:3])

In [30]:
xs, ys = get_dataloader(names)

print(f"Input shape: {xs.shape}, Output shape: {ys.shape}")

torch.int64 torch.int64
Input shape: torch.Size([228146, 3]), Output shape: torch.Size([228146])


### Model

In [34]:
C = torch.randn(MLPConfig.vocab_size, 2)

In [35]:
C.shape

torch.Size([27, 2])

In [40]:
# Embedding example
# a: [1, :]
a_embed = C[1, :]
print(f"Embedding of 'a': {a_embed}")

Embedding of 'a': tensor([-1.4056, -0.1122])


y = tanh(Cx)
- x: (batch_size, block_size)
- C: (vocab_size, embedding_size)
- 