Implementation of [**A Neural Probabilistic Language Model**](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) by Yoshua Bengio. 
<br>
Datase used for training:- [**Tiny Shakespeare**](https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt)

#### Architecture (paper):
```
Embedding lookup → concatenate → hidden (tanh) → logits = Wx + Uh + b → softmax → cross-entropy loss.
```
Defaults for Tiny Shakespeare (good starting point):
- `context_len (k) = 10`
- `vocab_size (V) ≈ 65` (compute from text)
- `embed_dim (m) = 32`
- `hidden_size (H) = 128`
- `batch_size = 128`
- `optimizer = Adam(lr=1e-3)`
- `epochs = 10–30`

### Data Pipeline 

get dataset via API:- 
```py
    import requests
    url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
    text = requests.get(url).text
```

In [1]:
# 1. Load dataset
text = open("input.txt", 'r').read()

In [2]:
print(len(text))

1115394


In [3]:
print(text[:500])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


In [4]:
lines = text.split("\n")
print("Total lines:", len(lines))

vocab = sorted(list(set(text)))
print("Vocab size:", len(vocab))
print(vocab)


Total lines: 40001
Vocab size: 65
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [5]:
# 2. Build vacabulary
stoi = {s: i for i, s in enumerate(vocab)}
itos = {i: s for i, s in enumerate(vocab)}
V = len(vocab)
itos[37]

'Y'

In [6]:
# 3. Encode whole text to integers
data = [stoi[ch] for ch in text]
len(data)

1115394

In [7]:
# 4. Train/Validation split
split = int(0.9 * len(data))
train_data = data[:split]
test_data = data[split:]
print(f"training: {len(train_data)}")
print(f"testing: {len(test_data)}")

training: 1003854
testing: 111540


In [8]:
k = 10

In [9]:
# 5. build sliding window dataset

def build_dataset(data, k):
    X, y = [], []
    for i in range(k, len(data)):
        X.append(data[i-k:i])   # context
        y.append(data[i])       # target
    return X, y

X_train, y_train = build_dataset(train_data, k)
X_val, y_val     = build_dataset(test_data, k)

In [10]:
X_train[:5]

[[18, 47, 56, 57, 58, 1, 15, 47, 58, 47],
 [47, 56, 57, 58, 1, 15, 47, 58, 47, 64],
 [56, 57, 58, 1, 15, 47, 58, 47, 64, 43],
 [57, 58, 1, 15, 47, 58, 47, 64, 43, 52],
 [58, 1, 15, 47, 58, 47, 64, 43, 52, 10]]

In [11]:
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn

In [12]:
# 6. batching, using PyTorch

def to_loader(X, y, shuffle=True, batch_size=128):
    X = torch.tensor(X, dtype=torch.long)
    y = torch.tensor(y, dtype=torch.long)
    
    dataset = TensorDataset(X, y)
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)

    return loader

In [13]:
train_loader = to_loader(X_train, y_train, shuffle=True)
val_loader   = to_loader(X_val, y_val, shuffle=False)

In [14]:
for xb, yb in train_loader:
    print(xb.shape, yb.shape)
    print(xb)
    print(yb)
    break


torch.Size([128, 10]) torch.Size([128])
tensor([[ 0, 35, 46,  ..., 39, 50, 53],
        [ 1, 44, 39,  ..., 57,  1, 52],
        [ 1, 25, 39,  ...,  1, 20, 39],
        ...,
        [58,  1, 53,  ..., 63,  1, 40],
        [ 1, 59, 57,  ...,  1, 47, 52],
        [58,  0, 41,  ..., 43, 56, 44]])
tensor([59, 53, 41, 50, 43,  1,  0, 39, 57, 52, 46, 52, 50, 59, 58, 50, 53, 53,
        17, 39,  6,  1, 56, 43, 59,  1,  1,  1, 54, 45, 47, 58, 58, 50, 50,  6,
        63, 46, 60, 53, 51,  0, 52, 57, 39, 43, 42, 47, 39, 59, 33, 50,  1,  1,
         5, 57, 42, 47, 24, 47,  1, 53, 43,  1, 47, 44, 59, 57,  1, 60, 53,  8,
        59,  1, 46, 43, 44, 17, 15,  8, 39, 56, 56, 43, 46,  1, 58, 32, 57, 44,
        39, 46, 52, 10,  6, 53, 26,  5,  1, 46, 41,  0, 47, 50,  0, 57, 57, 43,
        57, 13,  1, 10,  8,  1, 47,  1,  1, 57, 20,  6, 58, 39, 53, 50, 57, 56,
        58, 43])


### Forward Pass:-
```
    indices → embeddings → flatten → hidden(tanh) → Wx + Uh → logits
```

Parameters (names & shapes):
- `C` (Embedding): `nn.Embedding(V, m)` → embedding outputs `[B, k, m]`.
- After flatten: `x = reshape([B, k*m])` → shape `[B, k*m]`.
- `H` (hidden linear): `nn.Linear(k*m, H)` so `Hx + b_h` → `[B, H]`.
- Activation: `h = torch.tanh(Hx + b_h)` → `[B, H]`.
- `W` (linear from input to logits): `nn.Linear(k*m, V, bias=False)` — gives `Wx → [B, V]`.
- `U` (linear from hidden to logits): `nn.Linear(H, V, bias=False)` — gives `Uh → [B, V]`.
- `b` (output bias):` nn.Parameter(torch.zeros(V))` or `nn.Linear(..., V, bias=True)` combined.

In [15]:
# init params
k = 10 # context elngth
V = len(vocab) # vocab size
m = 32 # embedding dimension
h = 128 # hidden units

In [16]:
class BengioLM(nn.Module):
    def __init__(self, vocab_size, context_len, embed_dim, hidden_dim):
        super().__init__()

        # hyperparameters 
        self.V = vocab_size
        self.k = context_len
        self.m = embed_dim
        self.h = hidden_dim

        # ---- Layers ----
        # Matrix C (embedding matrix)
        self.embedding = nn.Embedding(self.V, self.m)

        # Hidden layer: Hx + b
        self.hidden_linear = nn.Linear(self.k * self.m, self.h)

        # Linear shortcut Wx (no bias)
        self.input_to_vocab = nn.Linear(self.k * self.m, self.V, bias=False)

        # Nonlinear path Uh (no bias)
        self.hidden_to_vocab = nn.Linear(self.h, self.V, bias=False)

        # Output bias b
        self.bias = nn.Parameter(torch.zeros(self.V))

    def forward(self, x):
        # x shape: [B, k]

        # 1. Embedding lookup → [B, k, m]
        emb = self.embedding(x)

        # 2. Flatten context → [B, k*m]
        B = emb.shape[0]
        x_flat = emb.view(B, -1)

        # 3. Hidden tanh layer
        hidden = torch.tanh(self.hidden_linear(x_flat))

        # 4. Bengio output: Wx + Uh + b
        logits = (
            self.input_to_vocab(x_flat)
            + self.hidden_to_vocab(hidden)
            + self.bias
        )

        return logits

In [17]:
model = BengioLM(vocab_size=65, context_len=10, embed_dim=32, hidden_dim=128)

xb = torch.randint(0, 65, (32, 10))  # fake batch
logits = model(xb)

print(logits.shape)


torch.Size([32, 65])


Sanity check on just one batch to verify sharpe correct, `loss` is almost `log(V)`, gradients exists.

In [18]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

In [19]:
for xb, yb in train_loader:
    print(xb.shape, xb.dtype, yb.shape, yb.dtype)
    
    # forward pass
    logits = model(xb)
    print(f'logits shape: {logits.shape}')

    loss = criterion(logits, yb)
    print(loss)
    
    break

torch.Size([128, 10]) torch.int64 torch.Size([128]) torch.int64
logits shape: torch.Size([128, 65])
tensor(4.4325, grad_fn=<NllLossBackward0>)


**One batch Overfit test**

In [20]:
xb, yb = next(iter(train_loader))

for step in range(500):
    optimizer.zero_grad()  # since all model parameters are manged by Adam optimizer
    logits = model(xb)     # forward pass
    loss = criterion(logits, yb)
    if step % 25 == 0: print(f"{step}: {loss}")  
    loss.backward()    # backward pass (compute gradients)
    optimizer.step()   # update parameters

0: 4.306519508361816
25: 0.2483793944120407
50: 0.034569647163152695
75: 0.016709206625819206
100: 0.011265597306191921
125: 0.008346375077962875
150: 0.0064804451540112495
175: 0.005197469145059586
200: 0.004273149184882641
225: 0.0035832326393574476
250: 0.003053407184779644
275: 0.002636855700984597
300: 0.002302915323525667
325: 0.0020306732039898634
350: 0.001805576030164957
375: 0.0016171414172276855
400: 0.0014576433459296823
425: 0.0013214421924203634
450: 0.001204045140184462
475: 0.0011021110694855452


In [21]:
import math

In [22]:
def train_model(model, train_loader, val_loader, epochs=10, lr=1e-3, device="cpu"):
    model = model.to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(1, epochs + 1):
        # TRAINING
        model.train()
        total_train_loss = 0

        for xb, yb in train_loader:
            xb = xb.to(device)
            yb = yb.to(device)

            optimizer.zero_grad()

            logits = model(xb)
            loss = criterion(logits, yb)

            loss.backward()
            optimizer.step()

            total_train_loss += loss.item()

        avg_train_loss = total_train_loss / len(train_loader)
        train_ppl = math.exp(avg_train_loss)

        # VALIDATION
        model.eval()
        total_val_loss = 0

        with torch.no_grad():
            for xb, yb in val_loader:
                xb = xb.to(device)
                yb = yb.to(device)

                logits = model(xb)
                loss = criterion(logits, yb)

                total_val_loss += loss.item()

        avg_val_loss = total_val_loss / len(val_loader)
        val_ppl = math.exp(avg_val_loss)

        # LOGGING
        print(f"Epoch {epoch:02d} | "
              f"Train Loss: {avg_train_loss:.3f} | Train PPL: {train_ppl:.2f} | "
              f"Val Loss: {avg_val_loss:.3f} | Val PPL: {val_ppl:.2f}")

In [23]:
train_model(
    model,
    train_loader,
    val_loader,
    epochs=15,
    lr=1e-3,
    device="cuda" if torch.cuda.is_available() else "cpu"
)

Epoch 01 | Train Loss: 1.981 | Train PPL: 7.25 | Val Loss: 1.908 | Val PPL: 6.74
Epoch 02 | Train Loss: 1.715 | Train PPL: 5.56 | Val Loss: 1.851 | Val PPL: 6.37
Epoch 03 | Train Loss: 1.652 | Train PPL: 5.22 | Val Loss: 1.813 | Val PPL: 6.13
Epoch 04 | Train Loss: 1.620 | Train PPL: 5.05 | Val Loss: 1.803 | Val PPL: 6.07
Epoch 05 | Train Loss: 1.599 | Train PPL: 4.95 | Val Loss: 1.790 | Val PPL: 5.99
Epoch 06 | Train Loss: 1.584 | Train PPL: 4.87 | Val Loss: 1.783 | Val PPL: 5.95
Epoch 07 | Train Loss: 1.572 | Train PPL: 4.82 | Val Loss: 1.783 | Val PPL: 5.95
Epoch 08 | Train Loss: 1.563 | Train PPL: 4.77 | Val Loss: 1.774 | Val PPL: 5.89
Epoch 09 | Train Loss: 1.555 | Train PPL: 4.73 | Val Loss: 1.766 | Val PPL: 5.84
Epoch 10 | Train Loss: 1.549 | Train PPL: 4.71 | Val Loss: 1.765 | Val PPL: 5.84
Epoch 11 | Train Loss: 1.543 | Train PPL: 4.68 | Val Loss: 1.762 | Val PPL: 5.82
Epoch 12 | Train Loss: 1.538 | Train PPL: 4.66 | Val Loss: 1.758 | Val PPL: 5.80
Epoch 13 | Train Loss: 1.534

In [24]:
import torch.nn.functional as F

In [25]:
def generate_text(model, seed, stoi, itos, max_new_chars=300, temperature=1.0, device="cpu"):
    model.eval()
    context_len = model.k  # Bengio context size

    # Convert seed to indices
    context = [stoi[c] for c in seed]

    # Pad if seed shorter than context
    if len(context) < context_len:
        context = [0] * (context_len - len(context)) + context

    context = context[-context_len:]  # keep last k

    generated = seed

    for _ in range(max_new_chars):
        x = torch.tensor(context, dtype=torch.long, device=device).unsqueeze(0)  # [1, k]

        with torch.no_grad():
            logits = model(x)  # [1, V]
            logits = logits[0] / temperature  # temperature scaling

            probs = F.softmax(logits, dim=-1)

            # Sample next char
            next_idx = torch.multinomial(probs, num_samples=1).item()

        # Convert back to char
        next_char = itos[next_idx]
        generated += next_char

        # Slide context window
        context = context[1:] + [next_idx]

    return generated

In [26]:
print(generate_text(
    model,
    seed="KING:\n",
    stoi=stoi,
    itos=itos,
    max_new_chars=500,
    temperature=0.8,
))

KING:
Clife, Paris, to his bear that a
Will'd great Ay, and
have thou detion. True and the way.

ISABELLA:
The should witching.

SICINIUS:
He'll diese,
Were had arms:
When that but servise?

DERBY:
Why, soal me, on more.

ROMEO:
So the other?

BENVOLIO:
Howsself of his brother's not but that did the like me what should not a foint of all this natullys he old and stain in a pointly the goldins,
And here to slege curded holy, my lord.

KING RICHARD III:
Now now, hon that you. Tranks,
More than water'd i
