<a href="https://colab.research.google.com/github/Erickpython/kodeCamp_5X-MachineLearning/blob/main/Word_Prediction_with_LSTM_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Imports**

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from collections import Counter
import numpy as np

## **Training Data**

In [None]:
sentences = [
    "This movie was fantastic I loved it",
    "Absolutely terrible film waste of time",
    "Great acting and amazing story",
    "Worst movie ever"
]

## **Build Vocabulary**

We reserve
```
0 â†’ <PAD>
1 â†’ <UNK>
```

The `tokenize` function splits the provided text into words and converts them to indices into the vocabulary.

In [None]:
def tokenize(text):
    return text.lower().split()

counter = Counter()

for s in sentences:
    counter.update(tokenize(s))

word2idx = {"<PAD>":0, "<UNK>":1}

for word in counter:
    word2idx[word] = len(word2idx)

idx2word = {i:w for w,i in word2idx.items()}

vocab_size = len(word2idx)

print("Vocab size:", vocab_size)

Vocab size: 22


In [None]:
word2idx, idx2word

({'<PAD>': 0,
  '<UNK>': 1,
  'this': 2,
  'movie': 3,
  'was': 4,
  'fantastic': 5,
  'i': 6,
  'loved': 7,
  'it': 8,
  'absolutely': 9,
  'terrible': 10,
  'film': 11,
  'waste': 12,
  'of': 13,
  'time': 14,
  'great': 15,
  'acting': 16,
  'and': 17,
  'amazing': 18,
  'story': 19,
  'worst': 20,
  'ever': 21},
 {0: '<PAD>',
  1: '<UNK>',
  2: 'this',
  3: 'movie',
  4: 'was',
  5: 'fantastic',
  6: 'i',
  7: 'loved',
  8: 'it',
  9: 'absolutely',
  10: 'terrible',
  11: 'film',
  12: 'waste',
  13: 'of',
  14: 'time',
  15: 'great',
  16: 'acting',
  17: 'and',
  18: 'amazing',
  19: 'story',
  20: 'worst',
  21: 'ever'})

In [None]:
counter

Counter({'this': 1,
         'movie': 2,
         'was': 1,
         'fantastic': 1,
         'i': 1,
         'loved': 1,
         'it': 1,
         'absolutely': 1,
         'terrible': 1,
         'film': 1,
         'waste': 1,
         'of': 1,
         'time': 1,
         'great': 1,
         'acting': 1,
         'and': 1,
         'amazing': 1,
         'story': 1,
         'worst': 1,
         'ever': 1})

## **Create Embedding Matrix**

Using random vectors here. You can also use pre-trained embeddings from Word2Vec, glove or other sources.

### **Word2Vec**
```python
import numpy as np
from gensim.models import Word2Vec

# Train Word2Vec embedding model.
embed_model = Word2Vec(
    sentences=tokenized_sentences,
    vector_size=100,
    window=10,     # context window
    min_count=1,
    workers=4,
    sg=0,          # 1 = skip-gram, 0 = CBOW
)

# Create word index.
word_index = {word: i+1 for i, word in enumerate(embed_model.wv.index_to_key)}

# Create embedding matrix.
embedding_dim = embed_model.vector_size
vocab_size = len(word_index) + 1  # Reserve position 0 for padding.

embedding_matrix = np.zeros((vocab_size, embedding_dim))

for word, i in word_index.items():
    embedding_matrix[i] = embed_model.wv[word]
```

In [None]:
# Create embedding matrix
embedding_dim = 50

embedding_matrix = np.random.normal(
    scale=0.6,
    size=(vocab_size, embedding_dim)
)

embedding_matrix[0] = np.zeros(embedding_dim)  # PAD vector
embedding_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.12338714,  0.11808445, -0.22969002, ...,  0.19854284,
        -0.06849557,  1.07413389],
       [ 0.21504217, -0.29340822, -0.23870052, ...,  0.11755211,
        -1.24237687,  0.17780739],
       ...,
       [-0.3254148 , -0.05843513,  0.6612063 , ...,  0.71847021,
        -0.05008986,  0.06231694],
       [-0.44447929, -0.26648494,  0.81670412, ..., -0.57570844,
         0.55597729, -1.12032034],
       [ 0.4258125 ,  0.84553726, -1.20692592, ..., -0.44971787,
         0.56596023, -0.68565776]])

## **Build Next-Word Dataset**

We convert:
```
"This movie was fantastic"
```

into training pairs:
```
["this"] â†’ movie
["this movie"] â†’ was
["this movie was"] â†’ fantastic
```

## **Dataset Class**

In [None]:
class NextWordDataset(Dataset):
    def __init__(self, sentences, word2idx):
        self.samples = []

        for sentence in sentences:
            tokens = tokenize(sentence)
            ids = [word2idx.get(t, 1) for t in tokens]

            for i in range(1, len(ids)):
                context = torch.tensor(ids[:i])
                target = ids[i]
                self.samples.append((context, target))

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return self.samples[idx]


In [None]:
dataset = NextWordDataset(sentences, word2idx)
dataset.samples

[(tensor([2]), 3),
 (tensor([2, 3]), 4),
 (tensor([2, 3, 4]), 5),
 (tensor([2, 3, 4, 5]), 6),
 (tensor([2, 3, 4, 5, 6]), 7),
 (tensor([2, 3, 4, 5, 6, 7]), 8),
 (tensor([9]), 10),
 (tensor([ 9, 10]), 11),
 (tensor([ 9, 10, 11]), 12),
 (tensor([ 9, 10, 11, 12]), 13),
 (tensor([ 9, 10, 11, 12, 13]), 14),
 (tensor([15]), 16),
 (tensor([15, 16]), 17),
 (tensor([15, 16, 17]), 18),
 (tensor([15, 16, 17, 18]), 19),
 (tensor([20]), 3),
 (tensor([20,  3]), 21)]

## **Collate Function (Dynamic Padding)**

The collate function is where in PyTorch's data loading pipeline you can adjust the data flowing through it. We use this to implement padding of the variable-length sequences using PyTorch's `pad_sequence` function.

In [None]:
PAD_IDX = 0

def collate_fn(batch):
    contexts = [item[0] for item in batch]
    targets = torch.tensor([item[1] for item in batch])

    padded_contexts = pad_sequence(
        contexts,
        batch_first=True,
        padding_value=PAD_IDX
    )

    return padded_contexts, targets


In [None]:
collate_fn(dataset[10:15])

(tensor([[ 9, 10, 11, 12, 13],
         [15,  0,  0,  0,  0],
         [15, 16,  0,  0,  0],
         [15, 16, 17,  0,  0],
         [15, 16, 17, 18,  0]]),
 tensor([14, 16, 17, 18, 19]))

## **DataLoader**

Create a dataloader from an instance of the dataset. The `collate_fn` is passed into the data loader so as to process the data on-the-fly. A batch size of 4 is used. In training with more data, we can use a larger batch size.

In [None]:
loader = DataLoader(
    dataset,
    batch_size=4,
    shuffle=True,
    collate_fn=collate_fn
)

## **Next Word Predictor Model**

Using an embedding layer made from the embedding_matrix, and then passing the output into a dense layer. No softmax needed since the cross entropy loss function already includes a softmax.

In [None]:
class NextWordPredictor(nn.Module):
    def __init__(self, embedding_matrix):
        super().__init__()

        vocab_size, embedding_dim = embedding_matrix.shape

        self.embedding = nn.Embedding.from_pretrained(
            torch.tensor(embedding_matrix).float(),
            padding_idx=0,
            freeze=False
        )

        self.lstm = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=128,
            batch_first=True
        )

        self.fc = nn.Linear(128, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        output, (hidden, cell) = self.lstm(x)

        # last timestep
        logits = self.fc(output[:, -1, :])

        return logits


## **Training Loop**

A training loop that checks if there is a GPU available, otherwise falls back to the CPU for computations.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model = NextWordPredictor(embedding_matrix).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.003)

EPOCHS = 200

losses = []

for epoch in range(EPOCHS):

    total_loss = 0
    mean_loss = 0

    for contexts, targets in loader:

        contexts = contexts.to(device)
        targets = targets.to(device)

        logits = model(contexts)

        loss = criterion(logits, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    mean_loss = total_loss / len(loader)
    losses.append(mean_loss)

    if epoch % 20 == 0:
        print(f"Epoch {epoch} | Loss: {total_loss:.4f}")


Epoch 0 | Loss: 15.5523
Epoch 20 | Loss: 1.5252
Epoch 40 | Loss: 0.4645
Epoch 60 | Loss: 0.0417
Epoch 80 | Loss: 0.0544
Epoch 100 | Loss: 0.0222
Epoch 120 | Loss: 0.0135
Epoch 140 | Loss: 0.0084
Epoch 160 | Loss: 0.0061
Epoch 180 | Loss: 0.0057


## **Test the Model (Inference)**

In [None]:
def predict_next(model, text, word2idx, idx2word):

    model.eval()

    tokens = tokenize(text)
    ids = [word2idx.get(t,1) for t in tokens]

    x = torch.tensor(ids).unsqueeze(0).to(device)

    with torch.no_grad():
        logits = model(x)
        pred_id = torch.argmax(logits, dim=1).item()

    return idx2word[pred_id]


print(predict_next(model, "this movie was", word2idx, idx2word))


fantastic


In [None]:
predict_next(model, "fantastic movie", word2idx, idx2word)

'ever'

## **Insight**

This pipeline demonstrates teacher forcing implicitly:

The model always receives the true previous tokens, not its own predictions.

That is exactly how GPT-style models train.

A simplistic generative AI can be obtained from this model by chaining the predictions from the model to create the next input, and allowing the model to keep predicting and generating.

In [None]:
# Simplistic Generative AI.

prompt = "fantastic movie"
for i in range(20):
    next_word = predict_next(model, prompt, word2idx, idx2word)
    prompt = prompt + " " + next_word
    print(prompt)
    if next_word == "<UNK>":
        break

prompt

fantastic movie ever
fantastic movie ever loved
fantastic movie ever loved it
fantastic movie ever loved it it
fantastic movie ever loved it it it
fantastic movie ever loved it it it it
fantastic movie ever loved it it it it i
fantastic movie ever loved it it it it i loved
fantastic movie ever loved it it it it i loved it
fantastic movie ever loved it it it it i loved it it
fantastic movie ever loved it it it it i loved it it movie
fantastic movie ever loved it it it it i loved it it movie ever
fantastic movie ever loved it it it it i loved it it movie ever ever
fantastic movie ever loved it it it it i loved it it movie ever ever ever
fantastic movie ever loved it it it it i loved it it movie ever ever ever ever
fantastic movie ever loved it it it it i loved it it movie ever ever ever ever ever
fantastic movie ever loved it it it it i loved it it movie ever ever ever ever ever ever
fantastic movie ever loved it it it it i loved it it movie ever ever ever ever ever ever ever
fantastic m

'fantastic movie ever loved it it it it i loved it it movie ever ever ever ever ever ever ever ever ever'

Unfortunately, our model is too simplistic and was trained on too little data to generate any meaningful sentences.

With only 4 sentences:

ðŸ‘‰ The model is learning memorization, not language.

For a real demo, give it:

- IMDb
- WikiText-2
- Tiny Shakespeare

Then you will see real emergence.