In [1]:
import re
import numpy as np
import torch

**Dataset Cleaning, Tokenization, and Preprocessing**

Run the setup cell below which has opened and read the ebook of [_Pride and Prejudice_](https://www.gutenberg.org/files/1342/1342-h/1342-h.htm) to the variable `raw_text`. 

**Note**: Due to hardware constraints, we'll only use the full text from **Chapter 1** which we've indexed and saved to the variable `raw_text_ch1`.

We've cleaned and tokenized the text to individual word-based tokens into the following variables:
- `lowered_text` : contains the full raw text where every character is lowercased
- `preprocessed_text` : contains the lowercased text where punctuation marks and special characters are removed
- `tokenized_text` : contains the full text tokenized as a list of word-based tokens

We've also created the vocabularies and obtained the vocabulary size saved as the following variables:
- `w2ix` : vocabulary mapping tokens to their assigned token ID
- `vocab_size` : the vocabulary size of `w2ix`
- `ix2w` : inverse vocabulary mapping token IDs back to its word-based token

Using the vocabulary, we created the variables:
- `tokenized_id_text` : the tokenized text of Chapter 1 mapped to token IDs

Lastly, the bigrams were created into features and labels and saved as the following variables:
- `bigrams_ch1` : contains the bigram pairs as token IDs for the tokenized text of Chapter 1
- `features` : contains the context token for each bigram
- `labels` : contains the target token for each bigram

In [2]:
with open('datasets/book.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

# index Chapter 1
raw_text_ch1 = raw_text[1985:6468]

# cleaning and tokenization
lowered_text = raw_text_ch1.lower()
preprocessed_text = re.sub(r'([,.:;?_&$!"()\-\*\']|--|)', '', lowered_text)
tokenized_text = preprocessed_text.split()

# create vocabularies
unique_tokens = sorted(list(set(tokenized_text)))
w2ix = {word:ix for ix, word in enumerate(unique_tokens)}
vocab_size = len(w2ix)
ix2w = {ix:word for word,ix in w2ix.items()}

# token ID mapping of the text
tokenized_id_text = [w2ix[word] for word in tokenized_text]

# Map the tokens in the text to their token IDs
tokenized_id_text = [w2ix[word] for word in tokenized_text]

# Create Bigrams
bigrams_ch1 = np.array([[tokenized_id_text[i], 
                          tokenized_id_text[i + 1]] 
                          for i in range(len(tokenized_id_text) - 1)])

# Create Features and Labels Tensors
features = torch.tensor(bigrams_ch1[:,0],dtype=torch.long)
labels = torch.tensor(bigrams_ch1[:,1],dtype=torch.long)

**Construct Bigram Model Architecture**

Run the setup up cell below to create the `NextWordBigram` model class we built earlier that has been instantiated to the variable `bigram_model`. 

**Note**: we've slightly modified the the embedding layer to create embeddings with a dimension size of 6.

In [3]:
from torch import nn
torch.manual_seed(1) # set random seed --do not change!

class NextWordBigram(nn.Module):
    def __init__(self):
        super(NextWordBigram, self).__init__()
        self.embedding = nn.Embedding(vocab_size, 6)
        self.layer1 = nn.Linear(6, 18)
        self.layer2 = nn.Linear(18, vocab_size)
    
    def forward(self,x):
        x = self.embedding(x)
        x = self.layer1(x)
        x = self.layer2(x)
        return x

# Instantiate the bigram model
bigram_model = NextWordBigram()


First, let's initialize the loss function and the optimizer for training.

1. Create an instance of the cross-entropy loss function for multiclass classification and save it to the variable `loss`.

2. Create an instance of the Adam optimizer with a learning rate of `0.01` and save it to the variable `optimizer`.

**Note**: Be sure to run the three **Setup** cells above (importing libraries, preprocessing the dataset, and constructing the bigram architecture) before completing the checkpoints!



In [4]:
import torch.optim as optim
from sklearn.metrics import accuracy_score
torch.manual_seed(1) # set random seed --do not change!

## YOUR SOLUTION HERE ##
loss = nn.CrossEntropyLoss()
optimizer = optim.Adam(bigram_model.parameters(), lr=0.01)

In [5]:
optimizer

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.01
    maximize: False
    weight_decay: 0
)


Let's now create the training loop that consists of training the network for `700` epochs with the following steps at each iteration:

1. Reset the gradients
2. Apply the forward pass
3. Compute the loss
4. Compute the gradients
5. Update the weights and biases 

Keep track of the loss during training by printing out the loss every 100 epochs.


In [6]:
# initialize model and model components -- do not change!
torch.manual_seed(1)
bigram_model = NextWordBigram()
loss = nn.CrossEntropyLoss()
optimizer = optim.Adam(bigram_model.parameters(), lr=0.01)

## YOUR SOLUTION HERE ##
num_epochs = 700
for epoch in range(num_epochs):
    optimizer.zero_grad()
    predictions = bigram_model(features)
    CEloss = loss(predictions, labels) #CrossEntropyLoss
    CEloss.backward() #compute gradients
    optimizer.step()  # update weights and biases      
        
    
    ## DO NOT MODIFY ##
    # keep track of the loss during training every 100 epochs
    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], CELoss: {CEloss.item():.4f}')

Epoch [100/700], CELoss: 1.9768
Epoch [200/700], CELoss: 1.4131
Epoch [300/700], CELoss: 1.3493
Epoch [400/700], CELoss: 1.3399
Epoch [500/700], CELoss: 1.3375
Epoch [600/700], CELoss: 1.3377
Epoch [700/700], CELoss: 1.3361



Let's now generate text from the trained bigram model!

We've already set the model to evaluation mode and provided the same starting prompt as before: `['it', 'is', 'a', 'truth']` to see if our model can re-create the first sentence in our text: 

```md
it is a truth universally acknowledged that a single man in possession of a good fortune, must be in want of a wife
```

Create a loop that generates the next `20` tokens.

In [7]:
# set the model to evaluation mode
bigram_model.eval()

starting_prompt = ['it', 'is', 'a', 'truth']
num_generated_tokens = 20 

for _ in range(num_generated_tokens):
    ## YOUR SOLUTION HERE ##    
    context_token=torch.tensor(w2ix[starting_prompt[-1]], dtype=torch.long)
    token_scores = bigram_model(context_token)
    predicted_token_id =torch.argmax(token_scores, dim=0).item()
    predicted_token=ix2w[predicted_token_id]
    starting_prompt.append(predicted_token)
        
generated_text = ' '.join(starting_prompt)
     
print(generated_text)

it is a truth universally acknowledged that he is not you must know mrs long says that he is not you must know mrs


Nice, it looks like the trained bigram model generates slightly more coherent text than the untrained bigram model. 

Specifically, it correctly predicted the next three tokens: `'universally'`, `'acknowledged'`, and `'that'`. 

However, we see the limitations of the bigram model as it quickly falls into repetition, repeating the sequence `'that he is not you must know mrs'`. Since bigram models only use a single context token (the previous token) to predict the next token, it won't be able to learn the long-term contextual relationships within the text beyond those immediate bigram pairs.


**Calculate Next Word Probabilities**

In [8]:
context_token = 'truth'
index = torch.tensor(w2ix[context_token],dtype=torch.long)
predict = torch.softmax(bigram_model(index), dim=0)

data = []
for ix, item in enumerate(predict):
    data.append([ix2w[ix], round(item.item(), 5)])

import pandas as pd
df = pd.DataFrame(data, columns=['Next Token', 'Probability'])
df.sort_values("Probability", ascending=False).head()

Unnamed: 0,Next Token,Probability
289,universally,0.50003
130,is,0.49978
305,when,0.00016
271,these,2e-05
219,preference,0.0
