In [1]:
import re
import numpy as np
import torch

**Dataset Cleaning, Tokenization, and Preprocessing**

Run the setup cell below which has opened and read the ebook of [_Pride and Prejudice_](https://www.gutenberg.org/files/1342/1342-h/1342-h.htm) to the variable `raw_text`. 

**Note**: Due to hardware constraints, we'll only use the full text from **Chapter 1** which we've indexed and saved to the variable `raw_text_ch1`.

We've cleaned and tokenized the text to individual word-based tokens into the following variables:
- `lowered_text` : contains the full raw text where every character is lowercased
- `preprocessed_text` : contains the lowercased text where punctuation marks and special characters are removed
- `tokenized_text` : contains the full text tokenized as a list of word-based tokens

We've also created the vocabularies and obtained the vocabulary size saved as the following variables:
- `w2ix` : vocabulary mapping tokens to their assigned token ID
- `vocab_size` : the vocabulary size of `w2ix`
- `ix2w` : inverse vocabulary mapping token IDs back to its word-based token

Using the vocabulary, we created the variables:
- `tokenized_id_text` : the tokenized text of Chapter 1 mapped to token IDs

Lastly, the bigrams were created into features and labels and saved as the following variables:
- `bigrams_ch1` : contains the bigram pairs as token IDs for the tokenized text of Chapter 1
- `features` : contains the context token for each bigram
- `labels` : contains the target token for each bigram

In [2]:
with open('datasets/book.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

# index Chapter 1
raw_text_ch1 = raw_text[1985:6468]

# cleaning and tokenization
lowered_text = raw_text_ch1.lower()
preprocessed_text = re.sub(r'([,.:;?_&$!"()\-\*\']|--|)', '', lowered_text)
tokenized_text = preprocessed_text.split()

# create vocabularies
unique_tokens = sorted(list(set(tokenized_text)))
w2ix = {word:ix for ix, word in enumerate(unique_tokens)}
vocab_size = len(w2ix)
ix2w = {ix:word for word,ix in w2ix.items()}

# token ID mapping of the text
tokenized_id_text = [w2ix[word] for word in tokenized_text]

# Map the tokens in the text to their token IDs
tokenized_id_text = [w2ix[word] for word in tokenized_text]

# Create Bigrams
bigrams_ch1 = np.array([[tokenized_id_text[i], 
                          tokenized_id_text[i + 1]] 
                          for i in range(len(tokenized_id_text) - 1)])

# Create Features and Labels Tensors
features = torch.tensor(bigrams_ch1[:,0],dtype=torch.long)
labels = torch.tensor(bigrams_ch1[:,1],dtype=torch.long)

We've started constructing the `NextWordBigram` class for our bigram model from the narrative.

Complete the `__init__()` and `forward()` methods with the following architecture:

1. In the `__init__()` method, initialize the following layers:
    - `self.embedding` for the embedding layer where the number of embeddings is equal to the **vocabulary size** making sure each embedding contains **2** dimensions
    - `self.linear1` for the first linear layer with an input size of **2** and an output size of **18**
    - `self.linear2` for the second linear layer with an input size of **18** and an output size equal to the **vocabulary size**
    

2. In the `forward()` method, create the forward operations starting with the input `x` in the following order: 
    1. `self.embedding`
    2. `self.linear1`
    3. `self.linear2`
    4. return the output from `self.linear2`
    


In [3]:
from torch import nn
torch.manual_seed(1) # set random seed --do not change!

class NextWordBigram(nn.Module):
    def __init__(self):
        super(NextWordBigram, self).__init__()
        ## YOUR SOLUTION HERE ##
        self.embedding = nn.Embedding(num_embeddings=vocab_size, 
                                      embedding_dim=2)
        self.linear1 = nn.Linear(2,18)
        self.linear2 = nn.Linear(18, vocab_size)                
        
    def forward(self,x):
        ## YOUR SOLUTION HERE ##
        x=self.embedding(x)
        x=self.linear1(x)
        x=self.linear2(x)
        
        return x

Next, let's create an instance of the `NextWordBigram` class and save it to the variable `bigram_model`.

Be sure to set the model to **evaluation mode**.

In [6]:
bigram_model = NextWordBigram()
bigram_model.eval()

NextWordBigram(
  (embedding): Embedding(321, 2)
  (linear1): Linear(in_features=2, out_features=18, bias=True)
  (linear2): Linear(in_features=18, out_features=321, bias=True)
)


Lastly, let's generate some text from the untrained bigram model!

Let's see if the model can re-create the first sentence in Chapter 1: 

```md
it is a truth universally acknowledged that a single man in possession of a good fortune, must be in want of a wife
```

We've provided the first four tokens `['it', 'is', 'a', 'truth']` as the starting prompt. The bigram model will generate (predict) the next `10` tokens starting with the last token `'truth'` as the first context token.

Create a loop that generates the next 10 tokens:
1. Select the last token as the context token
2. Input the context token through the forward pass of the model to generate token scores
3. Use `torch.argmax` to select the token ID with the highest score (the predicted token)
4. Convert the predicted token ID to the actual token using the reverse vocabulary `ix2w`
5. Append the generated token to the starting prompt

Join the tokens in the starting prompt using `' '.join()` and save it to the variable `generated_text`.

Print `generated_text`.


In [7]:
starting_prompt = ['it', 'is', 'a', 'truth']
num_generated_tokens = 10
with torch.no_grad():
    for _ in range(num_generated_tokens):
        context_token=torch.tensor(w2ix[starting_prompt[-1]], dtype=torch.long)
        token_scores = bigram_model(context_token)
        predicted_token_id =torch.argmax(token_scores, dim=0).item()
        predicted_token=ix2w[predicted_token_id]
        starting_prompt.append(predicted_token)
        
generated_text = ' '.join(starting_prompt)

# show output - generated text
print(generated_text)

it is a truth bit high so bit high so bit high so bit


**Explaining the text generation output**

The text generated from the untrained model doesn't make too much sense because it seems to be repeating the tokens `'hope '`, and `'impatiently'`. 

Here is an explanation of how the bigram model generates a new token at each iteration:
- the first context token `'truth'` is predicted to have the next token `'hope'`
- the previously predicted token `'hope'` becomes the next context token which is used to predict the next token `'impatiently'`
- the previously predicted token  `'impatiently'` becomes the next context token which is used to predict the next token `'hope'`
    - this causes the repeated predictions `'hope' ==> 'impatiently' ==> 'hope ' ==> 'impatiently'`