In [1]:
import re
import numpy as np
import torch

**Dataset Cleaning and Tokenization**

Run the setup cell below which has opened and read the ebook of [_Pride and Prejudice_](https://www.gutenberg.org/files/1342/1342-h/1342-h.htm) to the variable `raw_text`. 

**Note**: Due to hardware constraints, we'll only use the full text from **Chapter 1** which we've indexed and saved to the variable `raw_text_ch1`.

We've cleaned and tokenized the text to individual word-based tokens into the following variables:
- `lowered_text` : contains the full raw text where every character is lowercased
- `preprocessed_text` : contains the lowercased text where punctuation marks and special characters are removed
- `tokenized_text` : contains the full text tokenized as a list of word-based tokens

We've also created the vocabularies and obtained the vocabulary size saved as the following variables:
- `w2ix` : vocabulary mapping tokens to their assigned token ID
- `vocab_size` : the vocabulary size of `w2ix`
- `ix2w` : inverse vocabulary mapping token IDs back to its word-based token

Using the vocabulary, we created the variable:
- `tokenized_id_text` : the tokenized text of Chapter 1 mapped to token IDs

In [2]:
with open('datasets/book.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

# index Chapter 1
raw_text_ch1 = raw_text[1985:6468]

# cleaning and tokenization
lowered_text = raw_text_ch1.lower()
preprocessed_text = re.sub(r'([,.:;?_&$!"()\-\*\']|--|)', '', lowered_text)
tokenized_text = preprocessed_text.split()

# create vocabularies
unique_tokens = sorted(list(set(tokenized_text)))
w2ix = {word:ix for ix, word in enumerate(unique_tokens)}
vocab_size = len(w2ix)
ix2w = {ix:word for word,ix in w2ix.items()}

# token ID mapping of the text
tokenized_id_text = [w2ix[word] for word in tokenized_text]

print("Vocabulary Size (Chapter 1):", vocab_size)

Vocabulary Size (Chapter 1): 321



The vocabulary size of Chapter 1 is **321** which is significantly lower than the vocabulary size of the full text (which was 7338). Again, the reason we'll only be using the text from Chapter 1 is due to hardware constraints.

Let's now create bigrams for the tokenized text of Chapter 1.

Using the tokenized sentence of token IDs, create a NumPy array of bigrams where for each bigram pair:
- the first token is the token ID of the context token
- the second token is the token ID of the target token

Save the bigrams to the variable `bigrams_ch1`. 

Print out the number of bigrams created, as well as, the first 10 bigrams.


In [3]:
bigrams_ch1 = np.array([[tokenized_id_text[i], tokenized_id_text[i+1]] for i in range(len(tokenized_id_text)-1)])

# show output - number of bigrams and first 10 bigrams
print("Number of bigrams in Chapter 1:", len(bigrams_ch1))
print("First 5 bigrams: \n", bigrams_ch1[:10])

Number of bigrams in Chapter 1: 854
First 5 bigrams: 
 [[131 130]
 [130   0]
 [  0 284]
 [284 289]
 [289   4]
 [  4 264]
 [264   0]
 [  0 244]
 [244 156]
 [156 124]]


Next, let's preprocess the bigrams in Chapter 1 into the following PyTorch tensors:
- `features`: contains the context tokens
- `labels`: contains the target tokens

Be sure to specify the `torch.long` datatype for both tensors.

Print out the first 10 values in `features` and `labels`. These should match the bigram pairs.

In [4]:
features = torch.tensor(bigrams_ch1[:,0])
labels = torch.tensor(bigrams_ch1[:,1])

#print the first 10 features and labels
print("First 10 features:", features[:10])
print("First 10 labels:", labels[:10])

First 10 features: tensor([131, 130,   0, 284, 289,   4, 264,   0, 244, 156])
First 10 labels: tensor([130,   0, 284, 289,   4, 264,   0, 244, 156, 124])
