**Working with Text**

Getting text ready to be processed by Languange Models

### 1.1 Tokenize the text

##### Get data

We will use A Tale of Two Cities by Charles Dickens from the Gutenberg Project:

In [6]:
import requests
url = "http://www.gutenberg.org/files/98/98-0.txt"

response = requests.get(url)
text = response.text

In [9]:
print(text[:200])

The Project Gutenberg eBook of A Tale of Two Cities, by Charles Dickens

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with alm


In [26]:
print(len(text))

793331


##### Split text on puctuations and white space

In [23]:
import re
words = re.split('\s', text)

In [25]:
print(words[:20])

['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'A', 'Tale', 'of', 'Two', 'Cities,', 'by', 'Charles', 'Dickens', '', '', '', 'This', 'eBook', 'is', 'for']


- The BOM (\ufeff), a Unicode character (the Byte Order Mark, or BOM) at the beginning, is often used to indicate the encoding of a text file

In [27]:
# Remove the BOM using regular expression
text = re.sub(r'^\ufeff', '', text)

- Split on white space and punctations like commas, periods, apostrophes, etc.

In [111]:
sample_text = r"""Hello, "this" is-- a sample text. There's a second line!"""
sample_words = re.split(r'([,.:;?_!"()\'-])|\s', sample_text)
# Filter out empty strings
sample_words = [word for word in sample_words if word]
print(sample_words[:30])

['Hello', ',', '"', 'this', '"', 'is', '-', '-', 'a', 'sample', 'text', '.', 'There', "'", 's', 'a', 'second', 'line', '!']


In [113]:
words = re.split(r'([,.:;?_!"()\'-])|\s', text)

# Filter out empty strings
words = [word for word in words if word]

print(words[:30])

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'A', 'Tale', 'of', 'Two', 'Cities', ',', 'by', 'Charles', 'Dickens', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other']


In [115]:
print(len(words))

169846


### 1.2 Convert tokens to token IDs

##### Generate vocabulary

The set of unique tokens that an NLP system operates on is called a vocabulary.

- Create a vocabulary using the tokens

In [175]:
unique_words = sorted(list(set(words)))
print(len(unique_words))
unique_words[100:105]

11658


['An', 'And', 'Angel', 'Angels', 'Angel’s']

In [176]:
vocab = {word: index for index, word in enumerate(unique_words)}
print(len(vocab))
vocab['Angel']

11658


102

##### Define Tokenizer Class

Define a **Simple Tokenizer** class that encodes and decodes the input string based on our vocabulary
- Vocab has a mapping from word -> index.
- We need to create a reverse_lookup mapping needed for decoding the tokens that maps index -> words.

In [155]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.reverse_lookup = {i: w for w, i in vocab.items()}

    def encode(self, text):
        tokens = re.split(r'([,.:;?_!"()\'-])|\s', text)
        # Filter out empty strings
        tokens = [token for token in tokens if token]
        return [self.vocab[token] for token in tokens]

    def decode(self, tokens):
        return ' '.join([self.reverse_lookup[token] for token in tokens])

In [156]:
sample_text = r'A Tale of Cities - by Charles Dickens'
tokenizer = SimpleTokenizer(vocab)

In [157]:
print('Encoded:', tokenizer.encode(sample_text))
print('Decoded:', tokenizer.decode(tokenizer.encode(sample_text)))

Encoded: [52, 1221, 7122, 254, 12, 2512, 240, 360]
Decoded: A Tale of Cities - by Charles Dickens


### 1.3 Adding Special tokens

Adding some special tokens in the vocabulary gives the LLM extra context. Such as:
- [SOS] Start of sequence, denotes the beginning of a sequence
- [EOS] End of sequence, denote the end of a sequence or between concatenated texts like between multiple articles, books,etc.
- [PAD] Padding token, is used to match the expected input length to the LLM when the actual input length is smaller than the expected sequence length
- [UNK] Unknown token, denote out-of-vocabulary tokens that did not occur in the vocabulary 

In [159]:
sample_text = r'Maybe these words do not exist in the vocab'
tokenizer = SimpleTokenizer(vocab)
tokenizer.encode(sample_text)

KeyError: 'Maybe'

- The cell above returns an error since the word 'Maybe' is not present in our vocabulary. This can be handled by adding a [UNK] token for any out-of-vocab token
- We can also add the [EOS] token 

In [177]:
special_tokens = ["<|EOS|>", "<|UNK|>"]
unique_words.extend(special_tokens)
vocab = {word: index for index, word in enumerate(unique_words)}

In [178]:
unique_words[-5:]

['“‘You', '“‘the', '”', '<|EOS|>', '<|UNK|>']

In [180]:
for token,id in list(vocab.items())[-5:]:
    print(token, id)

“‘You 11655
“‘the 11656
” 11657
<|EOS|> 11658
<|UNK|> 11659


- Modify the Tokenizer class to include special tokens

In [184]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.reverse_lookup = {i: w for w, i in vocab.items()}

    def encode(self, text):
        tokens = re.split(r'([,.:;?_!"()\'-])|\s', text)
        # Filter out empty strings
        tokens = [token for token in tokens if token]
        return [self.vocab[token] if token in self.vocab else self.vocab['<|UNK|>'] for token in tokens]

    def decode(self, tokens): 
        return ' '.join([self.reverse_lookup[token] for token in tokens])

In [187]:
sample_text1 = r'Maybe these words do not exist in the vocabulary.'
sample_text2 = r'A Tale of Cities - by Charles Dickens.'
sample_text = sample_text1 + '<|EOS|> ' + sample_text2

tokenizer = SimpleTokenizer(vocab)
print('Enocded:', tokenizer.encode(sample_text)) 
print('Decoded:', tokenizer.decode(tokenizer.encode(sample_text)))

Enocded: [11659, 9960, 10962, 3925, 7026, 4438, 5770, 9938, 11659, 13, 11658, 52, 1221, 7122, 254, 12, 2512, 240, 360, 13]
Decoded: <|UNK|> these words do not exist in the <|UNK|> . <|EOS|> A Tale of Cities - by Charles Dickens .
