**Working with Text**

Getting text ready to be processed by Languange Models

### 1.1 Tokenize the text

##### Get text data

We will use A Tale of Two Cities by Charles Dickens from the Gutenberg Project:

In [6]:
import requests
url = "http://www.gutenberg.org/files/98/98-0.txt"

response = requests.get(url)
text = response.text

In [9]:
print(text[:200])

The Project Gutenberg eBook of A Tale of Two Cities, by Charles Dickens

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with alm


In [26]:
print(len(text))

793331


##### Split text on puctuations and white space

In [23]:
import re
words = re.split('\s', text)

In [25]:
print(words[:20])

['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'A', 'Tale', 'of', 'Two', 'Cities,', 'by', 'Charles', 'Dickens', '', '', '', 'This', 'eBook', 'is', 'for']


- The BOM (\ufeff), a Unicode character (the Byte Order Mark, or BOM) at the beginning, is often used to indicate the encoding of a text file

In [27]:
# Remove the BOM using regular expression
text = re.sub(r'^\ufeff', '', text)

- Split on white space and punctations like commas, periods, apostrophes, etc.

In [111]:
sample_text = r"""Hello, "this" is-- a sample text. There's a second line!"""
sample_words = re.split(r'([,.:;?_!"()\'-])|\s', sample_text)
# Filter out empty strings
sample_words = [word for word in sample_words if word]
print(sample_words[:30])

['Hello', ',', '"', 'this', '"', 'is', '-', '-', 'a', 'sample', 'text', '.', 'There', "'", 's', 'a', 'second', 'line', '!']


In [113]:
words = re.split(r'([,.:;?_!"()\'-])|\s', text)

# Filter out empty strings
words = [word for word in words if word]

print(words[:30])

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'A', 'Tale', 'of', 'Two', 'Cities', ',', 'by', 'Charles', 'Dickens', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other']


In [115]:
print(len(words))

169846


### 1.2 Convert tokens to token IDs

- Create a vocabulary using the tokens

In [121]:
unique_words = list(set(words))
print(len(unique_words))
unique_words[:5]

11658


['accordance', 'industrious', 'engaged', '“here', 'brigand']

In [135]:
vocab = {word: index for index, word in enumerate(unique_words)}
print(len(vocab))
vocab['engaged']

11658


2

Define a **Simple Tokenizer** class that encodes and decodes the input string based on our vocabulary
- Vocab has a mapping from word -> index.
- We need to create a reverse_lookup mapping needed for decoding the tokens that maps index -> words.

In [142]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.reverse_lookup = {i: w for w, i in vocab.items()}

    def encode(self, text):
        tokens = re.split(r'([,.:;?_!"()\'-])|\s', text)
        # Filter out empty strings
        tokens = [token for token in tokens if token]
        return [self.vocab[token] for token in tokens]

    def decode(self, tokens):
        return ' '.join([self.reverse_lookup[token] for token in tokens])

In [145]:
sample_text = r'A Tale of Cities - by Charles Dickens'
tokenizer = SimpleTokenizer(vocab)

In [146]:
print('Encoded:', tokenizer.encode(sample_text))
print('Decoded:', tokenizer.decode(tokenizer.encode(sample_text)))

Encoded: [3154, 1347, 8772, 11105, 5182, 6340, 9808, 1877]
Decoded: A Tale of Cities - by Charles Dickens
