## Chap 2: build your vocabulary




### 2.2. Building your vocabulary

We can tokenize a sentence decently using a built-int method of Python str as a quick-and-dirty tokenizer  

In [1]:
sentence = "Thomas Jefferson began building Monticello at the age of 26."
sentence.split()

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

In [2]:
str.split(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

In [3]:
token_sequence = str.split(sentence)
vocab =  sorted(set(token_sequence)) #! keeps unique words only
", ".join(vocab)

'26., Jefferson, Monticello, Thomas, age, at, began, building, of, the'

In [4]:
import numpy as np
num_tokens = len(token_sequence)
vocab_size = len(vocab)
onehot_vectors = np.zeros((num_tokens,vocab_size),int)

for i, word in enumerate(token_sequence):
    onehot_vectors[i, vocab.index(word)] = 1

" ".join(vocab)

'26. Jefferson Monticello Thomas age at began building of the'

In [5]:
onehot_vectors

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [6]:
import pandas as pd
#Using dataframe to display oneht vectors
df = pd.DataFrame(onehot_vectors, columns=vocab)

df

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,0,0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0
9,1,0,0,0,0,0,0,0,0,0


In [7]:
# replace 0 with "" (empty space)
df[df == 0] = ''
df

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,,,,1.0,,,,,,
1,,1.0,,,,,,,,
2,,,,,,,1.0,,,
3,,,,,,,,1.0,,
4,,,1.0,,,,,,,
5,,,,,,1.0,,,,
6,,,,,,,,,,1.0
7,,,,,1.0,,,,,
8,,,,,,,,,1.0,
9,1.0,,,,,,,,,


However, onehot encoding emcodes each word into a long vector of 0 and 1, thus is not pratical when we apply it to a long document which includes many sentences. 

Another solution is that we represent each sentence as a vector of words (bag-of-words) and we only care about the presence or absence of a word in that sentence. Must be noted that the order of words in sentence may not be remained (because items in dictionary will be sorted according to ASCII table index)

In [8]:
sentence_bow = {}

for token in sentence.split():
    sentence_bow[token] = 1
sorted(sentence_bow.items())

[('26.', 1),
 ('Jefferson', 1),
 ('Monticello', 1),
 ('Thomas', 1),
 ('age', 1),
 ('at', 1),
 ('began', 1),
 ('building', 1),
 ('of', 1),
 ('the', 1)]

In [9]:
df = pd.DataFrame(pd.Series(dict([(token,1) for token in sentence.split()])), columns=['sent']).T
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent,1,1,1,1,1,1,1,1,1,1


In [10]:
sentences = "Thomas Jefferson began building Monticello at the age of 26.\n"
sentences += "Construction was done mostly by local mansons and carpenters.\n"
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += "Turing Monticello into a neoclassical masterpiece was Jefferson's obsession."

corpus = {}

for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok,1) for tok in sent.split())

df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
df


Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.,...,South,Pavilion,in,1770.,Turing,a,neoclassical,masterpiece,Jefferson's,obsession.
sent0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0,...,1,1,1,1,0,0,0,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,1,1,1,1,1


### 2.2.1. Dot product

In [11]:
import pandas as pd
import numpy as np

In [12]:
v1 = np.array([1,2,3])
v2 = np.array([2,3,4])

print(v1.dot(v2))


20


3 other ways to get dot product of 2 vectors

In [16]:
np.sum(v1*v2)

20

In [14]:
v1@v2

20

In [15]:
np.matmul(v1,v2)

20

### 2.2.2 Measuring bag-of-words overlap

The measure will show how similar sentences are in the words they use

In [22]:
df = df.T

similar_0_1 = df.sent0.dot(df.sent1)  # will equal 0, means no word was used in both sentences
similar_0_2 = df.sent0.dot(df.sent2)  # will equal 1, means 1 word was used in both sentences
similar_0_3 = df.sent0.dot(df.sent3)  # will equal 1, means 1 word was used in both sentences

print(similar_0_1, similar_0_2, similar_0_3, end="\n")


0 1 1


To find the word(s) that is shared by 2 sentences, we use bitwise equation between 2 or more binary vectors (the cell below)

In [24]:
[(k,v) for k,v in (df.sent0 & df.sent3).items() if v]

[('Monticello', 1)]

### 2.2.3 A token improvement

The above codes are about splitting the spaces between words to get tokens. That method still keeps the pesky punctuations (i.e. the preiod in "26."). Thus need a better tokenizer that can remove both whitespaces and punctuations <font style="background:rgba(255, 248, 112,0.7);color:black">(in some case, you want to treat those punctuations as tokens, some case you ignore them)</font>.

In [3]:
import re

sentence = "Thomas Jefferson began building Monticello at the age of 26."

tokens = re.split(r'[-\\\s.,;!?]+',sentence)
tokens


['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '']

<code style="background:rgba(255, 248, 112,0.7);color:black">NOTE:When to compile your regex patterns?</code><br><br>
For now, I have read both in this book and search this topic on [stackoverflow](https://stackoverflow.com/questions/452104/is-it-worth-using-pythons-re-compile#:~:text=For%20the%20simple%20case%20of,cache%20of%20previously%20compiled%20expressions.).

Then I come up with my conclusion that:
- if you want to use a same regex multiple times or complex regex, it is better to use re.compile()
- if you want to to call methods of regex with your own name (this kinda programing stuff) rather than `re` directly, use re.compile()

As you can see below, the result is still the same!

In [6]:
regex = re.compile(r"[-\s.,;!?]+")
regex.split(sentence)

['', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', '']

<code style="background:rgba(255, 248, 112,0.7);color:black">NOTE:When to use the new regex module in Python?</code>
<br>
<br>
Regex is build based on `re` package so it is backward-compatible with `re`
<br>
There is a new module named `regex` with useful new features include support for:
- Overlaping match sets
- Multithreading
- Feature-complete support for unicode
- Approximate regular expression matches 
- Larger default MAXCACHE

Install `regex`: `pip install regex`

The book said `regex` will eventually replace `re` package (still looking for that information tho 🙂)
<br>
*****All the information above is from the book, it confused me a lot and I will take time to give it a look later***

---

There are some libraries implement tokenizers, each one has its own advantage. Take a look at the capture below

![](2022-07-23-20-18-06.png)

Normally in NLP, NLTK and CoreNLP is the most widely used. Even though, CoreNLP is based on Java 8 backend so we must install and cofigure separately.

In [5]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
tokenizer.tokenize(sentence)


['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '.']

If a question appears in your head that "What if we use re with the same pattern passed into RegexpTokenizer?" see below

In [7]:
regex = re.compile(r'\w+|$[0-9.]+|\S+')
regex.split(sentence)


['', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '', '']

An even better tokenizer is the Treebank Word Tokenizer from the NLTK package.
<br>
This tokenizer separates phrase-terminating punctuation from adjacent tokens and retains decimal numbers containing a period as single token. Even more, this tokenizer will separate "don't" as ["do", "n't"] which will help with subsequence step in the pipeline such as normalizing tokens.

In [8]:
from nltk.tokenize import TreebankWordTokenizer

sentence = "Monticello wasn't designated as UNESCO World Heritage Site until 1987."

tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)


['Monticello',
 'was',
 "n't",
 'designated',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987',
 '.']

---