In [1]:
import pandas as pd
import numpy as np
import collections
import re
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import TreebankWordTokenizer

In [2]:
sentence = """Thomas Jeferson began building Monticello at the age at 26."""

sentence.split()

['Thomas',
 'Jeferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'at',
 '26.']

# Note !
The split method is the built-in method for string objects and as we can see it has "26." which we can imagine its float number from 26.0. So in like these cases, we may remove punctuation to have a string of "26" that represents for any form like "26?" "26." "26!" and other forms.

But for now lets handle like these cases later.

We can now get a vector from these words.

# One-hot-Encoding

This sequence of one-hot vectors captures the original tokens orders of the text, and for this sequence captures it used with some of the Sequence-to-Sequence deep learning models, because it retains all the meaning inherent in the original text.

- Split
- Get unique words and sort lexographically
- 2d-metrix, each column represent the words order in the orignal codument


In [3]:
# Split first
tokens = sentence.split()
print(tokens)
print("="*50)

# Build the vocab from unique words
vocab = sorted(set(tokens))
print(vocab)
print("="*50)

one_hot = np.zeros((len(tokens), len(vocab)), int)
print(one_hot.shape)

['Thomas', 'Jeferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'at', '26.']
['26.', 'Jeferson', 'Monticello', 'Thomas', 'age', 'at', 'began', 'building', 'the']
(10, 9)


In [4]:
for i, word in enumerate(tokens):
    one_hot[i, vocab.index(word)] = 1 

    
# Now the sentence 
print(sentence)
print("="* 50)
one_hot

Thomas Jeferson began building Monticello at the age at 26.


array([[0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0]])

# Note !!

We can see that first column has 1 in the last row as "26." came the last word and second columns have 1 in the second row as Jefferson came to the second word in the original document, so the sequence of a word appearing in the document has capture using the one-hot encoding, but what about 100 words? about all the words of some language? 

For now, let's convert what we mentioned by word as a column and represent the word in the columns that correspond to it and index the row for much knowledge capture instead of just 0 and 1.

In [5]:
pd.DataFrame(one_hot, columns=vocab)

Unnamed: 0,26.,Jeferson,Monticello,Thomas,age,at,began,building,the
0,0,0,0,1,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,1,0
4,0,0,1,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0
6,0,0,0,0,0,0,0,0,1
7,0,0,0,0,1,0,0,0,0
8,0,0,0,0,0,1,0,0,0
9,1,0,0,0,0,0,0,0,0


**So 1 at first columns, as we mentioned, represent the position of the word in the original document and so on. But the vector of the word based on the index it appears in the orginal document with**


# Frequency of words not their order

Machine learning can get the pattern from the data even if you ignore the order of words that appeared in the document, check the sentence after you have **sort it**, it's like when you building sentence from some of the shuffled words when you in a new language you learn, this what we need here is to compress the whole matrix into just vector that belong as your unique words in the corps, and you can use just 10,000 most repeated ones, not just this we can split the document into a small sentence, which will contain just small number of words and for this, it's like rare words because the same word may be repeated more than ones in the document rather than in a sentence rather than in the phrase, and this gives the model to get intuition about the words of the sentence as rare words when you have the whole vector as zeros as all unique words except those ones of words comes in this sentence.

So let's rebuild vector instead of 2d-matrix in one-hot.

In [6]:
sentence = """Thomas Jeferson began building Monticello at the age of 26."""

vector = collections.Counter()

unique_words = set(sorted(sentence.split())) 

for word in sentence.split():
    vector[word] = 1
    
vector

Counter({'Thomas': 1,
         'Jeferson': 1,
         'began': 1,
         'building': 1,
         'Monticello': 1,
         'at': 1,
         'the': 1,
         'age': 1,
         'of': 1,
         '26.': 1})

# More !

its just one document and no words are repeated what we can do with more than one document, we can first build the **lexicon** the vocab that you have in the corps, then sort this bag-of-words, and for each document, dictionary give it the index of its appearing in your sorted bag-of-words.

In [7]:
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in sorted(sentence.split())])), columns=['sent' ]).T
df

Unnamed: 0,26.,Jeferson,Monticello,Thomas,age,at,began,building,of,the
sent,1,1,1,1,1,1,1,1,1,1


In [8]:
sentences = """Thomas Jeferson began building Monticello at the age at 26.\n"""

sentences += """Construction was done mostly by local masons and carpenters.\n"""

sentences += """He moved into the South Pavilion in 1770.\n"""

sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

sentences

"Thomas Jeferson began building Monticello at the age at 26.\nConstruction was done mostly by local masons and carpenters.\nHe moved into the South Pavilion in 1770.\nTurning Monticello into a neoclassical masterpiece was Jefferson's obsession."

In [9]:
corps = {}
sentences_split = sentences.split('\n')
sentences_split

['Thomas Jeferson began building Monticello at the age at 26.',
 'Construction was done mostly by local masons and carpenters.',
 'He moved into the South Pavilion in 1770.',
 "Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."]

In [10]:
for i, sent in enumerate(sentences_split):
    corps['sent_' + str(i+1)] = dict([(token, 1) for token in sent.split()])


# data frame will build the columns from your unique words across all documents

df = pd.DataFrame.from_records(corps).fillna(0).astype(int).T
df

Unnamed: 0,Thomas,Jeferson,began,building,Monticello,at,the,age,26.,Construction,...,South,Pavilion,in,1770.,Turning,a,neoclassical,masterpiece,Jefferson's,obsession.
sent_1,1,1,1,1,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
sent_2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
sent_3,0,0,0,0,0,0,1,0,0,0,...,1,1,1,1,0,0,0,0,0,0
sent_4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,1,1,1,1,1


# Dot Product

As we have this table we can see that some of the words overlap with the same words in other documents, and from this point, we can get the similarity between two documents as number of common words, and this can help us in searching for similar documents and other applications, all of this can be done using the dot products

The dot product is like matrix multiplications, it outputs a scalar (single numbers) from multiplication 2-vectors together, multiply each number by corresponding one in the other vector then add them.

Example:

- 1* 2 + 2* 3 + 3* 4 = 20

In [11]:
v1 = np.array([1,2,3])
v2 = np.array([2,3,4])
v1.dot(v2)

20

In [12]:
df

Unnamed: 0,Thomas,Jeferson,began,building,Monticello,at,the,age,26.,Construction,...,South,Pavilion,in,1770.,Turning,a,neoclassical,masterpiece,Jefferson's,obsession.
sent_1,1,1,1,1,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
sent_2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
sent_3,0,0,0,0,0,0,1,0,0,0,...,1,1,1,1,0,0,0,0,0,0
sent_4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,1,1,1,1,1


In [13]:
sent_1 = df.iloc[0]
sent_2 = df.iloc[1]
sent_3 = df.iloc[2]
sent_4 = df.iloc[2]
print(sent_1.shape)
print(sent_2.shape)
print(sent_3.shape)
print(sent_4.shape)

(31,)
(31,)
(31,)
(31,)


In [14]:
# As we can see how sent_1 is similar to other sentences
print(sent_1.T.dot(sent_2))
print(sent_1.T.dot(sent_3))
print(sent_1.T.dot(sent_4))

0
1
1


# VSM

So we have now moved from just text to one-hot but it was a huge sparse 2d-matrix to represent just one document, instead, we have now the first **Vector Space Model**, which we can deal with and get similarities to add two vectors together and other math operations.

This representation of binary vector has a lot of power for document retriver and for search for many years.


# Token Improvement

NLP pipeline has a lot of stages involved in, so each of these stages can affect each other like in the tokenization we have made using a simple method associated with string **split** we have the word **26.** with this point and in other stages of the pipeline like **steaming** this will mislead the idea of group similar words together as it will be different from [26 or 26? or 26#] and others so the tokenizations process need to be more deeper than just spaces because some words can be separated using other punctuations. 


In [15]:
sentence = """Thomas Jeferson began building Monticello at the age at 26."""

tokens = re.split(r'[-\s.,;!?]+', sentence)
tokens

['Thomas',
 'Jeferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'at',
 '26',
 '']

# Hints


- [] means and char inside
- \s means spaces
- .,;!? means any of these chars beside spaces above 
- **+** means any of these chars comes one or more should be split


Other regular expression using NLTK:
look at \s and \S 

In [16]:
sentence = """Thomas Jeferson began building Monticello    at the age at 26."""
tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
tokenizer.tokenize(sentence) 

['Thomas',
 'Jeferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'at',
 '26',
 '.']

In [17]:
sentence = """Thomas Jeferson began building Monticello    at the age at 26."""
tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\s+')
tokenizer.tokenize(sentence) 

['Thomas',
 ' ',
 'Jeferson',
 ' ',
 'began',
 ' ',
 'building',
 ' ',
 'Monticello',
 '    ',
 'at',
 ' ',
 'the',
 ' ',
 'age',
 ' ',
 'at',
 ' ',
 '26']

# TreebankWordTokenizer

TreebankWordTokenizer is a better word tokenizer from the nltk its combines a variety of common rules for English word tokenization for example its separate phrase-terminating punctuation (?!) from adjacent tokens and retains decimal numbers containing a period as a single token. As it contain a rules for English constructions.

In [18]:
sentence = """Monticello wasn't designed as UNESCO World Heritage Site until 1987."""
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)

['Monticello',
 'was',
 "n't",
 'designed',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987',
 '.']