# Extracting the tokens from a corpus

This jupyter notebook has been used in DIT's Computational Linguistic course; session of 28th Feruary

We are going to setup a toy corpus and compute some _similarities_.

The libraries to be used today are:

* [numpy](https://numpy.org/)
* [pandas](https://pandas.pydata.org/)
* [nltk](https://www.nltk.org/)

## 0. Prerequisites

Since we are going to use non-standard libraries, we need to set them up --if working on an ephemeral environment (e.g., colab)


In [2]:
!pip3 install nltk
!pip install wheel
!pip install pandas



## 1. Importing the necessary libraries

In [3]:
import nltk
import numpy as np
import pandas as pd

from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer

## 2. Defining the preprocessing _pipeline_:

1. Tokenisation
2. Stemmming
3. Stopwording

In [4]:
# invoking the necessary objects
tokenizer = TreebankWordTokenizer()
stemmer = PorterStemmer()

In [5]:
# a tiny test
tokenizer.tokenize("The input text.")


['The', 'input', 'text', '.']

In [6]:
stemmer.stem("documents")

'document'

In [7]:
# both tokenisation and stemming
text = """Perseverance (nicknamed Percy) is a car-sized Mars 
rover designed to explore the crater Jezero on Mars as part 
of NASA's Mars 2020 mission."""

print([stemmer.stem(w) for w in tokenizer.tokenize(text)])

['persever', '(', 'nicknam', 'perci', ')', 'is', 'a', 'car-siz', 'mar', 'rover', 'design', 'to', 'explor', 'the', 'crater', 'jezero', 'on', 'mar', 'as', 'part', 'of', 'nasa', "'s", 'mar', '2020', 'mission', '.']


In [8]:
# The first time you use the stopwords, you have to download them!
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\paolo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [9]:
stop_words = stopwords.words("english")
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

### What is a stopword?

According to [the Wikipedia](https://en.wikipedia.org/wiki/Stop_word): in computing, stop words are words which are **filtered out** before or after processing of natural language data \[...\] the most common words in a language.

\[...\]

For some search engines, these are **some of the most common, short function words,** such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That". 

**Q: Can I create a list of stopwords on the fly?**

## 3. Setting up the _corpus_

In [10]:
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

sentence = sentences.lower()
sentence      # what is the diff wrt print()?

"thomas jefferson began building monticello at the age of 26.\nconstruction was done mostly by local masons and carpenters.\nhe moved into the south pavilion in 1770.\nturning monticello into a neoclassical masterpiece was jefferson's obsession."

## 4. Bag of words representation

Let us compute the BoW representation for our toy corpus

In [11]:
# Loading the corpus into a dictionary
corpus ={}
for i, sent in enumerate(sentences.split('\n')):
    sentence = sent.lower()                 # Case folding
    tokens = tokenizer.tokenize(sentence)   # Tokenisation 
    stems = [stemmer.stem(token) for token in tokens if token not in stop_words]
    
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in stems)

print(corpus)


{'sent0': {'thoma': 1, 'jefferson': 1, 'began': 1, 'build': 1, 'monticello': 1, 'age': 1, '26': 1, '.': 1}, 'sent1': {'construct': 1, 'done': 1, 'mostli': 1, 'local': 1, 'mason': 1, 'carpent': 1, '.': 1}, 'sent2': {'move': 1, 'south': 1, 'pavilion': 1, '1770': 1, '.': 1}, 'sent3': {'turn': 1, 'monticello': 1, 'neoclass': 1, 'masterpiec': 1, 'jefferson': 1, "'s": 1, 'obsess': 1, '.': 1}}


In [12]:
# Loading the data into a pandas dataframe
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T

df[df.columns[:10]]
#print(df)

Unnamed: 0,thoma,jefferson,began,build,monticello,age,26,.,construct,done
sent0,1,1,1,1,1,1,1,1,0,0
sent1,0,0,0,0,0,0,0,1,1,1
sent2,0,0,0,0,0,0,0,1,0,0
sent3,0,1,0,0,1,0,0,1,0,0


## 5. Computing the dot product

"The sum of the products of the corresponding entries of two sequences of numbers". 

Let us go and have a look at the [Wikipedia](https://en.wikipedia.org/wiki/Dot_product).


In [13]:
v1 = np.array([1, 2, 3])
v2 = np.array([2, 4, 6])


# The long way
sum_dot = 0

for i in range(len(v1)):
    sum_dot += v1[i] * v2[i]
    print("result at iteration {}: {}".format(i, sum_dot))
print("Result:", sum_dot)


result at iteration 0: 2
result at iteration 1: 10
result at iteration 2: 28
Result: 28


In [14]:
# The smart way (we are "vectorising")
dot = (v1 * v2).sum()
print(dot)

28


In [15]:
# The numpy way
v1.dot(v2)

28

The dot product can be used to measure the overlap between two documents

In [16]:
# We first need to compute the transpose of the matrix 
# because I need column vectors

df = df.T

In [17]:
#How can I print it?
print(df)

            sent0  sent1  sent2  sent3
thoma           1      0      0      0
jefferson       1      0      0      1
began           1      0      0      0
build           1      0      0      0
monticello      1      0      0      1
age             1      0      0      0
26              1      0      0      0
.               1      1      1      1
construct       0      1      0      0
done            0      1      0      0
mostli          0      1      0      0
local           0      1      0      0
mason           0      1      0      0
carpent         0      1      0      0
move            0      0      1      0
south           0      0      1      0
pavilion        0      0      1      0
1770            0      0      1      0
turn            0      0      0      1
neoclass        0      0      0      1
masterpiec      0      0      0      1
's              0      0      0      1
obsess          0      0      0      1


In [18]:
df.sent0.dot(df.sent1)


1

In [19]:
df.sent0.dot(df.sent2)

1

In [20]:
df.sent0.dot(df.sent3)


3

In [21]:
# Where do these numbers come from?
print(sentences)
[(k, v) for (k, v) in (df.sent0 & df.sent3).items() if v]

Thomas Jefferson began building Monticello at the age of 26.
Construction was done mostly by local masons and carpenters.
He moved into the South Pavilion in 1770.
Turning Monticello into a neoclassical masterpiece was Jefferson's obsession.


[('jefferson', 1), ('monticello', 1), ('.', 1)]

## This is your first **vector space model**!


