### Coding Challenge #2: Natural Language Processing

A common task in NLP is to determine the similarity between documents or words. In order to facilitate the comparison between documents or words, you will leverage the learnings from Coding Challenge #1 to create vectors. Once you have a document term matrix, comparisons are possible since you can measure the difference between the numbers.

In this Coding Challenge, you will utilize the "**Gensim**" library, which is a free Python library to determine document similarity.

**"Gensim" Reference**: https://radimrehurek.com/project/gensim/




**Install Gensim**:

In [1]:
# https://radimrehurek.com/gensim/install.html
!pip install --upgrade gensim

Requirement already up-to-date: gensim in /usr/local/lib/python3.6/dist-packages (3.4.0)
Requirement not upgraded as not directly required: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.14.3)
Requirement not upgraded as not directly required: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.0)
Requirement not upgraded as not directly required: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (0.19.1)
Requirement not upgraded as not directly required: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.5.7)
Requirement not upgraded as not directly required: bz2file in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (0.98)
Requirement not upgraded as not directly required: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (1.7.36)
Requirement not upgraded as not directly required: requests in /usr/local/lib/python3.6/dist-packages (from s

In [0]:
import gensim

**Install NLTK:**

In [3]:
# Import the NLTK package
import nltk

# Get all the data associated with NLTK – could take a while to download all the data
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /content/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /content/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /content/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to /content/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to /content/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to

True

**Import the requiste NLTK packages:**

In [0]:
#Import word tokenizer
from nltk.tokenize import word_tokenize

**Dataset:**

In [0]:
#For the purposes of this challenge, each line represents a document. In all, there are 8 documents

raw_documents = ['The dog ran up the steps and entered the owner\'s room to check if the owner was in the room.',
                 'My name is Thomson Comer, commander of the Machine Learning program at Lambda school.',
                 'I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program.',
                 'Machine Learning is one of my favorite subjects.',
                 'I am excited about taking the Machine Learning class at the Lambda school starting in April.',
                 'When does the Machine Learning program kick-off at Lambda school?',
                 'The batter hit the ball out off AT&T park into the pacific ocean.',
                 'The pitcher threw the ball into the dug-out.']

**Step #1**: **Create a document that contains a list of tokens**

In [0]:
tokens = [word_tokenize(doc) for doc in raw_documents]

**Step #2: Use the document to create a dictionary - a dictionary maps every word to a number**

In [0]:
dct = gensim.corpora.Dictionary(tokens)

**Step #3: Convert the list of tokens from the document (created above in Step 1) into a bag of words. The bag of words highlights the term frequency i.e. each element in the bag of words is the index of the word in the dictionary and the # of times it occurs**



In [8]:
corpus = [dct.doc2bow(doc) for doc in tokens]
print(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 2), (12, 1), (13, 4), (14, 1), (15, 1), (16, 1)], [(1, 1), (13, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1)], [(1, 1), (3, 1), (13, 3), (20, 2), (21, 2), (29, 2), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1)], [(1, 1), (20, 1), (21, 1), (26, 1), (28, 1), (40, 1), (41, 1), (42, 1), (43, 1)], [(1, 1), (8, 1), (13, 2), (19, 1), (20, 1), (21, 1), (24, 1), (30, 1), (31, 1), (32, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1)], [(13, 1), (19, 1), (20, 1), (21, 1), (24, 1), (29, 1), (30, 1), (50, 1), (51, 1), (52, 1), (53, 1)], [(1, 1), (2, 1), (13, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1)], [(1, 1), (2, 1), (13, 2), (57, 1), (60, 1), (66, 1), (67, 1), (68, 1)]]


**Step #4:  Use the "*Gensim*" library to create a TF-IDF module for the bag of words**

In [0]:
model = gensim.models.TfidfModel(corpus)  

**Step #5: a) Output the 5th document, b) Output the bag of words for the fifth document i.e. term frequency, c) Review the Inverse Document Frequency (IDF) for each term in the bag of words for the 5th document**

In [10]:
print('Document:', raw_documents[4])
print('Term Frequency:', corpus[4])
print('Tfidf:', model[corpus[4]])

Document: I am excited about taking the Machine Learning class at the Lambda school starting in April.
Term Frequency: [(1, 1), (8, 1), (13, 2), (19, 1), (20, 1), (21, 1), (24, 1), (30, 1), (31, 1), (32, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1)]
Tfidf: [(1, 0.02253010613488428), (8, 0.2339027435896511), (13, 0.04506021226976856), (19, 0.16549057668178024), (20, 0.07930143947845378), (21, 0.07930143947845378), (24, 0.16549057668178024), (30, 0.16549057668178024), (31, 0.2339027435896511), (32, 0.2339027435896511), (44, 0.35085411538447664), (45, 0.35085411538447664), (46, 0.35085411538447664), (47, 0.35085411538447664), (48, 0.35085411538447664), (49, 0.35085411538447664)]


**Step #6: Determine document similarity** -  Identify the most similar document and the least similar document to the body of text below.

*Good Reference for review*: https://radimrehurek.com/gensim/similarities/docsim.html

In [0]:
# Step 6

# Document to  compare: "Machine Learning at Lambda school is awesome"

In [0]:
from gensim.similarities import MatrixSimilarity

In [14]:
test_doc = "Machine Learning at Lambda school is awesome"

query = model[dct.doc2bow(word_tokenize(test_doc))]
index = MatrixSimilarity(model[corpus])

sims = index[query]

most = sims.argmax()
least = sims.argmin()

print('Most similar document:\n{}\
       \nLeast similar document:\n{}'.format(raw_documents[most],
                                            raw_documents[least]))

Most similar document:
My name is Thomson Comer, commander of the Machine Learning program at Lambda school.       
Least similar document:
The dog ran up the steps and entered the owner's room to check if the owner was in the room.
