### Coding Challenge #2: Natural Language Processing

A common task in NLP is to determine the similarity between documents or words. In order to facilitate the comparison between documents or words, you will leverage the learnings from Coding Challenge #1 to create vectors. Once you have a document term matrix, comparisons are possible since you can measure the difference between the numbers.

In this Coding Challenge, you will utilize the "**Gensim**" library, which is a free Python library to determine document similarity.

**"Gensim" Reference**: https://radimrehurek.com/project/gensim/




**Install Gensim**:

In [0]:
# https://radimrehurek.com/gensim/install.html
!pip install --upgrade gensim
!pip install regex

**Install NLTK:**

In [0]:
# Import the NLTK package
import nltk

# Get all the data associated with NLTK – could take a while to download all the data
nltk.download('all')

**Import the requiste NLTK packages:**

In [0]:
#Import word tokenizer
import numpy as np
import regex as re
from gensim import corpora
from gensim.models import TfidfModel
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

**Dataset:**

In [0]:
#For the purposes of this challenge, each line represents a document. In all, there are 8 documents

raw_documents = ['The dog ran up the steps and entered the owner\'s room to check if the owner was in the room.',
                 'My name is Thomson Comer, commander of the Machine Learning program at Lambda school.',
                 'I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program.',
                 'Machine Learning is one of my favorite subjects.',
                 'I am excited about taking the Machine Learning class at the Lambda school starting in April.',
                 'When does the Machine Learning program kick-off at Lambda school?',
                 'The batter hit the ball out off AT&T park into the pacific ocean.',
                 'The pitcher threw the ball into the dug-out.']

**Step #1**: **Create a document that contains a list of tokens**

In [0]:
clean = [re.sub(r'[^\w\s]','',i).lower() for i in raw_documents]
print('no punctuation: \n', np.matrix(clean), '\n')
en_stopwords = list(set(nltk.corpus.stopwords.words('english')))

tokens = [word_tokenize(x) for x in clean]
print('tokens: \n', np.matrix(tokens), '\n')

nontokens = []

for i in tokens:
  nontokens.append([])
  for j in i:
    if j in en_stopwords:
      continue
    else: nontokens[-1].append(j)
      
print('tokens minus stopwords: \n', np.matrix(nontokens), '\n')

docs2 = [' '.join(i) for i in nontokens]
print('docs minus stopwords: \n', np.matrix(docs2), '\n')

no punctuation: 
 [['the dog ran up the steps and entered the owners room to check if the owner was in the room'
  'my name is thomson comer commander of the machine learning program at lambda school'
  'i am creating the curriculum for the machine learning program and will be teaching the fulltime machine learning program'
  'machine learning is one of my favorite subjects'
  'i am excited about taking the machine learning class at the lambda school starting in april'
  'when does the machine learning program kickoff at lambda school'
  'the batter hit the ball out off att park into the pacific ocean'
  'the pitcher threw the ball into the dugout']] 

tokens: 
 [[list(['the', 'dog', 'ran', 'up', 'the', 'steps', 'and', 'entered', 'the', 'owners', 'room', 'to', 'check', 'if', 'the', 'owner', 'was', 'in', 'the', 'room'])
  list(['my', 'name', 'is', 'thomson', 'comer', 'commander', 'of', 'the', 'machine', 'learning', 'program', 'at', 'lambda', 'school'])
  list(['i', 'am', 'creating', 'th

**Step #2: Use the document to create a dictionary - a dictionary maps every word to a number**

In [0]:
dictionary = corpora.Dictionary(tokens)
dictionary2 = corpora.Dictionary(nontokens)
print('tokens dictionary: ', dictionary.token2id)
print(len(dictionary), ' token words')
print(len(dictionary2), ' token words minus stopwords')
print('stopwords:', list(set([item for sublist in tokens for item in sublist]) - set([item for sublist in nontokens for item in sublist])))

tokens dictionary:  {'and': 0, 'check': 1, 'dog': 2, 'entered': 3, 'if': 4, 'in': 5, 'owner': 6, 'owners': 7, 'ran': 8, 'room': 9, 'steps': 10, 'the': 11, 'to': 12, 'up': 13, 'was': 14, 'at': 15, 'comer': 16, 'commander': 17, 'is': 18, 'lambda': 19, 'learning': 20, 'machine': 21, 'my': 22, 'name': 23, 'of': 24, 'program': 25, 'school': 26, 'thomson': 27, 'am': 28, 'be': 29, 'creating': 30, 'curriculum': 31, 'for': 32, 'fulltime': 33, 'i': 34, 'teaching': 35, 'will': 36, 'favorite': 37, 'one': 38, 'subjects': 39, 'about': 40, 'april': 41, 'class': 42, 'excited': 43, 'starting': 44, 'taking': 45, 'does': 46, 'kickoff': 47, 'when': 48, 'att': 49, 'ball': 50, 'batter': 51, 'hit': 52, 'into': 53, 'ocean': 54, 'off': 55, 'out': 56, 'pacific': 57, 'park': 58, 'dugout': 59, 'pitcher': 60, 'threw': 61}
62  token words
40  token words minus stopwords
stopwords: ['in', 'at', 'off', 'am', 'if', 'will', 'is', 'does', 'into', 'be', 'my', 'i', 'to', 'and', 'for', 'was', 'the', 'when', 'about', 'of', 

**Step #3: Convert the list of tokens from the document (created above in Step 1) into a bag of words. The bag of words highlights the term frequency i.e. each element in the bag of words is the index of the word in the dictionary and the # of times it occurs**



In [0]:
corpus = [dictionary.doc2bow(text) for text in tokens]
corpus2 = [dictionary.doc2bow(text) for text in nontokens]
print('corpus: \n', np.matrix(corpus))

corpus: 
 [[list([(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 5), (12, 1), (13, 1), (14, 1)])
  list([(11, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1)])
  list([(0, 1), (11, 3), (20, 2), (21, 2), (25, 2), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)])
  list([(18, 1), (20, 1), (21, 1), (22, 1), (24, 1), (37, 1), (38, 1), (39, 1)])
  list([(5, 1), (11, 2), (15, 1), (19, 1), (20, 1), (21, 1), (26, 1), (28, 1), (34, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1)])
  list([(11, 1), (15, 1), (19, 1), (20, 1), (21, 1), (25, 1), (26, 1), (46, 1), (47, 1), (48, 1)])
  list([(11, 3), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1)])
  list([(11, 3), (50, 1), (53, 1), (59, 1), (60, 1), (61, 1)])]]


**Step #4:  Use the "*Gensim*" library to create a TF-IDF module for the bag of words**

In [0]:
tfidf = TfidfModel(corpus);
tfidf2 = TfidfModel(corpus2);
print(np.matrix([tfidf[i] for i in corpus]))

[[list([(0, 0.16670846296747688), (1, 0.2500626944512153), (2, 0.2500626944512153), (3, 0.2500626944512153), (4, 0.2500626944512153), (5, 0.16670846296747688), (6, 0.2500626944512153), (7, 0.2500626944512153), (8, 0.2500626944512153), (9, 0.5001253889024306), (10, 0.2500626944512153), (11, 0.08028891210506647), (12, 0.2500626944512153), (13, 0.2500626944512153), (14, 0.2500626944512153)])
  list([(11, 0.025524077576628612), (15, 0.18748222010754678), (16, 0.39747827220782794), (17, 0.39747827220782794), (18, 0.26498551480521865), (19, 0.18748222010754678), (20, 0.08983961642561385), (21, 0.08983961642561385), (22, 0.26498551480521865), (23, 0.39747827220782794), (24, 0.26498551480521865), (25, 0.18748222010754678), (26, 0.18748222010754678), (27, 0.39747827220782794)])
  list([(0, 0.21439591157214022), (11, 0.06195347564301893), (20, 0.14537584420808175), (21, 0.14537584420808175), (25, 0.3033782545666494), (28, 0.21439591157214022), (29, 0.32159386735821033), (30, 0.32159386735821033)

**Step #5: a) Output the 5th document, b) Output the bag of words for the fifth document i.e. term frequency, c) Review the Inter Document Frequency (IDF) for each term in the bag of words for the 5th document**

In [0]:
n = 0

print(raw_documents[n])
print(tokens[n])
print(corpus[n])
print(tfidf[corpus[n]])

The dog ran up the steps and entered the owner's room to check if the owner was in the room.
['the', 'dog', 'ran', 'up', 'the', 'steps', 'and', 'entered', 'the', 'owners', 'room', 'to', 'check', 'if', 'the', 'owner', 'was', 'in', 'the', 'room']
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 2), (10, 1), (11, 5), (12, 1), (13, 1), (14, 1)]
[(0, 0.16670846296747688), (1, 0.2500626944512153), (2, 0.2500626944512153), (3, 0.2500626944512153), (4, 0.2500626944512153), (5, 0.16670846296747688), (6, 0.2500626944512153), (7, 0.2500626944512153), (8, 0.2500626944512153), (9, 0.5001253889024306), (10, 0.2500626944512153), (11, 0.08028891210506647), (12, 0.2500626944512153), (13, 0.2500626944512153), (14, 0.2500626944512153)]


**Step #6: Determine document similarity** -  Identify the most similar document and the least similar document to the body of text below.

*Good Reference for review*: https://radimrehurek.com/gensim/similarities/docsim.html

In [0]:
vectorizer = TfidfVectorizer()

t = "\n\033[1mMachine Learning at Lambda school is awesome\033[0m"
print(t, '\n')

def cosine_sim(text1, text2):
    v = vectorizer.fit_transform([text1, text2])
    return ((v * v.T).A)[0,1]

print('\033[4mSimilarities with stopwords\033[0m:')
for i in raw_documents: print(round(cosine_sim(i, t),3), ': ', i)
  
t2 = "\n\033[1mMachine Learning Lambda school awesome\033[0m"
print(t2, '\n')

print('\033[4mSimilarities without stopwords\033[0m:')
for i in docs2: print(round(cosine_sim(i, t2),3), ': ', i)


[1mMachine Learning at Lambda school is awesome[0m 

[4mSimilarities with stopwords[0m:
0.0 :  The dog ran up the steps and entered the owner's room to check if the owner was in the room.
0.317 :  My name is Thomson Comer, commander of the Machine Learning program at Lambda school.
0.069 :  I am creating the curriculum for the Machine Learning program and will be teaching the full-time Machine Learning program.
0.144 :  Machine Learning is one of my favorite subjects.
0.213 :  I am excited about taking the Machine Learning class at the Lambda school starting in April.
0.275 :  When does the Machine Learning program kick-off at Lambda school?
0.043 :  The batter hit the ball out off AT&T park into the pacific ocean.
0.0 :  The pitcher threw the ball into the dug-out.

[1mMachine Learning Lambda school awesome[0m 

[4mSimilarities without stopwords[0m:
0.0 :  dog ran steps entered owners room check owner room
0.261 :  name thomson comer commander machine learning program lambda 