# Hands on NLP using Python

## 4. NLP Core

## S6.V28. Variables and operations in Python

In [476]:
import nltk

In [477]:
#nltk.download()

## S6.V29. Tokenizing words and sentences

In [478]:
paragraph = """Thank you all so very much. Thank you to the Academy. 
               Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               and crew. First off, to my brother in this endeavor, Mr. Tom 
               Hardy. Tom, your talent on screen can only be surpassed by 
               your friendship off screen … thank you for creating a t
               ranscendent cinematic experience. Thank you to everybody at 
               Fox and New Regency … my entire team. I have to thank 
               everyone from the very onset of my career … To my parents; 
               none of this would be possible without you. And to my 
               friends, I love you dearly; you know who you are. And lastly,
               I just want to say this: Making The Revenant was about
               man's relationship to the natural world. A world that we
               collectively felt in 2015 as the hottest year in recorded
               history. Our production needed to move to the southern
               tip of this planet just to be able to find snow. Climate
               change is real, it is happening right now. It is the most
               urgent threat facing our entire species, and we need to work
               collectively together and stop procrastinating. We need to
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""

In [479]:
# Seperate the entire string into list of sentences
sentences = nltk.sent_tokenize(paragraph)
sentences[:2], len(sentences)

(['Thank you all so very much.', 'Thank you to the Academy.'], 21)

In [480]:
# Seperate the entire string into list of words
words = nltk.word_tokenize(paragraph)
words[:5], len(words)

(['Thank', 'you', 'all', 'so', 'very'], 347)

## S6.V31. Stemming and Lemmatization

∆ Stemming and Lemmatization becomes important when we are extracting features from corpus (collection) of several sentences.

**Stemming:** <br/>
After tokenizing a corpus of sentences into words. We can find different sentences tokenized as words can contain similar words. For example intelligent, intelligently | work, working. 

Although intelligent, intelligently are same words but they will be treated differently. So we have to reduce all words into the word stem, base or root forms.<br/> **∆ Thats what stemming does.**

**Problem:** After stemming the word, the word wont retain any meaning. `intelligent, intelligently` converts to `intelligen`. `Final, finally` converts into `fina`. **Lemmatization** prevents this meaning loss.

**Lemmatization:** <br/>
After lemmatization, `intelligent, intelligently` will get converted into `intelligent`. 

∆ So place where readability or keeping the meaning of words intact is necessary, we should use lemmatization.

∆ For ML models, keeping the meaning of word does NOT do any extra help.

## S6.V32. Stemming using NLTK

In [481]:
from nltk.stem import PorterStemmer

# Step 1: Tokenize
sentences = nltk.sent_tokenize(paragraph)
print(sentences)
print('-'*115)

# Step 2: create an object of PorterStemmer class
stemmer = PorterStemmer()

# Step 3: Stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    newwords = [stemmer.stem(word) for word in words]
    sentences[i] = ' '.join(newwords)
print(sentences)

['Thank you all so very much.', 'Thank you to the Academy.', 'Thank you to all of you in this room.', 'I have to congratulate \n               the other incredible nominees this year.', 'The Revenant was \n               the product of the tireless efforts of an unbelievable cast\n               and crew.', 'First off, to my brother in this endeavor, Mr. Tom \n               Hardy.', 'Tom, your talent on screen can only be surpassed by \n               your friendship off screen … thank you for creating a t\n               ranscendent cinematic experience.', 'Thank you to everybody at \n               Fox and New Regency … my entire team.', 'I have to thank \n               everyone from the very onset of my career … To my parents; \n               none of this would be possible without you.', 'And to my \n               friends, I love you dearly; you know who you are.', "And lastly,\n               I just want to say this: Making The Revenant was about\n               man's relations

## S6.V33. Lemmatization using NLTK

In [482]:
from nltk.stem import WordNetLemmatizer

# Step 1: Tokenize
sentences = nltk.sent_tokenize(paragraph)
print(sentences)
print('-'*115)

# Step 2: create an object of WordNetLemmatizer class
lemmatizer = WordNetLemmatizer()

# Step 3: Lemmatizing
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    newwords = [lemmatizer.lemmatize(word) for word in words]
    sentences[i] = ' '.join(newwords)
print(sentences)

['Thank you all so very much.', 'Thank you to the Academy.', 'Thank you to all of you in this room.', 'I have to congratulate \n               the other incredible nominees this year.', 'The Revenant was \n               the product of the tireless efforts of an unbelievable cast\n               and crew.', 'First off, to my brother in this endeavor, Mr. Tom \n               Hardy.', 'Tom, your talent on screen can only be surpassed by \n               your friendship off screen … thank you for creating a t\n               ranscendent cinematic experience.', 'Thank you to everybody at \n               Fox and New Regency … my entire team.', 'I have to thank \n               everyone from the very onset of my career … To my parents; \n               none of this would be possible without you.', 'And to my \n               friends, I love you dearly; you know who you are.', "And lastly,\n               I just want to say this: Making The Revenant was about\n               man's relations

**Better readability in Lemmatization than Stemming**. Lemmatization keeps the words meaning intact.

∆ In case of Text classification or spam detection, **stemming** works very well.

∆ In case of question answering chatbots, **lemmatization** would be a better option.

## S6.V34. Stop word removal using NLTK

In [483]:
paragraph = """Thank you all so very much. Thank you to the Academy. 
               Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               and crew. First off, to my brother in this endeavor, Mr. Tom 
               Hardy. Tom, your talent on screen can only be surpassed by 
               your friendship off screen … thank you for creating a t
               ranscendent cinematic experience. Thank you to everybody at 
               Fox and New Regency … my entire team. I have to thank 
               everyone from the very onset of my career … To my parents; 
               none of this would be possible without you. And to my 
               friends, I love you dearly; you know who you are. And lastly,
               I just want to say this: Making The Revenant was about
               man's relationship to the natural world. A world that we
               collectively felt in 2015 as the hottest year in recorded
               history. Our production needed to move to the southern
               tip of this planet just to be able to find snow. Climate
               change is real, it is happening right now. It is the most
               urgent threat facing our entire species, and we need to work
               collectively together and stop procrastinating. We need to
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""

Stop words are very common, that do not give any insights on the context like in the case of sentiment analysis. Stop words can be in a positive or negative sentence. Basically, they are useless words for our purpose. That's why we need to remove them.

In [484]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sayantan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [485]:
from nltk.corpus import stopwords

# Step 1: Tokenize
sentences = nltk.sent_tokenize(paragraph)
print(sentences)
print('-'*115)

# Step 3: Removing stopwords
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    newwords = [word for word in words if word not in stopwords.words('english')]
    sentences[i] = ' '.join(newwords)
print(sentences)

['Thank you all so very much.', 'Thank you to the Academy.', 'Thank you to all of you in this room.', 'I have to congratulate \n               the other incredible nominees this year.', 'The Revenant was \n               the product of the tireless efforts of an unbelievable cast\n               and crew.', 'First off, to my brother in this endeavor, Mr. Tom \n               Hardy.', 'Tom, your talent on screen can only be surpassed by \n               your friendship off screen … thank you for creating a t\n               ranscendent cinematic experience.', 'Thank you to everybody at \n               Fox and New Regency … my entire team.', 'I have to thank \n               everyone from the very onset of my career … To my parents; \n               none of this would be possible without you.', 'And to my \n               friends, I love you dearly; you know who you are.', "And lastly,\n               I just want to say this: Making The Revenant was about\n               man's relations

## S6.V35. Part of Speech (POS) tagging

In [486]:
# tokenize paragraph into words as we are trying to decide POS of each word 
words = nltk.word_tokenize(paragraph)
print(len(words))
tagged_words = nltk.pos_tag(words)
print(len(tagged_words))
print(tagged_words[:5])

347
347
[('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('so', 'RB'), ('very', 'RB')]


In [487]:
# words appended with their POS
word_tags = []
for tw in tagged_words:
    word_tags.append(tw[0] + '_' + tw[1])
    
word_tags[:10]

['Thank_NNP',
 'you_PRP',
 'all_DT',
 'so_RB',
 'very_RB',
 'much_JJ',
 '._.',
 'Thank_VB',
 'you_PRP',
 'to_TO']

In [488]:
# paragraph with words and their POS
tagged_paragraph = ' '.join(word_tags)
tagged_paragraph

"Thank_NNP you_PRP all_DT so_RB very_RB much_JJ ._. Thank_VB you_PRP to_TO the_DT Academy_NNP ._. Thank_NNP you_PRP to_TO all_DT of_IN you_PRP in_IN this_DT room_NN ._. I_PRP have_VBP to_TO congratulate_VB the_DT other_JJ incredible_JJ nominees_NNS this_DT year_NN ._. The_DT Revenant_NNP was_VBD the_DT product_NN of_IN the_DT tireless_NN efforts_NNS of_IN an_DT unbelievable_JJ cast_NN and_CC crew_NN ._. First_NNP off_RB ,_, to_TO my_PRP$ brother_NN in_IN this_DT endeavor_NN ,_, Mr._NNP Tom_NNP Hardy_NNP ._. Tom_NNP ,_, your_PRP$ talent_NN on_IN screen_NN can_MD only_RB be_VB surpassed_VBN by_IN your_PRP$ friendship_NN off_IN screen_JJ …_NNP thank_NN you_PRP for_IN creating_VBG a_DT t_JJ ranscendent_NN cinematic_JJ experience_NN ._. Thank_NNP you_PRP to_TO everybody_VB at_IN Fox_NNP and_CC New_NNP Regency_NNP …_NNP my_PRP$ entire_JJ team_NN ._. I_PRP have_VBP to_TO thank_VB everyone_NN from_IN the_DT very_RB onset_NN of_IN my_PRP$ career_NN …_NN To_TO my_PRP$ parents_NNS ;_: none_NN of_

## S6.V37. Named Entity Recognition

In [489]:
paragraph = "The Taj Mahal was built by Emperor Shah Jahan"

words = nltk.word_tokenize(paragraph)

# Function that performs named entity recognition requires POS tagged words list
tagged_words = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged_words)

namedEnt

LookupError: 

===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================

Tree('S', [('The', 'DT'), Tree('ORGANIZATION', [('Taj', 'NNP'), ('Mahal', 'NNP')]), ('was', 'VBD'), ('built', 'VBN'), ('by', 'IN'), Tree('PERSON', [('Emperor', 'NNP'), ('Shah', 'NNP'), ('Jahan', 'NNP')])])

**Not a good way to visualize. Therefore we can draw the tree**

In [490]:
#namedEnt.draw()

## S6.V38. Text Modelling using Bag of Words (BOW) Model

* Step 1. Tokenize sentences, Preprocess- only .isalpha()selection, then lowercase all words,


* Step 2. How many times each word appears in a sentence. Build a dictionary where each word will be key and its frequency will be the value. Combine the frequency of the same word across different sentences in the document (if we are dealing with a corpus of documents).


* Step 3. Sort the dictionary from highest frequency/count to low 


* Step 4. We can not consider all words. If we are dealing with 50k documents then there will be several thousands of words. We will consider 100 (or similar number) most frequent words. **heapq** is used for this purpose


* Step 5. Now we will create **Bag of words** that is a matrix. Now all these top frequent words will be columns. Each row will be a sentence if we are dealing with a document( or each row will be a document for a corpus of documents). Now we will put number of occurances of each word in a sentence (or number of occurances of each word in a document when we are dealing with corpus of documents)

## S6.V39. Building Bag of Words (BOW) Model

In [491]:
paragraph = """Thank you all so very much. Thank you to the Academy. 
               Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               and crew. First off, to my brother in this endeavor, Mr. Tom 
               Hardy. Tom, your talent on screen can only be surpassed by 
               your friendship off screen … thank you for creating a t
               ranscendent cinematic experience. Thank you to everybody at 
               Fox and New Regency … my entire team. I have to thank 
               everyone from the very onset of my career … To my parents; 
               none of this would be possible without you. And to my 
               friends, I love you dearly; you know who you are. And lastly,
               I just want to say this: Making The Revenant was about
               man's relationship to the natural world. A world that we
               collectively felt in 2015 as the hottest year in recorded
               history. Our production needed to move to the southern
               tip of this planet just to be able to find snow. Climate
               change is real, it is happening right now. It is the most
               urgent threat facing our entire species, and we need to work
               collectively together and stop procrastinating. We need to
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""

In [492]:
# step 1
import re
sentences = nltk.sent_tokenize(paragraph)
sentences[:4]

['Thank you all so very much.',
 'Thank you to the Academy.',
 'Thank you to all of you in this room.',
 'I have to congratulate \n               the other incredible nominees this year.']

In [493]:
# step 1 # Preprocessing 
for i in range(len(sentences)):
    sentences[i] = sentences[i].lower() # lowercase
    sentences[i] = re.sub(r'\W', ' ', sentences[i]) # Substitute non-word character with space 
    # this is similar to .isalpha() selection
    sentences[i] = re.sub(r'\s+', ' ', sentences[i]) # remove extra spaces
sentences[:4]    

['thank you all so very much ',
 'thank you to the academy ',
 'thank you to all of you in this room ',
 'i have to congratulate the other incredible nominees this year ']

In [494]:
# S6.V40 # step2: creating the histogram
word2count = {}
for sentence in sentences:
    words = nltk.word_tokenize(sentence) # tokenize sentences to words
    for word in words:
        if word not in word2count.keys(): # word not encountered in dictionary yet, include it
            word2count[word] = 1
        else: # word already encountered in dictionary, add count by 1
            word2count[word] += 1

list(word2count.items())[:5]

[('thank', 8), ('you', 12), ('all', 4), ('so', 2), ('very', 3)]

* This can be also done by using `nltk.FreqDist(nltk.word_tokenize(paragraph))`

In [498]:
# Alternate idea
# step 1 # Preprocessing 
for i in range(len(sentences)):
    sentences[i] = sentences[i].lower() # lowercase
    sentences[i] = re.sub(r'\W', ' ', sentences[i]) # Substitute non-word character with space 
    # this is similar to .isalpha() selection
    sentences[i] = re.sub(r'\s+', ' ', sentences[i]) # remove extra spaces
sentences[:4]   


['thank you all so very much ',
 'thank you to the academy ',
 'thank you to all of you in this room ',
 'i have to congratulate the other incredible nominees this year ']

In [506]:
#nltk.FreqDist(nltk.word_tokenize(nltk.sent_tokenize(sentences)))

TypeError: expected string or bytes-like object

However we see a lot of junk words. To remove them following should be the strategy

* We need to clean up (lowercase, remove punctuation) from paragraph after tokenization. Then perform `' '.join(tokens)` to make a cleaned up paragraph and then use the above code. Then we can substract stopwords from them to get most meaningful word. Then we can use nltk for most common 100 words as well.

In [430]:
# S6.V41 # step 3&4: Easier way to find frequent words
import heapq # helps to find out n most frequent words
freq_words = heapq.nlargest(100,word2count, key = word2count.get)
freq_words[:10]

['the', 'to', 'you', 'of', 'for', 'this', 'thank', 'and', 'i', 'my']

∆ For big files we will get 2000-3000 most frequent words (not 100) if we work with big corpus of data.

In [431]:
## S6.V42. # Step 5: Building Bag of Words (BOW) Model
X = [] 

for sentence in sentences:
    vector = []
    for word in freq_words:
        if word in nltk.word_tokenize(sentence):
            vector.append(1)
        else:
            vector.append(0)
    X.append(vector)
    
print(X[:1])

[[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


Now each sentence will be a vector (i.e. an observation or a row). 100 frequent words will form 100 columns. If 1 word appears in the sentence it will be 1 or otherwise 0 value in the corrosponding word column.

**code explanation**: take 1 sentence out of all sentences in the paragraph. Now take 1 word out of the most frequent words, check whether that word is present in the sentence (by tokenizing the sentence). Add 1 if present , 0 if not.

We see in the first sentence (1st list) `X[:1]`, there are 100 numbers (i.e. values of occurances of the most frequent words). 1 denotes that corrosponding frequent word is present in the sentence

In [432]:
# create a 2-D array from X list
X= np.asarray(X)
X

array([[0, 0, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [433]:
X.shape

(21, 100)

Perfect. Now we see there are 21 lengths or 21 sentences in the paragraph. 100 columns represent the 100 most frequent words we selected.

However, BOW comes with disadvantages that we will work on next

## S6.V43. TF-IDF Model

In BOW model one disadvantages is it counts the number of times a specific word appeared in a document (where each document is a row). Here each word is given the same importance. ML model will think all word have same importance in the document. 

#### Term Frequency-Inverse Document Frequency
This disadvantage is overcome through `TF-IDF` model. This gives more importance to the uncommon words. Here we as a part of preprocessing we lowercase every letter, then tokenize to words. 

`Term Frequency` means **number of occurrences of a word in a document divided by number of words in the document. **

`Inverse Document Frequency` means pass the **number of documents divided by number of documents containing the word** in log. 

`TFIDF = TF * IDF` doing this allow to extract more meaningful words

In [434]:
paragraph = """Thank you all so very much. Thank you to the Academy. 
               Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               and crew. First off, to my brother in this endeavor, Mr. Tom 
               Hardy. Tom, your talent on screen can only be surpassed by 
               your friendship off screen … thank you for creating a t
               ranscendent cinematic experience. Thank you to everybody at 
               Fox and New Regency … my entire team. I have to thank 
               everyone from the very onset of my career … To my parents; 
               none of this would be possible without you. And to my 
               friends, I love you dearly; you know who you are. And lastly,
               I just want to say this: Making The Revenant was about
               man's relationship to the natural world. A world that we
               collectively felt in 2015 as the hottest year in recorded
               history. Our production needed to move to the southern
               tip of this planet just to be able to find snow. Climate
               change is real, it is happening right now. It is the most
               urgent threat facing our entire species, and we need to work
               collectively together and stop procrastinating. We need to
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""

In [435]:
# step 1 # similar to BOW
import re
sentences = nltk.sent_tokenize(paragraph)
sentences[:4]

['Thank you all so very much.',
 'Thank you to the Academy.',
 'Thank you to all of you in this room.',
 'I have to congratulate \n               the other incredible nominees this year.']

In [436]:
# step 1 # Preprocessing # similar to BOW
for i in range(len(sentences)):
    sentences[i] = sentences[i].lower() # lowercase
    sentences[i] = re.sub(r'\W', ' ', sentences[i]) # Substitute non-word character with space 
    # this is similar to .isalpha() selection
    sentences[i] = re.sub(r'\s+', ' ', sentences[i]) # remove extra spaces
sentences[:4]    

['thank you all so very much ',
 'thank you to the academy ',
 'thank you to all of you in this room ',
 'i have to congratulate the other incredible nominees this year ']

In [437]:
# step2: creating the histogram # similar to BOW
word2count = {}
for sentence in sentences:
    words = nltk.word_tokenize(sentence) # tokenize sentences to words
    for word in words:
        if word not in word2count.keys(): # word not encountered in dictionary yet, include it
            word2count[word] = 1
        else: # word already encountered in dictionary, add count by 1
            word2count[word] += 1

list(word2count.items())[:5]

[('thank', 8), ('you', 12), ('all', 4), ('so', 2), ('very', 3)]

In [438]:
# step 3&4: Easier way to find frequent words # similar to BOW
import heapq # helps to find out n most frequent words
freq_words = heapq.nlargest(100,word2count, key = word2count.get)
freq_words[:10]

['the', 'to', 'you', 'of', 'for', 'this', 'thank', 'and', 'i', 'my']

In [439]:
# IDF Matrix
word_idfs = {}

for word in freq_words:
    doc_count = 0
    for sentence in sentences:
        if word in nltk.word_tokenize(sentence):
            doc_count += 1
    
    word_idfs[word] = np.log((len(sentences)/doc_count)+1) # 1 is added as industry standard
    
print(list(word_idfs.items())[:5])

[('the', 1.1314021114911006), ('to', 1.067840630001356), ('you', 1.2039728043259361), ('of', 1.5040773967762742), ('for', 1.5040773967762742)]


In [440]:
# TF Matrix

tf_matrix = {}

for word in freq_words:
    doc_tf = []
    for sentence in sentences:
        frequency = 0
        for w in nltk.word_tokenize(sentence):
            if w==word:
                frequency += 1
        tf_word = frequency / len(nltk.word_tokenize(sentence))
        doc_tf.append(tf_word)
    tf_matrix[word] = doc_tf   
print(list(tf_matrix.items())[:2])

[('the', [0.0, 0.2, 0.0, 0.1, 0.2, 0.0, 0.0, 0.0, 0.043478260869565216, 0.0, 0.1, 0.06666666666666667, 0.05263157894736842, 0.0, 0.05, 0.10638297872340426, 0.045454545454545456, 0.0, 0.0, 0.0, 0.0]), ('to', [0.0, 0.2, 0.1111111111111111, 0.1, 0.0, 0.09090909090909091, 0.0, 0.08333333333333333, 0.08695652173913043, 0.07692307692307693, 0.1, 0.0, 0.21052631578947367, 0.0, 0.05, 0.02127659574468085, 0.0, 0.0, 0.0, 0.0, 0.0])]


In [441]:
# TF-IDF calculation
tfidf_matrix = []

for word in tf_matrix.keys():
    tfidf = [] # tfidf score for specific words
    for value in tf_matrix[word]:
        score = value * word_idfs[word] # value is the term freq of a specific word in a specific sentence
        # idf value of a particular word is constant across the corpus of sentences
        # multiply both to get tfidf value
        tfidf.append(score)
    
    tfidf_matrix.append(tfidf)
    
tfidf_matrix[:1] # length 21 because corpus is of 21 sentences

[[0.0,
  0.22628042229822012,
  0.0,
  0.11314021114911006,
  0.22628042229822012,
  0.0,
  0.0,
  0.0,
  0.049191396151786984,
  0.0,
  0.11314021114911006,
  0.07542680743274004,
  0.059547479552163184,
  0.0,
  0.05657010557455503,
  0.1203619267543724,
  0.051427368704140934,
  0.0,
  0.0,
  0.0,
  0.0]]

In [442]:
X = np.asarray(tfidf_matrix)
print(X.shape) # 21 columns means 21 sentences and 100 frequent words. 
# But it should be opposite so transpose
X = np.transpose(X)
X.shape

(100, 21)


(21, 100)

In [443]:
X[:1]

array([[0.        , 0.        , 0.20066213, 0.        , 0.        ,
        0.        , 0.21464238, 0.        , 0.        , 0.        ,
        0.30543024, 0.        , 0.        , 0.        , 0.        ,
        0.34657359, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.40705784,
        0.40705784, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

## TF-IDF can be more presentable through functions

Truncate entire paragraph into sentences. Clean up each sentences. Then save each sentence as values with a particular key in a dictionary.

This dictionary will be eventually fed in the TF-IDF model.

In [444]:
sent_dict = {}
sentences = nltk.sent_tokenize(paragraph)

for i in range(len(sentences)):
    sentences[i] = sentences[i].lower() # lowercase
    sentences[i] = re.sub(r'\W', ' ', sentences[i]) # Substitute non-word character with space 
    # this is similar to .isalpha() selection
    sentences[i] = re.sub(r'\s+', ' ', sentences[i]) # remove extra spaces

for i in range(len(sentences)):
    if sentences[i] not in sent_dict.values():
        sent_dict[i] = sentences[i]

list(sent_dict.items())[3:5]

[(3, 'i have to congratulate the other incredible nominees this year '),
 (4,
  'the revenant was the product of the tireless efforts of an unbelievable cast and crew ')]

In [508]:
## Complete code
import math

# Calculate term frequencies
def tf(dataset, file_name): # dataset will be sent_dict and file_name will be 0,1,2 
    
    text = dataset[file_name] 
    # select the specific text file
    tokens = nltk.word_tokenize(text) 
    # tokenize the text file
    fd = nltk.FreqDist(tokens) 
    # make freq distribution of the tokens
    # i.e. count of how many times we saw each word
    return fd

# Calculate inverse document frequency
def idf(dataset, term):
    count = [term in dataset[file_name] for file_name in dataset]
    inv_df = math.log(len(count)/sum(count))
    return inv_df

def tfidf(dataset, file_name, n):
    term_scores = {}
    file_fd = tf(dataset,file_name)
    for term in file_fd:
        if term.isalpha():
            idf_val = idf(dataset,term)
            tf_val = tf(dataset, file_name)[term]
            tfidf = tf_val*idf_val
            term_scores[term] = round(tfidf,2)
    return sorted(term_scores.items(), key=lambda x:-x[1])[:n]

In [513]:
print(tf(sent_dict, file_name))
tf(sent_dict, file_name)

<FreqDist with 5 samples and 5 outcomes>


FreqDist({'thank': 1, 'you': 1, 'so': 1, 'very': 1, 'much': 1})

In [507]:
print(tfidf(sent_dict, 3, 2))
print(tfidf(sent_dict, 4, 3))

[('congratulate', 3.04), ('incredible', 3.04)]
[('tireless', 3.04), ('efforts', 3.04), ('unbelievable', 3.04)]


In [505]:
for file_name in sent_dict:
    print(file_name)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


In [447]:
for file_name in sent_dict:
    print("{0}: {1} \n".format(file_name, tfidf(sent_dict,file_name,3)))

0: [('much', 2.35), ('so', 1.95), ('all', 1.66)] 

1: [('academy', 3.04), ('thank', 0.97), ('you', 0.85)] 

2: [('room', 3.04), ('you', 1.69), ('all', 1.66)] 

3: [('congratulate', 3.04), ('incredible', 3.04), ('nominees', 3.04)] 

4: [('tireless', 3.04), ('efforts', 3.04), ('unbelievable', 3.04)] 

5: [('first', 3.04), ('brother', 3.04), ('endeavor', 3.04)] 

6: [('your', 6.09), ('screen', 6.09), ('talent', 3.04)] 

7: [('everybody', 3.04), ('fox', 3.04), ('new', 3.04)] 

8: [('everyone', 3.04), ('from', 3.04), ('onset', 3.04)] 

9: [('love', 3.04), ('dearly', 3.04), ('know', 3.04)] 

10: [('lastly', 3.04), ('want', 3.04), ('say', 3.04)] 

11: [('that', 3.04), ('felt', 3.04), ('hottest', 3.04)] 

12: [('production', 3.04), ('needed', 3.04), ('move', 3.04)] 

13: [('climate', 3.04), ('change', 3.04), ('real', 3.04)] 

14: [('urgent', 3.04), ('threat', 3.04), ('facing', 3.04)] 

15: [('speak', 6.09), ('billions', 6.09), ('who', 5.84)] 

16: [('children', 6.09), ('those', 3.04), ('whose'

## S6.V48. N-Gram Model

### Introduction to Markov Chains, bigrams, trigrams

**Markov Chains**: <br/>
Suppose there are 2 states i.e. `A` and `B`

If probability for:<br/>
`A` to go to `B` is 50% <br/>
`A` to go to `A` is 50% <br/>
`B` to go to `B` is 50% <br/>
`B` to go to `A` is 50% 

So if initial state is `A`, then we can go to `B` as probability is 50%, then we can go to `A` or `B` whatever because all probability is 50% and thus we can form a chain like `ABAABBABAB`. This sequence of chains is called Markov chain.

**N-Gram** <br/>
N-gram is continuous sequence of `n items` from a sample of text. These items are the `states` in Markov chain. These items can be character, words, or sentences or any bigger element.<br/>
N=2, it is **bigrams**, N=3, it is called **trigrams.**


example- 'the bird'<br/>
* **bigrams**- 'th', 'he', 'e ', ' b', 'bi', 'ir', 'rd'.<br/>
* **trigrams**- 'the', 'he ', 'e b', ' bi', 'bir', 'ird'

**Question**: What's the benefit of this N-Grams? What we do with these?
We follow the next coming character of each trigrams example- for `the` it would be a space. for `he<space> ` it would be `b`. Next one would be `i` and then `r` and so on

**Word-Gram**- In this case we donot consider characters rather we consider words.
'The bird is flying on the blue sky'
* **word-grams[trigram]**- 'The bird is', 'bird is flying', 'is flying on', 'flying on the', 'on the blue', 'the blue sky'.
Here we will follow the next word (rather than character) of word trigrams.

**But how can we use it? Applications?**
If we consider a huge corpus of sentences, then after same word tri-gram there can be different follow up word. In this example after 'The bird is', follow up word is `flying`. Other sentence of the corpus can have same trigram i.e.'The bird is' followed up completely different word such as `eating`, `sleeping` etc. Thus for a corpus of sentences after every trigram we will have a list of follow-up words. Thus we can build a dictionary of whatever possible things can come after a specific word trigram.

Thus, for character tri-gram after 'the',  from dictionary it could be suppose 5 different characters. 

For word gram, after 'The bird is', from dictionary we know 'the bird is' can `flying`, `eating` or `sleeping` (based on same word trigram ('The bird is') from the entire corpus). We can pick one from the list randomly such as `eating` (there could be several strategies for picking a word from the list, other then random picking). Thus it will become 'The bird is eating'. Then we will add another word based on the next list of possible words. And we keep adding. However after getting 'The bird is eating on the' from adding, next word added is 'orange' to make it 'The bird is eating on the orange'. And our corpus do not have a sequence of word trigram of 'on the orange', so we will have to stop from proceeding as we know it is wrong.

**N-Gram** models are used in autocomplete. Character N-Gram when we type few characters, it predicts the next characters to make it into a word. In case of word  N-Gram, after a word they can predict the next word.

In [448]:
# N-Gram modeling- Character Grams
import random

# Sample data
text = """Global warming or climate change has become a worldwide concern. It is gradually developing into an unprecedented environmental crisis evident in melting glaciers, changing weather patterns, rising sea levels, floods, cyclones and droughts. Global warming implies an increase in the average temperature of the Earth due to entrapment of greenhouse gases in the earth’s atmosphere."""

# Order of the grams
n = 6

# Our N-Grams
ngrams = {}

# Creating the model
for i in range(len(text)-n):
    gram = text[i:i+n] # for it would be trigram 0-3, 1-4, 2-5 and so on
    if gram not in ngrams.keys():
        ngrams[gram] = []
    ngrams[gram].append(text[i+n])

`len(text)-n)` n i.e. 3 in this case. We need to do this because we need 3 characters for trigrams. If we do not subtract 3 from length, last 2 iteration won't have 3 characters. 

`if gram not in ngrams.keys(): <br/>
    ngrams[gram] = []`

If the trigram is not present in ngrams dictionary, we are going to add that to the dictionary with the ngram as key and values as an empty list.

`ngrams[gram].append(text[i+n])` add the next character to trigram in the dictionary as a value to the key which is trigram

First trigram is `Glo` and this will be the first key and the first value corrosponding to this key will be `b`. If we encounter `Glo` as another trigram in the entire document, then we will add what is following `Glo` to the same key as well (even if it is followed by `b` again)

In [449]:
list(ngrams.items())[:5]

[('Global', [' ', ' ']),
 ('lobal ', ['w', 'w']),
 ('obal w', ['a', 'a']),
 ('bal wa', ['r', 'r']),
 ('al war', ['m', 'm'])]

In [450]:
# Testing our N-Gram Model    
currentGram = text[0:n] # FIRST trigram to start with
result = currentGram
for i in range(100): # to generate string of length 100 by N-Gram model
    if currentGram not in ngrams.keys():
        break
    possibilities = ngrams[currentGram] # list of possibilities of the characters that follows the current trigram
    nextItem = possibilities[random.randrange(len(possibilities))]
    result += nextItem
    currentGram = result[len(result)-n:len(result)] # ENSURES last 3 character of result string

`nextItem = possibilities[random.randrange(len(possibilities))]` - `random.randrange` will return a number between 0 and the length of the possibility (different characters possible after the N-Gram). And based on that number, we can select a possibility by indexing.

Then add the `nextItem` to the N-Gram. 

`currentGram = result[len(result)-n:len(result)]` ENSURES last 3 character of result string as new currentGram

In [451]:
print(result) # with bigrams it does NOT make sense

Global warming or climate change has become a worldwide concern. It is gradually developing into an unprec


In [452]:
print(result) # with 5-Grams it does make sense

Global warming or climate change has become a worldwide concern. It is gradually developing into an unprec


In [453]:
print(result) # with 6-Grams it does make sense

Global warming or climate change has become a worldwide concern. It is gradually developing into an unprec


In [454]:
# N-Gram modeling- Word n-Grams
import random
import nltk

# Sample data
text = """Global warming or climate change has become a worldwide concern. It is gradually developing into an unprecedented environmental crisis evident in melting glaciers, changing weather patterns, rising sea levels, floods, cyclones and droughts. Global warming implies an increase in the average temperature of the Earth due to entrapment of greenhouse gases in the earth’s atmosphere."""

# Order of the grams
n = 3

# Our N-Grams
ngrams = {}

# Creating the model
words = nltk.word_tokenize(text)

for i in range(len(words)-n):
    gram = ' '.join(words[i:i+n]) # use join to keep adding words with a spacee in between
    if gram not in ngrams.keys():
        ngrams[gram] = []
    ngrams[gram].append(words[i+n])

`words[i:i+n]` this will give us 3 words from the list, but `' '.join` will allow adding the words to make a word n-Grams.

In [455]:
list(ngrams.items())[:5]

[('Global warming or', ['climate']),
 ('warming or climate', ['change']),
 ('or climate change', ['has']),
 ('climate change has', ['become']),
 ('change has become', ['a'])]

In [456]:
# Testing our N-Gram Word Model    

currentGram = ' '.join(words[0:n]) # FIRST word trigram to start with, 
# ' '.join used to tie 3 words from list to make a word trigram 

result = currentGram
for i in range(30): # to generate string of length 30 words
    if currentGram not in ngrams.keys():
        break
    possibilities = ngrams[currentGram] # list of possibilities of the words that follows the current word trigram
    nextItem = possibilities[random.randrange(len(possibilities))]
    result += ' ' + nextItem # ' ' to maintain space between word trigram and the next word
    rwords = nltk.word_tokenize(result) # tokenize result string to get last 3 words 
    currentGram = ' ' .join(rwords[len(rwords)-n:len(rwords)]) 

In [457]:
print(result)

Global warming or climate change has become a worldwide concern . It is gradually developing into an unprecedented environmental crisis evident in melting glaciers , changing weather patterns , rising sea levels ,


Here both 2 or 3 nGrams leads to meaningful output sentence. <br/>
**For big article it might not be correct for 3/4 n-Grams, we have to increase n to 5 or 6. So we need to play around.**

## S6.V51. Latent Semantic Analysis: Finding topics/concepts of each documents

**∆** Suppose we have 10 different documents. LSA helps to figure out which document belongs to what concept/topic (eg-music, food, news, tech etc)

**∆** There can be some documents which can be part of multiple concepts/topics. Suppose document 2 can be part of 2 concepts/topics i.e. music and tech. <br/>

**∆** Next comes probability. Document 2 is 85% part of music concept/topic and 15% of tech concept/topic. So document 2 belongs to mainly music concept. However, document 2 has some concepts that make it partly tech 

**∆** When we developed BOW model, each row was a document and columns/features belong to a particular word. Column values suggested frequency of a particular word in that document. Now after BOW (or TF-IDF) has been build, from that we can build different concepts or topics. This can be done through **Singular value decomposition.**

**Singular Value Decomposition (SVD)** According to this any matrix (A here) of rows m and columns n can be decomposed into 3 matrixes, in the following way-<br/>
`A[mxn] = U[mxr] X S[rxr] X (V[nxr])^T`<br/>
`A` is the **Input Data Matrix** <br/>
m = number of rows/documents | n = number of words/features

3 matrixes are 
1. `U[mxr]` is **Left Singular Matrix** where m = number of rows/documents | r = number of concepts (r no. of columns)
2. `S[rxr]` is **Rank Matrix** where r = rank of A (r number of rows and columns). It is a diagonal matrix.
3. `V[nxr]` is **Right Singular Matrix** where n = number of rows/documents | r = number of concepts (r no. of columns). *And after Transposing `T` it will have r number of rows and n number of columns

Breaking down the Input Data Matrix into 3 matrixes help is to find the concepts or topics. Following are the applications -<br/>
A. Website posting articles on different topics/concepts/subjects. Different types of article can be put in different buckets.<br/>
B. Finding relation between article and words<br/>
C. Page indexing in search engines like google. Suppose we search NLP. Now NLP is a topic/concept in google. Based on this topic/concept (NLP), google has a lot of keywords based on LSA. Now based on the search (NLP here), google looks for articles in which most of the keywords associated with the NLP is appears. which concept it belongs more likely <br/>


In [458]:
# Latent Semantic Analysis (LSA)

# Importing the Libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Sample Data # 7 different sentences belonging to different domains
# Aim: Build different concepts:
dataset = ["The amount of polution is increasing day by day",
           "The concert was just great",
           "I love to see Gordon Ramsay cook",
           "Google is introducing a new technology",
           "AI Robots are examples of great technology present today",
           "All of us were singing in the concert",
           "We have launch campaigns to stop pollution and global warming"]

# Preprocessing (punctuation and stop words not here, so just take care of capitalized words)
dataset = [line.lower() for line in dataset]

In [459]:
# Now we have to create a BOW model. It can be a binary one or TF-IDF model
# Creating Tfidf Model
vectorizer = TfidfVectorizer() # TfidfVectorizer() can create Tfidf Model out of the list of strings
X = vectorizer.fit_transform(dataset)

# Visualizing the Tfidf Model
print(X[0]) 

  (0, 34)	0.22786438777524437
  (0, 2)	0.3211483974289088
  (0, 24)	0.22786438777524437
  (0, 26)	0.3211483974289088
  (0, 19)	0.2665807498646048
  (0, 17)	0.3211483974289088
  (0, 9)	0.6422967948578177
  (0, 5)	0.3211483974289088


0 here is document 1. The first word is 'The' in the document 1. The word 'The' is in position 34 in the TF-IDF model (out of all the words in 7 documents).

**∆** Now we are going to decompose TF-IDF matrix (Matrix A according to the formula. That is matrix `X` here) into those 3 matrixes. This decomposition can be done by **TruncatedSVD**

In [460]:
# Creating the SVD
lsa = TruncatedSVD(n_components = 4, n_iter = 100)
lsa.fit(X)

TruncatedSVD(algorithm='randomized', n_components=4, n_iter=100,
       random_state=None, tol=0.0)

`n_components`: number of concepts/topic we are trying to find. https://chrisalbon.com/machine_learning/feature_engineering/select_best_number_of_components_in_tsvd/?fbclid=IwAR0O_3idoKfkOeP65-0qNowD0HNV9CfiQP8ED0WPUCgfhlnjF7Lb3yxnmTU


`n_iter`: Number of iterations. Here it will undergo 100 iterations to properly decompose the `X` matrix into 3 matrixes. With every iteration, it get modified for a better decomposition. (Generally higher the better)

In [461]:
# First Column of V
row1 = lsa.components_[0]
print(len(row1))
row1

42


array([ 1.24191973e-01,  1.78240252e-01,  1.14460798e-01, -5.51504304e-17,
        1.24191973e-01,  1.14460798e-01, -5.51504304e-17,  3.44988739e-01,
       -7.67469591e-17,  2.28921595e-01,  1.24191973e-01, -5.51504304e-17,
        9.72770950e-02, -7.67469591e-17,  3.00124026e-01, -5.51504304e-17,
        1.78240252e-01,  1.14460798e-01,  9.72770950e-02,  1.75760635e-01,
        2.37365829e-01, -5.51504304e-17, -7.67469591e-17,  9.72770950e-02,
        2.95798061e-01, -5.51504304e-17,  1.14460798e-01,  1.24191973e-01,
       -7.67469591e-17,  1.24191973e-01, -7.67469591e-17,  1.78240252e-01,
       -5.51504304e-17,  1.83838346e-01,  3.76098295e-01, -1.09486161e-16,
        1.24191973e-01,  1.78240252e-01, -5.51504304e-17,  2.37365829e-01,
       -5.51504304e-17,  1.78240252e-01])

In [462]:
lsa.components_.shape

(4, 42)

Using `TruncatedSVD`, we can get the last one i.e. `V[nxr])^T` V(transposed). `V` matrix has n rows and r columns. However, after transposing it contains r rows and n columns. Here we are having 4 concepts that means 4 rows (and 42 columns, as 42 unique words of the sentences were selected by TF-IDF).

`lsa.components_[0]` returns first row of `V` transpose matrix. First row is first concept. Different values for 42 words of the first row/concept is the output. Some words that are in this concept will have a high value and the word not in this concept will have a lower value.

**∆** Now the idea is to see, corrosponding to each concept which are most critical words:

In [463]:
# Visualizing the concepts
terms = vectorizer.get_feature_names() # gives all the words in the TF-IDF model
len(terms) # 42 different words

42

In [514]:
terms[:10]

['ai',
 'all',
 'amount',
 'and',
 'are',
 'by',
 'campaigns',
 'concert',
 'cook',
 'day']

In [465]:
for i,comp in enumerate(lsa.components_):
    componentTerms = zip(terms,comp)
    sortedTerms = sorted(componentTerms,key=lambda x:x[1],reverse=True) # sorting concept values
    sortedTerms = sortedTerms[:10] # top 10 concept values
    print("\nConcept",i,":")
    for term in sortedTerms:
        print(term)


Concept 0 :
('the', 0.3760982952926378)
('concert', 0.3449887392330663)
('great', 0.30012402589487364)
('of', 0.29579806095266686)
('just', 0.23736582929791233)
('was', 0.23736582929791233)
('day', 0.2289215954150453)
('technology', 0.183838345674134)
('all', 0.17824025175628994)
('in', 0.17824025175628994)

Concept 1 :
('to', 0.4157884439670069)
('cook', 0.2835916579351072)
('gordon', 0.2835916579351072)
('love', 0.2835916579351072)
('ramsay', 0.2835916579351072)
('see', 0.2835916579351072)
('and', 0.21730644711292482)
('campaigns', 0.21730644711292482)
('global', 0.21730644711292482)
('have', 0.21730644711292482)

Concept 2 :
('technology', 0.3779180676714394)
('is', 0.3419614380631992)
('google', 0.3413969441909745)
('introducing', 0.3413969441909745)
('new', 0.3413969441909745)
('day', 0.141124326809949)
('ai', 0.11387892195372969)
('are', 0.11387892195372964)
('examples', 0.11387892195372964)
('present', 0.11387892195372964)

Concept 3 :
('day', 0.4654267679041099)
('amount', 0.2

In [466]:
# code explanation
for i,comp in enumerate(lsa.components_):
    print(i,comp)

0 [ 1.24191973e-01  1.78240252e-01  1.14460798e-01 -5.51504304e-17
  1.24191973e-01  1.14460798e-01 -5.51504304e-17  3.44988739e-01
 -7.67469591e-17  2.28921595e-01  1.24191973e-01 -5.51504304e-17
  9.72770950e-02 -7.67469591e-17  3.00124026e-01 -5.51504304e-17
  1.78240252e-01  1.14460798e-01  9.72770950e-02  1.75760635e-01
  2.37365829e-01 -5.51504304e-17 -7.67469591e-17  9.72770950e-02
  2.95798061e-01 -5.51504304e-17  1.14460798e-01  1.24191973e-01
 -7.67469591e-17  1.24191973e-01 -7.67469591e-17  1.78240252e-01
 -5.51504304e-17  1.83838346e-01  3.76098295e-01 -1.09486161e-16
  1.24191973e-01  1.78240252e-01 -5.51504304e-17  2.37365829e-01
 -5.51504304e-17  1.78240252e-01]
1 [ 2.14388091e-16  4.15518209e-16 -6.42517481e-17  2.17306447e-01
  1.44997462e-16 -4.34350654e-17  2.17306447e-01  1.26354939e-16
  2.83591658e-01 -8.68701308e-17  1.44997460e-16  2.17306447e-01
  1.26976111e-16  2.83591658e-01 -8.70055480e-17  2.17306447e-01
  4.36334141e-16 -4.34350654e-17  1.26976111e-16  6.

`lsa.components_` means all the rows (4 concepts) in `V-transpose` matrix. Iterating through will give each row/concept and the corrosponding **concept values** of 42 words

`componentTerms = zip(terms,comp)` this helps to combine the words (terms) and their concept value (comp). `componentTerms` will be tupule of the word and the value.

`sorted(componentTerms,key=lambda x:x[1],reverse=True)` Sorting the component terms by concept values. `componentTerms` is a tupule of word and concept value. So we are sorting the `componentTerms` based on the concept value

**Now we will figure out each sentence falls into which concept?**

In [467]:
# Word Concept Dictionary Creation # Here we store sorted term of each concept
concept_words = {} # This is to know what are the keywords specific to each concept

# Visualizing the concepts
terms = vectorizer.get_feature_names()
for i,comp in enumerate(lsa.components_):
    componentTerms = zip(terms,comp)
    sortedTerms = sorted(componentTerms,key=lambda x:x[1],reverse=True)
    sortedTerms = sortedTerms[:10]
    concept_words["Concept "+str(i)] = sortedTerms 

`i` is the value between 0 to 3 based on which concept is referred. This will be the key in the `concept_words` dictionary. 

And the terms/words and their concept values will be the values of the key

In [468]:
list(concept_words.items())[:2]

[('Concept 0',
  [('the', 0.3760982952926378),
   ('concert', 0.3449887392330663),
   ('great', 0.30012402589487364),
   ('of', 0.29579806095266686),
   ('just', 0.23736582929791233),
   ('was', 0.23736582929791233),
   ('day', 0.2289215954150453),
   ('technology', 0.183838345674134),
   ('all', 0.17824025175628994),
   ('in', 0.17824025175628994)]),
 ('Concept 1',
  [('to', 0.4157884439670069),
   ('cook', 0.2835916579351072),
   ('gordon', 0.2835916579351072),
   ('love', 0.2835916579351072),
   ('ramsay', 0.2835916579351072),
   ('see', 0.2835916579351072),
   ('and', 0.21730644711292482),
   ('campaigns', 0.21730644711292482),
   ('global', 0.21730644711292482),
   ('have', 0.21730644711292482)])]

In [469]:
# Sentence Concepts
for key in concept_words.keys(): # Iterate through each concept
    sentence_scores = []
    for sentence in dataset:
        words = nltk.word_tokenize(sentence) # tokenize each document (sentence here)
        score = 0
        for word in words:
            for word_with_score in concept_words[key]:
                if word == word_with_score[0]:
                    score += word_with_score[1]
        sentence_scores.append(score) 
    print("\n"+key+":")
    for sentence_score in sentence_scores:
        print(sentence_score)


Concept 0:
1.1297395470753953
1.4959427190164025
0
0.183838345674134
0.7797604325216746
1.3733655989909508
0

Concept 1:
0
0
1.833746733642543
0
0
0
1.2850142324187064

Concept 2:
0.6242100916830972
0
0
1.744070338307562
0.8334337554863579
0
0

Concept 3:
2.201593755447883
0.1272421318069438
0
0.21264455202449928
0
0.2965820743887414
0


So here we are going into each concept by `for key in concept_words.keys():`. Then we are tokenizing each sentence into list of words, with a initial score of 0. 

Then we are looping through each word in the sentence. Then we are looping through each word with concept score (`word_with_score`) in the concept and see whether the tokenized word is present in the keyword list of the concept i.e. `concept_words[key]`.If it contains we are adding the concept score of the tokenized word that is part of the keyword of that concept. More words of a sentence matching a concept (i.e. more words of a particular concept) more will be the score.


`Concept 0:
1.1297395470753941
1.4959427190164032
0
0.18383834567413404
0.7797604325216747
1.3733655989909501
0`

This means for the first concept when 1st sentence was tokenized into words and matched with the top keywords with concept 0, we got a score of 1.12. Therefore some words of the sentence 1 matches with the keywords of the concept 0.

However, no word of sentence 3 matches with the keywords of concept 0. 

We find the maximum score in case of 2nd sentence. So the words of sentence 2 has highest match with the keywords of concept 0

# S6.V54. Word's Synonym and Antonym using NLTK

In [470]:
# Finding synonym and antonym of words

Here we will use `WORDNET` to find synonym and antonym. Here each input word has a set (called synsets) of synonym that are grouped as noun, adjective, adverb, pronoun etc.

Goal: Is to write a python program that would give a similar set of synonyms (only words not tagged wuth noun, adjective etc). Similar way we can find antonyms as well.

Corrosponding to each word there is a bunch of synsets within a list. Here we see, synsets to the word `good`. Each synset now contains a bunch of word, all of which is synonym the original word.

In [471]:
from nltk.corpus import wordnet
wordnet.synsets("good")

[Synset('good.n.01'),
 Synset('good.n.02'),
 Synset('good.n.03'),
 Synset('commodity.n.01'),
 Synset('good.a.01'),
 Synset('full.s.06'),
 Synset('good.a.03'),
 Synset('estimable.s.02'),
 Synset('beneficial.s.01'),
 Synset('good.s.06'),
 Synset('good.s.07'),
 Synset('adept.s.01'),
 Synset('good.s.09'),
 Synset('dear.s.02'),
 Synset('dependable.s.04'),
 Synset('good.s.12'),
 Synset('good.s.13'),
 Synset('effective.s.04'),
 Synset('good.s.15'),
 Synset('good.s.16'),
 Synset('good.s.17'),
 Synset('good.s.18'),
 Synset('good.s.19'),
 Synset('good.s.20'),
 Synset('good.s.21'),
 Synset('well.r.01'),
 Synset('thoroughly.r.02')]

`wordnet.synsets("good")`, gives a list of synsets. Now we have to loop through this list to get each synset. Then loop through a particular synset to get all synonym words within it.

In [472]:
# Finding synonyms and antonyms of words

# Importing libraries
from nltk.corpus import wordnet

# Initializing the list of synnonyms and antonyms
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"): # loop through all synsets specific to the word 'good'
    for s in syn.lemmas(): # syn.lemmas(): gives all words in that particular synset
        synonyms.append(s.name()) # add each word from syn.lemmas() to the the empty synonym list
        for a in s.antonyms():
            antonyms.append(a.name())
            
            
# Displaying the synonyms and antonyms
print(set(synonyms),'\n') # set is used to remove multiple occurnaces of the same word
print(set(antonyms))

{'dear', 'effective', 'secure', 'full', 'skillful', 'commodity', 'practiced', 'dependable', 'just', 'thoroughly', 'right', 'safe', 'skilful', 'goodness', 'beneficial', 'undecomposed', 'unspoilt', 'soundly', 'adept', 'honorable', 'proficient', 'sound', 'honest', 'in_effect', 'good', 'estimable', 'near', 'well', 'salutary', 'respectable', 'upright', 'expert', 'unspoiled', 'trade_good', 'serious', 'in_force', 'ripe'} 

{'evilness', 'ill', 'badness', 'evil', 'bad'}


`s.name()` this we have to use to extract only word from the entire thing example - `('dependable.s.04')`. Here we want to extract only the word `dependable`. `s.name()` will only extract the word name.

`s.antonyms()` this will give all the antonyms to the specific word. And `a.name()` will extract only the name of the word.

# S6.V55. Word Negation Tracking

"I was not happy with the team's performance" sentence with the negation word `not` has a negative meaning. The moment `not` word is taken out we get a positive meaning.

So far all the models we have built like BOW etc we have treated each word as a separate entity. TF-IDF consider a little bit about the relation between the words but not that strongly. But here `not` and `good` we can NOT ignore treating them separately. Because these two words together means unhappy. So we have to find a strategy to take these 2 words into account together. 

Strategy is to track the negation word (`not`) and join the negation word (`not`) with the word (`good`) that negation word (`not`) is negating. There are 2 ways to do that 
1. Add `_` after the negation word (`not`) to make it `not_good`
2. Find out antonym for the combination of negation word (`not`) and the word (`good`) that negation word (`not`) is negating.

In [519]:
# Word Negation Tracking - Strategy 1
import nltk
sentence = "I was not happy with the team's performance"
words = nltk.word_tokenize(sentence)

new_words = []

temp_word = ''
for word in words:
    if word == 'not':
        temp_word = 'not_'
    elif temp_word == 'not_':
        word = temp_word + word
        temp_word = ''
    if (word != 'not'):
        new_words.append(word)

sentence = ' '.join(new_words)
sentence

"I was not_happy with the team 's performance"

If word is NOT the negation word it will get appended to the empty list.
If the word is the negation word `not`, then we make a temp_word `not_`. Then we add the next word after that. Thus we will get `not_good` 

In [522]:
# strategy to make an antonym of happy (building onto the previous code)

import nltk
from nltk.corpus import wordnet

sentence = "I was not happy with the team's performance"

words = nltk.word_tokenize(sentence)

new_words = []

temp_word = ''
for word in words:
    antonyms = []
    if word == 'not':
        temp_word = 'not_'
    elif temp_word == 'not_':
        for syn in wordnet.synsets(word):
            for s in syn.lemmas(): # syn.lemmas(): gives all words in that particular synset
                for a in s.antonyms():
                    antonyms.append(a.name())
        if len(antonyms) >= 1:
            word = antonyms[0]
        else:
            word = temp_word + word
        temp_word = ''
    if word != 'not':
        new_words.append(word)

sentence = ' '.join(new_words)

sentence

"I was unhappy with the team 's performance"

                ` elif temp_word == 'not_':
                      word = temp_word + word
                      temp_word = ''` 
 **has been replaced by how to find antonym code**         
    
    `for syn in wordnet.synsets(word):
            for s in syn.lemmas():
                for a in s.antonyms():
                    antonyms.append(a.name())
        if len(antonyms) >= 1:
            word = antonyms[0]`

`        if len(antonyms) >= 1:
            word = antonyms[0]
         else:
            word = temp_word + word
         temp_word = ''`
         
If length of antonyms is more than 1 we select first one. However, if there is no antonym we will add the word with `not_` to make it `not_good`

In some occasion `not_good` will work better with ML models. In some cases antonym occurs better.