#  TF-IDF Model

In this model some semantic information is preserved as uncommon words are given more importance than common words <br>
Example: "She is beautiful". Here "beautiful" will have more importance than "She" or "is".
<br><br>
TF = Term Frequency of a particular word in a document. Calculated per document. So each word can have different TF value in different documents. <br>
IDF = Inverse document frequency of a word for the whole corpus of documents. So for each word we will have single IDF value. <br>
TF-IDF = TF * IDF 
<br><br>

TF = (number of occurance of a word in a document)/(number of all words in that document)
<br><br>

IDF = log<sub>e</sub>((number of documents)/(number of documents containing the specific words))

<br><br>
 TFIDF(Word) = TF(Document, Word) * IDF(Word)

In [1]:
import nltk
import re
import heapq
import numpy as np

In [2]:
paragraph = """Thank you all so very much. Thank you to the Academy. 
               Thank you to all of you in this room. I have to congratulate 
               the other incredible nominees this year. The Revenant was 
               the product of the tireless efforts of an unbelievable cast
               and crew. First off, to my brother in this endeavor, Mr. Tom 
               Hardy. Tom, your talent on screen can only be surpassed by 
               your friendship off screen … thank you for creating a t
               ranscendent cinematic experience. Thank you to everybody at 
               Fox and New Regency … my entire team. I have to thank 
               everyone from the very onset of my career … To my parents; 
               none of this would be possible without you. And to my 
               friends, I love you dearly; you know who you are. And lastly,
               I just want to say this: Making The Revenant was about
               man's relationship to the natural world. A world that we
               collectively felt in 2015 as the hottest year in recorded
               history. Our production needed to move to the southern
               tip of this planet just to be able to find snow. Climate
               change is real, it is happening right now. It is the most
               urgent threat facing our entire species, and we need to work
               collectively together and stop procrastinating. We need to
               support leaders around the world who do not speak for the 
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and 
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and 
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this 
               amazing award tonight. Let us not take this planet for 
               granted. I do not take tonight for granted. Thank you so very much."""

## Preprocess the Data#

In [3]:
dataset = nltk.sent_tokenize(paragraph)
print(dataset)

['Thank you all so very much.', 'Thank you to the Academy.', 'Thank you to all of you in this room.', 'I have to congratulate \n               the other incredible nominees this year.', 'The Revenant was \n               the product of the tireless efforts of an unbelievable cast\n               and crew.', 'First off, to my brother in this endeavor, Mr. Tom \n               Hardy.', 'Tom, your talent on screen can only be surpassed by \n               your friendship off screen … thank you for creating a t\n               ranscendent cinematic experience.', 'Thank you to everybody at \n               Fox and New Regency … my entire team.', 'I have to thank \n               everyone from the very onset of my career … To my parents; \n               none of this would be possible without you.', 'And to my \n               friends, I love you dearly; you know who you are.', "And lastly,\n               I just want to say this: Making The Revenant was about\n               man's relations

In [4]:
for i in range(len(dataset)):
    dataset[i] = dataset[i].lower()
    dataset[i] = re.sub(r'\W',' ', dataset[i])
    dataset[i] = re.sub(r'\s+', ' ', dataset[i])
    dataset[i] = dataset[i].strip()
print(dataset)

['thank you all so very much', 'thank you to the academy', 'thank you to all of you in this room', 'i have to congratulate the other incredible nominees this year', 'the revenant was the product of the tireless efforts of an unbelievable cast and crew', 'first off to my brother in this endeavor mr tom hardy', 'tom your talent on screen can only be surpassed by your friendship off screen thank you for creating a t ranscendent cinematic experience', 'thank you to everybody at fox and new regency my entire team', 'i have to thank everyone from the very onset of my career to my parents none of this would be possible without you', 'and to my friends i love you dearly you know who you are', 'and lastly i just want to say this making the revenant was about man s relationship to the natural world', 'a world that we collectively felt in 2015 as the hottest year in recorded history', 'our production needed to move to the southern tip of this planet just to be able to find snow', 'climate change 

## Create Histogram

First we need all the different words mapped with their count in a dictionary.

In [5]:
word2count = {}
for data in dataset:
    words = nltk.word_tokenize(data)
    for w in words:
        if w not in word2count.keys():
            word2count[w] = 1
        else:
            word2count[w] += 1
print(word2count)
print('*** Length of word2count:', len(word2count), sep=" ")

{'thank': 8, 'you': 12, 'all': 4, 'so': 2, 'very': 3, 'much': 2, 'to': 16, 'the': 17, 'academy': 1, 'of': 10, 'in': 4, 'this': 9, 'room': 1, 'i': 6, 'have': 3, 'congratulate': 1, 'other': 1, 'incredible': 1, 'nominees': 1, 'year': 2, 'revenant': 2, 'was': 2, 'product': 1, 'tireless': 1, 'efforts': 1, 'an': 1, 'unbelievable': 1, 'cast': 1, 'and': 8, 'crew': 1, 'first': 1, 'off': 2, 'my': 5, 'brother': 1, 'endeavor': 1, 'mr': 1, 'tom': 2, 'hardy': 1, 'your': 2, 'talent': 1, 'on': 1, 'screen': 2, 'can': 1, 'only': 1, 'be': 4, 'surpassed': 1, 'by': 3, 'friendship': 1, 'for': 10, 'creating': 1, 'a': 2, 't': 1, 'ranscendent': 1, 'cinematic': 1, 'experience': 1, 'everybody': 1, 'at': 1, 'fox': 1, 'new': 1, 'regency': 1, 'entire': 2, 'team': 1, 'everyone': 1, 'from': 1, 'onset': 1, 'career': 1, 'parents': 1, 'none': 1, 'would': 2, 'possible': 1, 'without': 1, 'friends': 1, 'love': 1, 'dearly': 1, 'know': 1, 'who': 4, 'are': 1, 'lastly': 1, 'just': 2, 'want': 1, 'say': 1, 'making': 1, 'about': 

## Filter the words
Filter out only n-number of most frequent words base on word count

In [6]:
freq_words = heapq.nlargest(100, word2count, key=word2count.get)
print(freq_words)

['the', 'to', 'you', 'of', 'for', 'this', 'thank', 'and', 'i', 'my', 'all', 'in', 'be', 'who', 'world', 'very', 'have', 'by', 'we', 'our', 'is', 'not', 'people', 'out', 'so', 'much', 'year', 'revenant', 'was', 'off', 'tom', 'your', 'screen', 'a', 'entire', 'would', 'just', 's', 'collectively', 'planet', 'it', 'most', 'need', 'do', 'speak', 'billions', 'there', 'children', 'tonight', 'take', 'granted', 'academy', 'room', 'congratulate', 'other', 'incredible', 'nominees', 'product', 'tireless', 'efforts', 'an', 'unbelievable', 'cast', 'crew', 'first', 'brother', 'endeavor', 'mr', 'hardy', 'talent', 'on', 'can', 'only', 'surpassed', 'friendship', 'creating', 't', 'ranscendent', 'cinematic', 'experience', 'everybody', 'at', 'fox', 'new', 'regency', 'team', 'everyone', 'from', 'onset', 'career', 'parents', 'none', 'possible', 'without', 'friends', 'love', 'dearly', 'know', 'are', 'lastly']


## Create IDF Matrix

In [11]:
word_idfs ={}
for w in freq_words:
    doc_count = 0
    for data in dataset:
        if w in nltk.word_tokenize(data):
            doc_count+=1
    word_idfs[w] = np.log((len(dataset)/doc_count)+1) # 1 is bias, used by standard libraries

In [12]:
word_idfs

{'the': 1.1314021114911006,
 'to': 1.067840630001356,
 'you': 1.2039728043259361,
 'of': 1.5040773967762742,
 'for': 1.5040773967762742,
 'this': 1.2039728043259361,
 'thank': 1.2878542883066382,
 'and': 1.3862943611198906,
 'i': 1.5040773967762742,
 'my': 1.8325814637483102,
 'all': 1.8325814637483102,
 'in': 2.0794415416798357,
 'be': 1.8325814637483102,
 'who': 2.4423470353692043,
 'world': 2.0794415416798357,
 'very': 2.0794415416798357,
 'have': 2.0794415416798357,
 'by': 2.0794415416798357,
 'we': 2.0794415416798357,
 'our': 2.0794415416798357,
 'is': 2.4423470353692043,
 'not': 2.0794415416798357,
 'people': 2.4423470353692043,
 'out': 2.4423470353692043,
 'so': 2.4423470353692043,
 'much': 2.4423470353692043,
 'year': 2.4423470353692043,
 'revenant': 2.4423470353692043,
 'was': 2.4423470353692043,
 'off': 2.4423470353692043,
 'tom': 2.4423470353692043,
 'your': 3.091042453358316,
 'screen': 3.091042453358316,
 'a': 2.4423470353692043,
 'entire': 2.4423470353692043,
 'would': 2.

## Create TF Matrix

In [13]:
tf_matrix = {}
for w in freq_words:
    doc_tf = []
    for data in dataset:
        frequency = 0
        for w1 in nltk.word_tokenize(data):
            if w == w1:
                frequency+=1
        tf_value = frequency / len(nltk.word_tokenize(data))
        doc_tf.append(tf_value)
    tf_matrix[w] = doc_tf

In [14]:
tf_matrix

{'the': [0.0,
  0.2,
  0.0,
  0.1,
  0.2,
  0.0,
  0.0,
  0.0,
  0.043478260869565216,
  0.0,
  0.1,
  0.06666666666666667,
  0.05263157894736842,
  0.0,
  0.05,
  0.10638297872340426,
  0.045454545454545456,
  0.0,
  0.0,
  0.0,
  0.0],
 'to': [0.0,
  0.2,
  0.1111111111111111,
  0.1,
  0.0,
  0.09090909090909091,
  0.0,
  0.08333333333333333,
  0.08695652173913043,
  0.07692307692307693,
  0.1,
  0.0,
  0.21052631578947367,
  0.0,
  0.05,
  0.02127659574468085,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0],
 'you': [0.16666666666666666,
  0.2,
  0.2222222222222222,
  0.0,
  0.0,
  0.0,
  0.043478260869565216,
  0.08333333333333333,
  0.043478260869565216,
  0.23076923076923078,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.1111111111111111,
  0.0,
  0.0,
  0.2],
 'of': [0.0,
  0.0,
  0.1111111111111111,
  0.0,
  0.13333333333333333,
  0.0,
  0.0,
  0.0,
  0.08695652173913043,
  0.0,
  0.0,
  0.0,
  0.05263157894736842,
  0.0,
  0.0,
  0.06382978723404255,
  0.045454545454545456,
  0.0,
 

## TF * IDF

In [15]:
tfidf_matrix = []

for word in tf_matrix.keys():
    tfidf = []
    for value in tf_matrix[word]:
        score = value * word_idfs[word]
        tfidf.append(score)
    tfidf_matrix.append(tfidf)

In [17]:
np.asarray(tfidf_matrix)

array([[0.        , 0.22628042, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.21356813, 0.11864896, ..., 0.        , 0.        ,
        0.        ],
       [0.20066213, 0.24079456, 0.26754951, ..., 0.        , 0.        ,
        0.24079456],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [19]:
len(tfidf_matrix[0])

21

In [20]:
len(tfidf_matrix)

100

So, in BOW models we had documents as rows and words as columns. But in TF-IDF model we have words as rows and documents as columns, so we need to transpose the TF-IDF matrix

In [21]:
X = np.asarray(tfidf_matrix)
X

array([[0.        , 0.22628042, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.21356813, 0.11864896, ..., 0.        , 0.        ,
        0.        ],
       [0.20066213, 0.24079456, 0.26754951, ..., 0.        , 0.        ,
        0.24079456],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [23]:
X.shape

(100, 21)

In [24]:
X = np.transpose(X)
X

array([[0.        , 0.        , 0.20066213, ..., 0.        , 0.        ,
        0.        ],
       [0.22628042, 0.21356813, 0.24079456, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.11864896, 0.26754951, ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.24079456, ..., 0.        , 0.        ,
        0.        ]])

In [25]:
X.shape

(21, 100)