<font face='georgia'>
    
   <h4><strong>What does tf-idf mean?</strong></h4>

   <p>    
Tf-idf stands for <em>term frequency-inverse document frequency</em>, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
</p>
    
   <p>
One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model.
</p>
    
   <p>
Tf-idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
</p>
    
</font>

<font face='georgia'>
    <h4><strong>How to Compute:</strong></h4>

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

 <ul>
    <li>
<strong>TF:</strong> Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: <br>

$TF(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}.$
</li>
<li>
<strong>IDF:</strong> Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: <br>

$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}}.$
for numerical stabiltiy we will be changing this formula little bit
$IDF(t) = \log_{e}\frac{\text{Total  number of documents}} {\text{Number of documents with term t in it}+1}.$
</li>
</ul>

### Corpus

In [None]:
## SkLearn# Collection of string documents

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

### SkLearn Implementation

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
skl_output = vectorizer.transform(corpus)

In [None]:
# sklearn feature names, they are sorted in alphabetic order by default.

print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [None]:
# Here we will print the sklearn tfidf vectorizer idf values after applying the fit method
# After using the fit function on the corpus the vocab has 9 words in it, and each has its idf value.

print(vectorizer.idf_)

[1.91629073 1.22314355 1.51082562 1.         1.91629073 1.91629073
 1.         1.91629073 1.        ]


In [None]:
# shape of sklearn tfidf vectorizer output after applying transform method.
skl_output.shape

(4, 9)

In [None]:
print(skl_output)

  (0, 8)	0.38408524091481483
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 2)	0.5802858236844359
  (0, 1)	0.46979138557992045
  (1, 8)	0.281088674033753
  (1, 6)	0.281088674033753
  (1, 5)	0.5386476208856763
  (1, 3)	0.281088674033753
  (1, 1)	0.6876235979836938
  (2, 8)	0.267103787642168
  (2, 7)	0.511848512707169
  (2, 6)	0.267103787642168
  (2, 4)	0.511848512707169
  (2, 3)	0.267103787642168
  (2, 0)	0.511848512707169
  (3, 8)	0.38408524091481483
  (3, 6)	0.38408524091481483
  (3, 3)	0.38408524091481483
  (3, 2)	0.5802858236844359
  (3, 1)	0.46979138557992045


In [None]:
# sklearn tfidf values for first line of the above corpus.
# Here the output is a sparse matrix
print(skl_output[0])

  (0, 8)	0.38408524091481483
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 2)	0.5802858236844359
  (0, 1)	0.46979138557992045


In [None]:
# sklearn tfidf values for first line of the above corpus.
# To understand the output better, here we are converting the sparse output matrix to dense matrix and printing it.
print(skl_output[0].toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


### Custom Implementation

In [None]:
from collections import Counter
from tqdm import tqdm
from scipy.sparse import csr_matrix
import math
import operator
from sklearn.preprocessing import normalize
import numpy as np

#Fitting the data which is the equivalent of .fit() of the TFIDvecotorizer function of the Sckikit learn Library
#Creating a list of lists with strings.Converting the sentences into its component words


def f_c(corpus_n):                                                              #Custom Fit function to get the unique values or the Vocab 
  w=[]                                                                          #Initializing an empty list to get a list of all strings 
  for i in corpus_n:                                                            #Looping through each of the documents 
    w.append(i.split())                                                         #Splitting each of the documents into a list of strings
  w1=[]                                                                         #Initializing new lists to get all strings together
  u=[]                                                                            
  #Getting the Unique words in the given corpus
  s= " ".join(corpus_n)                                                         #joining all the documents together and form a single string
  w1= s.split()                                                                 #splitting the string into individual words
  u= np.unique(w1)                                                              #Getting the unique strings in the corpus or creating a Vocab
  return u,w                                                                    #Returning the Vocab

def transfo_c(v,w2,corpus_t):                                                   #Custom Transform function to calculate the TFIDF values and return a sparse matrix of TFIDF
  i_f=[]                                                                        #Initializing new lists to get the IDF values for each word in the vocab
  for word in v:                                                                #looping though each of the elements of the Vocab
    i_count=0                                                                   #Initializing a count variable to capture the frequency of each of the elements of the vocab
    for f in w2:                                                                #looping through each of the documents to check if any of the vocab words are present in the documents
      check= False                                                              #Setting the Check parameter to False 
      for s in f:                                                               #Looping through individual words of the document
        if word== s:                                                            #Checking if the word is present in the document
          check= True                                                           #Changing the check parameter to True if the vocab word is present in the document
      if check== True:                                                          #using the if statement to see if the check parameter has been changed which confirms that the vocab word is present in the document
        i_count= i_count+1                                                      #Increasing the count value when the vocab word is present in the document
    i_f.append(1+(math.log((1+len(corpus_t))/(1+i_count))))                     #Computing the IDF value for each of the Vocab words
  idf_dict= dict(zip(v,i_f))                                                    #Moving the IDF values with their corresponding vocab words into a dictionary
  y=[]                                                                          #Initializing an empty list
  for n in w2:                                                                  #Looping through individual documents
    k=[]                                                                        #initializing a new list and a new dictionary
    t_dict= {} 
    t_dict= dict(Counter(n))                                                    #Getting the frequencies of each of the individual strings
    for a in v:                                                                 #looping though the vocab words for Computing the TF values for all the elements of a document
      if a in n:                                                                #checking if each of the individual vocab words is present in each of the documents
        t_count= t_dict.get(a)                                                  #Getting the frequency of each of the words of a dictionary
      else:                                                                     
        t_count=0                                                               #Setting the counter to zero if the vaocab word is not present in the document
      k.append((t_count/len(n))*(idf_dict.get(a)))                              #Computing the TFIDF value for each of the words in the each of the documents
    y.append(k)                                                                 #appeding the TFIDF values into a list

  r= normalize(csr_matrix(np.array(y)), norm='l2', axis=1)                      #converting the TFIDF value into a sparse matrix and normalizing the sparse matrix
  return r, idf_dict                                                            #Returning the TFIDF sparse matrix and the IDF values

r1, r2= f_c(corpus)                                                             #Fitting the corpus data 

fin_res_r, fin_res_id= transfo_c(r1,r2,corpus)                                  #Transforming the fitted data and printing the output
print(fin_res_r)

  (0, 1)	0.4697913855799205
  (0, 2)	0.580285823684436
  (0, 3)	0.3840852409148149
  (0, 6)	0.3840852409148149
  (0, 8)	0.3840852409148149
  (1, 1)	0.6876235979836937
  (1, 3)	0.2810886740337529
  (1, 5)	0.5386476208856762
  (1, 6)	0.2810886740337529
  (1, 8)	0.2810886740337529
  (2, 0)	0.511848512707169
  (2, 3)	0.267103787642168
  (2, 4)	0.511848512707169
  (2, 6)	0.267103787642168
  (2, 7)	0.511848512707169
  (2, 8)	0.267103787642168
  (3, 1)	0.4697913855799205
  (3, 2)	0.580285823684436
  (3, 3)	0.3840852409148149
  (3, 6)	0.3840852409148149
  (3, 8)	0.3840852409148149


<font face='georgia'>
    <h4><strong>2. Implementation of max features functionality:</strong></h4>

<ul>
    <li> As a part of this task we modify the fit and transform functions so that vocab will contain only 50 terms with top idf scores.</li>
    <br>
    <li>Here a pickle file, with file name <strong>cleaned_strings</strong>. will load the corpus and use it as input to the tfidf vectorizer.</li>
    <br>

In [None]:
# Below is the code to load the cleaned_strings pickle file provided
# Here corpus is of list type

import pickle
with open('cleaned_strings', 'rb') as f:
    corpus_2 = pickle.load(f)
    
# printing the length of the corpus loaded
print("Number of documents in corpus = ",len(corpus_2))

Number of documents in corpus =  746


In [None]:
print(corpus_2[:10])

['slow moving aimless movie distressed drifting young man', 'not sure lost flat characters audience nearly half walked', 'attempting artiness black white clever camera angles movie disappointed became even ridiculous acting poor plot lines almost non existent', 'little music anything speak', 'best scene movie gerardo trying find song keeps running head', 'rest movie lacks art charm meaning emptiness works guess empty', 'wasted two hours', 'saw movie today thought good effort good messages kids', 'bit predictable', 'loved casting jimmy buffet science teacher']


In [None]:
# Write your code here.
# Try not to hardcode any values.
# Make sure its well documented and readble with appropriate comments.

Fitting the new corpus using the Fit() function from Task 1 and getting the Vocab value

In [None]:
q2_r1, q2_r2= f_c(corpus_2)                                                     #Fitting the new corpus data
print(q2_r1)                                                                    #Printing to see a sample of the fitted data
print(len(q2_r1))                                                               #getting the length of the fittted data or the Voacab or unique values in the corpus

['aailiyah' 'abandoned' 'ability' ... 'zillion' 'zombie' 'zombiez']
2897


Taking the IDF values of all the Vocab words by using the Transform function and taking only the IDF values

In [None]:
vals2, idf_val= transfo_c(q2_r1, q2_r2,corpus_2)                         #Transforming the fittted values and getting the IDF values
print(type(idf_val))                                                     #Printing the data type of the fitted values

<class 'dict'>


Getting the lenght of the IDF values list

In [None]:
len(idf_val)

2897

Sorting the Vocab words based on the IDF values in descending order. and picking the top 50 IDF values and their corresponding words and creating a new Vocab titled "new'

In [None]:
new= {}                                                                         #Creating an empty dictionary to capture the top50 IDF values and their corresponding Vocab words
sor_idf= sorted(idf_val.items(), key=lambda x: x[1], reverse=True)              #Sorting the Vocab words based on the IDF value
new= sor_idf[:50]                                                               #Creating a new vocab by taking the top50 vocab words based on the IDF values

In [None]:
u50=[]                                                                          #Initializing an empty list to capture all the new vocab words
iv50=[]                                                                         #initializing another empty list to capture all the IDF values
for h in range(len(new)):                                                       #looping through the length of the new vocab
    u50.append(new[h][0])                                                       #Capturing all the vocab words from the new Vocab
    iv50.append(new[h][1])                                                      #Capturing the respective IDF's of each of teh new Vocab words

print(u50)
print(iv50) 

['aailiyah', 'abandoned', 'abroad', 'abstruse', 'academy', 'accents', 'accessible', 'acclaimed', 'accolades', 'accurate', 'accurately', 'achille', 'ackerman', 'actions', 'adams', 'add', 'added', 'admins', 'admiration', 'admitted', 'adrift', 'adventure', 'aesthetically', 'affected', 'affleck', 'afternoon', 'aged', 'ages', 'agree', 'agreed', 'aimless', 'aired', 'akasha', 'akin', 'alert', 'alike', 'allison', 'allow', 'allowing', 'alongside', 'amateurish', 'amaze', 'amazed', 'amazingly', 'amusing', 'amust', 'anatomist', 'angel', 'angela', 'angelina']
[6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.922918004572872, 6.9229180

Using the above created transform function to get the TFIDF value sparse matrix and printing the sparse matrix

In [None]:
res2, res2_id= transfo_c(u50,q2_r2,corpus_2)                                    #Calling the Transform function and appying the same on the new corpus and the new vocab
print(res2)

  (0, 30)	1.0
  (68, 24)	1.0
  (72, 29)	1.0
  (74, 31)	1.0
  (119, 33)	1.0
  (135, 3)	0.37796447300922725
  (135, 10)	0.37796447300922725
  (135, 18)	0.37796447300922725
  (135, 20)	0.37796447300922725
  (135, 36)	0.37796447300922725
  (135, 40)	0.37796447300922725
  (135, 41)	0.37796447300922725
  (176, 49)	1.0
  (181, 13)	1.0
  (192, 21)	1.0
  (193, 23)	1.0
  (216, 2)	1.0
  (222, 47)	1.0
  (225, 19)	1.0
  (227, 17)	1.0
  (241, 44)	1.0
  (270, 1)	1.0
  (290, 25)	1.0
  (333, 26)	1.0
  (334, 15)	1.0
  (341, 43)	1.0
  (344, 42)	1.0
  (348, 8)	1.0
  (377, 37)	1.0
  (409, 5)	1.0
  (430, 39)	1.0
  (457, 45)	1.0
  (461, 4)	1.0
  (465, 38)	1.0
  (475, 35)	1.0
  (493, 6)	1.0
  (500, 48)	1.0
  (548, 0)	0.7071067811865475
  (548, 32)	0.7071067811865475
  (608, 14)	1.0
  (612, 11)	1.0
  (620, 46)	1.0
  (632, 7)	1.0
  (644, 12)	0.7071067811865475
  (644, 27)	0.7071067811865475
  (664, 28)	1.0
  (667, 22)	1.0
  (691, 34)	1.0
  (697, 9)	1.0
  (722, 16)	1.0


Converting the sparse matrix into a dense matrix and checking the shape of the resultant dense matrix. The resultant dense matrix has 50 columns and the same number of document as the corpus

In [None]:
F=res2.todense()
F.shape

(746, 50)