# TF-IDF Computation Example
This is a very simple example of TF-IDF computation. It is divided in two section. The first part discusses TF-IDF as presented during the last edition of the course (2020/2021) and should be used to prepare for the exam. The second part shows how scikit-learn computes TF-IDF and discusses the difference between our basic version discussed during the course and the actual implementation available in scikit-learn. This second version won't be a part of any written exam and it is included for completeness.

### References
- https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting
- https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76

First, we import the libraries we will need.

In [1]:
import pandas as pd
import numpy as np
import math
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pierlucalanzi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Next we define a small corpus of three documents (the same used in the exam problem). This is the same corpus used in the problem from June 28 2019 exam that has been discussed at length in the forum.

In [2]:
corpus = [
    'A time to plant and a time to reap',
    'Time for you and time for me',
    'Fly Time'
]

From the corpus we have to extract the "bag of word" representation of the document which does not consider the order of the words but represents the documents simply as a set of words. Note that to extract the representation we also eliminate all the stopwords (those words that are not really interesting since they don't provide much information). 

In [3]:
def compute_bag_of_words(corpus):
    bag_of_words = []
    for text in corpus:
        bag_of_words.append([x.lower() for x in text.split(' ') if not (x.lower() in stopwords.words('english'))])
    return bag_of_words

In [4]:
bag_of_words = compute_bag_of_words(corpus)

In [5]:
bag_of_words

[['time', 'plant', 'time', 'reap'], ['time', 'time'], ['fly', 'time']]

Given a corpus we also need to compute the dictionary that will be used to create the table that represents the documents. The dictionary is simply the set of all the words contained in the corpuse (without any stopword that has been already eliminated)

In [6]:
def compute_dictionary(corpus):
    dictionary = set(bag_of_words[0])

    for bow in bag_of_words[1:]:
        dictionary = dictionary.union(set(bow))
    return dictionary

In [7]:
dictionary = compute_dictionary(corpus)

In [8]:
dictionary

{'fly', 'plant', 'reap', 'time'}

As the very next step we can compute the "term frequency", that is, how frequent every word in the dictionary appears in each document.

In [9]:
def compute_tf_count(bag_of_words,dictionary):
    
    word_count = []
    
    for document in bag_of_words:
        document_word_count = dict.fromkeys(dictionary, 0)

        for w in document:
            document_word_count[w] = document_word_count[w]+1

        word_count.append(document_word_count)
        
    return word_count

In [10]:
word_count = compute_tf_count(bag_of_words,dictionary)

In [11]:
word_count

[{'time': 2, 'plant': 1, 'reap': 1, 'fly': 0},
 {'time': 2, 'plant': 0, 'reap': 0, 'fly': 0},
 {'time': 1, 'plant': 0, 'reap': 0, 'fly': 1}]

We can now normalize these values and compute the actual frequency of each word. Note that we did not do this in the example done in class since we did not have the whole dictionary.

In [12]:
def compute_tf(document_word_count):
    
    document_tf = {}
    
    number_of_words = sum(document_word_count.values())
    
    for word, count in document_word_count.items():        
        document_tf[word] = count / float(number_of_words)
        
    return document_tf

In [13]:
tf = []
for document_word_count in word_count:
    tf.append(compute_tf(document_word_count))

In [14]:
tf

[{'time': 0.5, 'plant': 0.25, 'reap': 0.25, 'fly': 0.0},
 {'time': 1.0, 'plant': 0.0, 'reap': 0.0, 'fly': 0.0},
 {'time': 0.5, 'plant': 0.0, 'reap': 0.0, 'fly': 0.5}]

Let's now create and actual table for the counts and frequencies.

In [15]:
def create_df(tf,dictionary):
    data = {}
    for word in dictionary:

        word_tf = []

        for document_tf in tf:
            word_tf.append(document_tf[word])

        data[word] = word_tf

    df = pd.DataFrame(data)
    
    return df

In [16]:
df_count = create_df(word_count,dictionary)
df_tf = create_df(tf,dictionary)

In [17]:
df_count

Unnamed: 0,time,plant,reap,fly
0,2,1,1,0
1,2,0,0,0
2,1,0,0,1


In [18]:
df_tf

Unnamed: 0,time,plant,reap,fly
0,0.5,0.25,0.25,0.0
1,1.0,0.0,0.0,0.0
2,0.5,0.0,0.0,0.5


We can now compute the IDF for every word in the dictionary. The formula presented in class for IDF is, 

$
idf(w) = log(M/k)
$

where w is a word in the corpus dictionary, M is the number of documents in the corpus, and k is the number of documents in which word w appears. As usual we are going to use logarithm base 2 for the computations. Note that in previous editions of the course we used a slightly different formula. Note also that any type of logarithm might be used, we decide to use base 2 so that we can compare our results without worrying what base one has used.

In [19]:
def compute_idf(df):
    idf = {}

    # number of documents
    M = len(df)
    
    for word in df.columns:
        
        # number of documents in which the word appears
        
        k = sum(df[word]>0.0)
        
        idf[word] = math.log(M/k,2)
        
    return idf

In [20]:
idf = compute_idf(df_tf)

In [21]:
idf

{'time': 0.0,
 'plant': 1.5849625007211563,
 'reap': 1.5849625007211563,
 'fly': 1.5849625007211563}

Note that since the word "time" appears in all the documents its IDF value is zero. We can finally can compute the TF-IDF representation of the corpus as a table that we can use for instance to cluster the documents.

In [22]:
tf_idf = df_tf.copy()
for word in tf_idf.columns:
    tf_idf[word] = tf_idf[word]*idf[word]

In [23]:
tf_idf

Unnamed: 0,time,plant,reap,fly
0,0.0,0.396241,0.396241,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.792481


The results are similar to those available in the solution of the June 28 2019 exams. The difference in the values is due to the different formula that in 2018/2019 was used which contained a smoothing factor.

## TF-IDF using Scikit-Learn
Scikit-learn has its own set of functions to preprocess a corpus to generate TF-IDF representation so that we can avoid doing everything by hand. The following few lines replicate the entire process we just performed. 

In [25]:
vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
vectors = vectorizer.fit_transform(corpus)
sklearn_idf_values = vectorizer.idf_
sklearn_feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()

sklearn_tfidf = pd.DataFrame(denselist, columns=sklearn_feature_names)

If we print the table produced by scikit-learn we will note that the values are quite different. This because scikit-learn normalizes the term frequencies using the formula, 

$
    \text{tf}(w) = \frac{n_w}{\sqrt{\sum n_i}}
$

It applies a smoothed version of IDF, that is, 

$
\text{idf}(w) = \log(M + 1 / k + 1) + 1
$

where w is a word in the corpus dictionary, M is the number of documents in the corpus, k is the number of documents in which word w appears, and log is computed using the natural base. So for instance if we check the term frequency computed using TfidfVectorizer, 

In [26]:
sklearn_tfidf

Unnamed: 0,fly,plant,reap,time
0,0.0,0.542701,0.542701,0.641055
1,0.0,0.0,0.0,1.0
2,0.861037,0.0,0.0,0.508542


So to check the computation performed by scikit-learn we first repeat the same process without the IDF computation and produce the plain TF representation.

In [27]:
vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'),use_idf=False)
vectors = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()

sklearn_tf = pd.DataFrame(denselist, columns=feature_names)
sklearn_tf

Unnamed: 0,fly,plant,reap,time
0,0.0,0.408248,0.408248,0.816497
1,0.0,0.0,0.0,1.0
2,0.707107,0.0,0.0,0.707107


We note that we can compute the same values from our original code by applying the Euclidean norm to normalize the word counts:

In [28]:
def compute_tf_euclidean_norm(document_word_count):
    
    document_tf = {}
    
    denominator = math.sqrt(sum(np.power(list(document_word_count.values()),2)))
        
    for word, count in document_word_count.items():        
        document_tf[word] = count / float(denominator)
        
    return document_tf

In [29]:
tf_euclidean_norm = []
for document_word_count in word_count:
    tf_euclidean_norm.append(compute_tf_euclidean_norm(document_word_count))
df_euclidean_norm = create_df(tf_euclidean_norm,dictionary)

# use the same column order
df_euclidean_norm[['fly','plant','reap','time']]

Unnamed: 0,fly,plant,reap,time
0,0.0,0.408248,0.408248,0.816497
1,0.0,0.0,0.0,1.0
2,0.707107,0.0,0.0,0.707107


As note, applying the Euclidean norm to our word counts produces the same values for term frequency values generated by scikit-learn. We can apply the smoothed idf used by TfidfVectorizer to generate the TF-IDF representation,

In [30]:
def compute_smoothed_idf(df):
    idf = {}

    # number of documents
    M = len(df)
    
    for word in df.columns:
        
        # number of documents in which the word appears
        
        k = sum(df[word]>0.0)
        
        idf[word] = math.log((M+1)/(k+1))+1
        
    return idf

In [31]:
smoothed_idf = compute_smoothed_idf(df_tf)
smoothed_idf

{'time': 1.0,
 'plant': 1.6931471805599454,
 'reap': 1.6931471805599454,
 'fly': 1.6931471805599454}

Producing the same idf values computed by scikit-learn. 

In [32]:
for i,word in enumerate(sklearn_feature_names):
    print(word+"\t"+str(sklearn_idf_values[i]))

fly	1.6931471805599454
plant	1.6931471805599454
reap	1.6931471805599454
time	1.0


We can now combine everything and generate the same TF-IDF representation produced by scikit-learn by
1. multiplying the normalized TF by the smoothed IDF
2. reapplying the l2 normalization to the TF-IDF values computed

So, first step multiply the L2 normalized Tf with the IDF values

In [33]:
smoothed_tf_idf = df_euclidean_norm.copy()
for word in smoothed_tf_idf.columns:
    smoothed_tf_idf[word] = smoothed_tf_idf[word]*smoothed_idf[word]
smoothed_tf_idf[['fly','plant','reap','time']]

Unnamed: 0,fly,plant,reap,time
0,0.0,0.691224,0.691224,0.816497
1,0.0,0.0,0.0,1.0
2,1.197236,0.0,0.0,0.707107


Now apply L2 normalization to each row obtaining the same table produced using scikit-learn.

In [34]:
l2_norm_smoothed_tf_idf = smoothed_tf_idf.copy()
for row in range(len(l2_norm_smoothed_tf_idf)):
    denominator = math.sqrt(sum(np.power(l2_norm_smoothed_tf_idf.iloc[row].values,2)))
    l2_norm_smoothed_tf_idf.iloc[row] = l2_norm_smoothed_tf_idf.iloc[row]/denominator
    
l2_norm_smoothed_tf_idf[['fly','plant','reap','time']]

Unnamed: 0,fly,plant,reap,time
0,0.0,0.542701,0.542701,0.641055
1,0.0,0.0,0.0,1.0
2,0.861037,0.0,0.0,0.508542


## Conclusions
Overall, what we have discussed during the lecture is a simplification of the process that uses simpler formulas but the process is the same used in sci-kit learn. If you want to prepare for the exams try to reproduce the computations shown in the first part of this notebook using TF as plain count (as used in the lecture slides) or as normalized percentage. The second part on scikit-learn implementation is just for your curiosity :)