# Understanding TF-IDF

In this section we will work with texts and derive weighted metrics based on words (or terms) frequencies within these texts. More precisely, we will look at the _TF-IDF_ metric, which stands for _Term Frequency-Inverse Document Frequency_, to produce our metrics which will allow us to measure and evaluate how important certain words are in documents that are part of our IMDb corpus. The "texts" or "documents" we will look at are 'plot' descriptions in the IMDb dataset. 

In [2]:
import pandas as pd
import nltk
import string
from nltk import word_tokenize
from nltk.corpus import stopwords


## Loading the IMDb dataset

Load the IMDb dataset and look closely at the 'Plot' column

In [3]:
df = pd.read_csv(r"C:\Users\Gebruiker\Documents\CODE\Master\Personalisation\INFOMPPM_local\Week 05\data\imdb.csv")

In [3]:
df.shape

(5380, 27)

In [4]:
df.head(1)

Unnamed: 0,Title,Year,Rated,Released,Runtime,Genre,Director,Writer,Actors,Plot,...,imdbRating,imdbVotes,imdbID,Type,DVD,BoxOffice,Production,Website,Response,tomatoURL
0,The Player,2015,TV-14,24 Sep 2015,60 min,"Action, Crime, Drama",,"John Fox, John Rogers","Philip Winchester, Damon Gupton, Wesley Snipes...","A former intelligence and FBI officer, who now...",...,7.1,6331,tt4474310,series,,,,,True,


In [4]:
stop_words = stopwords.words('english')

def preprocess(df, column):
    df["preprocessed"] = column.str.lower()# Lowercase
    df['preprocessed'] = df['preprocessed'].str.translate(str.maketrans(' ', ' ', string.digits))# Remove digits
    df['preprocessed'] = df['preprocessed'].str.translate(str.maketrans(' ', ' ', string.punctuation))# Remove punctuation
    df['preprocessed'] = df['preprocessed'].str.strip() # Remove whitespaces
    df['preprocessed'] = df['preprocessed'].apply(lambda x: ' '.join([word for word in str(x).split() if word not in (stop_words)]))# Remove stopwords
    return df

preprocess(df,df["Plot"])


pd.options.display.max_colwidth = 1000 
df[["preprocessed","Plot"]].head()

Unnamed: 0,preprocessed,Plot
0,former intelligence fbi officer works security expert las vegas recruited mysterious pit boss mr johnson based tips try prevent crimes rich place bets chances success,"A former intelligence and FBI officer, who now works as a security expert in Las Vegas, is recruited by mysterious pit boss Mr. Johnson to, based on his tips, try to prevent crimes, while the rich place bets on his chances of success."
1,bus driver sewer worker friend struggle strike rich wives look weary patience,A bus driver and his sewer worker friend struggle to strike it rich while their wives look on with weary patience.
2,misadventures misfit pt boat crew world war ii,The misadventures of a misfit PT Boat crew during World War II.
3,witch married ordinary man cannot resist using magic powers solve problems family faces,A witch married to an ordinary man cannot resist using her magic powers to solve the problems her family faces.
4,staff army hospital korean war find laughter best way deal situation,The staff of an army hospital in the Korean war find that laughter is the best way to deal with their situation.


## Create a data structure

We need a custom data structure to carry out our TF-IDF calculations. Create a python dictionary having for keys the indices of the dataframe above and for value another dictionary with 'plot' as an entry for each row in the dataframe. 

In [5]:
plot_dict = {}
for row, plot in enumerate(df["preprocessed"]):
    plot_dict.update({row : plot})

## Tokenize and filter

Now that we have the plot of each IMDb entry in our dictionary, it is time to tokenize each plot's text and clean it up. Do we need punctuations as part of our tokens? Are there "stop words" we could get rid off? Please complete the following tokenizer function utilising the spacy library (which you used in Data Mining - also, remember to uncomment the first line if you are using spacy for the first time). When this is done, augment your custom dictionary with the plot's tokens for each entry.

In [24]:
nltk.download('stopwords')
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
lemmatizer = nltk.stem.WordNetLemmatizer()

def split_and_stop(text):
    
    word_tokenize(text) 

    
    return # code goes here

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gebruiker\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
t = "Hello, This is a #test"

word_tokenize(t) 


['Hello', ',', 'This', 'is', 'a', '#', 'test']

In [27]:
for row, plot in plot_dict.items():
    plot_dict.update({row : word_tokenize(plot)})

TypeError: expected string or bytes-like object, got 'list'

{0: ['former',
  'intelligence',
  'fbi',
  'officer',
  'works',
  'security',
  'expert',
  'las',
  'vegas',
  'recruited',
  'mysterious',
  'pit',
  'boss',
  'mr',
  'johnson',
  'based',
  'tips',
  'try',
  'prevent',
  'crimes',
  'rich',
  'place',
  'bets',
  'chances',
  'success'],
 1: ['bus',
  'driver',
  'sewer',
  'worker',
  'friend',
  'struggle',
  'strike',
  'rich',
  'wives',
  'look',
  'weary',
  'patience'],
 2: ['misadventures', 'misfit', 'pt', 'boat', 'crew', 'world', 'war', 'ii'],
 3: ['witch',
  'married',
  'ordinary',
  'man',
  'can',
  'not',
  'resist',
  'using',
  'magic',
  'powers',
  'solve',
  'problems',
  'family',
  'faces'],
 4: ['staff',
  'army',
  'hospital',
  'korean',
  'war',
  'find',
  'laughter',
  'best',
  'way',
  'deal',
  'situation'],
 5: ['veteran',
  'cop',
  'years',
  'experience',
  'teamed',
  'young',
  'inspector',
  'solve',
  'crimes',
  'san',
  'francisco'],
 6: ['four',
  'vietnam',
  'vets',
  'framed',
  'crime

## Understanding Term Frequency (TF)

$$
tf(t, d) = \frac{n_{t}} {\sum_{k} n_{k}}
$$

_Term Frequency_ is a normalised metric that measures how frequent a certain term $t$ is in a given document $d$. In the formula above ${n_{t}}$ stands for the number of times the term $t$ occur in document $d$ while $\sum_{k} n_{k}$ is the sum of all terms in the document (its length in other words). Note that term $t$ can potentially occur many times in $d$ hence the need to normalise the metric over the sum of all terms. Below is a function definition `calculate_tf` which takes as input the `tokens` of a certain document $d$ and counts the number of occurences of each terms in the document and calculate their normalised frequency. 

In [7]:
def calculate_tf(tokens):
    unique_tokens = set(tokens)
    term_count = dict.fromkeys(unique_tokens, 0)
    term_frequency = dict.fromkeys(unique_tokens, 0)
    N = float(len(tokens))
    for term in tokens:
        term_count[term] += 1
        term_frequency[term] += 1 / N
    return term_count, term_frequency        

Considering the function `calculate_tf` above, augment your custom dictionary with both the `term_count` and normalised `term_frequency` given the respective plot's `tokens` you previously computed.

In [9]:
test = ['former',
  'fbi',  
  'fbi',
  'officer',
  'works',
  'security',
  'expert',
  'las',
  'vegas',
  'recruited',
  'mysterious',
  'pit',
  'boss',
  'mr',
  'johnson',
  'based',
  'tips',
  'try',
  'prevent',
  'crimes',
  'rich',
  'place',
  'bets',
  'chances',
  'success']

calculate_tf(test)

({'tips': 1,
  'based': 1,
  'works': 1,
  'bets': 1,
  'place': 1,
  'fbi': 2,
  'mysterious': 1,
  'johnson': 1,
  'recruited': 1,
  'try': 1,
  'former': 1,
  'success': 1,
  'las': 1,
  'pit': 1,
  'security': 1,
  'officer': 1,
  'boss': 1,
  'prevent': 1,
  'chances': 1,
  'mr': 1,
  'expert': 1,
  'crimes': 1,
  'vegas': 1,
  'rich': 1},
 {'tips': 0.04,
  'based': 0.04,
  'works': 0.04,
  'bets': 0.04,
  'place': 0.04,
  'fbi': 0.08,
  'mysterious': 0.04,
  'johnson': 0.04,
  'recruited': 0.04,
  'try': 0.04,
  'former': 0.04,
  'success': 0.04,
  'las': 0.04,
  'pit': 0.04,
  'security': 0.04,
  'officer': 0.04,
  'boss': 0.04,
  'prevent': 0.04,
  'chances': 0.04,
  'mr': 0.04,
  'expert': 0.04,
  'crimes': 0.04,
  'vegas': 0.04,
  'rich': 0.04})

## Understanding Inverse Document Frequency (IDF)

$$
idf(t, D) = \log\frac{|D|}{|{d_{i} \in D : t \in d_{i}}|}
$$

_Inverse Document Frequency_ is a metric that measures of important a term $t$ is in a given corpus (or collection) $D$ of documents $d_{i}$. While _Term Frequency_ measures the frequency of a term $t$ in a single document $d$, here _IDF_ consider frequency of a term $t$ over the whole corpus $D$ as to derive a weight on the statistical significance of term $t$ overall. The idea here is that common words which occur in many documents ("man" or a stop word like "it" for example) hold little importance overall as they are redundant. What _IDF_ does is to give more weight to words that are uncommon overall yet possibly significant for certain documents. This is the reason why the metric takes the $\log$ of the fraction $\frac{|D|}{|{d_{i} \in D : t \in d_{i}}|}$ where $|D|$ is the number of documents in corpus $D$ and $|{d_{i} \in D : t \in d_{i}}|$ is the number of times a term $t$ appears in a document in the corpus. 

The first thing we need to do to calculate _IDF_ is to establish the overall vocabulary of the entire corpus. What are all the unique words (or terms) in all of our plots? How many unique words do we have? Consider the following `bag_of_words` python set and fill it with all the unique terms present in our plots. 

In [None]:
# Vocabulary -> bag of words

bag_of_words = set()

for plot in pre_

Now, remember we calculated a `term_count` for each term in each document when we calculated the _TF_ with `calculate_tf` above? We need to use this pre-calculated informatin here to derive $|{d_{i} \in D : t \in d_{i}}|$ which is the number of times a term $t$ appears in a document in the corpus. Make a list of each `term_count` you recorded in your custom dicitonary as to use it to computer _IDF_ below. 

In [None]:
list_all_documents_count = # code goes here

Here is function defintion `calculate_idf` that computes the _IDF_ of all the terms in our corpus. It takes a list of `term_count` as `documents_count_list` and a overall vocabulary as `bag_of_words`. Can you make sense of the function in light of the $idf(t, D)$ formula above?

In [None]:
import math

def calculate_idf(documents_count_list, bag_of_words):
    
    idf = dict.fromkeys(bag_of_words, 0)
    D = len(documents_count_list)
    
    for d in documents_count_list:
        for term, count in d.items():
            if count > 0:
                idf[term] += 1
                
    for term, document_count in idf.items():
        idf[term] = math.log(D / float(document_count))
        
    return idf

Lets calculate the _IDF_ then using the function above. What are the highest weight? What are the lowest weight?

In [None]:
# code goes here

## Putting it together: TF-IDF

$$
tf-idf(t, d, D) = tf(t, d) \cdot idf(t, D)
$$

Putting _TF_ and _IDF_ together is quite simple. Since _IDF_ is a weight for each term in the corpus, simply multiply the terms' weight value to all the _TF_ we already have calculated. Here is a function `calculate_tf_idf` that does just that!

In [None]:
def calculate_tf_idf(tf, idf):
    tf_idf = dict.fromkeys(tf.keys(), 0)
    for term, frequency in tf.items():
        tf_idf[term] = frequency * idf[term]
    return tf_idf    

With the function above, calculate the _TF-IDF_ of all plots in your custom dictionary and record the results in the dictionary itself. 

In [None]:
# code goes here

What is the difference between _TF_ and _IDF_ for a given plot?

In [None]:
# code goes here

## Save the data

Save your custom dictionary you have constructed above in a json file.

In [None]:
import json

# code goes here