# Practical 8 TF-IDF (Update 1: 22/8/2023)

## Why TF-IDF

TF-IDF allows us to score the importance of words in a document, based on how frequently they appear on multiple documents.

- If the word appears frequently in a document - assign a high score to that word (term frequency - TF)

- If the word appears in a lot of documents - assign a low score to that word. (inverse document frequency - IDF)

<img src="http://www.sefidian.com/wp-content/ql-cache/quicklatex.com-f33b8ece29548d38cf0d06e877c46dd5_l3.svg" alt="1cd646-b0646eda5097486f9cdbf3080798669d-mv2" border="0">

<img src="http://www.sefidian.com/wp-content/ql-cache/quicklatex.com-e06efbe37aaf1d31c1bf0ee44d774b9a_l3.svg" alt="1cd646-b0646eda5097486f9cdbf3080798669d-mv2" border="0">

The TF-IDF of a term is calculated by multiplying TF and IDF scores.

<img src="http://www.sefidian.com/wp-content/ql-cache/quicklatex.com-4c290738055303eedfd97f880db7f7d8_l3.svg" alt="1cd646-b0646eda5097486f9cdbf3080798669d-mv2" border="0">

Translated into plain English, the importance of a term is high when it occurs a lot in a given document and rarely in others. In short, commonality within a document measured by TF is balanced by rarity between documents measured by IDF.

The resulting TF-IDF score reflects the importance of a term for a document in the corpus.

<img src="https://img.picturequotes.com/2/2/1874/less-is-only-more-where-more-is-no-good-quote-1.jpg" alt="1cd646-b0646eda5097486f9cdbf3080798669d-mv2" border="0">

Let's start to implement TF-IDF!

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords

First, let’s construct a small corpus.

In [None]:
corpus = ['data science is one of the most important fields of science', # Doc 1
          'this is one of the best data science courses',                # Doc 2
          'RDS produces data scientist who can analyse data' ]           # Doc 3

In [None]:
# Debug It!: We can enhance this TF-IDF by REMOVING stopwords from the corpus
# However, this section of code comes with some bugs.
# Let's try to fix it! (Tips: There are 3 bugs here...😁)

tokenized_corpus = [doc.split() for doc in corpus]
stop_words = set(stopwords.words('ENG'))
# Remove stopwords from each document
filtered_corpus = []
for doc_tokens in tokenized_corpus:
    filtered_tokens = [token for token in doc_tokens if token.lower() in stop_words]
    filtered_corpus.appends(' '.join(filtered_tokens))
    
print(filtered_corpus)

Next, we’ll create a word set for the corpus:

In [None]:
words_set = set()
 
for doc in  corpus:
    words = doc.split(' ')
    words_set = words_set.union(set(words))
     
print('Number of words in the corpus:',len(words_set)) #You should able to tell how many 'dimensions' is available for this set of docs
print('The words in the corpus: \n', words_set)

# Computing Term Frequency

Now we can create a dataframe by the number of documents in the corpus and the word set, and use that information to compute the term frequency (TF):

In [None]:
n_docs = len(corpus)         #·Number of documents in the corpus
n_words_set = len(words_set) #·Number of unique words in the
 
df_tf = pd.DataFrame(np.zeros((n_docs, n_words_set)), columns=words_set)
 
# Compute Term Frequency (TF)
for i in range(n_docs):
    words = corpus[i].split(' ') # Words in the document
    for w in words:
        df_tf[w][i] = df_tf[w][i] + (1 / len(words))
         
df_tf

# Computing Inverse Document Frequency

Now, we’ll compute the inverse document frequency (IDF):

In [None]:
print("IDF of: ")
 
idf = {}
 
for w in words_set:
    k = 0    # number of documents in the corpus that contain this word
     
    for i in range(n_docs):
        if w in corpus[i].split():
            k += 1
             
    idf[w] =  np.log10(n_docs / k)
     
    print(f'{w:>15}: {idf[w]:>10}' )

# Putting it Together: Computing TF-IDF

Since we have TF and IDF now, we can compute TF-IDF:

In [None]:
df_tf_idf = df_tf.copy()
 
for w in words_set:
    for i in range(n_docs):
        df_tf_idf[w][i] = df_tf[w][i] * idf[w]
         
df_tf_idf

Notice that “data” has an IDF of 0 because it appears in every document. As a result, “is” is not considered to be an important term in this corpus. This will change slightly in the following sklearn implementation, where “data” will be non-zero.