# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [36]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd
from nltk import PorterStemmer

Import the data into the file *headlines.csv*

In [37]:
# TODO: Load the dataset
headlines_data = pd.read_csv('headlines.csv')
print(headlines_data)

      publish_date                                      headline_text
0         20170721  algorithms can make decisions on behalf of fed...
1         20170721  andrew forrests fmg to appeal pilbara native t...
2         20170721                           a rural mural in thallan
3         20170721  australia church risks becoming haven for abusers
4         20170721  australian company usgfx embroiled in shanghai...
...            ...                                                ...
1994      20170624  constitution avenue wins top prize at act arch...
1995      20170624                         dark mofo numbers crunched
1996      20170624  david petraeus says australia must be firm on ...
1997      20170624  driverless cars in australia face challenge of...
1998      20170624           drug company criticised over price hikes

[1999 rows x 2 columns]


As usual, check the dataset basic information.

In [38]:
# TODO: Have a look at the data
headlines_data.head()

Unnamed: 0,publish_date,headline_text
0,20170721,algorithms can make decisions on behalf of fed...
1,20170721,andrew forrests fmg to appeal pilbara native t...
2,20170721,a rural mural in thallan
3,20170721,australia church risks becoming haven for abusers
4,20170721,australian company usgfx embroiled in shanghai...


We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [39]:
# Import necessary libraries
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Initialize Porter stemmer for stemming
stemmer = PorterStemmer()

# Define set of English stop words
stop_words = set(stopwords.words('english'))

def preprocess_document(doc):
    # Tokenize the document
    tokens = word_tokenize(doc)
    
    # Remove punctuation and convert to lower case
    tokens = [word.lower() for word in tokens if word.isalnum()]
    
    # Remove stop words
    tokens = [word for word in tokens if word not in stop_words]
    
    # Stem the tokens
    tokens = [stemmer.stem(word) for word in tokens]
    
    return tokens

# Apply preprocessing to the 'headline_text' column
headlines_data['processed_text'] = headlines_data['headline_text'].apply(lambda x: preprocess_document(x))
processed_headline = headlines_data['processed_text']
# Print processed texts in the new dataframe
print(processed_headline)

0         [algorithm, make, decis, behalf, feder, minist]
1       [andrew, forrest, fmg, appeal, pilbara, nativ,...
2                                 [rural, mural, thallan]
3                  [australia, church, risk, becom, abus]
4       [australian, compani, usgfx, embroil, shanghai...
                              ...                        
1994    [constitut, avenu, win, top, prize, act, archi...
1995                         [dark, mofo, number, crunch]
1996    [david, petraeu, say, australia, must, firm, s...
1997    [driverless, car, australia, face, challeng, r...
1998               [drug, compani, criticis, price, hike]
Name: processed_text, Length: 1999, dtype: object


#Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [46]:
# TODO: Compute the BOW of the preprocessed data
from sklearn.feature_extraction.text import CountVectorizer


text_documents = [' '.join(doc) for doc in headlines_data['processed_text']]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(text_documents)
print(bow_matrix)

  (0, 171)	1
  (0, 2307)	1
  (0, 1075)	1
  (0, 436)	1
  (0, 1439)	1
  (0, 2443)	1
  (1, 211)	1
  (1, 1528)	1
  (1, 1501)	1
  (1, 240)	1
  (1, 2816)	1
  (1, 2547)	1
  (1, 3842)	1
  (1, 3263)	1
  (2, 3268)	1
  (2, 2518)	1
  (2, 3784)	1
  (3, 330)	1
  (3, 788)	1
  (3, 3209)	1
  (3, 427)	1
  (3, 92)	1
  (4, 332)	1
  (4, 872)	1
  (4, 4013)	1
  :	:
  (1995, 1044)	1
  (1995, 2628)	1
  (1995, 2465)	1
  (1995, 1000)	1
  (1996, 330)	1
  (1996, 775)	1
  (1996, 3310)	1
  (1996, 3524)	1
  (1996, 3337)	1
  (1996, 1476)	1
  (1996, 2528)	1
  (1996, 1057)	1
  (1996, 2793)	1
  (1997, 330)	1
  (1997, 2939)	1
  (1997, 1398)	1
  (1997, 743)	1
  (1997, 690)	1
  (1997, 1223)	1
  (1997, 3246)	1
  (1998, 872)	1
  (1998, 1229)	1
  (1998, 2921)	1
  (1998, 1813)	1
  (1998, 987)	1


You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [48]:
# TODO: Compute the TF using the BOW

# Compute the TF matrix
tf_matrix = np.zeros_like(bow_matrix, dtype=float)

# Iterate over each document
for i in range(bow_matrix.shape[0]):
    # Total number of terms in the current document
    total_terms = np.sum(bow_matrix[i])
    
    # Compute TF for each term in the document
    for j in range(bow_matrix.shape[1]):
        tf_matrix[i, j] = bow_matrix[i, j] / total_terms if total_terms != 0 else 0

# Convert TF matrix to DataFrame for better visualization
tf_df = pd.DataFrame(tf_matrix, columns=vectorizer.get_feature_names_out())

# Print the TF DataFrame
print("Term Frequency (TF) DataFrame:")
print(tf_df)

IndexError: too many indices for array: array is 0-dimensional, but 2 were indexed

In [49]:
# TODO: Compute the IDF
from sklearn.feature_extraction.text import TfidfTransformer

# Initialize TfidfTransformer
tfidf_transformer = TfidfTransformer(use_idf=True, smooth_idf=True)

# Compute IDF from BoW matrix
idf_values = tfidf_transformer.fit_transform(bow_matrix.values)

# Convert IDF matrix to DataFrame for better visualization (optional)
idf_dataframe = pd.DataFrame(idf_values.toarray(), columns=bow_matrix.columns)

# Print IDF DataFrame
print(idf_dataframe)

AttributeError: values not found

Compute finally the TF-IDF.

In [51]:
# TODO: compute the TF-IDF
# Compute TF-IDF
tfidf_matrix = bow_matrix.values * idf_dataframe.values

# Convert TF-IDF matrix to DataFrame for better visualization (optional)
tfidf_dataframe = pd.DataFrame(tfidf_matrix, columns=bow_matrix.columns)

# Print TF-IDF DataFrame
print(tfidf_dataframe)

AttributeError: values not found

What are the 10 words with the highest and lowest TF-IDF on average?

In [50]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average
# Compute average TF-IDF across all documents
avg_tfidf = np.mean(tfidf_dataframe, axis=0)

# Create a DataFrame for better visualization
avg_tfidf_df = pd.DataFrame({'word': tfidf_dataframe.columns, 'avg_tfidf': avg_tfidf})

# Sort by average TF-IDF in descending order to get top 10 highest
top_10_highest_tfidf = avg_tfidf_df.sort_values(by='avg_tfidf', ascending=False).head(10)

# Sort by average TF-IDF in ascending order to get top 10 lowest
top_10_lowest_tfidf = avg_tfidf_df.sort_values(by='avg_tfidf', ascending=True).head(10)

# Print the results
print("Top 10 words with highest average TF-IDF:")
print(top_10_highest_tfidf)

print("\nTop 10 words with lowest average TF-IDF:")
print(top_10_lowest_tfidf)

Top 10 words with highest average TF-IDF:
  word  avg_tfidf
         2.228225
a    a   1.085073
e    e   0.979820
r    r   0.840612
i    i   0.748304
t    t   0.596771
n    n   0.592459
o    o   0.565829
s    s   0.457424
l    l   0.452727

Top 10 words with lowest average TF-IDF:
  word  avg_tfidf
5    5   0.003974
6    6   0.003982
8    8   0.004155
4    4   0.004432
3    3   0.004779
9    9   0.005889
7    7   0.006320
2    2   0.010632
1    1   0.012802
z    z   0.016472


Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [10]:
# TODO: Compute the TF-IDF using scikit learn
# Import the module

# Instantiate the TF-IDF vectorizer

# Compute the TF-IDF


Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [11]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average


lowest words: coll     0.305258
gw       0.305258
nmfc     0.305258
adel     0.305258
melb     0.305258
syd      0.305258
haw      0.305258
geel     0.305258
gcfc     0.305258
fabio    0.322574
dtype: float64
highest words: mosul        0.779137
rig          0.786813
travel       0.788050
aquapon      0.794899
date         0.794899
employ       0.795060
financ       0.803629
mongolian    0.831769
pump         1.000000
peacemak     1.000000
dtype: float64


Do you have the same words? How do you explain it?