# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [1]:
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('punkt')
nltk.download('stopwords')

Import the data into the file *headlines.csv*

In [36]:
# TODO: Load the dataset
df = pd.read_csv('headlines.csv')
df.head()

Unnamed: 0,publish_date,headline_text
0,20170721,algorithms can make decisions on behalf of fed...
1,20170721,andrew forrests fmg to appeal pilbara native t...
2,20170721,a rural mural in thallan
3,20170721,australia church risks becoming haven for abusers
4,20170721,australian company usgfx embroiled in shanghai...


As usual, check the dataset basic information.

In [37]:
# TODO: Have a look at the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [38]:
# TODO: Perform preprocessing

def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text.lower())
    # Remove punctuation
    tokens = [token for token in tokens if token not in string.punctuation]  
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]   
    # Stem
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]    
    return tokens

In [39]:
df = df['headline_text'].apply(lambda x: preprocess_text(x))
df.head()

0      [algorithm, make, decis, behalf, feder, minist]
1    [andrew, forrest, fmg, appeal, pilbara, nativ,...
2                              [rural, mural, thallan]
3               [australia, church, risk, becom, abus]
4    [australian, compani, usgfx, embroil, shanghai...
Name: headline_text, dtype: object

Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [40]:
# TODO: Compute the BOW of the preprocessed data
from sklearn.feature_extraction.text import CountVectorizer

identity_function = lambda x: x
vectorizer = CountVectorizer(analyzer=identity_function)
bow_matrix = vectorizer.fit_transform(df)

You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [41]:
# TODO: Compute the TF using the BOW
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(use_idf=False)
tf_matrix = tfidf_vectorizer.fit_transform(df[1])

print("Shape of TF matrix:", tf_matrix.shape)
print("TF matrix:")
print(tf_matrix)

Shape of TF matrix: (8, 8)
TF matrix:
  (0, 0)	1.0
  (1, 3)	1.0
  (2, 2)	1.0
  (3, 1)	1.0
  (4, 5)	1.0
  (5, 4)	1.0
  (6, 7)	1.0
  (7, 6)	1.0


In [42]:
# TODO: Compute the IDF
idf_vectorizer = TfidfVectorizer(use_idf=True)
idf_vectorizer.fit(df[1])

print("Inverse Document Frequency (IDF):")
print(idf_vectorizer.idf_)

Inverse Document Frequency (IDF):
[2.5040774 2.5040774 2.5040774 2.5040774 2.5040774 2.5040774 2.5040774
 2.5040774]


Compute finally the TF-IDF.

In [48]:
# TODO: compute the TF-IDF
tfidf_vectorizer = TfidfVectorizer(use_idf=True)

tfidf_matrix = tfidf_vectorizer.fit_transform(df)

print("Shape of TF-IDF matrix:", tfidf_matrix.shape)
print("TF-IDF matrix:")
print(tfidf_matrix)

AttributeError: 'list' object has no attribute 'lower'

What are the 10 words with the highest and lowest TF-IDF on average?

In [44]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average
feature_names = tfidf_vectorizer.get_feature_names_out()

avg_tfidf = np.mean(tfidf_matrix, axis=0)

sorted_indices = np.argsort(avg_tfidf).flatten()

print("10 words with the lowest TF-IDF on average:")
for i in range(10):
    print(feature_names[sorted_indices[i]])

print("10 words with the lowest TF-IDF on average:")
for i in range(10):
    print(feature_names[sorted_indices[i]])

10 words with the lowest TF-IDF on average:
[['andrew' 'appeal' 'fmg' 'forrest' 'nativ' 'pilbara' 'rule' 'titl']]


IndexError: index 1 is out of bounds for axis 0 with size 1

Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [10]:
# TODO: Compute the TF-IDF using scikit learn
# Import the module

# Instantiate the TF-IDF vectorizer

# Compute the TF-IDF


Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [11]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average


lowest words: coll     0.305258
gw       0.305258
nmfc     0.305258
adel     0.305258
melb     0.305258
syd      0.305258
haw      0.305258
geel     0.305258
gcfc     0.305258
fabio    0.322574
dtype: float64
highest words: mosul        0.779137
rig          0.786813
travel       0.788050
aquapon      0.794899
date         0.794899
employ       0.795060
financ       0.803629
mongolian    0.831769
pump         1.000000
peacemak     1.000000
dtype: float64


Do you have the same words? How do you explain it?