# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [1]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd

Import the data into the file *headlines.csv*

In [16]:
# TODO: Load the dataset
df = pd.read_csv('headlines.csv')
df.head(10)

Unnamed: 0,publish_date,headline_text
0,20170721,algorithms can make decisions on behalf of fed...
1,20170721,andrew forrests fmg to appeal pilbara native t...
2,20170721,a rural mural in thallan
3,20170721,australia church risks becoming haven for abusers
4,20170721,australian company usgfx embroiled in shanghai...
5,20170721,australia suffers shock loss in womens world c...
6,20170721,big rigs
7,20170721,boy charged in connection with supermarket syr...
8,20170721,breaking bad creator vince gilligan on success...
9,20170721,breaking bad creator vince gilligan on walter ...


As usual, check the dataset basic information.

In [17]:
# TODO: Have a look at the data
print(df.describe())
print(df.info())

       publish_date
count  1.999000e+03
mean   2.017069e+07
std    3.926710e+01
min    2.017062e+07
25%    2.017063e+07
50%    2.017071e+07
75%    2.017072e+07
max    2.017072e+07
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB
None


We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [20]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def process_text(text):
    # Tokenize
    tokens = word_tokenize(text)
    # Remove punctuation
    tokens = [word for word in tokens if word.isalnum()]
    # Remove stop words
    tokens = [word for word in tokens if word.lower() not in stop_words]
    # Stem words
    tokens = [stemmer.stem(word) for word in tokens]
    return ' '.join(tokens)

df['processed_text'] = df['headline_text'].apply(process_text)
df['processed_text']

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Avry\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Avry\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0                algorithm make decis behalf feder minist
1       andrew forrest fmg appeal pilbara nativ titl rule
2                                     rural mural thallan
3                        australia church risk becom abus
4       australian compani usgfx embroil shanghai staf...
                              ...                        
1994    constitut avenu win top prize act architectu a...
1995                              dark mofo number crunch
1996    david petraeu say australia must firm south ch...
1997    driverless car australia face challeng roo pro...
1998                     drug compani criticis price hike
Name: processed_text, Length: 1999, dtype: object

Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [21]:
# TODO: Compute the BOW of the preprocessed data
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df['processed_text'])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

print(bow_df)

      10  100  1000km  100th  10m  10th  110kph  12  120  122  ...  youtub  \
0      0    0       0      0    0     0       0   0    0    0  ...       0   
1      0    0       0      0    0     0       0   0    0    0  ...       0   
2      0    0       0      0    0     0       0   0    0    0  ...       0   
3      0    0       0      0    0     0       0   0    0    0  ...       0   
4      0    0       0      0    0     0       0   0    0    0  ...       0   
...   ..  ...     ...    ...  ...   ...     ...  ..  ...  ...  ...     ...   
1994   0    0       0      0    0     0       0   0    0    0  ...       0   
1995   0    0       0      0    0     0       0   0    0    0  ...       0   
1996   0    0       0      0    0     0       0   0    0    0  ...       0   
1997   0    0       0      0    0     0       0   0    0    0  ...       0   
1998   0    0       0      0    0     0       0   0    0    0  ...       0   

      zambian  zealand  zedd  zinc  zion  zombi  zone  zonta  z

You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [28]:
# TODO: Compute the TF using the BOW
tf_df = bow_df.div(bow_df.sum(axis=1), axis=0)

tf = tf_df.to_numpy()
tf 

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [31]:
# TODO: Compute the IDF
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer(use_idf=True)

# Fit the transformer on the BoW matrix to compute IDF
tfidf_transformer.fit(X)

# Get the IDF values
idf_values = tfidf_transformer.idf_
idf_values

array([6.99146455, 7.2146081 , 7.90775528, ..., 7.90775528, 7.90775528,
       7.90775528])

Compute finally the TF-IDF.

In [33]:
# TODO: compute the TF-IDF
tf_idf = tf * idf_values
tf_idf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

What are the 10 words with the highest and lowest TF-IDF on average?

In [38]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average
mean_tf_idf_scores = np.mean(tf_idf, axis=0)
terms_with_scores = list(zip(vectorizer.get_feature_names_out(), mean_tf_idf_scores))
sorted_terms = sorted(terms_with_scores, key=lambda x: x[1], reverse=True)

#10 highest words
top_words = [(term, score) for term, score in sorted_terms if score != 0][:10]
#10 lowest words
bottom_words = [(term, score) for term, score in reversed(sorted_terms) if score != 0][:10]

print("Top 10 highest:")
print(f'{top_words}\n')
print('Top 10 lowest:')
print(bottom_words)

Top 10 highest:
[('australia', 0.025560931184250436), ('australian', 0.025431119138871205), ('new', 0.021839611973342384), ('polic', 0.018525533232928312), ('market', 0.018459756901048795), ('say', 0.01822373349221039), ('trump', 0.016865140274321757), ('wa', 0.01598724728325081), ('man', 0.015695709222202744), ('sydney', 0.014814531550443662)]

Top 10 lowest:
[('syd', 0.00032965463060622547), ('nmfc', 0.00032965463060622547), ('melb', 0.00032965463060622547), ('haw', 0.00032965463060622547), ('gw', 0.00032965463060622547), ('geel', 0.00032965463060622547), ('gcfc', 0.00032965463060622547), ('coll', 0.00032965463060622547), ('adel', 0.00032965463060622547), ('nemtsov', 0.0003596232333886096)]


Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the processed text to compute TF-IDF
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_text'])

# Convert TF-IDF matrix to a dense array for better readability
tfidf_array = tfidf_matrix.toarray()

# Get feature names (terms)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Create a DataFrame with TF-IDF scores
df_tfidf = pd.DataFrame(tfidf_array, columns=feature_names)

In [50]:
df_tfidf

Unnamed: 0,10,100,1000km,100th,10m,10th,110kph,12,120,122,...,youtub,zambian,zealand,zedd,zinc,zion,zombi,zone,zonta,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [52]:
# Find top 10 words with highest TF-IDF scores
top_words = df_tfidf.max().sort_values(ascending=False).head(10)

# Find bottom 10 words with lowest TF-IDF scores (excluding words with score 0)
# Filter out words where all TF-IDF scores are 0
bottom_words = df_tfidf.loc[:, (df_tfidf != 0.0).any()].min().sort_values().astype(float).head(10)

print("Top 10 words with highest TF-IDF scores:")
print(top_words)
print("\nBottom 10 words with lowest TF-IDF scores (excluding score 0):")
print(bottom_words)

Top 10 words with highest TF-IDF scores:
peacemak     1.000000
pump         1.000000
mongolian    0.831769
financ       0.803629
employ       0.795060
date         0.794899
aquapon      0.794899
travel       0.788050
rig          0.786813
mosul        0.779137
dtype: float64

Bottom 10 words with lowest TF-IDF scores (excluding score 0):
10          0.0
place       0.0
placemak    0.0
plagu       0.0
plain       0.0
plan        0.0
plane       0.0
plant       0.0
plastic     0.0
plate       0.0
dtype: float64


Do you have the same words? How do you explain it?