# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [1]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd

Import the data into the file *headlines.csv*

In [2]:
# TODO: Load the dataset
df = pd.read_csv('headlines.csv')
print(df.head())

   publish_date                                      headline_text
0      20170721  algorithms can make decisions on behalf of fed...
1      20170721  andrew forrests fmg to appeal pilbara native t...
2      20170721                           a rural mural in thallan
3      20170721  australia church risks becoming haven for abusers
4      20170721  australian company usgfx embroiled in shanghai...


As usual, check the dataset basic information.

In [4]:
# TODO: Have a look at the data
print(df[['publish_date', 'headline_text']].head())
print(df.info())

   publish_date                                      headline_text
0      20170721  algorithms can make decisions on behalf of fed...
1      20170721  andrew forrests fmg to appeal pilbara native t...
2      20170721                           a rural mural in thallan
3      20170721  australia church risks becoming haven for abusers
4      20170721  australian company usgfx embroiled in shanghai...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB
None


We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [6]:
# TODO: Perform preprocessing
# import needed modules
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
nltk.download('punkt')
nltk.download('stopwords')
# Tokenize
def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens
# Remove punctuation
def remove_punctuation(tokens):
    tokens_without_punct = [token for token in tokens if token not in string.punctuation]
    return tokens_without_punct
# Remove stop words
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    tokens_without_stopwords = [token for token in tokens if token.lower() not in stop_words]
    return tokens_without_stopwords
# Initialize PorterStemmer
stemmer = PorterStemmer()
# Stem
def stem_tokens(tokens):
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# Perform preprocessing using pandas apply and lambda functions
df['stemmed'] = df['headline_text'].apply(lambda x: tokenize_text(x.lower())) \
                                   .apply(remove_punctuation) \
                                   .apply(remove_stopwords) \
                                   .apply(stem_tokens)

# Display the resulting DataFrame with the preprocessed text
print(df['stemmed'])


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\saksh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\saksh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0         [algorithm, make, decis, behalf, feder, minist]
1       [andrew, forrest, fmg, appeal, pilbara, nativ,...
2                                 [rural, mural, thallan]
3                  [australia, church, risk, becom, abus]
4       [australian, compani, usgfx, embroil, shanghai...
                              ...                        
1994    [constitut, avenu, win, top, prize, act, archi...
1995                         [dark, mofo, number, crunch]
1996    [david, petraeu, say, australia, must, firm, s...
1997    [driverless, car, australia, face, challeng, r...
1998               [drug, compani, criticis, price, hike]
Name: stemmed, Length: 1999, dtype: object


Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [31]:
# TODO: Compute the BOW of the preprocessed data

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with analyzer as identity function and max_features=4165
count_vectorizer = CountVectorizer(analyzer=lambda x: x, max_features=4165)

# Fit and transform the preprocessed text
bow = count_vectorizer.fit_transform(df['stemmed'])

# Print the shape of the BOW matrix
print(bow.shape)


(1999, 4165)


You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [49]:
# TODO: Compute the TF using the BOW
# Convert BOW matrix to array for easier manipulation
bow_array = bow.toarray()

# Compute the sum of occurrences of each word in each document
word_counts_per_doc = np.sum(bow_array, axis=1, keepdims=True)

# Compute the TF matrix by dividing each element of the BOW matrix by the sum of occurrences in its corresponding document
tf_matrix = bow_array / word_counts_per_doc

# Print the TF matrix
print(tf_matrix)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [50]:
# TODO: Compute the IDF
# Compute the document frequency (df) of each term
df = np.sum(bow_array > 0, axis=0)

# Compute the total number of documents
N = bow_array.shape[0]

# Compute the IDF for each term
idf = np.log(N / df)

# Print the IDF for each term
print(idf)

[7.60040233 7.60040233 5.80864287 ... 7.60040233 7.60040233 7.60040233]


Compute finally the TF-IDF.

In [51]:
# TODO: compute the TF-IDF
tf_idf_matrix = tf_matrix * idf

# Print the TF-IDF matrix
print(tf_idf_matrix)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


What are the 10 words with the highest and lowest TF-IDF on average?

In [52]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average
# Compute the average TF-IDF score for each word across all documents
average_tf_idf = np.mean(tf_idf_matrix, axis=0)

# Get the indices of the words with the highest and lowest average TF-IDF scores
top_10_indices = np.argsort(average_tf_idf)[-10:][::-1]  # Indices of the top 10 words
bottom_10_indices = np.argsort(average_tf_idf)[:10]       # Indices of the bottom 10 words

# Get the words corresponding to the indices
feature_names = count_vectorizer.get_feature_names_out()

# Print the top 10 words with the highest average TF-IDF scores
print("Top 10 words with highest average TF-IDF:")
for idx in top_10_indices:
    print(feature_names[idx])

# Print the bottom 10 words with the lowest average TF-IDF scores
print("\nBottom 10 words with lowest average TF-IDF:")
for idx in bottom_10_indices:
    print(feature_names[idx])


Top 10 words with highest average TF-IDF:
australia
australian
new
polic
market
say
trump
wa
man
charg

Bottom 10 words with lowest average TF-IDF:
v
pacquaio
nmfc
bori
gw
geel
gcfc
melb
adel
coll


Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [68]:
# TODO: Compute the TF-IDF using scikit learn




IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [70]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average





NotFittedError: Vocabulary not fitted or provided

Do you have the same words? How do you explain it?