# 01-TF-IDF

We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [8]:
# import needed libraries
import nltk
import numpy as np
import pandas as pd
nltk.download('punkt') #Run this line one time to get the resource
nltk.download('stopwords') #Run this line one time to get the resource
nltk.download('wordnet') #Run this line one time to get the resource
nltk.download('averaged_perceptron_tagger') #Run this line one time to get the resource
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\natha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\natha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\natha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\natha\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Import the data into the file *headlines.csv*

In [3]:
# TODO: Load the dataset
df = pd.read_csv('headlines.csv')
df.head()

Unnamed: 0,publish_date,headline_text
0,20170721,algorithms can make decisions on behalf of fed...
1,20170721,andrew forrests fmg to appeal pilbara native t...
2,20170721,a rural mural in thallan
3,20170721,australia church risks becoming haven for abusers
4,20170721,australian company usgfx embroiled in shanghai...


As usual, check the dataset basic information.

In [4]:
# TODO: Have a look at the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [5]:
for x in df['headline_text']:
    print(x)

algorithms can make decisions on behalf of federal ministers
andrew forrests fmg to appeal pilbara native title ruling
a rural mural in thallan
australia church risks becoming haven for abusers
australian company usgfx embroiled in shanghai staff standoff
australia suffers shock loss in womens world cup semi
big rigs
boy charged in connection with supermarket syringe incident
breaking bad creator vince gilligan on success and
breaking bad creator vince gilligan on walter white and tv
brumbies hoping for gigantic finals upset against hurricanes
can the gold coast keep up with its crazy construction boom
can victorian prisons learn from the us female justice system
chamseddine sentence for sexually assaulting deaf girl reduced
chinas designated driver business is booming
clive palmer court clive mensink queensland nickel
daijia services in china see businesses driving drinkers home
dalis dna samples taken in a bid to solve paternity suit
day scott and leishman lead promising aussie charg

In [6]:
# TODO: Perform preprocessing
# import needed modules
import string

tokens = []
for e,s in enumerate(df['headline_text']):
    # Tokenize
    token = nltk.word_tokenize(s)
    # Remove punctuation
    s = s.translate(str.maketrans('', '', string.punctuation))
    token = nltk.word_tokenize(s)
    # Remove stop words
    token = [x for x in token if x not in nltk.corpus.stopwords.words('english')]
    # Stem
    lemma = nltk.stem.WordNetLemmatizer()
    token = [lemma.lemmatize(x) for x in token]
    tokens.append(token)
df["headline_tokens"] = tokens
df["headline_tokens"]

0       [algorithm, make, decision, behalf, federal, m...
1       [andrew, forrests, fmg, appeal, pilbara, nativ...
2                                 [rural, mural, thallan]
3             [australia, church, risk, becoming, abuser]
4       [australian, company, usgfx, embroiled, shangh...
                              ...                        
1994    [constitution, avenue, win, top, prize, act, a...
1995                       [dark, mofo, number, crunched]
1996    [david, petraeus, say, australia, must, firm, ...
1997    [driverless, car, australia, face, challenge, ...
1998             [drug, company, criticised, price, hike]
Name: headline_tokens, Length: 1999, dtype: object

Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [14]:
# TODO: Compute the BOW of the preprocessed data
vectorizer = CountVectorizer()
s = [str(x) for x in df["headline_tokens"]]
vec = vectorizer.fit_transform(s)
vec.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [6]:
# TODO: Compute the TF using the BOW


array([0.        , 0.08333333, 0.09090909, 0.1       , 0.11111111,
       0.125     , 0.14285714, 0.16666667, 0.18181818, 0.2       ,
       0.22222222, 0.25      , 0.28571429, 0.33333333, 0.4       ,
       0.5       , 1.        ])

In [7]:
# TODO: Compute the IDF


array([3.28291422, 3.36629583, 3.44151925, 3.53995932, 3.57505064,
       3.70858204, 3.79373984, 3.83920222, 3.91152288, 3.96281617,
       4.04505427, 4.10389477, 4.13466643, 4.16641513, 4.19920495,
       4.2331065 , 4.26819782, 4.30456547, 4.3423058 , 4.38152651,
       4.4223485 , 4.46490812, 4.50935988, 4.5558799 , 4.60467006,
       4.65596336, 4.71003058, 4.76718899, 4.82781361, 4.89235213,
       4.961345  , 5.03545298, 5.11549568, 5.20250706, 5.29781724,
       5.40317776, 5.52096079, 5.65449219, 5.80864287, 5.99096442,
       6.21410797, 6.50179005, 6.90725515, 7.60040233])

Compute finally the TF-IDF.

In [8]:
# TODO: compute the TF-IDF


What are the 10 words with the highest and lowest TF-IDF on average?

In [9]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average


lowest words: gcfc    0.633367
geel    0.633367
gw      0.633367
haw     0.633367
melb    0.633367
coll    0.633367
adel    0.633367
syd     0.633367
nmfc    0.633367
cold    0.690456
dtype: float64
highest words: date         3.800201
mongolian    3.800201
puffbal      3.800201
mous         3.800201
rig          3.800201
superannu    3.800201
aquapon      3.800201
loophol      3.800201
pump         6.907255
peacemak     7.600402
dtype: float64


Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [10]:
# TODO: Compute the TF-IDF using scikit learn
# Import the module

# Instantiate the TF-IDF vectorizer

# Compute the TF-IDF


Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [11]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average


lowest words: coll     0.305258
gw       0.305258
nmfc     0.305258
adel     0.305258
melb     0.305258
syd      0.305258
haw      0.305258
geel     0.305258
gcfc     0.305258
fabio    0.322574
dtype: float64
highest words: mosul        0.779137
rig          0.786813
travel       0.788050
aquapon      0.794899
date         0.794899
employ       0.795060
financ       0.803629
mongolian    0.831769
pump         1.000000
peacemak     1.000000
dtype: float64


Do you have the same words? How do you explain it?