# 01-TF-IDF

![](https://images.unsplash.com/photo-1532153975070-2e9ab71f1b14?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Annie Spratt](https://unsplash.com/photos/5cFwQ-WMcJU)


We will here compute the TF-IDF on a corpus of newspaper headlines.

Begin by importing needed libraries:

In [40]:
# TODO: import needed libraries
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Import the data into the file *headlines.csv*

In [41]:
# TODO: Load the dataset
df = pd.read_csv('headlines.csv',on_bad_lines='skip')

As usual, check the dataset basic information.

In [42]:
# TODO: Have a look at the data
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   1999 non-null   int64 
 1   headline_text  1999 non-null   object
dtypes: int64(1), object(1)
memory usage: 31.4+ KB


In [43]:
df.columns

Index(['publish_date', 'headline_text'], dtype='object')

We will now perform preprocessing on this text data: tokenization, punctuation and stop words removal and stemming.

Hint: to do so, use NLTK, *pandas*'s method *apply*, lambda functions and list comprehension

In [44]:
# TODO: Perform preprocessing
def preprocessing(document):
    # 1- tokenization
    tokens = word_tokenize(document) ## ok avec skip dans read_csv
    # 2- punctuation removal
    tokens = [t.lower() for t in tokens if t.isalpha()]
    # 3- remove stopwords
    stop_words = stopwords.words('english')
    tokens = [t for t in tokens if not t in stop_words]
    # 4- stemming
    stemmer = PorterStemmer() #build root by removing some known suffix and prefix
    tokens_stem = [stemmer.stem(w) for w in tokens]
    return tokens_stem


Compute now the Bag of Words for our data, using scikit-learn.

Warning: since we used our own preprocessing, you have to bypass analyzer with identity function.

In [45]:
# TODO: Compute the BOW of the preprocessed data
df['headline_text'] = df['headline_text'].apply(preprocessing)

vectorizer = CountVectorizer(analyzer=lambda x: x)
#BOW = vectorizer.fit_transform([preprocessing(x) for x in df.headline_text]).toarray()
BOW = vectorizer.fit_transform(df.headline_text).toarray()
strings = vectorizer.get_feature_names_out()

df_BOW = pd.DataFrame(data=BOW, columns=strings)  ## être sûr du phasage entre les valeurs et les colonnes
df_BOW #1999 x 4165: OK

## l'error avec bytes sthg est corrigée avec l'option skip dans read_csv :)

Unnamed: 0,aardman,aaron,ab,aback,abbott,abc,abel,abil,ablett,aborigin,...,youtub,zambian,zealand,zedd,zinc,zion,zombi,zone,zonta,zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1994,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


You can check the shape of the BOW, the expected value is `(1999, 4165)`.

Now compute the Term Frequency and then the Inverse Document Frequency, and check the values are not only zeros.

In [48]:
# TODO: Compute the TF using the BOW
# TF = F(BOW)
## j'ai besoin que Vincent dépose le notebook de ce matin
TF = df_BOW.divide(df_BOW.sum(axis=1), axis=0)
TF.max() # ok

aardman    0.142857
aaron      0.200000
ab         0.250000
aback      0.166667
abbott     0.200000
             ...   
zion       0.125000
zombi      0.142857
zone       0.125000
zonta      0.142857
zoo        0.166667
Length: 4165, dtype: float64

In [49]:
# TODO: Compute the IDF
IDF = np.log(1 + df_BOW.shape[0])/(1 + df_BOW.sum(axis=0))
IDF

aardman    3.800451
aaron      1.900226
ab         3.800451
aback      2.533634
abbott     0.542922
             ...   
zion       3.800451
zombi      3.800451
zone       3.800451
zonta      3.800451
zoo        3.800451
Length: 4165, dtype: float64

Compute finally the TF-IDF.

In [None]:
# TODO: compute the TF-IDF



What are the 10 words with the highest and lowest TF-IDF on average?

In [None]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average


Now let's compute the TF-IDF using scikit-learn on our preprocessed data (the one you used to compute the BOW).

In [59]:
# TODO: Compute the TF-IDF using scikit learn
vectorizer = TfidfVectorizer(analyzer=lambda x: x)
tf_idf = vectorizer.fit_transform(df.headline_text).toarray()
tf_idf

df_tfidf = pd.DataFrame(data=tf_idf, columns=vectorizer.get_feature_names_out())
print(f'10 highest TF-IDF: {df_tfidf.max().sort_values(ascending=False)[:10]}')
"""
peacemak     1.000000
pump         1.000000
mongolian    0.831769
financ       0.803629
employ       0.795060
aquapon      0.794899
date         0.794899
travel       0.788050
rig          0.786813
mosul        0.779137
"""
print(f'10 lowest TF-IDF: {df_tfidf.min().sort_values(ascending=False)[:10]}')
## comme c'est le log, on peut avoir des valeurs de TF-IDF à 0
"""
10 lowest TF-IDF: aardman       0.0
poet          0.0
point         0.0
poison        0.0
polic         0.0
policeman     0.0
polici        0.0
polit         0.0
politician    0.0
politicis     0.0
"""
df_tfidf



10 highest TF-IDF: peacemak     1.000000
pump         1.000000
mongolian    0.831769
financ       0.803629
employ       0.795060
aquapon      0.794899
date         0.794899
travel       0.788050
rig          0.786813
mosul        0.779137
dtype: float64
10 lowest TF-IDF: aardman       0.0
poet          0.0
point         0.0
poison        0.0
polic         0.0
policeman     0.0
polici        0.0
polit         0.0
politician    0.0
politicis     0.0
dtype: float64


Unnamed: 0,aardman,aaron,ab,aback,abbott,abc,abel,abil,ablett,aborigin,...,youtub,zambian,zealand,zedd,zinc,zion,zombi,zone,zonta,zoo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Compare the 10 highest and lowest TF-IDF words on average to the ones you had by yourself.

In [None]:
# TODO: Print the 10 words with the highest and lowest TF-IDF on average


Do you have the same words? How do you explain it?