# TF-IDF for dummies

We'll use a dataset which was used for a [Kaggle InClass competition](https://www.kaggle.com/competitions/defi-ia-insa-toulouse/overview) from a few years ago. The goal is to predict a person's job based on their resume. The competition's purpose was to build a classifier that was biased towards gender. But in this notebook, we'll just focus on the TF-IDF part.

In [1]:
import pathlib
import zipfile
import pandas as pd

data_dir = pathlib.Path('../../data/bias-in-bios.zip')

with zipfile.ZipFile(data_dir, 'r') as z:
    with z.open('train.json') as f:
        train = pd.read_json(f).set_index('Id')
    with z.open('categories_string.csv') as f:
        names = pd.read_csv(f)['0'].to_dict()
    with z.open('train_label.csv') as f:
        jobs = pd.read_csv(f, index_col='Id')['Category']
        jobs = jobs.map(names)
        jobs = jobs.rename('job')
        train['job'] = jobs

train.head()


Unnamed: 0_level_0,description,gender,job
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,She is also a Ronald D. Asmus Policy Entrepre...,F,professor
1,He is a member of the AICPA and WICPA. Brent ...,M,accountant
2,Dr. Aster has held teaching and research posi...,M,professor
3,He runs a boutique design studio attending cl...,M,architect
4,"He focuses on cloud security, identity and ac...",M,architect


In [2]:
f"{len(train):,d}"


'217,197'

In [3]:
train['job'].value_counts()


professor            70016
attorney             18820
photographer         14646
nurse                12622
journalist           12295
physician            11607
psychologist         10391
teacher               9145
surgeon               6616
architect             5841
dentist               5450
painter               4621
poet                  4292
filmmaker             4124
model                 4115
software_engineer     4060
composer              3395
accountant            3121
dietitian             2288
comedian              1639
pastor                1497
chiropractor          1406
paralegal              967
yoga_teacher           944
interior_designer      858
dj                     831
personal_trainer       807
rapper                 783
Name: job, dtype: int64

No machine learning model takes as input text directly. The text always has to be transformed. In particular, for text, the act of transforming text into a vector of numbers is called **vectorization**. There are many ways to vectorize text, but the most common one is called **TF-IDF**. Before we go into that, let's first look at a simpler method called **Bag of Words**.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()


A vectorizer does two things. First it normalizes the text:

In [5]:
clean = vectorizer.build_preprocessor()(train['description'][0])
clean


' she is also a ronald d. asmus policy entrepreneur fellow with the german marshall fund and is a visiting fellow at the centre for international studies (cis) at the university of oxford. this commentary first appeared at sada, an online journal published by the carnegie endowment for international peace.'

Next, it splits the text into tokens:

In [6]:
tokens = vectorizer.build_tokenizer()(clean)
tokens[:10]


['she',
 'is',
 'also',
 'ronald',
 'asmus',
 'policy',
 'entrepreneur',
 'fellow',
 'with',
 'the']

The idea is then to build a matrix where each row corresponds to a document and each column corresponds to a token. The value of each cell is the number of times the token appears in the document. This is called a **Bag of Words** representation because we lose the order of the words in the text. We only keep track of the number of times each word appears in the text.

In [7]:
counts = vectorizer.fit_transform(raw_documents=train['description'])
counts


<217197x230368 sparse matrix of type '<class 'numpy.int64'>'
	with 9851657 stored elements in Compressed Sparse Row format>

This is a sparse matrix, because that's a data structure which makes sense in this case: most documents will only contain a small subset of the tokens, so it's a waste of memory to store all the zeros. Sparse matrices are very common in text processing, so some machine learning algorithms are optimized to work with them.

It's important to think about the data in terms of a sparse matrix. For instance, regular standard scaling should be avoided. Indeed, if you subtract the mean of a sparse matrix, you'll get a dense matrix, which will take a lot of memory. Instead, you should use a scaler which is aware of the sparse structure of the data, such as `MaxAbsScaler` or `MinMaxScaler`. Indeed, dividing each value by the maximum value of the row will keep the data sparse.

In [8]:
from sklearn import linear_model
from sklearn import pipeline
from sklearn import preprocessing

model = pipeline.make_pipeline(
    CountVectorizer(),
    preprocessing.StandardScaler(with_mean=False),
    preprocessing.Normalizer(),
    linear_model.SGDClassifier(loss='log_loss', max_iter=100, tol=1e-3)
)
model.fit(train['description'], train['job'])


The Bag of Words representation is very simple, but it has a few drawbacks. First, it doesn't take into account the order of the words. Second, it doesn't take into account the fact that some words are more common than others. For instance, the word "the" is very common, but it doesn't carry much information. TF-IDF is a way to fix that.

TF-IDF stands for **Term Frequency - Inverse Document Frequency**. It's a way to normalize the Bag of Words representation. The idea is to divide each value by the number of times the token appears in the document. This is called the **Term Frequency**. But we also divide by the number of documents in which the token appears. This is called the **Inverse Document Frequency**. The intuition is that if a token appears in many documents, it's not very informative. On the other hand, if it appears in only a few documents, it's more informative.

There are actually many ways to compute the TF-IDF. The most common one is called **smoothed TF-IDF**. It's computed as follows:

$$
\text{TF-IDF}(d, t) = \frac{\text{TF}(d, t)}{\text{max}(\text{TF}(d, t'))} \times \log\left(\frac{N}{\text{DF}(t)}\right)
$$

where $d$ is a document, $t$ is a token, $t'$ is a token in the document $d$, $N$ is the number of documents, $\text{TF}(d, t)$ is the number of times the token $t$ appears in the document $d$, $\text{DF}(t)$ is the number of documents in which the token $t$ appears.

The first part is the **Term Frequency**. The second part is the **Inverse Document Frequency**. The $\log$ is here to make sure that the values are not too large. The $\text{max}$ is here to make sure that the values are not too small. The $\text{max}$ is computed over all the tokens in the document $d$.

**TLDR: TF-IDF is a way to normalize the Bag of Words representation. It's a way to give more importance to rare words.**

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(raw_documents=train['description'])


In [15]:
tfidf_matrix[0]


<1x230368 sparse matrix of type '<class 'numpy.float64'>'
	with 37 stored elements in Compressed Sparse Row format>

In [33]:
tfidf_matrix[0].indices


array([160507,  75843,  47025,  44714, 170236, 113882, 154903,  24702,
       181876,  26907,  83165,  55279, 207078, 157328, 153755, 214671,
        52562, 199296, 108467,  84635,  48901,  29976, 219339,  24992,
        86829, 133833,  89664, 206175, 224453,  81756,  76468, 165639,
        29401, 179670,  23555, 109703, 189090], dtype=int32)

In [32]:
tfidf_matrix[0].data


array([0.17587834, 0.209706  , 0.18732235, 0.08337532, 0.09836813,
       0.11498371, 0.13353337, 0.064915  , 0.31659522, 0.12865291,
       0.10931213, 0.21066895, 0.09937576, 0.16555669, 0.03304113,
       0.05916193, 0.26172833, 0.10305093, 0.2036051 , 0.09570113,
       0.14782546, 0.15956879, 0.15093547, 0.02909765, 0.18124474,
       0.19921095, 0.17256723, 0.13280117, 0.05378592, 0.24756103,
       0.2139983 , 0.12751501, 0.32776078, 0.22907902, 0.07296681,
       0.09074152, 0.05282196])

In [55]:
feature_names = vectorizer.get_feature_names_out()

indices = tfidf_matrix[0].indices
scores = tfidf_matrix[0].data

for i in scores.argsort()[::-1]:
    print(f"{feature_names[indices[i]]:<20} {scores[i]:.3f}")


asmus                0.328
sada                 0.317
cis                  0.262
fellow               0.248
ronald               0.229
entrepreneur         0.214
commentary           0.211
endowment            0.210
international        0.204
marshall             0.199
carnegie             0.187
fund                 0.181
peace                0.176
german               0.173
oxford               0.166
at                   0.160
visiting             0.151
centre               0.148
online               0.134
the                  0.133
appeared             0.129
policy               0.128
journal              0.115
first                0.109
studies              0.103
this                 0.099
published            0.098
for                  0.096
is                   0.091
by                   0.083
also                 0.073
an                   0.065
university           0.059
with                 0.054
she                  0.053
of                   0.033
and                  0.029


Compare this to a Bag of Words representation:

In [57]:
indices = counts[0].indices
scores = counts[0].data

for i in scores.argsort()[::-1]:
    print(f"{feature_names[indices[i]]:<20} {scores[i]:.3f}")


the                  4.000
at                   3.000
international        2.000
is                   2.000
fellow               2.000
for                  2.000
german               1.000
visiting             1.000
and                  1.000
fund                 1.000
marshall             1.000
with                 1.000
entrepreneur         1.000
policy               1.000
asmus                1.000
ronald               1.000
also                 1.000
centre               1.000
peace                1.000
endowment            1.000
studies              1.000
carnegie             1.000
by                   1.000
published            1.000
journal              1.000
online               1.000
an                   1.000
sada                 1.000
appeared             1.000
first                1.000
commentary           1.000
this                 1.000
oxford               1.000
of                   1.000
university           1.000
cis                  1.000
she                  1.000


One last thing to mention is that the tokenization can be customized. For instance, in search engines, it's common to use n-grams instead of tokens. An n-gram is a sequence of n tokens. In particular, trigrams are quite common.

In [74]:
trigrammer = CountVectorizer(ngram_range=(1, 3))
trigrams = trigrammer.fit_transform(raw_documents=train['description'][:100])


In [72]:
feature_names = trigrammer.get_feature_names_out()

indices = trigrams[0].indices
scores = trigrams[0].data

for i in scores.argsort()[::-1][:20]:
    print(f"{feature_names[indices[i]]:<20} {scores[i]:.3f}")


the                  4.000
at                   3.000
is                   2.000
international        2.000
for                  2.000
for international    2.000
fellow               2.000
at the               2.000
asmus policy         1.000
fellow with          1.000
entrepreneur fellow  1.000
policy entrepreneur  1.000
for international peace 1.000
ronald asmus         1.000
also ronald          1.000
the german           1.000
is also              1.000
she is               1.000
peace                1.000
endowment            1.000


This can also be done at the character level.

In [85]:
trigrammer = TfidfVectorizer(ngram_range=(1, 3), analyzer='char_wb', )
trigrams = trigrammer.fit_transform(raw_documents=train['description'][:10_000])


In [88]:
feature_names = trigrammer.get_feature_names_out()

indices = trigrams[0].indices
scores = trigrams[0].data

for i in scores.argsort()[::-1][:30]:
    print(f"{feature_names[indices[i]]:<20} {scores[i]:.3f}")


                     0.725
e                    0.215
a                    0.185
n                    0.170
t                    0.155
i                    0.148
r                    0.133
o                    0.118
s                    0.118
l                    0.104
 a                   0.074
d                    0.074
h                    0.067
na                   0.063
f                    0.060
e                    0.059
rna                  0.056
smu                  0.056
(ci                  0.054
nal                  0.054
is)                  0.053
fel                  0.052
rn                   0.052
u                    0.052
wme                  0.050
nt                   0.049
 f                   0.049
owm                  0.049
al                   0.048
pea                  0.048
