# Term Frequency - Inverse Document Frequency (TF-IDF)

In [2]:
import re
import pandas as pd
import numpy as np
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

With TF-IDF the idea is write down how rare a word is in a specific document, considering all other documents. For example, common words like 'the', 'and', etc. have a high occurrence in any text, so they do not contribute meaningfully to a document classification. However, words that are *inversely* proportionally present in a certain document, related to the presence in other documents, are likely to be important markers of content.

The TF-IDF matrix encodes these values by having each row represent a document and each column represent a specific word or term. For example:

| doc | We | like | and | dogs | cats | cars | planes |
| --- | -- | ---- | --- | ---- | ---- | ---- | ------ |
| 0   | 1  | 1    | 1   | 1    | 1    | 0    | 0      |
| 1   | 1  | 1    | 1   | 0    | 0    | 1    | 1      |

where we have two documents that contain a statement about animals or vehicles. Just looking at the common words would not tell us anything about the documents, but inverting this with the following IDF function does:

$$
\text{idf}(d, t) = \log \frac{N}{|{d\in D: t\in d}|} + 1
$$

where $N$ is the total number of documents in the corpus, $\Delta = |{d\in D: t\in d}|$ is the number of documents $D$ where the term $t$ appears. So if a word would appear in every document, then $N/\Delta = 1$ and the function would evaluate to $0+1=1$ (making sure that the weights are never <1 or zero). The reason a logarithm is used here is that the increase of word frequency at low incidence matters more than at high incidence.

Then we can multiply this by the actual term frequency in document d $f_{t, d}$ 

$$
\text{tfidf}(t,d, D) = f_{t,d} * \text{idf}
$$

For the matrix shown above this would result in

| doc | We | like | and | dogs   | cats   | cars   | planes |
| --- | -- | ---- | --- | ------ | ------ | ------ | ------ |
| 0   | 1  | 1    | 1   | 1.6931 | 1.6931 | 0      | 0      |
| 1   | 1  | 1    | 1   | 0      | 0      | 1.6931 | 1.6931 |

where we can see that terms that are not in both documents get a higher weight, and as such will be seen as more 'important'.

## Performing TF-IDF

In [3]:
D = ["We like dogs and cats", "We like cars and planes"]
cv = CountVectorizer()
tf_mat = cv.fit_transform(D)
tf = pd.DataFrame(tf_mat.toarray(), columns = cv.get_feature_names_out())
tf

Unnamed: 0,and,cars,cats,dogs,like,planes,we
0,1,0,1,1,1,0,1
1,1,1,0,0,1,1,1


Though the columns have a different permutation, the matrix is identical.

We can now apply a tf-idf transformer to this to get the tf-idf weight matrix

In [7]:
tfidf_trans = TfidfTransformer(smooth_idf=False)
tfidf_mat = tfidf_trans.fit_transform(tf)
tfidf = pd.DataFrame(tfidf_mat.toarray(), columns = tfidf_trans.get_feature_names_out())
tfidf

Unnamed: 0,and,cars,cats,dogs,like,planes,we
0,0.338381,0.0,0.572929,0.572929,0.338381,0.0,0.338381
1,0.338381,0.572929,0.0,0.0,0.338381,0.572929,0.338381


These values have been normalised by sklearn such that $\vec{d} \cdot \vec{d} = 1$. To get the exact same matrix as before we can apply the weigths to our idf to the original dataset.

In [5]:
pd.DataFrame(tfidf_trans.idf_ * tf.to_numpy(), columns = tfidf_trans.get_feature_names_out())

Unnamed: 0,and,cars,cats,dogs,like,planes,we
0,1.0,0.0,1.693147,1.693147,1.0,0.0,1.0
1,1.0,1.693147,0.0,0.0,1.0,1.693147,1.0


## TF-IDF matrix from data

We'll use a dataset from [Kaggle](https://www.kaggle.com/datasets/vivmankar/physics-vs-chemistry-vs-biology), containing reddit comments.

In [None]:
df = pd.read_csv('./data/tfidf.csv')["Comment"]
df.head()

0    Personally I have no idea what my IQ is. I’ve ...
1    I'm skeptical. A heavier lid would be needed t...
2    I think I have 100 cm of books on the subject....
3    Is chemistry hard in uni. Ive read somewhere t...
4    In addition to the other comment, you can crit...
Name: Comment, dtype: object

For preprocessing, we want to vectorize the input text words to tokens. The `CountVectorizer` gives a matrix where the counts of a word (columns) in a document (rows) are encoded.

We first need to remove the numbers from the text and convert it to lower case. For this we can use a custom function passed to the `CountVectorizer` instance.

In [14]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text

cv = CountVectorizer(max_features = 500, preprocessor = preprocess_text)
tf = cv.fit_transform(df)
pd.DataFrame(tf.toarray(), columns = cv.get_feature_names_out())

Unnamed: 0,able,about,above,acid,acids,actually,add,after,again,ago,...,wouldn,wrong,www,yeah,year,years,yes,you,your,yourself
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1581,0,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,10,6,0
1582,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
1583,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,3,0,0
1584,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Then we use the `TfIdfTransformer` to get the TF-IDF values for each token per document

In [17]:
tfidf_trans = TfidfTransformer()
tfidf_mat = tfidf_trans.fit_transform(tf.toarray())
tfidf = pd.DataFrame(tfidf_mat.toarray(), columns = cv.get_feature_names_out())
tfidf

Unnamed: 0,able,about,above,acid,acids,actually,add,after,again,ago,...,wouldn,wrong,www,yeah,year,years,yes,you,your,yourself
0,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
1,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
2,0.11354,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044232,0.000000,0.0
3,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.188718,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
4,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079460,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1581,0.00000,0.214699,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.331533,0.308927,0.0
1582,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.096378,0.149678,0.0
1583,0.00000,0.121809,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.225714,0.000000,0.0
1584,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0


There are actually a lot of 0 values in this matrix, where a word does not occur in a specific document.

In [26]:
f"Only {100 * np.count_nonzero(tfidf) / tfidf.size:.2f}% of the values of the TF-IDF matrix is nonzero"

'Only 6.48% of the values of the TF-IDF matrix is nonzero'

So it would be better to encode this in a sparse matrix format, where there are only 3 columns: (document, word, tfidf value). 

In [32]:
f"This would reduce the size from {tfidf.size} entries to {3 * np.count_nonzero(tfidf)} entries, for a reduction of {100 - 100*(3 * np.count_nonzero(tfidf)) / (tfidf.size):.2f}%"

'This would reduce the size from 793000 entries to 154257 entries, for a reduction of 80.55%'

Pandas makes this really easy for us:

In [None]:
dense_tfidf = tfidf.stack()
dense_tfidf = dense_tfidf[dense_tfidf != 0]

0     an       0.137536
      and      0.154283
      be       0.109092
      been     0.398127
      by       0.153027
                 ...   
1585  to       0.208749
      up       0.163900
      video    0.559576
      want     0.189222
      you      0.181626
Length: 51419, dtype: float64