## What is the TF-IDF model?

Previously, we talked about the bag-of-word model, which has many limitations. Today we take a step further to see if we can try to fix **one of the limitations - Each word has the same importance**.

> 💡 The crux of the problem - **How to define the word importance**?

One idea is: The more frequently a word appears **within a single document**, the more important it is **for that document**. *For instance, in an article discussing dogs, the word "dog" is likely to appear frequently, reflecting the document's main topic*.

But what if a word appears very frequently **in all documents**? *For example, the word "of" may appear quite often in every document, can we say "of" is important?* Clearly, that's not the case. So we have a clue here: If a word has a high frequency in **every document**, probably it's not significant and does not convey too much information.

Therefore, a reasonable solution should consider a word's frequency within **a single document** but also take into account its frequency crossing **multiple documents**. TF-IDF balances these two aspects.

In summary, the **intuition** behind TF-IDF is - **Similar documents *may* use similar words, while the importance of different words should vary**.


## TF-IDF model in detail

TF-IDF = Term frequency(TF) + Inverse document frequency(IDF). Let's break it into two parts
### TF

The TF can be seen as a function of document $d$ and word $w$, and the equation is:

$$\text{TF}(w, d)=\frac{\text{frequency of}\ w\ \text{in}\ d}{\text{word counts of } d}$$

That is, we just need to calculate the frequency of word $w$ in the document $d$, and then divide it by the total number of words in $d$.

> 🐛 In Scikit-Learn, the computation of TF is a bit different. It doesn't involve dividing by the total number of words in the document. The purpose of dividing is to normalize the $\text{TF}(w, d)$ values within the document $d$, making them add up to one. In Scikit-Learn, this normalization process is performed after the TF-IDF calculation. We will demonstrate this with an example later.

### IDF

The goal of IDF is to reduce the importance of some common words that appear in each document. Therefore, the IDF is a function involving the word $w$ and the $corpus$.

$$
\text{IDF}(w, corpus)=log\ \frac{\text{document count of }corpus}{1+\text{count of document which contains }w}
$$

We add one in the denominator to avoid division by 0.

> 🤔️ The $corpus$ is gennerally fixed. So it can be treated as a constant. In that case, IDF can be considered as something that's only related to the word $w$

> 💡 Note the $log$ here. Are we using $log_2$, $log_{10}$ or $ln$? Different frameworks might have variations. *Scikit-Learn use $ln$*

> 🐛 In Scikit-Learn, the calculation of IDF differs from the equations mentioned above. By default, Scikit-Learn uses the following formula[^1]:

$$
\text{IDF}(w, corpus)=log\ \frac{1 + \text{document count of }corpus}{1+\text{count of document which contains }w} + 1
$$

🤔️ In my opinion, **the modification made by Scikit-Learn ensures that $\text{IDF}(w)$ cannot be less than 1**. In the origin equation, if a word $w$ appears in each document within the corpus, $\text{IDF}(w)$ could be a negative value. Therefore, Scikit-Learn's modification seems more practical. It provides a more intuitive comparison of the IDF values for different words.

### TF-IDF

$$
\text{TF-IDF}(w, d, corpus)=\text{TF}(w, d) * \text{IDF}(d, corpus)
$$


## The TF-IDF in Scikit-Learn

It's trivial to implement the TF-IDF algorithm. However, probably you will just use the well-established APIs provided by Scikit-Learn. Here, we will delve into how to calculate TF-IDF in Scikit-Learn. Let's proceed by continuing to use the official example from Scikit-Learn:

In [1]:
toy_corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

tokenized_toy_corpus = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'is', 'the', 'second', 'second', 'document'],
    ['and', 'the', 'third', 'one'],
    ['is', 'this', 'the', 'first', 'document']
]

Let's retrieve the TF-IDF matrix using the APIs

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
# set norm=None for comparison
vectorizer = TfidfVectorizer(norm=None)
X = vectorizer.fit_transform(toy_corpus)

In [3]:
print(vectorizer.get_feature_names_out())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [4]:
print(X.toarray())

[[0.         1.22314355 1.51082562 1.22314355 0.         0.
  1.         0.         1.22314355]
 [0.         1.22314355 0.         1.22314355 0.         3.83258146
  1.         0.         1.22314355]
 [1.91629073 0.         0.         0.         1.91629073 0.
  1.         1.91629073 0.        ]
 [0.         1.22314355 1.51082562 1.22314355 0.         0.
  1.         0.         1.22314355]]


We can access the TF-IDF matrix using `X.toarray()` (Note that I set `norm=None` in the code snippet)

|           | and     | document | first   | is      | one     | second  | the | third   | this    |
| --------- | ------- | -------- | ------- | ------- | ------- | ------- | --- | ------- | ------- |
| document1 | 0.0     | 1.22314  | 1.51082 | 1.22314 | 0.0     | 0.0     | 1.0 | 0.0     | 1.22314 |
| document2 | 0.0     | 1.22314  | 0.0     | 1.22314 | 0.0     | 3.83258 | 1.0 | 0.0     | 1.22314 |
| document3 | 1.91629 | 0.0      | 0.0     | 0.0     | 1.91629 | 0.0     | 1.0 | 1.91629 | 0.0     |
| document4 | 0.0     | 1.22314  | 1.51082 | 1.22314 | 0.0     | 0.0     | 1.0 | 0.0     | 1.22314 |

Let me also put the bag-of-word matrix here:

|           | and | document | first | is  | one | second | the | third | this |
| --------- | --- | -------- | ----- | --- | --- | ------ | --- | ----- | ---- |
| document1 | 0   | 1        | 1     | 1   | 0   | 0      | 1   | 0     | 1    |
| document2 | 0   | 1        | 0     | 1   | 0   | 2      | 1   | 0     | 1    |
| document3 | 1   | 0        | 0     | 0   | 1   | 0      | 1   | 1     | 0    |
| document4 | 0   | 1        | 1     | 1   | 0   | 0      | 1   | 0     | 1    |

🤔️ *Comparing these two matrices, we can find that the word importance of `document` and `first` inside the document1 has changed. The `TF-IDF` value for `document` is `1.22314`, while the TF-IDF value for `first` is `1.51082`, due to the unequal presence of these words in the corpus. However, the bag-of-word model fails to recognize this and considers both of them as having an importance of `1`*

We can retrieve the IDF value of each word by accessing the `idf_` attribute


In [5]:
print(vectorizer.idf_)

[1.91629073 1.22314355 1.51082562 1.22314355 1.91629073 1.91629073
 1.         1.91629073 1.22314355]


🤔️ If we multiply this IDF vector by the matrix output of the bag-of-word model(note that the IDF vector will be broadcasted), you would obtain the TF-IDF matrix calculated by Scikit-Learn. This confirms what we mentioned earlier:
- Scikit-Learn directly uses the output of the bag-of-word model as TF.
- Scikit-Learn's IDF calculation differs from the standard approach.


## Implement TF-IDF manually

We assume that each document within the corpus is tokenized, and we use the TF-IDF definition of Scikit-Learn

> 🐛 The code below is not optimized, just for demonstration :)

In [6]:
import math


def TF(word: str, tokenized_document: list[str]) -> float:
    return tokenized_document.count(word)


def IDF(word: str, tokenized_corpus: list[list[str]]) -> float:
    doc_count_contains_word = 0
    for doc in tokenized_corpus:
        if word in doc:
            doc_count_contains_word += 1

    return math.log((1 + len(tokenized_corpus)) / (1 + doc_count_contains_word)) + 1


def TF_IDF(
    word: str, tokenized_document: list[str], tokenized_corpus: list[list[str]]
) -> float:
    return TF(word, tokenized_document) * IDF(word, tokenized_corpus)

## TF-IDF for CodeSearchNet

Let's use the TF-IDF model to generate the feature vector for each code snippet

In [7]:
from utils import get_code, get_token_stream
from tqdm.auto import tqdm
from gensim import corpora

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
corpus = get_code("test", "python")

In [9]:
py2_cnt, py3_cnt = 0, 0
new_corpus = []
codes = []
for code in tqdm(corpus):
    try:
        codes.append(get_token_stream(code))
        new_corpus.append(code)
        py3_cnt += 1
    except SyntaxError:
        py2_cnt += 1
print(f"Python2: {py2_cnt}, Python3: {py3_cnt}")

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22176/22176 [00:13<00:00, 1626.14it/s]

Python2: 228, Python3: 21948





In [10]:
corpus = new_corpus

In [11]:
dictionary = corpora.Dictionary(codes)

once_ids = [
    token_id
    for token_id, doc_freq in dictionary.dfs.items()
    if doc_freq == 1
]

dictionary.filter_tokens(once_ids)
dictionary.compactify()

print(dictionary)

Dictionary<31933 unique tokens: ['', '(', ')', ',', ':']...>


First, let's get the bag-of-word matrix first

In [12]:
BoW_matrix_for_code = [dictionary.doc2bow(d) for d in codes]

In [13]:
from gensim.models import TfidfModel

In [14]:
tf_idf_model = TfidfModel(BoW_matrix_for_code, dictionary=dictionary)

Now we have built the tf-idf model, we can use it to get tf-idf vector for any code snippet

In [15]:
from gensim.similarities import Similarity

indexer = Similarity(
    output_prefix=None,
    corpus=tf_idf_model[BoW_matrix_for_code],
    num_features=len(dictionary),
    num_best=3,                  # let's see Top-3 result
)

The same `query` as before

In [16]:
query = """def foo(x):
    if x > 5:
        if x > 10:
            return x + 1
        else:
            return x - 1
    else:
        if x < 0:
            return x + 1
        else:
            return x - 1
"""

In [17]:
indexer[tf_idf_model[dictionary.doc2bow(get_token_stream(query))]]

[(5958, 0.8276320695877075),
 (19805, 0.7624242305755615),
 (19669, 0.7616549730300903)]

Compare to previous output(generated by the bag-of-word model):
```
[(19669, 0.7191814184188843),
 (19805, 0.705620288848877),
 (5958, 0.6945071220397949)]
```

The TF-IDF model give `corpus[5958]` high similarity score. Let's inspect this function

In [18]:
print(corpus[5958])

def json_to_dict(x):
    '''OAuthResponse class can't parse the JSON data with content-type
-    text/html and because of a rubbish api, we can't just tell flask-oauthlib to treat it as json.'''
    if x.find(b'callback') > -1:
        # the rubbish api (https://graph.qq.com/oauth2.0/authorize) is handled here as special case
        pos_lb = x.find(b'{')
        pos_rb = x.find(b'}')
        x = x[pos_lb:pos_rb + 1]

    try:
        if type(x) != str:  # Py3k
            x = x.decode('utf-8')
        return json.loads(x, encoding='utf-8')
    except:
        return x


🤔️ Interstingly, this code contains a lot of `x`. I guess that's why the TF-IDF prefer this function

In [19]:
sorted([
    (dictionary.id2token[k], v)
    for k, v in
    tf_idf_model[dictionary.doc2bow(get_token_stream(query))]
], key=lambda t: t[1], reverse=True)[:3]

[('x', 0.8880760906244697),
 ('1', 0.21030873215563448),
 ('>', 0.20264008358137425)]

In [20]:
sorted([
    (dictionary.id2token[k], v)
    for k, v in
    tf_idf_model[dictionary.doc2bow(get_token_stream(corpus[5958]))]
], key=lambda t: t[1], reverse=True)[:3]

[('x', 0.8798645408690431),
 ('find', 0.369319615724186),
 ('encoding', 0.12512606274449986)]

In [21]:
sorted([
    (dictionary.id2token[k], v)
    for k, v in
    tf_idf_model[dictionary.doc2bow(get_token_stream(corpus[19669]))]
], key=lambda t: t[1], reverse=True)[:3]

[('x', 0.8085718239321962),
 ('SUPPRESS_ERRORS', 0.4438771127891553),
 ('256', 0.2014353828377701)]

That's true, the `x` in `corpus[5958]` has higher word importance