# TF-IDF Model

## Term Frequency

- Measures how frequently a term occurs in a document.
- A term might appear more times in long documents than shorter ones since every document’s length is different.
- 𝑡𝑓(𝑡,𝑑) of term <i>t</i> in document <i>d</i> is defined as the number of times that <i>t</i> occurs in <i>d</i>.
- Greater when a term is frequent in the document.

Consider the following documents:

- Doc1 = "The sky is blue."<br>
- Doc2 = "The sun is bright today."<br>
- Doc3 = "The sun in the sky is bright."<br>
- Doc4 = "We can see the shining sun, the bright sun."

The term frequency matrix to calculate the number of times a term appeared in each document is shown below. Note that the stop words are not considered in this matrix.

|Terms|Doc1|Doc2|Doc3|Doc4|
| --- | --- | --- |--- | --- |
|blue    |1| 0| 0| 0|
|bright  |0 |1 |1 |1|
|can     |0 |0 |0 |1|
|see     |0 |0 |0 |1|
|shining |0 |0 |0 |1|
|sky    | 1 |0 |1 |0|
|sun   |  0 |1| 1| 2|
|today|   0 |1 |0 |0|

tf('sun',doc2) = 1/3 = 0.33

## Inverse Document Frequency

- A word is not very informative if it occurs in all documents.
- Estimate the rarity of a term in the whole document collection. 
- If a term occurs in all the documents of the collection, its IDF is zero. 
- Greater when the term is <b>rare</b> in the collection

<center>$idf(t) = log(\frac{D}{df_t})$</center>

where:
- D = Number of documents in the collection, i.e. the document space,
- $df_t$ = Number of documents in which term <i>t</i> appear, i.e., document frequency


$idf('sun') = log\frac{4}{4} = 0.0$

## Tf-idf

The tf-idf weight of a term is the product of its tf weight and its idf weight, i.e.,

<center>$w(t) = tf(t,d) * log(\frac{D}{df_t})$</center>

tf-idf('sun') = 0.33 * 0.0 = 0.0

## Using TfidfVectorizer to compute tf-idf scores

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['The sky is blue.','The sun is bright today.','The sun in the sky is bright.',
          'We can see the shining sun, the bright sun.']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
print(X.toarray())

['blue', 'bright', 'can', 'in', 'is', 'see', 'shining', 'sky', 'sun', 'the', 'today', 'we']
(4, 12)
[[0.65919112 0.         0.         0.         0.42075315 0.
  0.         0.51971385 0.         0.34399327 0.         0.        ]
 [0.         0.40412895 0.         0.         0.40412895 0.
  0.         0.         0.40412895 0.33040189 0.63314609 0.        ]
 [0.         0.3218464  0.         0.50423458 0.3218464  0.
  0.         0.39754433 0.3218464  0.52626104 0.         0.        ]
 [0.         0.23910199 0.37459947 0.         0.         0.37459947
  0.37459947 0.         0.47820398 0.39096309 0.         0.37459947]]
