**TF-IDF** can be seen as a function that ranks words based on their importance accross documents, weighted down by the amout of times they actually appear, following the idea that if a word is way too common, it shouldn't be that important.

Mathematically, it is the equivalent of calculating the _term frequency_ (TF) multiplied by the _inverse document frequency_ (IDF). So, given a term _t_ and a document _d_. For our `TF` we'd have:

$$TF(t,d) = \frac{count(t,d)}{count(*,d)}$$

Where `count(t,d)` is the amount of times `t` appears inside of `d`, and `count(*,d)` is the total number of terms inside of `d`, meaning that, if a word appears a lot in our documents, that means that it's probably somewhat revelant.

But, what about words that are actually common but don't bring any value for us? That's where IDF comes in, it's a way to weight down such common words by comparing its appearance among all of our documents. So, if a word appears way too much across many documents, it probably means that it isn't that relevant and it's just a common word. Here's what it looks like in maths language:

$$IDF(t,D) = log_e(\frac{D}{amount(t,D)})$$

Where `D` is the total number of documents, and `amount(t,D)` is the number of documents containing `t`.
The $log_e$ is due to the fact that the `IDF` would explode for the cases where we have too many documents.

Finally, we have:

$$TF\text{-}IDF = TF(t,d)*IDF(t,D)$$

Here we'll learn how to implement and make proper use of it

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

In [3]:
#let's create the vectorizer and fit the corpus and transform them accordingly
v = TfidfVectorizer()
v.fit(corpus)
transform_output = v.transform(corpus)

In [4]:
#let's print the vocabulary

print(v.vocabulary_)
print(len(v.vocabulary_))

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}
28


In [5]:
#let's print the idf of each word:

all_feature_names = v.get_feature_names_out()

for word in all_feature_names:
    
    #let's get the index in the vocabulary
    indx = v.vocabulary_.get(word)
    
    #get the score
    idf_score = v.idf_[indx]
    
    print(f"{word} : {idf_score}")

already : 2.386294361119891
am : 2.386294361119891
amazon : 2.386294361119891
and : 2.386294361119891
announcing : 1.2876820724517808
apple : 2.386294361119891
are : 2.386294361119891
ate : 2.386294361119891
biryani : 2.386294361119891
dot : 2.386294361119891
eating : 1.9808292530117262
eco : 2.386294361119891
google : 2.386294361119891
grapes : 2.386294361119891
iphone : 2.386294361119891
ironman : 2.386294361119891
is : 1.1335313926245225
loki : 2.386294361119891
microsoft : 2.386294361119891
model : 2.386294361119891
new : 1.2876820724517808
pixel : 2.386294361119891
pizza : 2.386294361119891
surface : 2.386294361119891
tesla : 2.386294361119891
thor : 2.386294361119891
tomorrow : 1.2876820724517808
you : 2.386294361119891


In [6]:
print(corpus[:2])
#let's print the transformed output from tf-idf
v1 = transform_output.toarray()[0]
v2 = transform_output.toarray()[1]
v3 = transform_output.toarray()[2]

print(v1)

['Thor eating pizza, Loki is eating pizza, Ironman ate pizza already', 'Apple is announcing new iphone tomorrow']
[0.24266547 0.         0.         0.         0.         0.
 0.         0.24266547 0.         0.         0.40286636 0.
 0.         0.         0.         0.24266547 0.11527033 0.24266547
 0.         0.         0.         0.         0.72799642 0.
 0.         0.24266547 0.         0.        ]


With this you can see now we can have vector representation of any document. These reprentations can be processed.

In [8]:
from scipy.spatial.distance import euclidean, cosine

In [9]:
# Calculating distances
distance_v1_v2 = euclidean(v1, v2)
distance_v3_v2 = euclidean(v3, v2)

cosine_v1_v2 = cosine(v1, v2)
cosine_v3_v2 = cosine(v3, v2)

# Output results
print("Euclidean distance between v1 and v2:", distance_v1_v2)
print("Euclidean distance between v3 and v2:", distance_v3_v2)
print("Cosine similarity (1 - cosine) between v1 and v2:", 1 - cosine_v1_v2)
print("Cosine similarity (1 - cosine) between v3 and v2:", 1 - cosine_v3_v2)

Euclidean distance between v1 and v2: 1.392046685146447
Euclidean distance between v3 and v2: 1.1360708006098064
Cosine similarity (1 - cosine) between v1 and v2: 0.0311030131863943
Cosine similarity (1 - cosine) between v3 and v2: 0.35467156800089694
