# **TF-IDF**
É um dos esquemas de pesagem de palavras mais populares atualmente. Basicamente substitui no vetor que estavamos contando por um mecanismo da frequência da palavra no texto quanto no documento todo.

<div align="center" style="margin-top: 40px;">
    <img src="./images/image.png" alt="Alt text" width="400"/>
</div>

### **Example**

In [1]:
C = ['The who is the band!', 'who is the band?', 'The band who plays the who.']

print('C has %d texts:' % len(C))
for i in range(len(C)):
    print(f"t{i+1} = {C[i]}")

C has 3 texts:
t1 = The who is the band!
t2 = who is the band?
t3 = The band who plays the who.


In [3]:
import re

def pre_process_corpus(corpus):
    new_corpus = [doc.lower() for doc in corpus]
    regex = r"(?<!\d)[\!\?.,;:-](?!\d)"
    return [re.sub(regex, "", doc, 0) for doc in new_corpus]

In [4]:
import sklearn
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer

corpus = pre_process_corpus(C)
print(corpus)

['the who is the band', 'who is the band', 'the band who plays the who']


In [5]:
vectorizer = CountVectorizer()
doc_term_matriz = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names_out()

print(pd.DataFrame(doc_term_matriz.A, columns=terms).to_string())

   band  is  plays  the  who
0     1   1      0    2    1
1     1   1      0    1    1
2     1   0      1    2    2


In [6]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()
tf_idf_matrix = transformer.fit_transform(doc_term_matriz)

print(pd.DataFrame(tf_idf_matrix.A, columns=terms).to_string())

       band        is     plays       the       who
0  0.361359  0.465315  0.000000  0.722718  0.361359
1  0.463334  0.596627  0.000000  0.463334  0.463334
2  0.290291  0.000000  0.491506  0.580583  0.580583


Observe que, na medida em que os termos ocorrem com mais frequência no documento e menos na coleção, o valor do peso dele aumenta.