In [1]:
## TFIDF

Term Frequency: Measures how frequently a term appears in a document.

Calculated as the ratio of the number of times a term occurs in a document to the total number of terms in that document.

Higher TF values indicate that a term appears more frequently in a particular document.

Inverse Document Frequency (IDF):

Measures how important a term is across the entire collection of documents.

Calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term.

Higher IDF values indicate that a term is less common across the document collection.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["It was the best of times", "it was the worst of times", "it was the age of wisdom", "it was the age of foolishness"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(sorted(vectorizer.vocabulary_))
# encode document
vector = vectorizer.transform([text[0]])


['age', 'best', 'foolishness', 'it', 'of', 'the', 'times', 'was', 'wisdom', 'worst']


In [5]:

print(vectorizer.idf_)

[1.51082562 1.91629073 1.91629073 1.         1.         1.
 1.51082562 1.         1.91629073 1.91629073]


A vocabulary of 10 words is learned from the documents and each word is assigned a unique integer index in the output vector. The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed words: "it", "of", "the" , "was".

In [6]:
# summarize encoded vector
print(vector.shape)
print(vector.toarray())                                                                                                                                                                                                                                                                                                                                                     

(1, 10)
[[0.         0.60735961 0.         0.31694544 0.31694544 0.31694544
  0.4788493  0.31694544 0.         0.        ]]


The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.