**Term Frequency (TF)**: Measures how frequently a term appears in a document

**Inverse Document Frequency (IDF):** Measures how important a term is in the entire corpus

**TfidfVectorizer** is a tool provided by the **scikit-learn library** in Python that transforms text data into feature vectors.

These vectors are based on the *Term Frequency-Inverse Document Frequency (TF-IDF) metric*, which reflects how important a word is to a document in a collection of documents (corpus).

#used in naural language processing and text analysis tasks
#1.Information Retrieval and Search Engines
#2. Text Classification
#3. Clustering and Topic Modeling
#4. Recommendation Systems
#5. Document Similarity
#6. Keyword Extraction
#7. Summarization

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Sample documents
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat chased the dog."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
X = vectorizer.fit_transform(documents)

# Convert to DataFrame
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Display the DataFrame
print(df)


        cat    chased       dog       log       mat        on       sat  \
0  0.374207  0.000000  0.000000  0.000000  0.492038  0.374207  0.374207   
1  0.000000  0.000000  0.374207  0.492038  0.000000  0.374207  0.374207   
2  0.403525  0.530587  0.403525  0.000000  0.000000  0.000000  0.000000   

        the  
0  0.581211  
1  0.581211  
2  0.626747  


In [4]:
#search engine
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat chased the dog."
]
vect = TfidfVectorizer()
x = vect.fit_transform(documents)
query = "The cat chased the dog."
query_vec = vect.transform([query])
cosine_similarity = linear_kernel(query_vec, x).flatten()
ranked_docs = cosine_similarity.argsort()[::-1]
for idx in ranked_docs:
    print(documents[idx], cosine_similarity[idx])



The cat chased the dog. 1.0
The dog sat on the log. 0.5152740657843794
The cat sat on the mat. 0.5152740657843794


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "The cat sat on the mat.",
    "The dog sat on the log.",
    "The cat chased the dog."
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

similarity_matrix = cosine_similarity(X)

print(similarity_matrix)


[[1.         0.61786795 0.51527407]
 [0.61786795 1.         0.51527407]
 [0.51527407 0.51527407 1.        ]]
