<a href="https://colab.research.google.com/github/Randoot/NLP-2/blob/main/TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It helps in identifying terms that are more relevant to a particular document while reducing the influence of commonly occurring terms across all documents.

1. **Term Frequency (TF)**: Measures how frequently a term appears in a document. It is often normalized by the total number of terms in the document.
2. **Inverse Document Frequency (IDF)**: Measures how important a term is by evaluating how frequently it appears across all documents. Terms that appear in fewer documents have higher IDF scores.
3. **TF-IDF Score**: The product of TF and IDF, which helps in identifying terms that are unique to a document and relevant in the context of the corpus.

In [1]:
## Example: Using Scikit-Learn for TF-IDF
#Convert a collection of raw documents to a matrix of TF-IDF features.
# Equivalent to CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Sample documents
documents = [
    "I love programming in Python",
    "Python programming is fun",
    "Machine learning is fascinating"
]

# Create the TfidfVectorizer object


In [3]:
vectorizer = TfidfVectorizer()

# Fit and transform the documents


In [4]:
X = vectorizer.fit_transform(documents)

# Convert to array and get feature names


In [5]:
X_array = X.toarray()
feature_names = vectorizer.get_feature_names_out()

# Display the TF-IDF representation


In [6]:
import pandas as pd
#pd.DataFrame(...) -> creates a DataFrame object from the provided data.
# A DataFrame: table where data is organized into rows and columns
df = pd.DataFrame(X_array, columns=feature_names)
print(df)

   fascinating       fun        in        is  learning      love   machine  \
0     0.000000  0.000000  0.562829  0.000000  0.000000  0.562829  0.000000   
1     0.000000  0.604652  0.000000  0.459854  0.000000  0.000000  0.000000   
2     0.528635  0.000000  0.000000  0.402040  0.528635  0.000000  0.528635   

   programming    python  
0     0.428046  0.428046  
1     0.459854  0.459854  
2     0.000000  0.000000  


In [7]:
documents = [
    "Data science is an interdisciplinary field",
    "It uses scientific methods, processes, algorithms",
    "Data science is used for data analysis"
]



# #TODO : Create the TfidfVectorizer object


# #TODO : Fit and transform the documents


# #TODO : Convert to array and get feature names


# #TODO : Display the TF-IDF representation
