 TF-IDF, which stands for Term Frequency-Inverse Document Frequency.

#### Definition:

TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is an improvement over the Bag of Words model by considering not just the frequency of a term in a document (TF) but also how unique or rare the term is across all documents (IDF).

![1.png](attachment:1.png)

#### Use Cases:
Text Classification: Enhancing features for algorithms like SVM, Naive Bayes, etc.

Information Retrieval: Improving search results by ranking documents based on term relevance.

Keyword Extraction: Identifying significant words or phrases in documents.

#### Implementation in Python:
We'll use TfidfVectorizer from sklearn to implement TF-IDF.

#### Installation:
Ensure you have scikit-learn installed:

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = ["I love NLP", "NLP is great"]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit the model and transform the documents into TF-IDF matrix
X = vectorizer.fit_transform(documents)

# Print the feature names (vocabulary)
print("Feature Names:", vectorizer.get_feature_names_out())

# Print the TF-IDF vectors
print("TF-IDF Vectors:\n", X.toarray())


C:\Users\karma\anaconda3\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
C:\Users\karma\anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


Feature Names: ['great' 'is' 'love' 'nlp']
TF-IDF Vectors:
 [[0.         0.         0.81480247 0.57973867]
 [0.6316672  0.6316672  0.         0.44943642]]


#### Explanation:
TfidfVectorizer: Converts a collection of raw documents to a matrix of TF-IDF features.

fit_transform: Fits the model and learns the vocabulary; then transforms the data into a TF-IDF matrix.

get_feature_names_out(): Returns the feature names (words in the vocabulary).
toarray(): Converts the sparse matrix to a dense array.

#### Conclusion:
TF-IDF is a powerful tool in NLP that helps in understanding the importance of words in documents relative to a corpus. It enhances the simple Bag of Words model by providing a more informative representation, useful in various text analysis tasks. As you progress, you'll find TF-IDF essential in preprocessing text data for machine learning models.