#### TF-IDF 

It is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus). The TF-IDF value increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. This helps to highlight terms that are unique to a particular document and are not common across the entire corpus.

$TF-IDF(t,d,D) = TF(t,d) * IDF(t, D)$

Where :

$TF(t,d)$ is the Term Frequency, representing how often term $t$ appears in document $d$.
$IDF(t,D)$ is the Inverse Document Frequency, calculated as $log(N/n_t)$
- $N$ is the total number of documents in the corpus and 
- $n_t$ is the number of documents containing term $t$.

#### TF-IDF Vectorizer:

In scikit-learn, the TfidfVectorizer is a transformer that helps convert a collection of raw documents to a matrix of TF-IDF features. It combines the functionalities of `CountVectorizer` and `TfidfTransformer`. Here's a brief explanation of the key parameters:

1. `sublinear_tf` (default=False): If True, apply sublinear scaling to the TF (Term Frequency),i.e., replace TF with $1+log(TF)$
2. `smooth_idf` (default=True): Add 1 to document frequencies to prevent zero divisions.
3. `use_idf` (default=True): Enable the IDF (Inverse Document Frequency) reweighting.
4. `ngram_range` (default=(1, 1)): The range of n-grams to extract, e.g., `(1, 1)` for unigrams, `(1, 2)` for unigrams and bigrams
5. `max_df` (default=1.0): Ignore terms with a document frequency higher than the given threshold.
6. `min_df` (default=1): Ignore terms with a document frequency lower than the given threshold.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
corpus = ["This is the first document.", "This document is the second document."]

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the data
tfidf_matrix = vectorizer.fit_transform(corpus)

# The resulting matrix is a spares matrix of TF-IDF features
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray())

# Get the feature names
feature_names = vectorizer.get_feature_names_out()
print("\nFeature Names(tokens):", feature_names)


TF-IDF Matrix:
[[0.4090901  0.57496187 0.4090901  0.         0.4090901  0.4090901 ]
 [0.66758217 0.         0.33379109 0.46913173 0.33379109 0.33379109]]

Feature Names(tokens): ['document' 'first' 'is' 'second' 'the' 'this']
