In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a corpus (collection of documents)
corpus = [
    "The research in Physics is interesting and difficult.",
    "The research in Math is interesting and not difficult.",
    "The Research in AI needs a deep understanding of Math",
    "is the research really difficult?",
]



In [2]:
corpus

['The research in Physics is interesting and difficult.',
 'The research in Math is interesting and not difficult.',
 'The Research in AI needs a deep understanding of Math',
 'is the research really difficult?']

In [3]:
# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()



In [4]:
# Fit and transform the corpus to obtain TF-IDF vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)



In [5]:
tfidf_matrix

<4x16 sparse matrix of type '<class 'numpy.float64'>'
	with 31 stored elements in Compressed Sparse Row format>

In [6]:
# Convert the TF-IDF matrix to a dense array for better readability (optional)
tfidf_matrix_dense = tfidf_matrix.toarray()



In [7]:
tfidf_matrix_dense

array([[0.        , 0.39371128, 0.        , 0.31874322, 0.31874322,
        0.39371128, 0.31874322, 0.        , 0.        , 0.        ,
        0.        , 0.49937284, 0.        , 0.26059346, 0.26059346,
        0.        ],
       [0.        , 0.36634077, 0.        , 0.29658443, 0.29658443,
        0.36634077, 0.29658443, 0.36634077, 0.        , 0.46465682,
        0.        , 0.        , 0.        , 0.2424772 , 0.2424772 ,
        0.        ],
       [0.39002912, 0.        , 0.39002912, 0.        , 0.24895054,
        0.        , 0.        , 0.30750344, 0.39002912, 0.        ,
        0.39002912, 0.        , 0.        , 0.20353338, 0.20353338,
        0.39002912],
       [0.        , 0.        , 0.        , 0.41553722, 0.        ,
        0.        , 0.41553722, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.65101935, 0.3397289 , 0.3397289 ,
        0.        ]])

In [8]:
# Get the feature names (words/terms) corresponding to the columns of the TF-IDF matrix
feature_names = tfidf_vectorizer.get_feature_names_out()



As per sklearn’s online documentation, it uses the below method to calculate idf of a term in a document.

idf(t) = log e [ (1+n) / ( 1 + df(t) ) ] + 1 (default i:e smooth_idf = True)

and

idf(t) = log e [ n / df(t) ] + 1 (when smooth_idf = False)



In [9]:
ohe_matrix_final= pd.DataFrame(tfidf_matrix_dense,index= corpus,columns=feature_names)
ohe_matrix_final

Unnamed: 0,ai,and,deep,difficult,in,interesting,is,math,needs,not,of,physics,really,research,the,understanding
The research in Physics is interesting and difficult.,0.0,0.393711,0.0,0.318743,0.318743,0.393711,0.318743,0.0,0.0,0.0,0.0,0.499373,0.0,0.260593,0.260593,0.0
The research in Math is interesting and not difficult.,0.0,0.366341,0.0,0.296584,0.296584,0.366341,0.296584,0.366341,0.0,0.464657,0.0,0.0,0.0,0.242477,0.242477,0.0
The Research in AI needs a deep understanding of Math,0.390029,0.0,0.390029,0.0,0.248951,0.0,0.0,0.307503,0.390029,0.0,0.390029,0.0,0.0,0.203533,0.203533,0.390029
is the research really difficult?,0.0,0.0,0.0,0.415537,0.0,0.0,0.415537,0.0,0.0,0.0,0.0,0.0,0.651019,0.339729,0.339729,0.0


## Observation???

1. The TF-IDF values are calculated for each term in each document.
2. Terms that are common across all documents (e.g., "is", "the") receive low TF-IDF scores because they are not distinctive.
3. Terms that are important in a specific document but less common in the entire corpus receive higher TF-IDF scores.
4. The values in the TF-IDF matrix are real numbers that represent the importance of each term in each document.
5. You can use these TF-IDF vectors for various NLP tasks like text classification, information retrieval, or clustering.