**Introduction**

Term Frequency - Inverse Document Frequency (Tf-idf) is an important tool in Natural Language Processing and is used by algorithms like cosine similarity to find documents that aare similar to a given search query. 

1. Term Frequency

Tf is the number of times a term appears in a particular document. It's specific to a document. 

tf(t) : The number of times 't' appears in a document

2. Inverse Document Frequency 

idf is the measure of how common or rare a term is across the entire corpus of documents. It's common to all the documents. If a word is common to and appears in many documents, the idf value (normallized) will approach 0 or else approach 1 if it's rare. 


tf-idf value of a term in a document is the product of its tf and idf. The higher the value, the more relevant the term is in that document. 

Note 1 on Normalization: 

Tf-idf assigns a weight to each feature (term) based on Term Frequency (TF) and Inverse Document Frequency (IDF). The Term Frequency (TF) reflects how often the term appears in the document, and IDF reflects how rare or common the term is in the entire document.

-The TF already ensures that more frequent terms in a document will have higher weightage. 

-The IDF ensures that terms that appear frequently across many documents (common words) will be given lower weightage.

These two components ensure that important terms (those that are relevant to the document's context) are still emphasized by their TF-IDF scores, even after normalization. 

Normalization (making each document vector unit length) affects the magnitude of the vector but not the relative importance of individual features within that vector. The relative relationship between the tf-idf values of different features in a single document is preserved. 

Before normalization: Larger documents(with more terms) will have higher overall magnitudes (sum of sqaured tf-idf values)

After normalization: THe magnitudes are adjusted to 1, but the relative importance of the features remains the same. So, the term 'cars' might have a higher weight compared to 'diesel' in a document, even after normalization. 

Note 2: 

When the IDF of a term is zero, it means that the term appears in every document in the corpus. This implies that the term is not informative or discriminative for distinguishing between different documents in the corpus. 

In [21]:
#Importing the library

from sklearn.feature_extraction.text import TfidfVectorizer

#Setting up the document corpus

d1 = "petrol cars are cheaper than diesel cars"

d2 = "diesel is cheaper than petrol"

doc_corpus = [d1, d2]

print(doc_corpus)

['petrol cars are cheaper than diesel cars', 'diesel is cheaper than petrol']


In [22]:
#Initializing TfidfVectorizer and printing the feature names

vec = TfidfVectorizer(norm = 'l2', stop_words = 'english') #the vectorizer will automatically exclude common english stop words

#These words are typically not very informative for tasks like text classification or clustering
matrix = vec.fit_transform(doc_corpus)

print("Feature Names n: ", vec.get_feature_names_out())

Feature Names n:  ['cars' 'cheaper' 'diesel' 'petrol']


In [23]:
#Generate a sparse matrix with tf-idf values

print('Sparse Matrix n', matrix.shape,'n', matrix.toarray())

Sparse Matrix n (2, 4) n [[0.85135433 0.30287281 0.30287281 0.30287281]
 [0.         0.57735027 0.57735027 0.57735027]]


In [30]:
#2.Example 2

from sklearn.feature_extraction.text import TfidfVectorizer

d1 = "I love programming in Python"

d2 = "Python is great for data analysis"

d3 = "Machine learning is a subfield of artificial intelligence"

d4 = "Data Science involves machine learning and statistics"

d5 = "Artificial intelligence and machine learning are growing fields"

documents = [d1,d2,d3,d4,d5]

vec = TfidfVectorizer(stop_words='english')

matrix = vec.fit_transform(documents)

print(f'The feature names are: {vec.get_feature_names_out()}')

print('Shape of Sparse Matrix:', matrix.shape, 'array:', matrix.toarray())

The feature names are: ['analysis' 'artificial' 'data' 'fields' 'great' 'growing' 'intelligence'
 'involves' 'learning' 'love' 'machine' 'programming' 'python' 'science'
 'statistics' 'subfield']
Shape of Sparse Matrix: (5, 16) array: [[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.61418897 0.         0.61418897
  0.49552379 0.         0.         0.        ]
 [0.55032913 0.         0.44400208 0.         0.55032913 0.
  0.         0.         0.         0.         0.         0.
  0.44400208 0.         0.         0.        ]
 [0.         0.45109178 0.         0.         0.         0.
  0.45109178 0.         0.37444693 0.         0.37444693 0.
  0.         0.         0.         0.55911663]
 [0.         0.         0.37831623 0.         0.         0.
  0.         0.46891321 0.31403664 0.         0.31403664 0.
  0.         0.46891321 0.46891321 0.        ]
 [0.         0.39372848 0.         0.4880163  0.         0.4880163
  0.39372848 0.       