In [1]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()
print(X.shape)

(4, 9)


In [2]:
print("\nVectorizer configuration:")
print(vectorizer)


Vectorizer configuration:
TfidfVectorizer()


In [3]:
vocab = vectorizer.get_feature_names_out()

print("\nVocabulary (features):")
for i, word in enumerate(vocab):
    print(f"{i}: {word}")

print("\nVocabulary size:", len(vocab))



Vocabulary (features):
0: and
1: document
2: first
3: is
4: one
5: second
6: the
7: third
8: this

Vocabulary size: 9


In [4]:
X_dense = X.toarray()

print("\nTF-IDF matrix (rows=documents, columns=words):")
print(np.round(X_dense, 3))



TF-IDF matrix (rows=documents, columns=words):
[[0.    0.47  0.58  0.384 0.    0.    0.384 0.    0.384]
 [0.    0.688 0.    0.281 0.    0.539 0.281 0.    0.281]
 [0.512 0.    0.    0.267 0.512 0.    0.267 0.512 0.267]
 [0.    0.47  0.58  0.384 0.    0.    0.384 0.    0.384]]


## Understanding a TF-IDF Vector (Intuition)

Consider the **first row of the TF-IDF matrix**, which corresponds to the first document:

> **"This is the first document"**

### Vocabulary and Vector Representation
From the learned vocabulary, each word is assigned a **column index**.  
In our example, the TF-IDF matrix has **9 columns (0â€“8)**, one for each unique word.

- A value of **0** means the word from the vocabulary **does not appear** in the document
- A **non-zero value** means the word **appears in the document**, with a weight indicating its importance

### Why do some words have higher weights?
For example, the word **"first"** has a higher TF-IDF value because:
- It appears in **fewer documents overall**
- Therefore, it has a **higher Inverse Document Frequency (IDF)**

---

### TF-IDF Components

**Term Frequency (TF)**  
Measures how frequently a term appears in a document:

$$
TF(t) = \frac{\text{number of times term } t \text{ appears in a document}}
{\text{total number of terms in the document}}
$$

**Inverse Document Frequency (IDF)**  
Measures how informative a term is across the corpus:

$$
IDF(t) = \log\left(\frac{\text{number of documents}}
{\text{number of documents containing term } t}\right)
$$

**TF-IDF**  
The final weight assigned to a term is:

$$
TF\text{-}IDF(t) = TF(t) \times IDF(t)
$$

---

### Key Intuition
- Common words across many documents receive **lower weights**
- Rare but meaningful words receive **higher**

