In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer


d0 = "The cat sat on the mat."
d1 = "The dog played in the park."
d2 = "Cats and dogs are great pets."
string = [d0, d1, d2]

In [2]:
string

['The cat sat on the mat.',
 'The dog played in the park.',
 'Cats and dogs are great pets.']

$idf(t, D) = \log\left(\frac{N}{df(t)}\right) + 1$$idf(t, D) = \log\left(\frac{N}{df(t)}\right) + 1$

### Step-by-Step TF-IDF Calculation for 'cat' in Document 0

Let's analyze the first document: `d0 = "The cat sat on the mat."`

#### 1. Calculate Term Frequency (TF)
Term Frequency ($tf(t, d)$) is the raw count of how many times a term $t$ appears in document $d$.

*   For the word "cat" in `d0`: `"cat"` appears **1** time.
*   So, $tf(\text{'cat'}, d0) = 1$.

#### 2. Calculate Inverse Document Frequency (IDF)
Inverse Document Frequency ($idf(t, D)$) is calculated using the formula (with `smooth_idf=True`, which `TfidfVectorizer` implicitly uses in this case):

$idf(t, D) = \log\left(\frac{N + 1}{df(t) + 1}\right) + 1$

Where:
*   $N$ is the total number of documents (which is 3).
*   $df(t)$ is the number of documents containing the term $t$.

Let's find $df(\text{'cat'})$:
*   `d0`: "The **cat** sat on the mat." (contains 'cat')
*   `d1`: "The dog played in the park." (does not contain 'cat')
*   `d2`: "**Cats** and dogs are great pets." (contains 'cats', but not 'cat' as a distinct token based on the vectorizer's vocabulary)

*   So, $df(\text{'cat'}) = 1$.

Now, calculate $idf(\text{'cat'})$:
$idf(\text{'cat'}) = \log\left(\frac{3 + 1}{1 + 1}\right) + 1 = \log\left(\frac{4}{2}\right) + 1 = \log(2) + 1 \approx 0.69314718 + 1 \approx 1.69314718$

This matches the `idf` value for 'cat' obtained from `tfidf.idf_`.

#### 3. Calculate Unnormalized TF-IDF
Unnormalized TF-IDF is simply the product of TF and IDF:

$tfidf(t, d, D)_{\text{unnormalized}} = tf(t, d) \times idf(t, D)$

*   For "cat" in `d0`: $tfidf(\text{'cat'}, d0)_{\text{unnormalized}} = 1 \times 1.69314718 \approx 1.69314718$

#### 4. Apply L2 Normalization
`TfidfVectorizer` applies L2 normalization to the entire TF-IDF vector for each document. This means dividing each term's unnormalized TF-IDF by the Euclidean norm of the document's vector.

Let's list the unnormalized TF-IDF values for all terms present in `d0`:
*   'cat': $tf=1$, $idf=1.69314718 \implies \text{unnormalized } tfidf = 1.69314718$
*   'sat': $tf=1$, $idf=1.69314718 \implies \text{unnormalized } tfidf = 1.69314718$
*   'on': $tf=1$, $idf=1.69314718 \implies \text{unnormalized } tfidf = 1.69314718$
*   'mat': $tf=1$, $idf=1.69314718 \implies \text{unnormalized } tfidf = 1.69314718$
*   'the': $tf=2$, $idf=1.28768207 \implies \text{unnormalized } tfidf = 2 \times 1.28768207 \approx 2.57536414$

Now, calculate the L2 norm of this vector for `d0`:

$\text{Norm}(d0) = \sqrt{(1.69314718)^2 + (1.69314718)^2 + (1.69314718)^2 + (1.69314718)^2 + (2.57536414)^2}$
$\text{Norm}(d0) \approx \sqrt{2.866752 + 2.866752 + 2.866752 + 2.866752 + 6.632598}$
$\text{Norm}(d0) \approx \sqrt{18.099606} \approx 4.2543632$

Finally, the normalized TF-IDF for 'cat' in `d0`:

$tfidf(\text{'cat'}, d0)_{\text{normalized}} = \frac{\text{unnormalized } tfidf(\text{'cat'}, d0)}{\text{Norm}(d0)} = \frac{1.69314718}{4.2543632} \approx 0.39798027$

This matches the value observed in the `tf-idf values in matrix form` for 'cat' (index 2) in the first document (`d0`):

```
tf-idf values in matrix form:
[[0.         0.         0.39798027 ...]
 ...
]
```

### 文件 0 中「cat」的 TF-IDF 逐步計算過程

讓我們分析第一個文件：`d0 = "The cat sat on the mat."`

#### 1. 計算詞頻 (TF)
詞頻 ($tf(t, d)$) 是指詞彙 $t$ 在文件 $d$ 中出現的次數。

*   對於 `d0` 中的詞彙「cat」：「cat」出現了 **1** 次。
*   因此，$tf(\text{'cat'}, d0) = 1$。

#### 2. 計算逆向文件頻率 (IDF)
逆向文件頻率 ($idf(t, D)$) 使用以下公式計算（`TfidfVectorizer` 在此情況下隱式使用 `smooth_idf=True`）：

$idf(t, D) = \log\left(\frac{N + 1}{df(t) + 1}\right) + 1$

其中：
*   $N$ 是文件總數（即 3 個）。
*   $df(t)$ 是包含詞彙 $t$ 的文件數。

讓我們找出 $df(\text{'cat'})$：
*   `d0`：「The **cat** sat on the mat.」（包含 'cat'）
*   `d1`：「The dog played in the park.」（不包含 'cat'）
*   `d2`：「**Cats** and dogs are great pets.」（包含 'cats'，但根據詞彙表的定義，'cat' 和 'cats' 被視為不同的詞彙）

*   因此，$df(\text{'cat'}) = 1$。

現在，計算 $idf(\text{'cat'})$：
$idf(\text{'cat'}) = \log\left(\frac{3 + 1}{1 + 1}\right) + 1 = \log\left(\frac{4}{2}\right) + 1 = \log(2) + 1 \approx 0.69314718 + 1 \approx 1.69314718$

這與從 `tfidf.idf_` 獲得的 'cat' 的 `idf` 值相符。

#### 3. 計算未標準化的 TF-IDF
未標準化的 TF-IDF 只是 TF 與 IDF 的乘積：

$tfidf(t, d, D)_{\text{unnormalized}} = tf(t, d) \times idf(t, D)$

*   對於 `d0` 中的「cat」：$tfidf(\text{'cat'}, d0)_{\text{unnormalized}} = 1 \times 1.69314718 \approx 1.69314718$

#### 4. 應用 L2 標準化
`TfidfVectorizer` 對每個文件的整個 TF-IDF 向量應用 L2 標準化。這表示將每個詞彙的未標準化 TF-IDF 除以該文件向量的歐幾里得範數 (Euclidean norm)。

讓我們列出 `d0` 中所有詞彙的未標準化 TF-IDF 值：
*   'cat'：$tf=1$，$idf=1.69314718 \implies \text{unnormalized } tfidf = 1.69314718$
*   'sat'：$tf=1$，$idf=1.69314718 \implies \text{unnormalized } tfidf = 1.69314718$
*   'on'：$tf=1$，$idf=1.69314718 \implies \text{unnormalized } tfidf = 1.69314718$
*   'mat'：$tf=1$，$idf=1.69314718 \implies \text{unnormalized } tfidf = 1.69314718$
*   'the'：$tf=2$，$idf=1.28768207 \implies \text{unnormalized } tfidf = 2 \times 1.28768207 \approx 2.57536414$

現在，計算 `d0` 這個向量的 L2 範數：

$\text{Norm}(d0) = \sqrt{(1.69314718)^2 + (1.69314718)^2 + (1.69314718)^2 + (1.69314718)^2 + (2.57536414)^2}$
$\text{Norm}(d0) \approx \sqrt{2.866752 + 2.866752 + 2.866752 + 2.866752 + 6.632598}$
$\text{Norm}(d0) \approx \sqrt{18.099606} \approx 4.2543632$

最後，`d0` 中「cat」的標準化 TF-IDF 值為：

$tfidf(\text{'cat'}, d0)_{\text{normalized}} = \frac{\text{unnormalized } tfidf(\text{'cat'}, d0)}{\text{Norm}(d0)} = \frac{1.69314718}{4.2543632} \approx 0.39798027$

這與「tf-idf values in matrix form」中第一個文件 (`d0`) 裡「cat」（索引 2）的觀察值相符：

```
tf-idf values in matrix form:
[[0.         0.         0.39798027 ...]
 ...
]
```

In [3]:
tfidf = TfidfVectorizer()
result = tfidf.fit_transform(string)

In [4]:
result

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 16 stored elements and shape (3, 15)>

In [5]:
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ':', ele2)


idf values:
and : 1.6931471805599454
are : 1.6931471805599454
cat : 1.6931471805599454
cats : 1.6931471805599454
dog : 1.6931471805599454
dogs : 1.6931471805599454
great : 1.6931471805599454
in : 1.6931471805599454
mat : 1.6931471805599454
on : 1.6931471805599454
park : 1.6931471805599454
pets : 1.6931471805599454
played : 1.6931471805599454
sat : 1.6931471805599454
the : 1.2876820724517808


### Calculating Term Frequency (TF) using CountVectorizer

Term Frequency (TF) is the number of times a term appears in a document. It's the first part of the TF-IDF calculation. While `TfidfVectorizer` handles both, we can use `CountVectorizer` to get just the raw term counts.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer instance
count_vectorizer = CountVectorizer()

# Fit and transform the string data to get term counts
term_counts = count_vectorizer.fit_transform(string)

print('Term Frequency (raw counts) matrix:')
print(term_counts)
print('\nTerm Frequency (raw counts) in matrix form:')
print(term_counts.toarray())

print('\nWord indexes (Vocabulary):')
print(count_vectorizer.vocabulary_)

Term Frequency (raw counts) matrix:
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 16 stored elements and shape (3, 15)>
  Coords	Values
  (0, 14)	2
  (0, 2)	1
  (0, 13)	1
  (0, 9)	1
  (0, 8)	1
  (1, 14)	2
  (1, 4)	1
  (1, 12)	1
  (1, 7)	1
  (1, 10)	1
  (2, 3)	1
  (2, 0)	1
  (2, 5)	1
  (2, 1)	1
  (2, 6)	1
  (2, 11)	1

Term Frequency (raw counts) in matrix form:
[[0 0 1 0 0 0 0 0 1 1 0 0 0 1 2]
 [0 0 0 0 1 0 0 1 0 0 1 0 1 0 2]
 [1 1 0 1 0 1 1 0 0 0 0 1 0 0 0]]

Word indexes (Vocabulary):
{'the': 14, 'cat': 2, 'sat': 13, 'on': 9, 'mat': 8, 'dog': 4, 'played': 12, 'in': 7, 'park': 10, 'cats': 3, 'and': 0, 'dogs': 5, 'are': 1, 'great': 6, 'pets': 11}


In [6]:
print('\nWord indexes:')
print(tfidf.vocabulary_)
print('\ntf-idf value:')
print(result)
print('\ntf-idf values in matrix form:')
print(result.toarray())


Word indexes:
{'the': 14, 'cat': 2, 'sat': 13, 'on': 9, 'mat': 8, 'dog': 4, 'played': 12, 'in': 7, 'park': 10, 'cats': 3, 'and': 0, 'dogs': 5, 'are': 1, 'great': 6, 'pets': 11}

tf-idf value:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 16 stored elements and shape (3, 15)>
  Coords	Values
  (0, 14)	0.6053485081062917
  (0, 2)	0.3979802707840827
  (0, 13)	0.3979802707840827
  (0, 9)	0.3979802707840827
  (0, 8)	0.3979802707840827
  (1, 14)	0.6053485081062917
  (1, 4)	0.3979802707840827
  (1, 12)	0.3979802707840827
  (1, 7)	0.3979802707840827
  (1, 10)	0.3979802707840827
  (2, 3)	0.4082482904638631
  (2, 0)	0.4082482904638631
  (2, 5)	0.4082482904638631
  (2, 1)	0.4082482904638631
  (2, 6)	0.4082482904638631
  (2, 11)	0.4082482904638631

tf-idf values in matrix form:
[[0.         0.         0.39798027 0.         0.         0.
  0.         0.         0.39798027 0.39798027 0.         0.
  0.         0.39798027 0.60534851]
 [0.         0.         0.         0.         0.39