# 11. Mengenal TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF (Term Frequency - Inverse Document Frequency) merupakan salah satu metode statistik yang digunakan untuk mengukur seberapa penting suatu kata terhadap suatu dokumen tertentu dari sekumpulan dokumen atau corpus.

Referensi:
- https://en.wikipedia.org/wiki/Tf–idf
- https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

## Dataset

In [1]:
corpus = ['the house had a tiny mouse',
         'the cat saw the mouse',
         'the mouse ran away from the house',
         'the cat finaly ate the mouse',
         'the end of the mouse story']
corpus


['the house had a tiny mouse',
 'the cat saw the mouse',
 'the mouse ran away from the house',
 'the cat finaly ate the mouse',
 'the end of the mouse story']

## TF-IDF Weights dengan `TfidfVectorizer`

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english') 
response = vectorizer.fit_transform(corpus)
print(response)

  (0, 6)	0.3477147117091919
  (0, 10)	0.7297183669435993
  (0, 5)	0.5887321837696324
  (1, 8)	0.7297183669435993
  (1, 2)	0.5887321837696324
  (1, 6)	0.3477147117091919
  (2, 1)	0.5894630806320427
  (2, 7)	0.5894630806320427
  (2, 6)	0.2808823162882302
  (2, 5)	0.47557510189256375
  (3, 0)	0.5894630806320427
  (3, 4)	0.5894630806320427
  (3, 2)	0.47557510189256375
  (3, 6)	0.2808823162882302
  (4, 9)	0.6700917930430479
  (4, 3)	0.6700917930430479
  (4, 6)	0.3193023297639811


note:
- angka-angka di sisi kiri dalam kurung merepresentasikan indeks corpus kita ( indeks 0 melambangkan kalimat 1 dari corpus dst)
- angka-angka di sisi kiri dalam kanan merepresentasikan indeks dari features_names yang dihasilkan dari bag of words kita (ada di bawah)
- Sekumpulan angka yang diluar dalam kurung merepresentasikan bobot dari TF-IDF hasil kalkulai yang dilakukan oleh `TfidfVectorizer`

In [8]:
vectorizer.get_feature_names()

['ate',
 'away',
 'cat',
 'end',
 'finaly',
 'house',
 'mouse',
 'ran',
 'saw',
 'story',
 'tiny']

In [9]:
response.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.58873218, 0.34771471, 0.        , 0.        , 0.        ,
         0.72971837],
        [0.        , 0.        , 0.58873218, 0.        , 0.        ,
         0.        , 0.34771471, 0.        , 0.72971837, 0.        ,
         0.        ],
        [0.        , 0.58946308, 0.        , 0.        , 0.        ,
         0.4755751 , 0.28088232, 0.58946308, 0.        , 0.        ,
         0.        ],
        [0.58946308, 0.        , 0.4755751 , 0.        , 0.58946308,
         0.        , 0.28088232, 0.        , 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , 0.67009179, 0.        ,
         0.        , 0.31930233, 0.        , 0.        , 0.67009179,
         0.        ]])

In [7]:
import pandas as pd

df = pd.DataFrame(response.todense().T,
                 index=vectorizer.get_feature_names(),
                 columns=[f'D{i+1}' for i in range (len(corpus))])
df

Unnamed: 0,D1,D2,D3,D4,D5
ate,0.0,0.0,0.0,0.589463,0.0
away,0.0,0.0,0.589463,0.0,0.0
cat,0.0,0.588732,0.0,0.475575,0.0
end,0.0,0.0,0.0,0.0,0.670092
finaly,0.0,0.0,0.0,0.589463,0.0
house,0.588732,0.0,0.475575,0.0,0.0
mouse,0.347715,0.347715,0.280882,0.280882,0.319302
ran,0.0,0.0,0.589463,0.0,0.0
saw,0.0,0.729718,0.0,0.0,0.0
story,0.0,0.0,0.0,0.0,0.670092


source: https://www.youtube.com/watch?v=f0a1XXmaQp8 Trs_m