# Bag of Words & Stop Word Filtering

Bag of Words menyederhanakan representasi text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat. Text akan dikonversi menjadi lowercase dan tanda baca akan diabaikan.

# Dataset

In [1]:
corpus= [
          'Linux has been around since the mid-1990s.',
          'Linux distribution include the Linux Kernel.',
          'Linux is one of the most prominent open-source software.'
]
corpus

['Linux has been around since the mid-1990s.',
 'Linux distribution include the Linux Kernel.',
 'Linux is one of the most prominent open-source software.']

# Bag of Words model dengan CountVectorizer

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_x = vectorizer.fit_transform(corpus).todense()#menkonfersikan hasil fit_transform dari objek vectorizer menjadi dua dimensi
vectorized_x

matrix([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]],
       dtype=int64)

In [3]:
vectorizer.get_feature_names()


['1990s',
 'around',
 'been',
 'distribution',
 'has',
 'include',
 'is',
 'kernel',
 'linux',
 'mid',
 'most',
 'of',
 'one',
 'open',
 'prominent',
 'since',
 'software',
 'source',
 'the']

# Euclidean Distance untuk mengukur kedekatan/kemiripan antar dokumen 

In [4]:

from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_x)):
    for j in range (i, len(vectorized_x)):
        if i==j:
            continue 
        jarak = euclidean_distances(vectorized_x[i], vectorized_x[j])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

Jarak dokumen 1 dan 2: [[3.16227766]]
Jarak dokumen 1 dan 3: [[3.74165739]]
Jarak dokumen 2 dan 3: [[3.46410162]]


# Stop Word Filtering pada text

Stop Word Filtering berfungsi menyederhanakan representasi text dengan mengabaikan beberapa kata seperti determiners (the, a, an), auxiliary verbs (do, be, will), dan prepositions (on, in, at).

# Dataset

In [5]:
corpus

['Linux has been around since the mid-1990s.',
 'Linux distribution include the Linux Kernel.',
 'Linux is one of the most prominent open-source software.']

# Stop Word Filtering dengan CountVectorizer

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english') #melakukan stop word filtering untuk bahasa inggris
vectorized_x = vectorizer.fit_transform(corpus).todense()#menkonfersikan hasil fit_transform dari objek vectorizer menjadi dua dimensi
vectorized_x

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [7]:
vectorizer.get_feature_names()

['1990s',
 'distribution',
 'include',
 'kernel',
 'linux',
 'mid',
 'open',
 'prominent',
 'software',
 'source']

# Mengenal TF-IDF

TF-IDF (Term Frequency - Inverse Document Frequency) merupakan salah satu metode statistik yang digunakan untuk mengukur seberapa penting suatu kata terhadap suatu dokumen tertentu dari sekumpulan dokumen atau corpus.

# Dataset

In [8]:
corpus = ['the house had a tiny mouse',
         'the cat saw the mouse',
         'the mouse ran away from the house',
         'the cat finaly ate the mouse',
         'the end of the mouse story']
corpus

['the house had a tiny mouse',
 'the cat saw the mouse',
 'the mouse ran away from the house',
 'the cat finaly ate the mouse',
 'the end of the mouse story']

# TF-IDF Weights dengan TfidfVectorizer

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english') 
response = vectorizer.fit_transform(corpus)
print(response)

  (0, 6)	0.3477147117091919
  (0, 10)	0.7297183669435993
  (0, 5)	0.5887321837696324
  (1, 8)	0.7297183669435993
  (1, 2)	0.5887321837696324
  (1, 6)	0.3477147117091919
  (2, 1)	0.5894630806320427
  (2, 7)	0.5894630806320427
  (2, 6)	0.2808823162882302
  (2, 5)	0.47557510189256375
  (3, 0)	0.5894630806320427
  (3, 4)	0.5894630806320427
  (3, 2)	0.47557510189256375
  (3, 6)	0.2808823162882302
  (4, 9)	0.6700917930430479
  (4, 3)	0.6700917930430479
  (4, 6)	0.3193023297639811


In [10]:
vectorizer.get_feature_names()

['ate',
 'away',
 'cat',
 'end',
 'finaly',
 'house',
 'mouse',
 'ran',
 'saw',
 'story',
 'tiny']

In [11]:

response.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.58873218, 0.34771471, 0.        , 0.        , 0.        ,
         0.72971837],
        [0.        , 0.        , 0.58873218, 0.        , 0.        ,
         0.        , 0.34771471, 0.        , 0.72971837, 0.        ,
         0.        ],
        [0.        , 0.58946308, 0.        , 0.        , 0.        ,
         0.4755751 , 0.28088232, 0.58946308, 0.        , 0.        ,
         0.        ],
        [0.58946308, 0.        , 0.4755751 , 0.        , 0.58946308,
         0.        , 0.28088232, 0.        , 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , 0.67009179, 0.        ,
         0.        , 0.31930233, 0.        , 0.        , 0.67009179,
         0.        ]])

In [12]:
import pandas as pd

df = pd.DataFrame(response.todense().T,
                 index=vectorizer.get_feature_names(),
                 columns=[f'D{i+1}' for i in range (len(corpus))])
df

Unnamed: 0,D1,D2,D3,D4,D5
ate,0.0,0.0,0.0,0.589463,0.0
away,0.0,0.0,0.589463,0.0,0.0
cat,0.0,0.588732,0.0,0.475575,0.0
end,0.0,0.0,0.0,0.0,0.670092
finaly,0.0,0.0,0.0,0.589463,0.0
house,0.588732,0.0,0.475575,0.0,0.0
mouse,0.347715,0.347715,0.280882,0.280882,0.319302
ran,0.0,0.0,0.589463,0.0,0.0
saw,0.0,0.729718,0.0,0.0,0.0
story,0.0,0.0,0.0,0.0,0.670092
