# 10. Mengenal Text Processing: Bag of Words & Stop Word Filtering

## Bag of Words model sebagai representasi text

Bag of Words menyederhanakan representasi text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat. Text akan dikonversi menjadi lowercase dan tanda baca akan diabaikan. 

Referensi: https://en.wikipedia.org/wiki/Bag-of-words_model

## Dataset

In [1]:
corpus= [
          'Linux has been around since the mid-1990s.',
          'Linux distribution include the Linux Kernel.',
          'Linux is one of the most prominent open-source software.'
]
corpus

['Linux has been around since the mid-1990s.',
 'Linux distribution include the Linux Kernel.',
 'Linux is one of the most prominent open-source software.']

## Bag of Words model dengan `CountVectorizer`

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_x = vectorizer.fit_transform(corpus).todense()#menkonfersikan hasil fit_transform dari objek vectorizer menjadi dua dimensi
vectorized_x

matrix([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]],
       dtype=int64)

In [4]:
vectorizer.get_feature_names()

['1990s',
 'around',
 'been',
 'distribution',
 'has',
 'include',
 'is',
 'kernel',
 'linux',
 'mid',
 'most',
 'of',
 'one',
 'open',
 'prominent',
 'since',
 'software',
 'source',
 'the']

note: 
- urutan angka-angka diatas sudah sesuai urutannya dengan kata-kata yang telah tersusun secara alfabetik
- angka 1 berarti kata nya ada satu, 2 berarti ada dua, 0 berarti tidak ada sama sekali   

## Euclidean Distance untuk mengukur kedekatan/kemiripan antar dokumen (vector) 

In [6]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_x)):
    for j in range (i, len(vectorized_x)):
        if i==j:
            continue 
        jarak = euclidean_distances(vectorized_x[i], vectorized_x[j])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

Jarak dokumen 1 dan 2: [[3.16227766]]
Jarak dokumen 1 dan 3: [[3.74165739]]
Jarak dokumen 2 dan 3: [[3.46410162]]


note: semakin kecil nilai jaraknya berarti semakin mirip dokumennya

## Stop Word Filtering pada text

Stop Word Filtering menyederhanakan representasi text dengan mengabaikan beberapa kata seperti determiners (the, a, an), auxiliary verbs (do, be, will), dan prepositions (on, in, at).

Referensi: https://en.wikipedia.org/wiki/Stop_word

## Dataset

In [7]:
corpus

['Linux has been around since the mid-1990s.',
 'Linux distribution include the Linux Kernel.',
 'Linux is one of the most prominent open-source software.']

## Stop Word Filtering dengan `CountVectorizer`

Stop Word Filtering juga dapat diterapkan dengan memanfaatkan `CountVectorizer`

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english') #melakukan stop word filtering untuk bahasa inggris
vectorized_x = vectorizer.fit_transform(corpus).todense()#menkonfersikan hasil fit_transform dari objek vectorizer menjadi dua dimensi
vectorized_x

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [11]:
vectorizer.get_feature_names()

['1990s',
 'distribution',
 'include',
 'kernel',
 'linux',
 'mid',
 'open',
 'prominent',
 'software',
 'source']

source: https://www.youtube.com/watch?v=U30sF4m0bd0 Trs_m