# Bag of words model sebagai representasi text
Bag of word menyederhanakan representasi text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat. Text akan dikonversi menjadi lowercase dan tanda baca akan diabaikan

In [1]:
corpus = [' Linux has been around since the mid-1998s.',
         'Linux distributions include the linux kernel.',
         'Linux is one of the most prominent open-source software']

corpus

[' Linux has been around since the mid-1998s.',
 'Linux distributions include the linux kernel.',
 'Linux is one of the most prominent open-source software']

## Bag of words model dengan CountVectorizer
Bag of words model dapat diterapkan dengan memanfaatkan CountVectorizer

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]],
       dtype=int64)

In [4]:
vectorizer.get_feature_names()



['1998s',
 'around',
 'been',
 'distributions',
 'has',
 'include',
 'is',
 'kernel',
 'linux',
 'mid',
 'most',
 'of',
 'one',
 'open',
 'prominent',
 'since',
 'software',
 'source',
 'the']

setiap nilai tersebut mempresentasikan jumlah kemunculan token/kata tertentu pada kalimat

## Euclidean Distance untuk mengukur kedekatan/jarak antar dokumen(vector)

In [15]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_X)):
    for j in range(i, len(vectorized_X)):
        if i == j:
            continue
        jarak = euclidean_distances(vectorized_X[i], vectorized_X[j])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')
                  

Jarak dokumen 1 dan 2: [[3.16227766]]
Jarak dokumen 1 dan 3: [[3.74165739]]
Jarak dokumen 2 dan 3: [[3.46410162]]




Bahwa tingkat kemiripan dokumen 1 dan 2 diantara 3 dokumen lainnya

## Stop Word Filtering pada text
Stop word filtering menyederhanakan representasi text dengan menyederhanakan beberapa kata seperti determiners (the, a, an) auxiliary verb(do be will) dan preposition (on, in, at)

In [16]:
corpus = [' Linux has been around since the mid-1998s.',
         'Linux distributions include the linux kernel.',
         'Linux is one of the most prominent open-source software']

corpus

[' Linux has been around since the mid-1998s.',
 'Linux distributions include the linux kernel.',
 'Linux is one of the most prominent open-source software']

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_X = vectorizer.fit_transform(corpus).todense()
vectorized_X

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [18]:
vectorizer.get_feature_names()



['1998s',
 'distributions',
 'include',
 'kernel',
 'linux',
 'mid',
 'open',
 'prominent',
 'software',
 'source']