## Mengenal Text Processing:
### Bag of Words & Stop Word Filtering

Bag of Words menyederhanakan representasikan text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat. Text akan dikonversi menjadi lowercase dan tanda baca akan diabaikan

###  Membuat Dataset

In [1]:
corpus = [
    'Linux has been around since the mid-1998s.',
    'Linux distributions include the Linux kernel.',
    'Linux is one of the most prominent open-source software.'
]

corpus

['Linux has been around since the mid-1998s.',
 'Linux distributions include the Linux kernel.',
 'Linux is one of the most prominent open-source software.']

### Bag of Words model dengan memanfaatkan CountVectorizer

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer_x = vectorizer.fit_transform(corpus).todense()
vectorizer_x

matrix([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
        [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]],
       dtype=int64)

In [3]:
vectorizer.get_feature_names()

['1998s',
 'around',
 'been',
 'distributions',
 'has',
 'include',
 'is',
 'kernel',
 'linux',
 'mid',
 'most',
 'of',
 'one',
 'open',
 'prominent',
 'since',
 'software',
 'source',
 'the']

### Euclidean Distance untuk mengukur kedekatan/jarak antar dokumen(vector)

In [None]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_x)):
    for j in range(i, len(vectorized_x)):
        if i == j:
            continue
        jarak = euclidean_distances(vectorized_x[i], vectorized_x[j])
        print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

### Stop Word Filtering pada text

Stop Word Filtering menyederhanakan representasikan text dengan mengabaikan beberapa kata seperti determiners (the, a, an), auxiliary verbs (do, be, will), dan prepositions (on, in, at).

### Membentuk Dataset

In [4]:
corpus

['Linux has been around since the mid-1998s.',
 'Linux distributions include the Linux kernel.',
 'Linux is one of the most prominent open-source software.']

### Memanfaatkan CountVectorizer

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_x = vectorizer.fit_transform(corpus).todense()
vectorized_x

matrix([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
        [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

In [7]:
vectorizer.get_feature_names()

['1998s',
 'distributions',
 'include',
 'kernel',
 'linux',
 'mid',
 'open',
 'prominent',
 'software',
 'source']