<a href="https://colab.research.google.com/github/LatiefDataVisionary/scikit-learn-with-indonesia-belajar/blob/main/SKLearn_10_Bag_of_Words_%26_Stop_Word_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SKLearn 10: Bag of Words & Stop Word Filtering | Text Processing | Belajar Machine Learning Dasar**


**Bag of Words model sebagai representasi text**

Bag of Words menyederhanakan representasi text sebagai sekumpulan kata serta mengabaikan grammar dan posisi tiap kata pada kalimat. Text akan dikonversi menjadi lowecase dan tanda baca akan diabaikan.

Referensi: https://en.wikipedia.org/wiki/Bag-of-words_model

# **Dataset**

In [3]:
corpus = [
    'Linux has been around since the mid-1990s.',
    'Linux distributions include the Linux kernel.',
    'Linux is one of the most prominent open-source software.'
]

corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the Linux kernel.',
 'Linux is one of the most prominent open-source software.']

## **Bag of Words dengan `CountVectorizer`**

Bag or Words model dapat diterapkan dengan memanfaatkan `CountVectorizer`

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorized_X = vectorizer.fit_transform(corpus).toarray()
vectorized_X

array([[1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1]])

In [15]:
vectorizer.get_feature_names_out()

array(['1990s', 'around', 'been', 'distributions', 'has', 'include', 'is',
       'kernel', 'linux', 'mid', 'most', 'of', 'one', 'open', 'prominent',
       'since', 'software', 'source', 'the'], dtype=object)

## **Euclidean distance untuk mengukur kedekatakan/jarak antar dokumen (vector)**

In [18]:
from sklearn.metrics.pairwise import euclidean_distances

for i in range(len(vectorized_X)):
  for j in range(i, len(vectorized_X)):
    if i == j:
      continue
    jarak = euclidean_distances(vectorized_X[i].reshape(1, -1), vectorized_X[j].reshape(1, -1))
    print(f'Jarak dokumen {i+1} dan {j+1}: {jarak}')

Jarak dokumen 1 dan 2: [[3.16227766]]
Jarak dokumen 1 dan 3: [[3.74165739]]
Jarak dokumen 2 dan 3: [[3.46410162]]


## **Stop Word Filtering pada text**

Stop Word Filtering menyederhanakan representasi text dengan mengabaikan beberapa kata seperrti determiners (the, a, an), auxiliary verbs(do, be, will), dan preprositions(in, on, at)

Referensi: https://en.wikipedia.org/wiki/Stop_word

## **Dataset**

In [19]:
corpus

['Linux has been around since the mid-1990s.',
 'Linux distributions include the Linux kernel.',
 'Linux is one of the most prominent open-source software.']

## **Stop Word Filtering dengan `CountVectorizer`**

Stop Word Filtering juga dapat diterapkan dengan memanfaatkan `CountVectorizer`

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
vectorized_X = vectorizer.fit_transform(corpus).toarray()
vectorized_X

array([[1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
       [0, 1, 1, 1, 2, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 1, 1, 1, 1]])

In [22]:
vectorizer.get_feature_names_out()

array(['1990s', 'distributions', 'include', 'kernel', 'linux', 'mid',
       'open', 'prominent', 'software', 'source'], dtype=object)