## Bag Of Words
+ 단어의 문맥이나 순서를 무시하고, <strong>빈도수(frequency)</strong>를 부여해 피쳐값을 만드는 모델
+ 1. 문장에서 모든 단어를 <strong>중복을 제거</strong>하고, 각 단어를 컬럼 형태로 나열한다.
+ 2. 각 단어에 <strong>고유의 인덱스</strong>를 부여한다.
+ 3. 각 문장에서 단어가 나타난 빈도수(frequency)를 각 단어에 기재한다.

#### BOW 장점
- 문서 내의 단어 및 토큰은 숫자형 데이터로 바꾸는 점
- 단순히 단어의 빈도수에 기반하지만, 문장 및 텍스트의 특징을 알 수 있다는 점

#### BWO 단점
- 단어의 순서를 고려하지 않기 때문에 <strong>문맥의 흐름을 반영할 수 없다.</strong>
- 희소 행렬 문제 ( Sparse Matrix Problem )
  - Sparse Matrix = 대부분의 값이 0으로 채워진 행렬
  - BOW를 Feature Vectorization 하면 Sparse Matrix 형태의 데이터가 만들어진다.
  - 기본적으로 0이 많아서 ML의 성능을 떨어뜨린다.
  - 일반적으로 텍스트마다 서로 다른 단어들로 구성된 문장들로 구성되는데, 단어가 문서마다 존재하는것 보다, 그렇지 않은 경우가 훨씬 많다.

In [1]:
# nltk library baseline
import nltk
nltk.download('stopwords')# for Stop word
nltk.download('punkt')# for Tokenize
nltk.download('wordnet')# for Lemmatization

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [3]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

In [5]:
vectorizer = CountVectorizer(min_df = 1, ngram_range=(1, 1))

corpus = [
          'The proposed rising was a dismal failure, but the Habeas Corpus Act was suspended and Thistlewood and Watson were seized, although upon being tried they were acquitted.', 
          'Before the prorogation, however, he saw the invaluable Act of Habeas Corpus, which he had carried through parliament, receive the royal assent.', 
          'These Personal Liberty Laws forbade justices and judges to take cognizance of claims, extended the habeas corpus act and the privilege of jury trial to fugitives, and punished false testimony severely.', 
          'The procession of the Host on Corpus Christi day became, as it were, a public demonstration of Catholic orthodoxy against Protestantism and later against religious Liberalism.'
          ]

features = vectorizer.fit_transform(corpus)
print(features)


  (0, 61)	2
  (0, 46)	1
  (0, 53)	1
  (0, 70)	2
  (0, 19)	1
  (0, 21)	1
  (0, 10)	1
  (0, 25)	1
  (0, 16)	1
  (0, 1)	1
  (0, 58)	1
  (0, 4)	2
  (0, 64)	1
  (0, 71)	1
  (0, 72)	2
  (0, 56)	1
  (0, 3)	1
  (0, 69)	1
  (0, 9)	1
  (0, 68)	1
  (0, 63)	1
  (0, 0)	1
  (1, 61)	3
  (1, 25)	1
  (1, 16)	1
  :	:
  (2, 22)	1
  (2, 60)	1
  (2, 57)	1
  (3, 61)	2
  (3, 16)	1
  (3, 4)	1
  (3, 72)	1
  (3, 39)	2
  (3, 45)	1
  (3, 28)	1
  (3, 40)	1
  (3, 13)	1
  (3, 17)	1
  (3, 7)	1
  (3, 5)	1
  (3, 31)	1
  (3, 49)	1
  (3, 18)	1
  (3, 12)	1
  (3, 41)	1
  (3, 2)	2
  (3, 48)	1
  (3, 35)	1
  (3, 52)	1
  (3, 37)	1


In [6]:
features.shape

(4, 74)

In [7]:
vocab = vectorizer.get_feature_names() 
print(len(vocab)) 
vocab[:10]

74


['acquitted',
 'act',
 'against',
 'although',
 'and',
 'as',
 'assent',
 'became',
 'before',
 'being']

In [8]:
pd.DataFrame(features.toarray(), columns = vocab).head()


Unnamed: 0,acquitted,act,against,although,and,as,assent,became,before,being,but,carried,catholic,christi,claims,cognizance,corpus,day,demonstration,dismal,extended,failure,false,forbade,fugitives,habeas,had,he,host,however,invaluable,it,judges,jury,justices,later,laws,liberalism,liberty,of,on,orthodoxy,parliament,personal,privilege,procession,proposed,prorogation,protestantism,public,punished,receive,religious,rising,royal,saw,seized,severely,suspended,take,testimony,the,these,they,thistlewood,through,to,trial,tried,upon,was,watson,were,which
0,1,1,0,1,2,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,0,2,0,1,1,0,0,0,1,1,2,1,2,0
1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,2,0,1,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,0,0,3,0,0,0,1,0,0,0,0,0,0,0,1
2,0,1,0,0,3,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,1,1,1,1,0,0,0,0,0,0,1,1,1,0,1,0,1,2,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,1,2,1,0,0,0,2,1,0,0,0,0,0,0
3,0,0,2,0,1,1,0,1,0,0,0,0,1,1,0,0,1,1,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,2,1,1,0,0,0,1,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0
