## Bag of Words
* 문서에 포함된 모든 단어를 문맥이나 순서를 무시하고
* 일괄적으로 단어에 대한 빈도값을 부여해 특성 값을 추출하는 모델을 의미
* BOW : 문서 내 모든 단어를 한꺼번에 봉투안에 넣은 뒤 흔들어 섞는다는 의미
    + 장점 : 이해하기 쉽고 구축이 빠름
    + 단점 : 문맥의미 파악 어려움, 희소행렬로 인한 ML 알고리즘 수행/예측시간 저하
* 일반적인 BOW 모델 생성방법은 빈도count 기반/가중치tf-idf 기반 으로 나뉨
* 단, sklearn의 Count/Tfidf Vectorize 클래스는 BOW 모델 생성시 희소행렬 대신 밀집행렬을 사용 

# BOW 모델 생성

In [1]:
# 다음 문장들에 대해 BOW 모델을 생성하시오
# my wife likes to watch baseball games and 
# my daughter likes to watch baseball games too.
# my wife likes to play baseball.

# 전체 문장에 사용된 단어들을 중복제거후 순서대로 나열
# and:0, baseball, daughter, games, likes, 
# my, play, to, too, watch, wife:10  (0~10, 총 11개 단어 수집)

# 개별 문장에서 해당 단어가 나타난 횟수를 행렬형태로 기재
#    and, baseball, daughter, games, ..., wife
# s1  1      2         1       2           1
# s2  0      1         0       0           1
# 문서-단어 행렬
# BOM 모델의 장점은 이해하기 쉽고, 구축이 빠름
# 희소행렬로 인한 ML 알고리즘 수행/ 예측시간 저하 


# 텍스트 분류 예제
+ 유즈넷 기사에 따른 주제 분류

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [5]:
# 데이터 적재 및 확인
from sklearn.datasets import fetch_20newsgroups

news_data = fetch_20newsgroups(subset='all')


In [6]:
print(news_data.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [7]:
# news_data가 dict로 구성 - 키 확인
news_data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [8]:
# 첫번째 유즈넷 기사
print( news_data.data[0] )

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [9]:
# 기사 종속변수 확인
news_data.target_names  # 20개 뉴스그룹명

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [10]:
# 데이터 분리 - 유즈넷 기사들 중 헤더, 푸터, 인용구 제거
news_data2 = fetch_20newsgroups(subset='all',
            remove=('headers', 'footers', 'quotes'),
            random_state=2208311515) # 본문만 추출

data = news_data.data
target = news_data.target

Xtrain, Xtest, ytrain, ytest = train_test_split( data, target, 
    test_size=0.3, stratify=target, random_state=2208311515)  

In [11]:
print(data[0].strip())

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


In [13]:
pd.Series(ytrain).value_counts

<bound method IndexOpsMixin.value_counts of 0         0
1         2
2         1
3         4
4        12
         ..
13187     2
13188    18
13189     1
13190     6
13191    12
Length: 13192, dtype: int64>

In [14]:
pd.Series(ytest).value_counts()[:5]

10    300
8     299
15    299
9     298
7     297
dtype: int64

# 빈도 기반 BOW 모델 생성


In [15]:
cntv = CountVectorizer()
cvXtrain = cntv.fit_transform(Xtrain)
cvXtest = cntv.transform(Xtest)



In [16]:
cvXtrain.shape  # 문서수, 단어수

(13192, 142039)

In [None]:
# 훈련
lrclf = LogisticRegression(max_iter=1000)

lrclf.fit(cvXtrain, ytrain)
lrclf.score(cvXtrain, ytrain)

pred = lrclf.predict(cvXtest)
accuracy_score(ytest, pred)

# tfidf(중요 가중치) 기반 BOW 모델 생성

In [None]:
tfv = TfidfVectorizer()
tfXtrain = tfv.fit_transform(Xtrain)
tfXtest = tfv.transform(Xtest)

In [None]:
tfXtrain.shape  # 문서수, 단어수

In [None]:
# 훈련
lrclf = LogisticRegression(max_iter=1000)

lrclf.fit(tfXtrain, ytrain)
print( lrclf.score(tfXtrain, ytrain) )

pred = lrclf.predict(tfXtest)
accuracy_score(ytest, pred)