## [머신러닝기법을 활용한 IMDB 영화리뷰데이터 텍스트마이닝(감성분석-Sentimental Analysis)]

# 1. Load data

In [37]:
#Load the libraries
import numpy as np
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from bs4 import BeautifulSoup
import spacy
import re,string,unicodedata
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import LancasterStemmer,WordNetLemmatizer
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from textblob import TextBlob
from textblob import Word
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [7]:
#데이터 불러오기
imdb_data=pd.read_csv('C:/Users/고유경/Desktop/대외활동/투빅스/과제/5주차/IMDB Dataset.csv/IMDB Dataset.csv')
print(imdb_data.shape)
imdb_data.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


'review' : 영화리뷰 텍스트 데이터

'sentiment' : 긍정, 부정으로 라벨링된 두가지 class가 있다. / 타겟변수


이미 긍정,부정 분류가 된 상태기 때문에 머신러닝(supervised learning - classification)을 수행하기에 적합한 데이터로 판단된다.

# 2. EDA & Pre-processing

In [8]:
#Summary of the dataset
imdb_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,negative
freq,5,25000


top은 가장 많은 빈도수를 갖는 변수, freq는 top에 해당하는 변수의 빈도수

In [9]:
#sentiment(타겟변수) count
imdb_data['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

상당히 balanced한 data

### 택스트 사전처리

- corput(말뭉치) : 텍스트 데이터 집합


- (표) **자주 쓰이는 기호 정리**

|기호|의미|
|---|-----|
| * | 바로 앞에 있는 문자, 하위 표현식이 0번 이상 반복됨을 나타낸다.|
| + | 바로 앞에 있는 문자, 하위 표현식이 1번 이상 반복됨을 나타낸다.|
| [ ] | 대괄호 안에 있는 문자 중 하나가 나타난다.|
| ( ) | 괄호 안의 정규식을 하위 표현식 그룹으로 만든다. 정규 표현식을 평가할 때는 하위 표현식이 가장 먼저 평가된다.|
| . | 어떠한 형태든 문자 1자를 나타낸다.|
| ^ | 바로 뒤에 있는 문자, 하위 표현식이 문자열 맨 앞에 나타난다.|
| $ | 바로 앞에 있는 문자, 하위 표현식이 문자열 맨 뒤에 나타난다.|
| {m} | 바로 앞에 있는 문자, 하위 표현식이 m회 반복된다.|
| {m,n} | 바로 앞에 있는 문자, 하위 표현식이 m번 이상, n번 이하 나타난다.|
| [^] | 대괄호 안에 있는 문자를 제외한 문자가 나타난다.|

**1) html 문서 처리**

In [10]:
imdb_data['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

(마크다운으로는 입력할 수 없는) html 문서의 기호들을 제거해주어야 한다.

In [11]:
from bs4 import BeautifulSoup

# html.parser로 불러온 html strip 안의 텍스트만 뽑아내기 위해 get_text 메서드 사용
imdb_data['review'] = imdb_data['review'].apply(lambda x: BeautifulSoup(x, "html.parser").get_text())
imdb_data.review.head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. The filming tec...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

In [12]:
imdb_data['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

없어졌다!

**2)불용어 제거(Removing stopwords)**
- 불용어: 자주 사용되지만 특별한 의미부여가 힘든 단어들(ex. 'a', 'the', 'an')
- 파이썬 텍스트마이닝 패키지 **NLTK**는 언어별로 불용어 리스트를 제공해준다.(영어, 프랑스어,독일어,이탈리아어 등등. 그러나 한국어는 지원하지 않는다ㅠ)
- 따라서 한국어의 경우 불용어리스트를 직접 작성하거나 이미 작성된 리스트를 활용하는 방법이 있다.
- **NLTK** 패키지의 **stopwords**라이브러리 활용

In [14]:
from nltk.corpus import stopwords

#불용어 리스트
english_stop_words = stopwords.words('english')
english_stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [13]:
#불용어 리스트에 있는 단어들이 전부 소문자로 지정되어있어서 텍스트 데이터 우선 소문자로 바꾸기.
new_imdb = imdb_data['review'].apply(lambda x: x.lower())
new_imdb[0]

"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.i would say the main appeal of the show is due to the fact that it goes where other shows wo

In [15]:
#불용어 날려버릴 함수
def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    return removed_stop_words

new_imdb = remove_stop_words(new_imdb)

In [16]:
new_imdb[0]

"one reviewers mentioned watching 1 oz episode hooked. right, exactly happened me.the first thing struck oz brutality unflinching scenes violence, set right word go. trust me, show faint hearted timid. show pulls punches regards drugs, sex violence. hardcore, classic use word.it called oz nickname given oswald maximum security state penitentary. focuses mainly emerald city, experimental section prison cells glass fronts face inwards, privacy high agenda. em city home many..aryans, muslims, gangstas, latinos, christians, italians, irish more....so scuffles, death stares, dodgy dealings shady agreements never far away.i would say main appeal show due fact goes shows dare. forget pretty pictures painted mainstream audiences, forget charm, forget romance...oz mess around. first episode ever saw struck nasty surreal, say ready it, watched more, developed taste oz, got accustomed high levels graphic violence. violence, injustice (crooked guards who'll sold nickel, inmates who'll kill order g

who'll, couldn't 등 불용어가 없어지고 분량 확 줄어든걸 확인할 수 있다.

**3) 숫자,문장부호,특수문자 제거**
- **re** 라이브러리의 **compile**메서드에 정규 표현식을 지정한 후 **sub**메서드 첫번째 인수에 바꿀 문자열을 입력하고 두번째 인수에 대상 문자열을 입력한다.
- 대상 문자열에서 정규 표현식과 일치하는 부분이 바꿀 문자열로 바뀐다
- 정규 표현식과 일치하는 부분을 삭제하고 싶으면 첫번째 인수를 ""(큰따옴표 두개)로 지정해주면 된다

In [17]:
import re

#영어와 숫자 제외한 애들 전부 제거하기
new_imdb = [re.sub('[^a-zA-z0-9\s]', ' ', line.lower()) for line in new_imdb]

In [18]:
new_imdb[0]

'one reviewers mentioned watching 1 oz episode hooked  right  exactly happened me the first thing struck oz brutality unflinching scenes violence  set right word go  trust me  show faint hearted timid  show pulls punches regards drugs  sex violence  hardcore  classic use word it called oz nickname given oswald maximum security state penitentary  focuses mainly emerald city  experimental section prison cells glass fronts face inwards  privacy high agenda  em city home many  aryans  muslims  gangstas  latinos  christians  italians  irish more    so scuffles  death stares  dodgy dealings shady agreements never far away i would say main appeal show due fact goes shows dare  forget pretty pictures painted mainstream audiences  forget charm  forget romance   oz mess around  first episode ever saw struck nasty surreal  say ready it  watched more  developed taste oz  got accustomed high levels graphic violence  violence  injustice  crooked guards who ll sold nickel  inmates who ll kill order g

괄호나 점들 사라진걸 확인할 수 있다

In [19]:
#문자기호들 없앤 상태에서 다시 한번 불용어 제거해주기
new_imdb = remove_stop_words(new_imdb)

In [20]:
new_imdb[0]

'one reviewers mentioned watching 1 oz episode hooked right exactly happened first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements never far away would say main appeal show due fact goes shows dare forget pretty pictures painted mainstream audiences forget charm forget romance oz mess around first episode ever saw struck nasty surreal say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bitches due lack street ski

불용어 제거 함수로 인해 들쭉날쭉이었던 띄어쓰기도 제대로 정렬되었다. 

#### 4) Text Normalization
- 서로 다른 단어들이지만, 하나의 단어로 일반화시킬 수 있다면 그렇게 해서 문서 내의 단어 수를 줄인다.
- 단어의 빈도수를 기반으로 문제를 풀고자 하는 BoW(Bag of Words) 표현을 사용하는 자연어 처리 문제에서 주로 사용된다.

**4-1. 어간 추출(Stemming)**
- 다르게 생긴 단어라 할지라도 같은 의미인 단어들이 있다.
    - ex) 한국어: 어미, 조사 / 영어: 주어의 인칭, 시제에 따라 동사의 형태가 변하는 경우
- 이처럼 형태가 조금씩 다른 단어들의 경우 전처리를 하지 않고 분석하면 제각각 다른 단어로 인식하게 된다.
- 따라서 동일화(stemming)과정을 거쳐 동일한 의미의 단어들을 같은 형태로 통일해준다.
- **NLTK**패키지의 **PorterStemmer**라이브러리 활용(LancasterStemmer, RegexpStemmer 라이브러리도 있다.)

**4-2. 표제어 추출(Lemmatization)**
- 단어들이 다른 형태를 가지더라도, 그 뿌리 단어를 찾아가서 단어의 개수를 줄일 수 있는지 판단한다.
    - ex) am, are, is는 서로 다른 스펠링이지만 뿌리단어(표제어)는 be이다.
- 표제어 추출은 어간 추출과는 달리 단어의 형태가 적절히 보존되는 양상을 보이는 특징이 있다.
- 또한, 문맥을 고려하며 수행했을 때의 결과는 해당 단어의 품사 정보를 보존한다.(POS태그 보존)
- **NLTK**패키지의 **WordNetLemmatizer**라이브러리 활용

Stemming과 Lemmatization의 차이를 간략히 예시로 보자면,

|**Stemming**|**Lemmatization**|
|-------------|-----------------|
|am → am|am → be|
|the going → the go|the going → the going|
|having → hav|having → have|

"Stemming은 가끔 '존재하지 않는' 단어를 만들어내고, Lemmatizing은 '실제로 존재하는' 단어를 만들어낸다."

In [21]:
# 1) stemming
def get_stemmed_text(corpus):
    from nltk.stem.porter import PorterStemmer
    stemmer = PorterStemmer()
    return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

stemmed_imdb = get_stemmed_text(new_imdb)

In [22]:
stemmed_imdb[0]

'one review mention watch 1 oz episod hook right exactli happen first thing struck oz brutal unflinch scene violenc set right word go trust show faint heart timid show pull punch regard drug sex violenc hardcor classic use word call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home mani aryan muslim gangsta latino christian italian irish scuffl death stare dodgi deal shadi agreement never far away would say main appeal show due fact goe show dare forget pretti pictur paint mainstream audienc forget charm forget romanc oz mess around first episod ever saw struck nasti surreal say readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard sold nickel inmat kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch oz may becom comfort uncomfort view that get touch darker side'

mentioned -> mention / unflinching -> unflinch 등과 같이 어간이 추출된 것을 확인할 수 있다.

In [23]:
# 2) Lemmatization
def get_lemmatized_text(corpus):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

lemmatized_imdb = get_lemmatized_text(new_imdb)

In [24]:
lemmatized_imdb[0]

'one reviewer mentioned watching 1 oz episode hooked right exactly happened first thing struck oz brutality unflinching scene violence set right word go trust show faint hearted timid show pull punch regard drug sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focus mainly emerald city experimental section prison cell glass front face inwards privacy high agenda em city home many aryan muslim gangsta latino christian italian irish scuffle death stare dodgy dealing shady agreement never far away would say main appeal show due fact go show dare forget pretty picture painted mainstream audience forget charm forget romance oz mess around first episode ever saw struck nasty surreal say ready watched developed taste oz got accustomed high level graphic violence violence injustice crooked guard sold nickel inmate kill order get away well mannered middle class inmate turned prison bitch due lack street skill prison experience watching oz

stemming과는 다르게 단어 형태가 거의 온전히 보존된 것을 확인할 수 있다.

#### 5) Vectorization
- general process of turning a collection of text documents into numerical feature vectors
- specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation

**Bags of words**

- 문서를 숫자 벡터로 변환하는 가장 기본적인 방법
- 전체 문서  {d1,d2,…,dn}  를 구성하는 고정된 단어장(vocabulary)  {t1,t2,…,tm}  를 만들고  di 라는 개별 문서에 단어장에 해당하는 단어들이 포함되어 있는지를 표시한다
- **sklearn**패키지의 **CountVectorizer, TfidVectorizer**등의 라이브러리 활용

**CountVectorizer**
1. 문서를 토큰 리스트로 변환한다.
2. 각 문서에서 토큰의 출현 빈도를 센다.
3. 각 문서를 BOW 인코딩 벡터로 변환한다.

**N-gram**
- n번 연이어 등장하는 단어들의 연쇄(ex. 두번 연이어 등장하면 바이그램, 세번이면 트라이그램)
- 보편적으로 영어에만 적용되는 전처리
    - 'Republic of Korea', 'United Kingdom' 같은 경우 엔그램을 활용해야 제대로 된 단어 객체로 인지할 수 있다.
- 무작정 적용하면 의미없는 단어 뭉치가 많이 발생하여 불필요한 작업이 될 수도 있다.
- 따라서 분석할 텍스트의 성격에 비추어 bi-gram이상의 엔그램과 uni-gram(1-gram)을 적절히 혼합하여 단어들을 도출해야한다.
- **NLTK**패키지의 **ngrams**라이브러리 활용

In [25]:
#stemming 데이터를 가지고 train, test 4:1 비율로 직접 split해주기
X_train = stemmed_imdb[:40000]
y_train = imdb_data.sentiment[:40000]
X_test = stemmed_imdb[40000:]
y_test = imdb_data.sentiment[40000:]

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

#CountVectorizer
cv=CountVectorizer(min_df=0,max_df=1,binary=True, ngram_range=(1,3))
#fit & transform X_train
cv_train_reviews=cv.fit_transform(X_train)
#transform X_test
cv_test_reviews=cv.transform(X_test)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)

BOW_cv_train: (40000, 5820522)
BOW_cv_test: (10000, 5820522)


# 3. Modeling & Evaluation

## 1) Logistic Regression

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [99]:
#n-gram 옵션 없는 버전
cv = CountVectorizer(binary=True)
cv.fit(X_train)
cv_train_reviews = cv.transform(X_train)
cv_test_reviews = cv.transform(X_test)

In [101]:
#c값 조절해가면서 로지스틱 회귀 돌리기
#c값은 Inverse of regularization strength
#Like in support vector machines, smaller values specify stronger regularization
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(cv_train_reviews, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_test, lr.predict(cv_test_reviews))))

Accuracy for C=0.01: 0.8805
Accuracy for C=0.05: 0.8836


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy for C=0.25: 0.8816


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy for C=0.5: 0.8798
Accuracy for C=1: 0.8756


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


가장 높은 성능은 Accuracy for C=0.05: 0.8836

In [31]:
#ngram 추가 버전
cv = CountVectorizer(binary=True, ngram_range=(1,3))
cv.fit(X_train)
cv_train_reviews = cv.transform(X_train)
cv_test_reviews = cv.transform(X_test)

In [33]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(cv_train_reviews, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_test, lr.predict(cv_test_reviews))))

Accuracy for C=0.01: 0.8858
Accuracy for C=0.05: 0.8924
Accuracy for C=0.25: 0.8937


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy for C=0.5: 0.8955
Accuracy for C=1: 0.8956


가장 높은 Accuracy for C=1: 0.8956

n_gram 옵션을 추가하니 확실히 성능이 올라감

In [34]:
#Classification report 
lr_bow_report=classification_report(y_test,lr.predict(cv_test_reviews),target_names=['Positive','Negative'])
print(lr_bow_report)

              precision    recall  f1-score   support

    Positive       0.90      0.89      0.89      4993
    Negative       0.89      0.90      0.90      5007

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



## 2) SVM(Support Vector Machine)

In [45]:
svm=SGDClassifier(loss='hinge',random_state=42)

#fitting
svm_bow=svm.fit(cv_train_reviews,y_train)

#성능평가
svm_bow_score=accuracy_score(y_test,svm.predict(cv_test_reviews))
print("svm_bow_score :",svm_bow_score)

svm_bow_score : 0.8935


로지스틱회귀와 비슷하게 높은 성능이 나왔다.

In [46]:
#Classification report
svm_bow_report=classification_report(y_test,svm.predict(cv_test_reviews),target_names=['Positive','Negative'])
print(svm_bow_report)

              precision    recall  f1-score   support

    Positive       0.89      0.89      0.89      4993
    Negative       0.89      0.89      0.89      5007

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



## 3) (Multinomial) Naive Bayes
-  Multinomial Naive Bayes
    - feature가 이산적인 값이면서 어떠한 이벤트의 빈도수일 때 이용한다.

In [47]:
mnb=MultinomialNB()

#fitting 
mnb_bow=mnb.fit(cv_train_reviews,y_train)

#성능평가
mnb_bow_score=accuracy_score(y_test,mnb.predict(cv_test_reviews))
print("mnb_bow_score :",mnb_bow_score)

mnb_bow_score : 0.8888


로지스틱과 svm에 비해 낮은 성능

In [48]:
#Classification report
mnb_bow_report=classification_report(y_test,mnb.predict(cv_test_reviews),target_names=['Positive','Negative'])
print(mnb_bow_report)

              precision    recall  f1-score   support

    Positive       0.88      0.90      0.89      4993
    Negative       0.89      0.88      0.89      5007

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000

