### IMDB 영화평 감성분석
- TfidVectorizer + LogisticRegression

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('./data/labeledTrainData.tsv', sep='\t', quoting=3)    # 3: QUOTE_NONE
df.head(3)
# <br />(: 줄바꿈) 태그는 공백으로 변환
df.review = df.review.str.replace('<br />','')
# 구둣점, 숫자 제거 - 영문자가 아닌 글자는 공백으로 변경
# 데이터프레임의 str method는 정규표현식을 지원함
df.review = df.review.str.replace('[^A-Za-z]',' ').str.strip()
df.review[0][:1000]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.review, df.sentiment, stratify=df.sentiment, random_state=2022
)
y_train.value_counts()

0    9375
1    9375
Name: sentiment, dtype: int64

- Pipeline: TfidfVectorizer + LogisticRegression

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [4]:
tvect = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
lrc = LogisticRegression(random_state=2022)
pipeline = Pipeline([('TVECT', tvect), ("LR", lrc)])

In [6]:
# 학습
%time pipeline.fit(X_train, y_train)

CPU times: total: 2min 1s
Wall time: 49.2 s


Pipeline(steps=[('TVECT',
                 TfidfVectorizer(ngram_range=(1, 2), stop_words='english')),
                ('LR', LogisticRegression(random_state=2022))])

In [7]:
pipeline.score(X_test, y_test)

0.87456

- 최적 파라메터 찾기

In [8]:
from sklearn.model_selection import GridSearchCV
params = {
    'TVECT__max_df': [100,500],
    'LR__C': [1,10]
}

In [9]:
grid_pipe = GridSearchCV(pipeline, params, scoring='accuracy', cv=3, n_jobs=-1)
grid_pipe.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('TVECT',
                                        TfidfVectorizer(ngram_range=(1, 2),
                                                        stop_words='english')),
                                       ('LR',
                                        LogisticRegression(random_state=2022))]),
             n_jobs=-1,
             param_grid={'LR__C': [1, 10], 'TVECT__max_df': [100, 500]},
             scoring='accuracy')

In [10]:
grid_pipe.best_params_

{'LR__C': 10, 'TVECT__max_df': 500}

In [11]:
grid_pipe.best_estimator_.score(X_test, y_test)

0.87488

- 모델 저장하고 불러오기

In [14]:
import joblib
joblib.dump(grid_pipe.best_estimator_,'./data/imdb_tvect_lr.pkl')

['./data/imdb_tvect_lr.pkl']

In [15]:
best_pipe = joblib.load('./data/imdb_tvect_lr.pkl')

- 실제 데이터 적용

In [17]:
review = '''This is a movie made purely to satisfy the fans and there should be no doubt about that.
No Way Home, in my opinion, is even better than Homecoming and Far From Home, and pretty much one of the best MCU movies of all time.
It's a simple story, but the execution is fantastic. Even the smallest of surprises have a huge impact,
and I could feel that in the theatre as I joined several other Spider-Man fans cheer out for both heroes and villains.
The action sequences were brilliant;
seeing them in 3D is totally worth the price of admission. Every actor delivered a believable, realistic performance,
and especially our lead actor Tom Holland.
The visual effects too were top notch and the editing was stupendous.
Two and a half hours flew by real quick while watching this popcorn action entertainer.
It won't be fair to reveal anything, so here I conclude my review,
and recommend you to check out this new world of Spidey-ness on the big screen and in 3D.
And once you've seen it, please don't spoil it for others, just like you won't want it spoiled for yourself.
'''

In [18]:
# 텍스트 전처리
import re
review = re.sub('[^A-Za-z]',' ',review).strip()

In [19]:
best_pipe.predict([review])

array([1], dtype=int64)