IMDB 영화평 감성 분석
* pipeline
* Tf-idf vectorizer + logistic regression

In [1]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('labeledTrainData.tsv',sep='\t')
df.head(2)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."


In [5]:
#<br /> 태그 공백변환
df.review = df.review.str.replace('<br />',' ')

In [6]:
#구둣점, 숫자 제거 - 영문자가 아닌것은 공백으로 변환
df.review = df.review.str.replace('[^A-Za-z]',' ').str.strip()

  


In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(
    df.review,df.sentiment,stratify=df.sentiment,random_state=2022
)

pipeline : tf-idf vectorizer + logistic regression

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


In [9]:
tvect = TfidfVectorizer(ngram_range=(1,2),stop_words='english')
lrc = LogisticRegression(random_state=2022)
pipeline = Pipeline([('TVECT',tvect),('LR',lrc)])

In [10]:
%time pipeline.fit(X_train,y_train)

CPU times: user 35.5 s, sys: 13.1 s, total: 48.6 s
Wall time: 38.4 s


Pipeline(steps=[('TVECT',
                 TfidfVectorizer(ngram_range=(1, 2), stop_words='english')),
                ('LR', LogisticRegression(random_state=2022))])

In [11]:
pipeline.score(X_test,y_test)

0.87472

* 최적의 파라미터 찾기

In [16]:
from sklearn.model_selection import GridSearchCV
params = {
    'TVECT__max_df':[100,500],
    'LR__C':[1,10]
}

In [17]:
grid_pipe=GridSearchCV(pipeline,params,scoring='accuracy',cv=3,n_jobs=-1)
%time grid_pipe.fit(X_train,y_train)

CPU times: user 46 s, sys: 20.7 s, total: 1min 6s
Wall time: 3min 29s


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('TVECT',
                                        TfidfVectorizer(ngram_range=(1, 2),
                                                        stop_words='english')),
                                       ('LR',
                                        LogisticRegression(random_state=2022))]),
             n_jobs=-1,
             param_grid={'LR__C': [1, 10], 'TVECT__max_df': [100, 500]},
             scoring='accuracy')

In [18]:
grid_pipe.best_params_

{'LR__C': 10, 'TVECT__max_df': 500}

In [19]:
grid_pipe.best_estimator_.score(X_test,y_test)

0.87552

* 모델 저장하고 불러오기

In [20]:
import joblib
joblib.dump(grid_pipe.best_estimator_,'imdb_tvect_lr.pkl')

['imdb_tvect_lr.pkl']

In [22]:
best_pipe = joblib.load('imdb_tvect_lr.pkl')

In [21]:
review = '''This is a movie made purely to satisfy the fans and there should be no doubt about that. 
No Way Home, in my opinion, is even better than Homecoming and Far From Home, 
and pretty much one of the best MCU movies of all time. 
It's a simple story, but the execution is fantastic. Even the smallest of surprises have a huge impact, 
and I could feel that in the theatre as I joined several other Spider-Man fans cheer out for both heroes and 
villains. The action sequences were brilliant; seeing them in 3D is totally worth the price of admission. 
Every actor delivered a believable, realistic performance, and especially our lead actor Tom Holland. 
The visual effects too were top notch and the editing was stupendous. 
Two and a half hours flew by real quick while watching this popcorn action entertainer. 
It won't be fair to reveal anything, so here I conclude my review, and recommend you to check out 
this new world of Spidey-ness on the big screen and in 3D. 
And once you've seen it, please don't spoil it for others, just like you won't want it spoiled for yourself.'''

In [23]:
#텍스트 전처리
import re
review =re.sub('[^[A-Za-z]',' ',review).strip()

In [24]:
best_pipe.predict([review])

array([1])