# IMDB 영화평 감성분석 (센티멘트 분석)
- Pipeline
- TF - IDF / Logistic Regression 

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('data/labeledTrainData.tsv', sep='\t', quoting=3) # 3 :  Quote - None
df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


- 텍스트 전처리

In [3]:
# <br /> 태그는 공백으로 처리
df.review = df.review.str.replace('<br />', ' ')

In [4]:
# 구둣점하고 숫자 제거 - 영문자가 아닌놈들은 공백으로 변환
df.review = df.review.str.replace('[^A-Za-z]', ' ').str.strip()
# df.review[0][:1000]

  df.review = df.review.str.replace('[^A-Za-z]', ' ').str.strip()


- Train / test set 분리 

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.review, df.sentiment, stratify=df.sentiment, random_state=2022
)
# X_train.shape, X_test.shape, y_train.shape, y_test.shape,
# y_train.value_counts()

In [6]:
# y_train.value_counts()

- # Pipeline : TF-IDFVectorizer + Logistic Regression

- Pipeline

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [8]:
pipeline = Pipeline([
    ('tvect', TfidfVectorizer(stop_words='english', ngram_range=(1,2))),
    ('lr', LogisticRegression(random_state=2022))
])

In [9]:
# 윗 셀과 동일한 코드.
# tvect = TfidfVectorizer(stop_words ='english', ngram_range= (1,2))
# lr = LogisticRegression(random_state=2022)
# pipeline = Pipeline([('tvect', tvect), ('lr', lr)])

In [10]:
# 학습 
%time pipeline.fit(X_train, y_train)

Wall time: 24.1 s


Pipeline(steps=[('tvect',
                 TfidfVectorizer(ngram_range=(1, 2), stop_words='english')),
                ('lr', LogisticRegression(random_state=2022))])

In [11]:
# 평가  # 파이프라인으로 감싸니까 변화하고 할 때에 특별한 값 없어. 
pipeline.score(X_test, y_test)

0.87472

- 최적의 하이퍼 파라메터 찾기

In [12]:
from sklearn.model_selection import GridSearchCV
params = {
    'tvect__max_df' : [100, 500],
    'lr__C' : [1, 10]
}

In [13]:
grid_pipe = GridSearchCV(
    pipeline, param_grid=params, scoring='accuracy', cv=3, n_jobs=-1
)  # n jobs = -1 주면, 멀티 쓰레딩으로 돌아
%time grid_pipe.fit(X_train, y_train)

Wall time: 1min 54s


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('tvect',
                                        TfidfVectorizer(ngram_range=(1, 2),
                                                        stop_words='english')),
                                       ('lr',
                                        LogisticRegression(random_state=2022))]),
             n_jobs=-1,
             param_grid={'lr__C': [1, 10], 'tvect__max_df': [100, 500]},
             scoring='accuracy')

In [14]:
# Wall time: 1min 54s
# GridSearchCV(cv=3,
#              estimator=Pipeline(steps=[('tvect',
#                                         TfidfVectorizer(ngram_range=(1, 2),
#                                                         stop_words='english')),
#                                        ('lr',
#                                         LogisticRegression(random_state=2022))]),
#              n_jobs=-1,
#              param_grid={'lr__C': [1, 10], 'tvect__max_df': [100, 500]},
#              scoring='accuracy')

In [15]:
grid_pipe.best_params_

{'lr__C': 10, 'tvect__max_df': 500}

In [16]:
from sklearn.model_selection import GridSearchCV
params = {
    'tvect__max_df' : [500, 1000],
    'lr__C' : [10, 20]
}
%time grid_pipe.fit(X_train, y_train)

Wall time: 1min 58s


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('tvect',
                                        TfidfVectorizer(ngram_range=(1, 2),
                                                        stop_words='english')),
                                       ('lr',
                                        LogisticRegression(random_state=2022))]),
             n_jobs=-1,
             param_grid={'lr__C': [1, 10], 'tvect__max_df': [100, 500]},
             scoring='accuracy')

In [17]:
grid_pipe.best_params_
# 여전히 같은 값이 나와줘. 300, 700, 13, 7 등등 할 수 있을것. 

{'lr__C': 10, 'tvect__max_df': 500}

In [18]:
grid_pipe.best_estimator_.score(X_test, y_test)
# 0.87472 보다 조금 올라 0.87552

0.87552

# 교재 추천 파트
 -  딥러닝을 위한 자연어 처리 입문
 -  정규표현식 등.
 -  독학하기 어려워. 작년 자연어 처리 1주일 처음에 나가서 벡터 유사도 까지 했었어 . 쉽지 않아. 

- 모델 저장

In [19]:
import joblib
joblib.dump(grid_pipe.best_estimator_, 'model/imdb_pipe.pkl')

['model/imdb_pipe.pkl']

In [20]:
best_pipe = joblib.load('model/imdb_pipe.pkl')

- 실제 데이터

In [21]:
review = '''
Before all of the hype I wanted to see Spider-Man in the theater, and the hype just made me want to see it more. I didn't know what to expect, but I certainly expected better than what I saw. Besides being a snooze fest, it was so sappy and saccharin I scheduled a dentist visit during the movie.

The movie started going wayward the moment Peter Parker (Tom Holland) went to Doctor Strange (Benedict Cumberbatch) for a spell to make everyone forget him. It wasn't a bad idea for Peter to go to Doctor Strange to fix his little revealed-identity problem, it was just oddly dumb of Doctor Strange to actually accommodate Peter. Doctor Strange truly attempted to make everyone in the world forget who Peter Parker was. It was a massive spell that seemed very excessive just to fix one man's problems. The movie went into silly mode as Parker kept interrupting Doctor Strange as he was doing the spell. The result was a host of old nemeses from different universes converging upon this universe to find Parker. If it was explained, it was explained poorly.

At this point the movie was on shaky ground, but it hadn't given way. "No Way Home" would suffer another setback when Parker fought Strange to prevent him from sending all of these villains back to their respective universes. You see, if they were to go back then they'd meet death at the hands of, or because of, that universe's Spider-Man, and this universe's Spider-Man was way too moralistic to allow that to happen. No, Parker would fix it so that they could all go back home and they could all live happily ever after.

Everything after this part of the movie was immaterial to me in the grand scheme. For me, when a plot is based upon a faulty or simply bad premise, everything that follows is equally faulty and meaningless. Not that the things that follow can't be cool aesthetically; they're just empty.

Parker was operating upon the righteous advice of his Aunt May (Marisa Tomei). Per her wisdom, Parker had an obligation to help these poor misguided men, no matter what the exponentially more knowledgeable Doctor Strange said. Doctor Strange wanted to do the most logical thing which was send each villain back to his own universe and let fate take over from there. Parker, with his oversized heart and undersized common sense, wanted to keep them in his own universe as pet projects until he could fix them, then send them back. Nevermind that he was going to be actively tampering with the course of events of another universe thereby massively altering their timelines with untold consequences, he simply wasn't old enough, wise enough, or experienced enough to know what was the best thing to do. Furthermore, it was a slap in the face of the other Spider-Men. As if to say, "You guys screwed up, I'm going to find a better, more wholesome solution because I'm a better Spider-Man"

I was so thoroughly perturbed by this new mission of Spider-Man's to save these villains that everything after that began to annoy me. The attempts at comedy were awkward and grossly unfunny. The emotional moments were too often and too long. The final nail in the coffin of this was the runtime. Boy did they drag this out. I could put up with a didactic and flawed plot if you get us through it in a quick and exciting way, but in their attempts to have this grand send off for Spider-Man we had to suffer through an ocean of tears, a chasm of emotions, and an abundance of silence with slow music. I left the theater drained and upset as though I'd just been to the DMV. Everything was set up for them to reboot Spider-Man if needed, but at this point I could use a break and no more Spider-Men about "home."

'''
review

'\nBefore all of the hype I wanted to see Spider-Man in the theater, and the hype just made me want to see it more. I didn\'t know what to expect, but I certainly expected better than what I saw. Besides being a snooze fest, it was so sappy and saccharin I scheduled a dentist visit during the movie.\n\nThe movie started going wayward the moment Peter Parker (Tom Holland) went to Doctor Strange (Benedict Cumberbatch) for a spell to make everyone forget him. It wasn\'t a bad idea for Peter to go to Doctor Strange to fix his little revealed-identity problem, it was just oddly dumb of Doctor Strange to actually accommodate Peter. Doctor Strange truly attempted to make everyone in the world forget who Peter Parker was. It was a massive spell that seemed very excessive just to fix one man\'s problems. The movie went into silly mode as Parker kept interrupting Doctor Strange as he was doing the spell. The result was a host of old nemeses from different universes converging upon this universe 

In [22]:
# 텍스트 전처리
import re
clean_review = re.sub('[^A-Za-z]', ' ', review).strip()

In [23]:
best_pipe.predict([clean_review])
# 0 부정이 나왔어.

array([0], dtype=int64)

In [24]:
# clean_review