### IMDB 영화평 감성분석(이진분류)   
- CountVectorizer + LogisticRegression

##### 1. 데이터 탐색

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('data/labeledTrainData.tsv',sep='\t')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
df = pd.read_csv('data/labeledTrainData.tsv',sep='\t',quoting=3)            # 3 : QUOTE NONE
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [8]:
print(df.review[0][:100])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching 


In [9]:
df.isna().sum().sum()

0

##### 2. 텍스트 전처리

In [10]:
# <br /> 태그는 공백으로
df.review = df.review.str.replace('<br />',' ')

In [11]:
df.review[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.  Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.  The actual feature film bit when it finally starts is only on f

In [17]:
# 구둣점, 숫자 제거 --> 영어 이외의 문자는 공백으로
df.review = df.review.str.replace('[^A-Za-z]',' ',regex=True)

In [18]:
df.review[0][:200]

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want '

##### 3. 데이터 셋 분리

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.review.values,df.sentiment.values,stratify=df.sentiment.values,
    test_size=0.2,random_state=2023
)
np.unique(y_train,return_counts=True)

(array([0, 1], dtype=int64), array([10000, 10000], dtype=int64))

##### 4. Text Encoding

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(stop_words='english')

In [21]:
cvect.fit(X_train)
X_train_cv = cvect.transform(X_train)
X_test_cv = cvect.transform(X_test)
X_train_cv.shape, X_test_cv.shape

((20000, 66602), (5000, 66602))

In [22]:
# 아래와 같은 방법으로 하면 안됨
# cvect.fit_transform(X_train)
# cvect.fit_transform(X_test)

##### 5. 학습 및 평가

In [25]:
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=2023, max_iter=500)

In [26]:
# 시간이 오래 걸리는 작업 = %time magic 명령어 사용
%time lrc.fit(X_train_cv,y_train)

CPU times: total: 5.16 s
Wall time: 4.9 s


In [27]:
lrc.score(X_test_cv,y_test)

0.8786

##### 6. I-gram

In [28]:
cvect2 = CountVectorizer(stop_words='english',ngram_range=(1,2))
cvect2.fit(X_train)
X_train_cv2 = cvect2.transform(X_train)
X_test_cv2 = cvect2.transform(X_test)
X_train_cv2.shape, X_test_cv2.shape

((20000, 1455899), (5000, 1455899))

In [29]:
lrc2 = LogisticRegression(random_state=2023, max_iter=500)
%time lrc2.fit(X_train_cv2,y_train)

CPU times: total: 53.6 s
Wall time: 45.9 s


In [30]:
lrc2.score(X_test_cv2,y_test)

0.8896

##### 7. 모델 Save/load

In [31]:
import joblib

In [32]:
# 모델 저장
joblib.dump(cvect2,'model/imdb_cvect_2.pkl')
joblib.dump(lrc2,'model/imdb_lrc2.pkl')

['model/imdb_lrc2.pkl']

In [33]:
# 모델 로드
new_cvect = joblib.load('model/imdb_cvect_2.pkl')
new_lrc = joblib.load('model/imdb_lrc2.pkl')

##### 8. 실제 데이터 검증

In [42]:
review = "This isn't just a beautifully crafted gangster film. Or an outstanding family portrait, for that matter. An amazing period piece. A character study. A lesson in filmmaking and an inspiration to generations of actors, directors, screenwriters and producers. For me, this is more: this is the definitive film. 10 stars out of 10."
'''I follow recommendations on this site highly. I rented this movie and wanted my money back. Ever been to one of those parties with distant relatives where you don't know anyone there and just sit in the corner waiting for it to end? If so, you've seen 90% of this movie. Throw in a few good scenes that happen so far apart, you forget the last one by the time you see the next one. Might be worth watching once just to say you have, but you'll probably never watch it again. Definitely not "best movie ever material."'''

'I follow recommendations on this site highly. I rented this movie and wanted my money back. Ever been to one of those parties with distant relatives where you don\'t know anyone there and just sit in the corner waiting for it to end? If so, you\'ve seen 90% of this movie. Throw in a few good scenes that happen so far apart, you forget the last one by the time you see the next one. Might be worth watching once just to say you have, but you\'ll probably never watch it again. Definitely not "best movie ever material."'

In [45]:
# 텍스트 전처리
import re
review = map(lambda x:re.sub('[^A-Za-z]',' ',x),review)
# review = re.sub('[^A-Za-z]',' ',review)

In [46]:
# feature 변환
# review_cv = new_cvect.transform([review])
review_cv = new_cvect.transform(review)
review_cv.shape

(327, 1455899)

In [50]:
# 예측
# '긍정' if new_lrc.predict(review_cv)[0] == 1 else '부정'
'긍정' if new_lrc.predict(review_cv)[0] == 1 else '부정'

'긍정'

In [40]:
review2 ='''It seems like a lot of years have passed since then, because Chicago looks completely rebuilt, only to be destroyed once again. But Bay has cast a whole new group of characters. He threw out Shia LaBeouf, Megan Fox and the others and replaced them with Mark Wahlberg, a couple of no-names, Stanley Tucci, and Kelsey Grammar. So it's a brand new day in a world where 'Transformers' are being hunted by a secret military operation uncanonized to the President or anyone else for that matter for the cliché'd reason of money. Without a single shot lasting more than seven seconds (even the slow motion ones), we center on Wahlberg who plays Cade Yeager, a single father, raising a teenage daughter Tessa (Nicola Peltz), on their Texas farm, as he tries to come up with the next big electrical invention.

His barn is a makeshift lab where he has invented dozens of different robots who have trouble performing the most simple tasks. You would thing that this plot point would come into play later, as he is good at working with metal and robots, but believe me, it doesn't pay off nor come into play at all. Wahlberg's right-hand man is Lucas (T.J. Miller), who is the comic relief here, but I guess Bay received so much flack for his lack of of comedic dialogue in the previous films, that he blows up the comedic relief early on the film, leaving the rest of the film at a much darker tone than the three previous movies. When Cade is not telling his 17-year old daughter that she can't date anyone or have fun, he is purchasing some equipment and comes across a rusted out old 18 wheeler, which turns out to be a beat up Optimus Prime. Cade and Optimus become friends, but the CIA and their new ally Lockdown, a mercenary Transformer who is up for the task for taking out all Autobots in exchange for a seed, or bomb that can destroy a planet in order to create life for more Transformers, is one of our bad guys here.

So for the next two hours, Cade, Tessa, and Tessa's secret 20-year old race car driving boyfriend Shane (Jack Reynor), are on the run from the CIA and Lockdown with Optimus Prime and a few other remaining auto-bots. We go from Texas to Chicago to Beijing, all of which are mostly destroyed by the ensuing fight scenes which are redundant and the same thing you've seen in the previous three movies, with the exception of the Dino-Bots making an appearance in the last few minutes of the movie. If you thought Megatron was dead, think again. CIA head operator Harold Attinger (Grammer) and billionaire inventor Joshua Joyce (Tucci) are in cahoots with each other to take out the auto-bots by taking the remnants of Megatron and learning how to build their own Transformers from scratch. But little do they know that Megatron is still alive and is now controlling the 50 new Transformers that Joyce built.

So it seems like Cade and his teenage daughter have a lot on their plates to deal with at the moment. Cade turns into an alien-gun wielding action hero while his daughter acts like a horrible person for most the movie by yelling at her dad and trying to make out with her older boyfriend in front of him, but gets to jump off a truck and kick a small goofy transformer with googly eyes once. Bay just seems to hate women as he has never had a decent female character in any of his movies, but just likes to show them wearing next to nothing through the entire film, which is how Cade's daughter dresses throughout.

The script is utter garbage with cheesy one-liner after cheesy-one liner, spewing from each actor throughout the 165 minutes. I've seen better dialogue on day-time soap operas, but I guess that's what you get when you hire writer Ehren Kruger ('Scream 3'). Is there anything good about this movie? Not really, but seeing it in IMAX, was pretty good, and the 3D didn't make me want to gouge my eyes out. I'd say the best part of the film was Stanley Tucci. His character is the only one that has a solid story arc and is fun to watch on screen. His frantic dialogue and expressions are very funny, but it is all short lived and happens too often.

Whalberg is always likable and it was good to see him here, but there wasn't really anything his character had other than clichés and bad lines. And Tessa and Reynor could have been played by anyone at anytime, as their performances were forgettable and lazy. At least Bay cast John Goodman as one of the auto-bots and we got to hear Bumblebee say a John Goodman line from 'The Big Lebowski', but other than that, every thing else was sub-par, even John DiMaggio, yes Bender from 'Futurama' is an auto-bot in this movie. I'm sure 'Age of Extinction' will make tons of money this summer, but it's a shame, because it definitely doesn't deserve it.

This is filmmaking at its worse, with terrible camera work, awful dialogue, bad characters, a bad musical score, and enough blatant product placement to make your throw up. Sure, the editors found a way to take Michael Bay's ridiculous style of filmmaking and turn the action scenes into something tolerable, but it barely works, and with it happening constantly for three hours, it becomes silly and annoying. The IMAX image and sound is amazing, but past that, 'Transformers: Age of Extinction' is just a horrendous mess of a film from top to bottom.'''

In [41]:
review2 = re.sub('[^A-Za-z]',' ',review2)
review_cv2 = new_cvect.transform([review2])
'긍정' if new_lrc.predict(review_cv2)[0] == 1 else '부정'

'부정'