<a href="https://colab.research.google.com/github/rtajeong/M3/blob/main/lab53_naver_movie_rev1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

네이버영화평점
==
- 감성분석
- 네이버 영화평점 (Naver sentiment movie corpus v.1.0) 데이터(https://github.com/e9t/nsmc)
- 영화 리뷰 20만건이 저장됨. 각 평가 데이터는 0(부정), 1(긍정)으로 label 됨.

### 한글 자연어 처리
- KoNLPy(“코엔엘파이”라고 읽습니다)는 한국어 정보처리를 위한 파이썬 패키지입니다.
- konlpy 패키지에서 제공하는 Twitter라는 문서 분석 라이브러리 사용 (트위터 분석 뿐 아니라 한글 텍스트 
  처리도 가능)
- colab 사용 권장

# 로지스틱회귀를 이용한 감성분석

In [1]:
!pip install konlpy

Collecting konlpy
[?25l  Downloading https://files.pythonhosted.org/packages/85/0e/f385566fec837c0b83f216b2da65db9997b35dd675e107752005b7d392b1/konlpy-0.5.2-py2.py3-none-any.whl (19.4MB)
[K     |████████████████████████████████| 19.4MB 1.4MB/s 
[?25hCollecting beautifulsoup4==4.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/9e/d4/10f46e5cfac773e22707237bfcd51bbffeaf0a576b0a847ec7ab15bd7ace/beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
[K     |████████████████████████████████| 92kB 13.7MB/s 
Collecting JPype1>=0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/88/f817ef1af6f794e8f11313dcd1549de833f4599abcec82746ab5ed086686/JPype1-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (448kB)
[K     |████████████████████████████████| 450kB 43.9MB/s 
[?25hCollecting colorama
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Installing coll

In [2]:
# Curl:
# curl is a tool to transfer data from or to a server, using one of the supported protocols (HTTP, HTTPS, FTP,
# FTPS, SCP, SFTP, TFTP, DICT, TELNET, LDAP or FILE). The command is designed to work without user interaction.
# 
# curl -L : (HTTP/HTTPS) If the server reports that the requested page has moved to a different location 
# (indicated with a Location: header and a 3XX response code), this option will make curl redo the request 
# on the new place.

In [3]:
# 네이버 영화 평점 데이터 다운로드
!curl -L https://bit.ly/2X9Owwr -o ratings_train.txt
!curl -L https://bit.ly/2WuLd5I -o ratings_test.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   152  100   152    0     0   4750      0 --:--:-- --:--:-- --:--:--  4750
100   148    0   148    0     0    550      0 --:--:-- --:--:-- --:--:--   550
100   318  100   318    0     0    554      0 --:--:-- --:--:-- --:--:--   554
100 14.0M  100 14.0M    0     0  12.4M      0  0:00:01  0:00:01 --:--:-- 41.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   151  100   151    0     0   6863      0 --:--:-- --:--:-- --:--:--  6863
100   147    0   147    0     0    622      0 --:--:-- --:--:-- --:--:--   622
100   318  100   318    0     0    583      0 --:--:-- --:--:-- --:--:--     0
100 4827k  100 4827k    0     0  5467k      0 --:--:-- --:--:-- --:--:-- 5467k


In [4]:
import konlpy
import pandas as pd
from konlpy.tag import Twitter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
# from sklearn.pipeline import make_pipeline
# import pickle
# import os.path

# 데이터 로드
# keep_default_na: Whether or not to include the default NaN values when parsing the data
# -> False: no strings will be parsed as NaN.

df_train = pd.read_csv('ratings_train.txt', delimiter='\t', keep_default_na=False)
df_test = pd.read_csv('ratings_test.txt', delimiter='\t', keep_default_na=False)

df_train.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [5]:
text_train, y_train = df_train['document'].values, df_train['label'].values
text_test, y_test = df_test['document'].values, df_test['label'].values

In [6]:
text_train.shape, text_test.shape   # too big

((150000,), (50000,))

In [7]:
text_train, y_train = text_train[:2000], y_train[:2000]
text_test, y_test = text_test[:1000], y_test[:1000]

In [8]:
text_train.shape, text_test.shape

((2000,), (1000,))

In [9]:
from konlpy.tag import Twitter
def twitter_tokenizer(text):
    return Twitter().morphs(text)

cv = TfidfVectorizer(tokenizer=twitter_tokenizer, max_features = 5000, min_df=5)

In [10]:
lr = LogisticRegression()
x_train = cv.fit_transform(text_train)
x_test = cv.transform(text_test)
result = lr.fit(x_train,y_train)

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')
  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


In [11]:
len(cv.vocabulary_)

794

In [12]:
print("훈련 데이터 점수 : ", result.score(x_train, y_train))
print("테스트 데이터 점수 : ", result.score(x_test, y_test))

훈련 데이터 점수 :  0.8555
테스트 데이터 점수 :  0.746


In [13]:
feature_names = cv.get_feature_names()
print(feature_names[-10:])

['화', '화면', '화보', '화이팅', '확실히', '후', '후회', '흠', '희망', '히']


# 불용어 처리
- 한국어  불용어 확인은 형태소 분석 라이브러리인 KoLPy 를 이용하면 됨.
- (예) 한국어 품사 중 조사를 추출하는 예
- pos (part-of-speech): 품사 (명사, 동사, ...)

In [14]:
Twitter().morphs("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제")

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


['텍스트',
 '데이터',
 '를',
 '이용',
 '해서',
 '불',
 '용어',
 '사전',
 '을',
 '구축',
 '하기',
 '위',
 '한',
 '간단',
 '예제']

In [15]:
Twitter().pos("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제")

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


[('텍스트', 'Noun'),
 ('데이터', 'Noun'),
 ('를', 'Josa'),
 ('이용', 'Noun'),
 ('해서', 'Verb'),
 ('불', 'Noun'),
 ('용어', 'Noun'),
 ('사전', 'Noun'),
 ('을', 'Josa'),
 ('구축', 'Noun'),
 ('하기', 'Verb'),
 ('위', 'Noun'),
 ('한', 'Josa'),
 ('간단', 'Noun'),
 ('예제', 'Noun')]

In [16]:
Twitter().pos("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제", norm=True)

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


[('텍스트', 'Noun'),
 ('데이터', 'Noun'),
 ('를', 'Josa'),
 ('이용', 'Noun'),
 ('해서', 'Verb'),
 ('불', 'Noun'),
 ('용어', 'Noun'),
 ('사전', 'Noun'),
 ('을', 'Josa'),
 ('구축', 'Noun'),
 ('하기', 'Verb'),
 ('위', 'Noun'),
 ('한', 'Josa'),
 ('간단', 'Noun'),
 ('예제', 'Noun')]

In [17]:
Twitter().nouns("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제")

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


['텍스트', '데이터', '이용', '불', '용어', '사전', '구축', '위', '간단', '예제']

- norm: 오타수정, stem: 어근 찾기

In [18]:
from konlpy.tag import Twitter
twitter = Twitter()
word_tags = twitter.pos("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제",
                       norm=True, stem=True)
print(word_tags)
stop_words = [word[0] for word in word_tags if word[1]=="Josa"]
print (stop_words)

[('텍스트', 'Noun'), ('데이터', 'Noun'), ('를', 'Josa'), ('이용', 'Noun'), ('하다', 'Verb'), ('불', 'Noun'), ('용어', 'Noun'), ('사전', 'Noun'), ('을', 'Josa'), ('구축', 'Noun'), ('하다', 'Verb'), ('위', 'Noun'), ('한', 'Josa'), ('간단', 'Noun'), ('예제', 'Noun')]
['를', '을', '한']


  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


# 여기까지 --- Text ---

# LSTM을 이용한 분석

In [19]:
print(df_train.shape, df_test.shape)
df_train.columns, df_test.columns

(150000, 3) (50000, 3)


(Index(['id', 'document', 'label'], dtype='object'),
 Index(['id', 'document', 'label'], dtype='object'))

In [20]:
df_data= pd.concat([df_train, df_test])
df_data.shape, df_data.columns

((200000, 3), Index(['id', 'document', 'label'], dtype='object'))

In [21]:
text_data, y_data = df_data['document'].values, df_data['label'].values

In [22]:
from konlpy.tag import Twitter
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
from sklearn import model_selection, metrics
import numpy as np
import pickle
import os.path
import tensorflow.keras.backend as K

# 토큰 파서
def twitter_tokenizer(text):
    return Twitter().morphs(text)

In [23]:
cv = TfidfVectorizer(tokenizer=twitter_tokenizer)

### pickling: 
-“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” 1 or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling”.

### Comparison with json
There are fundamental differences between the pickle protocols and JSON (JavaScript Object Notation):

- JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format;
- JSON is human-readable, while pickle is not;
- JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;
- JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs);
- Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.

In [None]:
# Tfidf 생성과 저장 - 단 파일이 없을 때만 !!
if not os.path.isfile("X_data.pickle"): 
    print('file does not exists')
    X_data = cv.fit_transform(text_data)
    pickle.dump(X_train, open("X_data.pickle", "wb"))

file does not exists


  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


In [None]:
# 저장된 tfidf vector 데이터 읽기
with open('X_data.pickle', 'rb') as f:
    X_data = pickle.load(f)

In [None]:
!ls -al X*

In [None]:
# one-hot encoding
Y_data = np_utils.to_categorical(y_data, 2)

In [None]:
X_train = X_data[:100000]
X_test = X_data[100000:]

Y_train = Y_data[:100000]
Y_test = Y_data[100000:]

In [None]:
max_words = 61070 
nb_classes = 2
batch_size = 1024
nb_epoch = 5

In [None]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
Y_train.shape[1]

In [None]:
# LSTM 학습을 위한 데이터 재배열 (Time step)
X_train_rnn = X_train.A.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test_rnn = X_test.A.reshape((X_test.shape[0], 1, X_test.shape[1]))

print(X_train_rnn.shape)
print(X_test_rnn.shape)

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.utils import np_utils

def build_LSTM_model():
    model = Sequential()
    model.add(LSTM(128, input_shape=(X_train_rnn.shape[1], X_train_rnn.shape[2]), return_sequences=True))
    model.add(Activation('relu'))
    model.add(Dropout(0.2))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(Y_train.shape[1], activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
import keras.backend.tensorflow_backend as K
with K.tf.device('/GPU:1'):
    model_lstm = KerasClassifier(
    build_fn=build_LSTM_model, 
    epochs=nb_epoch, 
    batch_size=batch_size)
    
    model_lstm.fit(X_train_rnn, Y_train)

In [None]:
y = model_lstm.predict(X_train_rnn)
y_train = y_data[:100000]
ac_score = metrics.accuracy_score(y_train, y)
print("훈련 셋 정답률 =", ac_score)

In [None]:
# predict 함수는 예측 결과를 0 or 1로 출력하므로
# 학습과정에서 사용한 Y_train, Y_test 변수로 정확도 측정이 안됨
# Y_train, Y_test는 [0, 1], [1, 0]의 형태로 해당하는 감정 컬럼(class)은 1, 다른 컬럼은 0으로 표시됨
# 초기 y_data에 저장된 값을 그대로 활용하여 정확도를 측정

print("y : ", y)
print("Y_train[0] : ", Y_train)
print("y_train[0] : ", y_data)

In [None]:
y = model_lstm.predict(X_test_rnn)
y_test = y_data[100000:]
ac_score = metrics.accuracy_score(y_test, y)
print("테스트 셋 정답률 =", ac_score)

In [None]:
## 연습

In [None]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(3,4)); df

In [None]:
df.as_matrix()   # Convert the frame to its Numpy-array representation.

In [None]:
df.values       # using this is recommended

In [None]:
df[0].as_matrix(), df[0].values