Lab03
=====

## Context
### Sequence Modeling Exercise
+ Large Movie Review Dataset


## 영화에 대한 느낌을 서술한 글 데이터와 영화 평가(좋게 평가한 것인지, 나쁘게 평가한 것인지) 간의 관계를 순환신경망으로 학습해보기


영화 추천 데이터베이스를 이용해 같은 사람이 영화에 대한 느낌을 서술한 글과 영화가 좋은지 나쁜지 별표 등으로 판단한 결과와의 관계를 순환신경망으로 학습 해보겠습니다. 

학습이 완료된 후 새로운 평가글이 주어졌을 때 판별 결과를 예측하는 것을 목표로 합니다.

In [1]:
#path 관련 라이브러리
import os
from os.path import join, isdir

# Scientific Math 라이브러리  
import numpy as np
from sklearn.model_selection import train_test_split

# Visualization 라이브러리
import matplotlib.pyplot as plt

import tensorflow as tf
import keras
from keras import layers, models, optimizers, Input
from keras.utils import to_categorical
from keras.preprocessing import sequence

Using TensorFlow backend.


## 1. 데이터 살펴보기

#### 1.1 데이터 불러오기
케라스에서 제공하는 공개 데이터인 IMDB(Internet Movie DataBase)를 사용합니다.

IMDB 데이터는 25,000건씩(추천(pos)=1,비추천(neg)=0) 총 50,000개의 영화평과 이진화된 영화 평점 정보를 담고 있습니다.

평점 정보는 별점이 많은 경우는 긍정, 그렇지 않은 경우는 부정으로 나눠진 정보입니다.
학습을 통해 영화평을 분석해 영화 평점 정보를 예측하도록 학습해 보겠습니다.

아래함수를 통해 데이터를 불러옵니다.

In [2]:
def load_data(path, num_words=None, skip_top=0, seed=113):
    #numpy 버전 1.16.3 버전 이상인 경우 allow_pickle을 True로 해주셔야합니다.
    with np.load(path, allow_pickle=True) as f:
        x_train, labels_train = f['x_train'], f['y_train']
        x_test, labels_test = f['x_test'], f['y_test']

    np.random.seed(seed)
    
    indices = np.arange(len(x_train))
    np.random.shuffle(indices)
    x_train = x_train[indices]
    labels_train = labels_train[indices]
    
    indices = np.arange(len(x_test))
    np.random.shuffle(indices)
    x_test = x_test[indices]
    labels_test = labels_test[indices]
    
    xs = np.concatenate([x_train, x_test])
    labels = np.concatenate([labels_train, labels_test])
    
    if not num_words:
        num_words = max([max(x) for x in xs])

    xs = [[w for w in x if skip_top <= w < num_words] for x in xs]
    
    idx = len(x_train)
    x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
    x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
    
    return (x_train, y_train), (x_test, y_test)

In [3]:
(train_data, train_labels), (test_data, test_labels) = load_data('./data/imdb.npz', num_words=10000)

\* train set과 test set에서 10,000개의 sample만 추출합니다.

In [4]:
#Data partition
sampling_indices = np.random.choice(len(train_data), round(len(train_data) * 0.4), replace=False)

In [5]:
train_data = train_data[sampling_indices][:]
train_labels = train_labels[sampling_indices][:]
test_data = test_data[sampling_indices][:]
test_labels = test_labels[sampling_indices][:]

In [6]:
print('train_data shape:', train_data.shape)
print('train_labels shape:', train_labels.shape)
print('a train_data sample:', train_data[1])
print('a train_label sample:', train_labels[1])

train_data shape: (10000,)
train_labels shape: (10000,)
a train_data sample: [487, 121, 48, 10, 381, 100, 871, 107, 764, 741, 7, 7, 441, 764, 741, 13, 3, 125, 71, 870, 705, 9, 211, 122, 5, 3, 547, 377, 2, 1361, 3858, 18, 31, 1, 55, 9, 13, 117, 42, 3, 7, 7, 42, 251, 5, 615, 6768, 48, 163, 11, 705, 14, 3014, 14, 10, 255, 9, 546, 12, 328, 273, 1, 164, 119, 3, 289, 4, 3658, 22, 80, 1, 203, 4, 48, 59, 897, 25, 74, 3, 518, 4, 1, 4206, 111, 10, 244, 8193, 5, 856, 10, 13, 146, 3, 17, 12, 557, 3, 173, 7, 7, 82, 6226, 10, 101, 23, 1, 4, 1437, 2, 75, 229, 2, 12, 117, 55, 22, 892, 5, 63, 2195, 5, 1, 1437, 3, 706, 43, 4, 155, 50, 37, 3, 690, 454, 18, 195, 181, 49]
a train_label sample: 1


train_data의 경우 정수 리스트 형태인 데이터를 확인할 수 있습니다.
각 정수는 영화 리뷰에서 한 단어를 나타냅니다. 즉, 영화 리뷰가 정수 리스트로 변환되어 있는 것입니다.<br>
(영화 리뷰 문장의 각 단어는 고유한 정수로 인덱싱되어 있습니다.)

train_labels의 경우 0 또는 1값을 가지며, 0은 비추천 1은 추천을 의미합니다.
<br><br>

예시)
<img src = "./Images/reviewsample.png">

<img src = "./Images/reviewtointlist.png">
(정확한 인덱싱 방법에 대한 설명은 아래 Reference 부분을 참고하시면 됩니다.)

#### 1.2 정수 리스트 단어로 복원해보기

단어를 key로 해당 단어의 index를 value로 갖는 딕셔너리를 아래 get_word_index를 통해 불러옵니다. 불러온 딕셔너리를 통해 단어를 복원합니다.

In [7]:
import json
def get_word_index(path):
    with open(path) as f:
        return json.load(f)

In [8]:
# 딕셔너리 불러오기
word_to_integer = get_word_index('./data/imdb_word_index.json')

# 딕셔너리의 key를 10개 출력합니다.
print(list(word_to_integer.keys())[0:10])

# 원래 있던 딕셔너리의 key와 value를 바꾼형태의 새로운 딕셔너리를 만듭니다.
# key : index
# value : 단어
reverse_word_to_integer  = dict([(value, key) for (key, value) in word_to_integer.items()])

# index에 해당하는 단어를 출력합니다.
print(reverse_word_to_integer [487])
print(reverse_word_to_integer [121])

# we need to subtract 3 from the indices because 0 is 'padding', 1 is 'start of sequence' and 2 is 'unknown'
# 딕셔너리의 get(x) 함수는 x라는 key에 대응되는 value를 돌려줍니다
# get(x, '디폴트 값') : 딕셔너리 안에 찾으려는 key 값이 없을 경우 디폴트 값을 대신 가져오게 할 수 있습니다.
decoded_review = ' '.join([reverse_word_to_integer.get(i, 'UNK') for i in train_data[1]])
print(decoded_review)

['fawn', 'tsukino', 'nunnery', 'sonja', 'vani', 'woods', 'spiders', 'hanging', 'woody', 'trawling']
you'll
know
you'll know what i mean after you've seen red eye br br overall red eye was a better than expected thriller it gets off to a slow start and slowly builds but by the time it was over it's a br br it's hard to exactly define what makes this thriller as thrilling as i found it except that simply put the director did a job of pulling you into the action of what would otherwise have been a run of the mill plot i rather tended to forget i was watching a movie that says a lot br br other factors i think are the of victim and bad guy and that over time you begin to really relate to the victim a 8 out of 10 more like a 7 5 but that's pretty good


## 2. 간단한 전처리

#### 2.1 모델 input size 맞춰주기

데이터셋의 문장들은 길이가 다르기 때문에 RNN(LSTM)이 처리하기 적합하도록 길이를 통일하는 작업을 필요로합니다.

문장에서 maxlen 이후에 있는 단어들을 케라스 패키지인 sequence에서 제공하는 pad_sequence()함수로 잘라냅니다. 문장 길이가 부족한 경우는 부족한 부분을 0으로 채웁니다.

In [9]:
#파라미터 설정
max_features = 20000
# cut texts after this number of words (among top max_features most common words)
maxlen = 80
batch_size = 512

In [10]:
X_train = sequence.pad_sequences(train_data, maxlen=maxlen)
X_test = sequence.pad_sequences(test_data, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

X_train shape: (10000, 80)
X_test shape: (10000, 80)


## 3. Keras로 LSTM 모델 만들어보기

In [11]:
# 입력 텐서
# input_shape : input_length 인수로는 순서열의 길이, input_dim 인수로는 벡터의 크기를 입력합니다.
# batch size를 제외한 크기를 입력합니다.
input_shape = (X_train[0].shape)
input_tensor = layers.Input(input_shape)

In [12]:
print('Build model...')
embed_layer = layers.Embedding(max_features, 128)(input_tensor) 
lstm_layer = layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2)(embed_layer)
output_tensor = layers.Dense(1, activation='sigmoid')(lstm_layer)

Build model...


In [13]:
model = models.Model(input_tensor, output_tensor)
model.compile(
              optimizer='Adam',
              # 이진 분류이므로 binary_crossentropy
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [14]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 80)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 80, 128)           2560000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________


In [15]:
print('Train...')
model.fit(X_train, train_labels,
          batch_size=batch_size,
          epochs=15,
          verbose=2,
          validation_data=(X_test, test_labels))
score, acc = model.evaluate(X_test, test_labels,
                            batch_size=batch_size,
                           verbose=2)

print('Test score:', score)
print('Test accuracy:', acc)

Train...
Train on 10000 samples, validate on 10000 samples
Epoch 1/15
 - 11s - loss: 0.6791 - acc: 0.5993 - val_loss: 0.5732 - val_acc: 0.6942
Epoch 2/15
 - 10s - loss: 0.5062 - acc: 0.7846 - val_loss: 0.4276 - val_acc: 0.8069
Epoch 3/15
 - 10s - loss: 0.3524 - acc: 0.8565 - val_loss: 0.3896 - val_acc: 0.8257
Epoch 4/15
 - 10s - loss: 0.2473 - acc: 0.9046 - val_loss: 0.4183 - val_acc: 0.8192
Epoch 5/15
 - 10s - loss: 0.1987 - acc: 0.9322 - val_loss: 0.4262 - val_acc: 0.8129
Epoch 6/15
 - 10s - loss: 0.1572 - acc: 0.9475 - val_loss: 0.5343 - val_acc: 0.8089
Epoch 7/15
 - 10s - loss: 0.1297 - acc: 0.9569 - val_loss: 0.5440 - val_acc: 0.8031
Epoch 8/15
 - 10s - loss: 0.1068 - acc: 0.9648 - val_loss: 0.6376 - val_acc: 0.7981
Epoch 9/15
 - 10s - loss: 0.0872 - acc: 0.9735 - val_loss: 0.5879 - val_acc: 0.7982
Epoch 10/15
 - 10s - loss: 0.0783 - acc: 0.9760 - val_loss: 0.7223 - val_acc: 0.7946
Epoch 11/15
 - 10s - loss: 0.0700 - acc: 0.9803 - val_loss: 0.6502 - val_acc: 0.7905
Epoch 12/15
 - 

## 비교 모델

In [16]:
X_train2 = np.reshape(X_train, (-1, len(X_train[0]), 1))
X_test2 = np.reshape(X_test, (-1, len(X_test[0]), 1))

In [17]:
print('X_train2 shape:', X_train2.shape)
print('X_test2 shape:', X_test2.shape)

X_train2 shape: (10000, 80, 1)
X_test2 shape: (10000, 80, 1)


In [18]:
print('Build model2...')

input_shape2 = (X_train2[0].shape)
input_tensor2 = layers.Input(input_shape2)

# embed_layer = layers.Embedding(max_features, 128)(input_tensor) 
lstm_layer1 = layers.LSTM(200, dropout=0.2, return_sequences=True, recurrent_dropout=0.2)(input_tensor2)

lstm_layer2 = layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2)(lstm_layer1)

output_tensor2 = layers.Dense(1, activation='sigmoid')(lstm_layer2)

Build model2...


In [19]:
model2 = models.Model(input_tensor2, output_tensor2)
model2.compile(
              optimizer='Adam',
              # 이진 분류이므로 binary_crossentropy
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [20]:
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 80, 1)             0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 80, 200)           161600    
_________________________________________________________________
lstm_3 (LSTM)                (None, 128)               168448    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 330,177
Trainable params: 330,177
Non-trainable params: 0
_________________________________________________________________


In [21]:
print('Train...')
model2.fit(X_train2, train_labels,
          batch_size=batch_size,
          epochs=15,
          verbose=2,
          validation_data=(X_test2, test_labels))

score2, acc2 = model2.evaluate(X_test2, test_labels,
                            batch_size=batch_size,
                            verbose = 2)

print('Test score2:', score2)
print('Test accuracy2:', acc2)

Train...
Train on 10000 samples, validate on 10000 samples
Epoch 1/15
 - 24s - loss: 0.6979 - acc: 0.5047 - val_loss: 0.6923 - val_acc: 0.5155
Epoch 2/15
 - 23s - loss: 0.6935 - acc: 0.5142 - val_loss: 0.6922 - val_acc: 0.5188
Epoch 3/15
 - 23s - loss: 0.6925 - acc: 0.5160 - val_loss: 0.6888 - val_acc: 0.5398
Epoch 4/15
 - 23s - loss: 0.6902 - acc: 0.5270 - val_loss: 0.6942 - val_acc: 0.5311
Epoch 5/15
 - 23s - loss: 0.6913 - acc: 0.5285 - val_loss: 0.6864 - val_acc: 0.5464
Epoch 6/15
 - 23s - loss: 0.6904 - acc: 0.5250 - val_loss: 0.6854 - val_acc: 0.5492
Epoch 7/15
 - 23s - loss: 0.6899 - acc: 0.5301 - val_loss: 0.6900 - val_acc: 0.5487
Epoch 8/15
 - 23s - loss: 0.6904 - acc: 0.5308 - val_loss: 0.6890 - val_acc: 0.5288
Epoch 9/15
 - 23s - loss: 0.6883 - acc: 0.5339 - val_loss: 0.6857 - val_acc: 0.5501
Epoch 10/15
 - 23s - loss: 0.6890 - acc: 0.5320 - val_loss: 0.7005 - val_acc: 0.5282
Epoch 11/15
 - 23s - loss: 0.6909 - acc: 0.5316 - val_loss: 0.6880 - val_acc: 0.5359
Epoch 12/15
 - 

### Reference
+ IMDB : https://www.samyzaf.com/ML/imdb/imdb.html