# Sentiment Classification

### Task
* IMDB 영화사이트에서 50000개의 영화평을 가지고 positive/negative인지 구분해보자.
* 데이터 불러오기를 제외한 딥러닝 트레이닝 과정을 직접 구현해보는 것이 목표 입니다.

### Dataset
* [IMDB datasets](https://www.imdb.com/interfaces/)

### Base code
* Dataset: train, val, test로 split
* Input data shape: (`batch_size`, `max_sequence_length`)
* Output data shape: (`batch_size`, 1)
* Architecture:
  * RNN을 이용한 간단한 classification 모델 가이드
  * `Embedding` - `SimpleRNN` - `Dense (with Sigmoid-이진 분류)`
  * [`tf.keras.layers`](https://www.tensorflow.org/api_docs/python/tf/keras/layers) 사용
* Training
  * `model.fit` 사용
* Evaluation
  * `model.evaluate` 사용 for test dataset

### Try some techniques
* Training-epochs 조절
* Change model architectures (Custom model)
  * Use another cells (LSTM, GRU, etc.)
  * Use dropout layers
* Embedding size 조절
  * 또는 one-hot vector로 학습
* Number of words in the vocabulary 변화
* `pad` 옵션 변화
* Data augmentation (if possible)

## 자연어처리에 관한 work flow

The flowchart of the algorithm is roughly:

<img src="https://user-images.githubusercontent.com/11681225/46912373-d2a3a800-cfae-11e8-8201-ef17b65834f5.png" alt="natural_language_flowchart" style="width: 300px;"/>

## Import modules

In [1]:
use_colab = True
assert use_colab in [True, False]

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import time
import shutil
import tarfile

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output

import tensorflow as tf

from tensorflow.keras import layers

## Load Data

* IMDB에서 다운받은 총 50000개의 영화평을 사용한다.
* `tf.keras.datasets`에 이미 잘 가공된 데이터 셋이 있으므로 쉽게 다운받아 사용할 수 있다.
* 원래는 text 데이터이지만 `tf.keras.datasets.imdb`는 이미 Tokenizing이 되어있다.

In [4]:
# Load training and eval data from tf.keras
imdb = tf.keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
train_labels = train_labels.astype(np.float64)
test_labels = test_labels.astype(np.float64)

In [5]:
print("Train-set size: ", len(train_data))
print("Test-set size:  ", len(test_data))

Train-set size:  25000
Test-set size:   25000


### Data 출력
* 데이터셋을 바로 불러왔을때 출력되는 데이터를 확인해보자

In [6]:
print(train_data[1])

[1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]


In [7]:
print("sequence length: {}".format(len(train_data[1])))

sequence length: 189


* Label정보를 확인해보자
  * 0.0 for a negative sentiment 부정적인 리뷰
  * 1.0 for a positive sentiment 긍정적인 리뷰

In [8]:
# negative sample
index = 1
print("text: {}\n".format(train_data[index]))
print("label: {}".format(train_labels[index]))

text: [1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012, 134, 26, 4, 715, 8, 118, 1634, 14, 394, 20, 13, 119, 954, 189, 102, 5, 207, 110, 3103, 21, 14, 69, 188, 8, 30, 23, 7, 4, 249, 126, 93, 4, 114, 9, 2300, 1523, 5, 647, 4, 116, 9, 35, 8163, 4, 229, 9, 340, 1322, 4, 118, 9, 4, 130, 4901, 19, 4, 1002, 5, 89, 29, 952, 46, 37, 4, 455, 9, 45, 43, 38, 1543, 1905, 398, 4, 1649, 26, 6853, 5, 163, 11, 3215, 2, 4, 1153, 9, 194, 775, 7, 8255, 2, 349, 2637, 148, 605, 2, 8003, 15, 123, 125, 68, 2, 6853, 15, 349, 165, 4362, 98, 5, 4, 228, 9, 43, 2, 1157, 15, 299, 120, 5, 120, 174, 11, 220, 175, 136, 50, 9, 4373, 228, 8255, 5, 2, 656, 245, 2350, 5, 4, 9837, 131, 152, 491, 18, 2, 32, 7464, 1212, 14, 9, 6, 371, 78, 22, 625, 64, 1382, 9, 8, 168, 145, 23, 4, 1690, 15, 16, 4, 1355, 5, 28, 6, 52, 154, 462, 33, 89, 78, 285, 16, 145, 95]

label: 0.0


In [9]:
# positive sample
index = 200
print("text: {}\n".format(train_data[index]))
print("label: {}".format(train_labels[index]))

text: [1, 14, 9, 6, 227, 196, 241, 634, 891, 234, 21, 12, 69, 6, 6, 176, 7, 4, 804, 4658, 2999, 667, 11, 12, 11, 85, 715, 6, 176, 7, 1565, 8, 1108, 10, 10, 12, 16, 1844, 2, 33, 211, 21, 69, 49, 2009, 905, 388, 99, 2, 125, 34, 6, 2, 1274, 33, 4, 130, 7, 4, 22, 15, 16, 6424, 8, 650, 1069, 14, 22, 9, 44, 4609, 153, 154, 4, 318, 302, 1051, 23, 14, 22, 122, 6, 2093, 292, 10, 10, 723, 8721, 5, 2, 9728, 71, 1344, 1576, 156, 11, 68, 251, 5, 36, 92, 4363, 133, 199, 743, 976, 354, 4, 64, 439, 9, 3059, 17, 32, 4, 2, 26, 256, 34, 2, 5, 49, 7, 98, 40, 2345, 9844, 43, 92, 168, 147, 474, 40, 8, 67, 6, 796, 97, 7, 14, 20, 19, 32, 2188, 156, 24, 18, 6090, 1007, 21, 8, 331, 97, 4, 65, 168, 5, 481, 53, 3084]

label: 1.0


## Prepare dataset

### Convert the integers back to words

* 실제 우리가 다루고 있는 데이터가 진짜 리뷰데이터인지 확인해보자

In [10]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0 # Zero Padding
word_index["<START>"] = 1 # Start
word_index["<UNK>"] = 2  # unknown (특정 리뷰에만 쓰이지 않은 & 빈도가 낮은)
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
  return ' '.join([reverse_word_index.get(i, '?') for i in text])


#### Text data 출력

In [11]:
print(train_data[5])

[1, 778, 128, 74, 12, 630, 163, 15, 4, 1766, 7982, 1051, 2, 32, 85, 156, 45, 40, 148, 139, 121, 664, 665, 10, 10, 1361, 173, 4, 749, 2, 16, 3804, 8, 4, 226, 65, 12, 43, 127, 24, 2, 10, 10]


In [12]:
decode_review(train_data[5])

"<START> begins better than it ends funny that the russian submarine crew <UNK> all other actors it's like those scenes where documentary shots br br spoiler part the message <UNK> was contrary to the whole story it just does not <UNK> br br"

In [13]:
print(train_labels[5])

0.0


### Padding and truncating data using pad sequences
* 전부 길이가 다른 리뷰들의 길이를 통일해주자

In [14]:
from tensorflow.keras.utils import pad_sequences

In [15]:
# 긴건 자르고, 짧은건 붙여줘서 동작을 시키는 방식으로 작동
num_seq_length = np.array([len(tokens) for tokens in list(train_data) + list(test_data)])
train_seq_length = np.array([len(tokens) for tokens in train_data], dtype=np.int32)
test_seq_length = np.array([len(tokens) for tokens in test_data], dtype=np.int32)


In [16]:
max_seq_length = 330 #최대 길이

* Max length보다 작은 리뷰의 퍼센트

In [17]:
print(np.sum(num_seq_length < max_seq_length) / len(num_seq_length))

0.8052


* `max_seq_length`을 256으로 설정하면 전체 데이터 셋의 70%를 커버할 수 있다.
* 30% 정도의 데이터가 256 단어가 넘는 문장으로 이루어져 있다.
* 보통 미리 정한 `max_seq_length`를 넘어가는 문장의 데이터는 *truncate* 한다.

In [18]:
# padding 옵션은 두 가지가 있다.
pad = 'pre'
# pad = 'post'

In [19]:
train_data_pad = pad_sequences(train_data,
                               maxlen=max_seq_length,
                               padding=pad,
                               value=word_index["<PAD>"])
test_data_pad = pad_sequences(test_data,
                              maxlen=max_seq_length,
                              padding=pad,
                              value=word_index["<PAD>"])

In [20]:
# 배치 길이와, 시퀀스 크기에 맞춰서 나옴
print(train_data_pad.shape)
print(test_data_pad.shape)

(25000, 330)
(25000, 330)


#### Padding data 출력

In [21]:
index = 0
print("text: {}\n".format(decode_review(train_data[index])))
print("token: {}\n".format(train_data[index]))
print("pad: {}".format(train_data_pad[index]))

text: <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised f

### Create a validation set

* num_val_data =  len(train_data) // 2  # train_data 길이의 절반을 사용
* 이유는: 데이터 세트를 훈련 세트와 검증 세트로 분할하기 위함


```
num_val_data = int(len(train_data) * 0.2)
```



In [22]:
num_val_data = 5000
val_data_pad = train_data_pad[:num_val_data]
train_data_pad_partial = train_data_pad[num_val_data:]

val_labels = train_labels[:num_val_data]
train_labels_partial = train_labels[num_val_data:]

### Dataset 구성

In [23]:
batch_size = 1024

# for train
train_dataset = tf.data.Dataset.from_tensor_slices((train_data_pad_partial, train_labels_partial))
train_dataset = train_dataset.shuffle(10000).repeat().batch(batch_size=batch_size)
print(train_dataset)

# for test
test_dataset = tf.data.Dataset.from_tensor_slices((test_data_pad, test_labels))
test_dataset = test_dataset.batch(batch_size=batch_size)
print(test_dataset)

# for valid
valid_dataset = tf.data.Dataset.from_tensor_slices((val_data_pad, val_labels))
valid_dataset = valid_dataset.batch(batch_size=batch_size)
print(valid_dataset)

<_BatchDataset element_spec=(TensorSpec(shape=(None, 330), dtype=tf.int32, name=None), TensorSpec(shape=(None,), dtype=tf.float64, name=None))>
<_BatchDataset element_spec=(TensorSpec(shape=(None, 330), dtype=tf.int32, name=None), TensorSpec(shape=(None,), dtype=tf.float64, name=None))>
<_BatchDataset element_spec=(TensorSpec(shape=(None, 330), dtype=tf.int32, name=None), TensorSpec(shape=(None,), dtype=tf.float64, name=None))>


## Setup hyper-parameters
- 단어를 잘 표현하려면 높은 차원이 필요할수도
- embedding layer는 index를 매칭.

In [24]:
# Set the hyperparameter set
max_epochs = 10
embedding_size = 256 # 각 문자를 표현하는 임베딩 벡터의 차원
vocab_size = 10000 # 모델이 처리할 수 있는 고유 문자의 총 수

# the save point
if use_colab:
    checkpoint_dir ='./drive/My Drive/train_ckpt/sentimental/exp1'
    if not os.path.isdir(checkpoint_dir):
        os.makedirs(checkpoint_dir)
else:
    checkpoint_dir = 'sentimental/exp1'

## Build the model
### Embedding layer

* embedding-layer는 전체 vocabulary의 갯수(num_words)로 이루어진 index가 `embedding_size`의 *dense vector* 로 변환되는 과정이다.

In [27]:
# LSTM
model = tf.keras.Sequential()
model.add(layers.Embedding(vocab_size, embedding_size, mask_zero=True,)) # 10000, 256

# model.add 를 통해 자유롭게 모델을 만들어보세요.

# GRU Model
# return_sequences= True: Many to Many Model (text 생성 가능해짐-True면)
# model.add(layers.GRU(1, return_sequences=True)) # bidirectional

# LSTM Model
# model.add(layers.LSTM(1, return_sequences=True)) # LSTM

# bidirectional Model
# Bi-directional RNN 구헌 방법: layers.Bidirectional로 덮어주면 끝.
# Data Output 2배가 됩니다아, concat으로 만들거나 더 할수 있음.
model.add(layers.Bidirectional(layers.LSTM(1, return_sequences=True)))
model.add(layers.Bidirectional(layers.LSTM(1, return_sequences=True)))
model.add(layers.Bidirectional(layers.LSTM(1)))

# TODO
model.add(layers.Dense(1, activation='sigmoid')) # 이진 분류! 0 or 1

In [28]:
# GRU
# model = tf.keras.Sequential()
# model.add(layers.Embedding(vocab_size, embedding_size)) # 10000, 256

# model.add 를 통해 자유롭게 모델을 만들어보세요.
# model.add(layers.Bidirectional(layers.GRU(32, return_sequences=True)))
# model.add(layers.Bidirectional(layers.GRU(32)))
# model.add(layers.Dense(16, activation='relu'))
# model.add(layers.Dense(8, activation='relu'))
# TODO
# model.add(layers.Dense(1, activation='sigmoid'))

In [29]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, None, 256)         2560000   
                                                                 
 bidirectional_5 (Bidirecti  (None, None, 2)           2064      
 onal)                                                           
                                                                 
 bidirectional_6 (Bidirecti  (None, None, 2)           32        
 onal)                                                           
                                                                 
 bidirectional_7 (Bidirecti  (None, 2)                 32        
 onal)                                                           
                                                                 
 dense_4 (Dense)             (None, 1)                 3         
                                                      

### Compile the model

In [30]:
learning_rate = 1e-3
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
model.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['accuracy'])

## Train the model

In [31]:
cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_dir,
                                                 save_weights_only=True,
                                                 monitor='val_loss',
                                                 mode='auto',
                                                 save_best_only=True,
                                                 verbose=1)

In [32]:
history = model.fit(train_dataset,
                    epochs=max_epochs,
                    validation_data=valid_dataset,
                    steps_per_epoch=len(train_data_pad_partial) // batch_size,
                    validation_steps=len(val_data_pad) // batch_size,
                    callbacks=[cp_callback])


Epoch 1/10
Epoch 1: val_loss improved from inf to 0.68849, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 2/10
Epoch 2: val_loss improved from 0.68849 to 0.67603, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 3/10
Epoch 3: val_loss improved from 0.67603 to 0.65363, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 4/10
Epoch 4: val_loss improved from 0.65363 to 0.61891, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 5/10
Epoch 5: val_loss improved from 0.61891 to 0.58586, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 6/10
Epoch 6: val_loss improved from 0.58586 to 0.56845, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 7/10
Epoch 7: val_loss improved from 0.56845 to 0.54989, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 8/10
Epoch 8: val_loss improved from 0.54989 to 0.53286, saving model to ./drive/My Drive/train_ckpt/sentimental/exp1
Epoch 9/10
E

## 모델 테스트
* 테스트 데이터셋을 이용해 모델을 테스트해봅시다.

In [33]:
model.load_weights(checkpoint_dir)

<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x7dfb6f868e80>

In [34]:
results = model.evaluate(test_dataset)
# loss
print("loss value: {:.3f}".format(results[0]))
# accuracy
print("accuracy value: {:.3f}".format(results[1]))

loss value: 0.514
accuracy value: 0.821
