# Sentiment Classification

### Task
* IMDB 영화사이트에서 50000개의 영화평을 가지고 positive/negative인지 구분해보자.
* 데이터 불러오기를 제외한 딥러닝 트레이닝 과정을 직접 구현해보는 것이 목표 입니다.

### Dataset
* [IMDB datasets](https://www.imdb.com/interfaces/)

### Base code
* Dataset: train, val, test로 split
* Input data shape: (`batch_size`, `max_sequence_length`)
* Output data shape: (`batch_size`, 1)
* Architecture:
  * RNN을 이용한 간단한 classification 모델 가이드
  * `Embedding` - `SimpleRNN` - `Dense (with Sigmoid)`
  * [`tf.keras.layers`](https://www.tensorflow.org/api_docs/python/tf/keras/layers) 사용
* Training
  * `model.fit` 사용
* Evaluation
  * `model.evaluate` 사용 for test dataset

### Try some techniques
* Training-epochs 조절
* Change model architectures (Custom model)
  * Use another cells (LSTM, GRU, etc.)
  * Use dropout layers
* Embedding size 조절
  * 또는 one-hot vector로 학습
* Number of words in the vocabulary 변화
* `pad` 옵션 변화
* Data augmentation (if possible)

## 자연어처리에 관한 work flow

The flowchart of the algorithm is roughly:

<img src="https://user-images.githubusercontent.com/11681225/46912373-d2a3a800-cfae-11e8-8201-ef17b65834f5.png" alt="natural_language_flowchart" style="width: 300px;"/>

## Import modules

### Import base modules

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import time
import shutil
import tarfile

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output

import tensorflow as tf

from tensorflow.python.keras import layers

os.environ["CUDA_VISIBLE_DEVICES"]="0"

## Load Data

* IMDB에서 다운받은 총 50000개의 영화평을 사용한다.
* `tf.keras.datasets`에 이미 잘 가공된 데이터 셋이 있으므로 쉽게 다운받아 사용할 수 있다.
* 원래는 text 데이터이지만 `tf.keras.datasets.imdb`는 이미 Tokenizing이 되어있다.

In [None]:
# Load training and eval data from tf.keras
imdb = tf.keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
train_labels = train_labels.astype(np.float64)
test_labels = test_labels.astype(np.float64)

In [None]:
# # save np.load
# np_load_old = np.load

# # modify the default parameters of np.load
# np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# # call load_data with allow_pickle implicitly set to true
# (train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

# # restore np.load for future normal usage
# np.load = np_load_old


In [None]:
print("Train-set size: ", len(train_data))
print("Test-set size:  ", len(test_data))

### Data 출력

In [None]:
print(train_data[0])

In [None]:
print("sequence length: {}".format(len(train_data[0])))

* Label정보
  * 0.0 for a negative sentiment
  * 1.0 for a positive sentiment

In [None]:
# positive sample
index = 1
print("text: {}\n".format(train_data[index]))
print("label: {}".format(train_labels[index]))

In [None]:
# negative sample
index = 200
print("text: {}\n".format(train_data[index]))
print("label: {}".format(train_labels[index]))

## Prepare dataset

### Convert the integers back to words

In [None]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
  return ' '.join([reverse_word_index.get(i, '?') for i in text])


#### Text data 출력

In [None]:
print(train_data[0])

In [None]:
decode_review(train_data[0])

In [None]:
decode_review(train_data[0])

### Padding and truncating data using pad sequences

In [None]:
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [None]:
num_seq_length = np.array([len(tokens) for tokens in list(train_data) + list(test_data)])
train_seq_length = np.array([len(tokens) for tokens in train_data], dtype=np.int32)
test_seq_length = np.array([len(tokens) for tokens in test_data], dtype=np.int32)


In [None]:
max_seq_length = 256

In [None]:
print(np.sum(num_seq_length < max_seq_length) / len(num_seq_length))

* `max_seq_length`을 256으로 설정하면 전체 데이터 셋의 70%를 커버할 수 있다.
* 30% 정도의 데이터가 256 단어가 넘는 문장으로 이루어져 있다.
* 보통 미리 정한 `max_seq_length`를 넘어가는 문장의 데이터는 *truncate* 한다.

In [None]:
# padding 옵션은 두 가지가 있다.
#pad = 'pre'
pad = 'post'

In [None]:
train_data_pad = pad_sequences(train_data,
                               maxlen=max_seq_length,
                               padding=pad,
                               value=word_index["<PAD>"])
test_data_pad = pad_sequences(test_data,
                              maxlen=max_seq_length,
                              padding=pad,
                              value=word_index["<PAD>"])

In [None]:
print(train_data_pad.shape)
print(test_data_pad.shape)

#### Padding data 출력

In [None]:
index = 0
print("text: {}\n".format(decode_review(train_data[index])))
print("token: {}\n".format(train_data[index]))
print("pad: {}".format(train_data_pad[index]))

### Create a validation set

In [None]:
num_val_data = 5000
val_data_pad = train_data_pad[:num_val_data]
train_data_pad_partial = train_data_pad[num_val_data:]

val_labels = train_labels[:num_val_data]
train_labels_partial = train_labels[num_val_data:]

## Setup hyper-parameters

In [None]:
# Set the hyperparameter set
batch_size = 64
learning_rate = 1e-3
max_epochs = 3
embedding_size = 32
vocab_size = 10000

## Build the model

In [None]:
model = tf.keras.Sequential()

### Embedding layer

* embedding-layer는 전체 vocabulary의 갯수(num_words)로 이루어진 index가 `embedding_size`의 *dense vector* 로 변환되는 과정이다.

In [None]:
# TODO
model.add(layers.Embedding())

### Main rnn model

In [None]:
# model.add 를 통해 자유롭게 모델을 만들어보세요.
# TODO
model.add()

### Classification layer

In [None]:
# TODO
model.add()

In [None]:
model.summary()

### Compile the model

In [None]:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

In [None]:
model.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=['accuracy'])

## Train the model

In [None]:
%%time
# TODO
history = model.fit()

## Performance on Test-Set

Now that the model has been trained we can calculate its classification accuracy on the test-set.

In [None]:
# loss
print("loss value: {:.3f}".format(results[0]))
# accuracy
print("accuracy value: {:.3f}".format(results[1]))