# <font color=1978DD> HWQ: use RNN to analyze movie reviews</font> 


In [83]:
%env KERAS_BACKEND=tensorflow
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)
print('訓練集資料量:', len(x_train))
print('測試集資料量:', len(x_test))

env: KERAS_BACKEND=tensorflow
訓練集資料量: 25000
測試集資料量: 25000


In [84]:
from keras.preprocessing import sequence

## <font color=FC8600>  maxlen=300</font> 

In [108]:
x_train = sequence.pad_sequences(x_train, maxlen=300)
x_test = sequence.pad_sequences(x_test, maxlen=300)
x_train.shape

(25000, 300)

## <font color=FC8600>  N-dimension and number of LSTM</font> 

In [109]:
N = 300 # N-dimension
K = 40 # number of LSTM

In [110]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, Dropout
from keras.layers import LSTM
from keras.optimizers import Adam

## <font color=FC8600> add Dropout function (reduce overfitting) </font> 

In [111]:
model = Sequential()
model.add(Embedding(10000, N))
model.add(Dropout(0.35))
model.add(LSTM(K))
model.add(Dropout(0.35))
model.add(Dense(1, activation='sigmoid'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_26 (Embedding)     (None, None, 300)         3000000   
_________________________________________________________________
dropout_37 (Dropout)         (None, None, 300)         0         
_________________________________________________________________
lstm_26 (LSTM)               (None, 40)                54560     
_________________________________________________________________
dropout_38 (Dropout)         (None, 40)                0         
_________________________________________________________________
dense_26 (Dense)             (None, 1)                 41        
Total params: 3,054,601
Trainable params: 3,054,601
Non-trainable params: 0
_________________________________________________________________


In [112]:
model.compile(loss='binary_crossentropy',
             optimizer=Adam(),
             metrics=['accuracy'])

## <font color=FC8600> batch_size=100 and add validation_data</font> 

In [113]:
model_his = model.fit(x_train, y_train,
            batch_size=100,
            epochs=2,
            validation_data = (x_test,y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


In [114]:
score = model.evaluate(x_test, y_test)
print(f'testing data loss={score[0]}')
print(f'testing data accuracy:{score[1]}')

testing data loss=0.3094479087257385
testing data accuracy:0.87008


## <font color=FC8600> conclusion </font> 
### <font color=(0,0,0)> 這次調參數的過程相當tricky,由於擷取的數據隨機性相當高，<br /> 加上overfitting問題嚴重，因此到達一定準確率之後要再提升不容易。</font> 
### <font color=(0,0,0)> 最後產生的準確率是<font color=FC0505>87.008%  </font> 勉強達標。</font> 

## <font color=FC8600> 查詢資料後整理關於 "overfitting"常見的處理方法 </font> 
https://ithelp.ithome.com.tw/articles/10203371

### <font color=0349F7> 1. Dropout </font> 
### <font color=020713>  $\;\;\;\;$ 減少神經網絡的層數、神經元個數等方式可以限制神經網絡的擬合能力，隨機關閉一些神經元<br />  $\;\;\;\;$ 以減少過擬合的情況</font>
### <font color=0349F7>  2. Early Stopping </font>
### <font color=020713>  $\;\;\;\;$ 在每一個 epoch 結束時計算驗證集（validation data）的準確率，當準確率不再提高就停止訓練。<br />  $\;\;\;\;$ 這是一個很常用的方法，好處是解決手動設置 epoch 數的問題（節省訓練模型時間），還能防止 overfitting。    </font>
### <font color=0349F7> 3. Weight Decay </font> 
### <font color=020713>  $\;\;\;\;$ 原理是在 cost function 的後面增加一個懲罰項（代表對某些參數做一些限制），如果一個權重太大，<br />  $\;\;\;\;$ 將導致代價過大，因此在反向傳播後就會對該權重進行懲罰，使其保持在一個較小的值。    </font>