# Assignment 5

#### Christopher W. Hong

The Road Ahead

* [Step 1:](#1) Sentiment classfication
  * [Step 1a:](#1a) Load the IMDB data
  * [Step 1b:](#1b) Split the IMDB data
  * [Step 1c:](#1c) Build model with a trainable embedding layer two fully connected layers
  * [Step 1d:](#1d) Train and evaluate accuracy
  * [Step 1e:](#1e) Downloading the GLOVE word embeddings and applied
  * [Step 1f:](#1f) Train and evaluate accuracy with embedding layer weights fixed
  * [Step 1g:](#1g) Conclusions about the trainable and GLOVE embedding
  * [Step 1h:](#1h) Build a model with an LSTM layer and a trainable embedding layer
  * [Step 1i:](#1i) Train and evaluate
  
* [Step 2:](#2) Topic classfication
  * [Step 2a:](#2a) Load the Reuters newswire data
  * [Step 2b:](#2b) Split the Reuters data
  * [Step 2c:](#2c) Build model with a trainable embedding layer and two fully connected layers
  * [Step 2d:](#2d) Train and evaluate accuracy with both trainable and loaded embedding layer
  * [Step 2e:](#2e) Compare results and draw conclusion
  * [Step 2f:](#2f) Build a model with an LSTM layer and a trainable embedding layer; train, evaluate, compare and draw conclusions.
  * [Step 2g:](#2g) Add an addtional LSTM layer to the previous model; train and evaluate.
  * [Step 2h:](#2h) Conclusion

You might wanna **skip the fist code block of 1e** that it will download the GLOVE to the current local directory. If so, please set the correct path that refers to **a folder `glove_dir` contains `glove.6B.100d.txt`** in the second code block of 1e.

In [0]:
from keras.datasets import imdb, reuters
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten, LSTM
from keras.callbacks.callbacks import EarlyStopping
from keras.utils.np_utils import to_categorical

!pip -q install wget
import os
import numpy as np

import warnings
warnings.filterwarnings('ignore')

<a id='1'></a>
## 1: Sentiment classfication

<a id='1a'></a>
### (a) Load the IMDB data

In [164]:
max_features = 10000 # number of features
maxlen = 20 # input sequence length

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences
x_train = sequence.pad_sequences(x_train, maxlen)
x_test = sequence.pad_sequences(x_test, maxlen)

Loading data...


<a id='1b'></a>
### (b) Split the IMDB data

In [165]:
x_valid, x_test, y_valid, y_test = train_test_split(x_test, y_test,
                                                    test_size=0.5,
                                                    random_state=0,
                                                    stratify=y_test)

print('x_train shape:', x_train.shape)
print('x_valid shape:', x_valid.shape)
print('x_test shape:', x_test.shape)

x_train shape: (25000, 20)
x_valid shape: (12500, 20)
x_test shape: (12500, 20)


<a id='1c'></a>
### (c) Build model with a trainable embedding layer two fully connected layers

In [166]:
embedding_size = 8 
batch_size = 32
epochs = 100

model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

model.summary()

Model: "sequential_47"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_47 (Embedding)     (None, 20, 8)             80000     
_________________________________________________________________
flatten_34 (Flatten)         (None, 160)               0         
_________________________________________________________________
dense_81 (Dense)             (None, 32)                5152      
_________________________________________________________________
dense_82 (Dense)             (None, 1)                 33        
Total params: 85,185
Trainable params: 85,185
Non-trainable params: 0
_________________________________________________________________


<a id='1d'></a>
### (d) Train and evaluate accuracy

In [167]:
print('Training...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_valid, y_valid),
          callbacks=[EarlyStopping(patience=5)],
          verbose=0)

_, acc = model.evaluate(x_test, y_test, batch_size=batch_size)

print('Test accuracy:', acc)

Training...
Test accuracy: 0.7159199714660645


<a id='1e'></a>
### (e) Downloading the GLOVE word embeddings and applied

In [139]:
# Download the GLOVE word embeddings
if not os.path.isfile('./glove.6B.zip'):
  print('Downloading GLOVE...')
  !wget http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
  !unzip glove.6B.zip
else:
  print('glove.6B.zip downloaded')


glove.6B.zip downloaded


In [169]:
# Parse the GLOVE word-embeddings file
# Feel free to modify the dir refered to glove.6B.100d.txt
# Adapted fro F. Chollet
glove_dir = './' 

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
  values = line.split()
  word = values[0]
  coefs = np.asarray(values[1:], dtype='float32')
  embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))


# Prepare the GLOVE word-embeddings matrix
# Adapted fro F. Chollet
embedding_dim = 100
word_index = imdb.get_word_index()

embedding_matrix = np.zeros((max_features, embedding_dim))
for word, i in word_index.items():
  if i < max_features:
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
      embedding_matrix[i] = embedding_vector


# Initialize the embedding layer using the embedding matrix and freeze the weights
model = Sequential()
model.add(Embedding(max_features, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

model.summary()

Found 400000 word vectors.
Model: "sequential_48"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_48 (Embedding)     (None, 20, 100)           1000000   
_________________________________________________________________
flatten_35 (Flatten)         (None, 2000)              0         
_________________________________________________________________
dense_83 (Dense)             (None, 32)                64032     
_________________________________________________________________
dense_84 (Dense)             (None, 1)                 33        
Total params: 2,064,065
Trainable params: 1,064,065
Non-trainable params: 1,000,000
_________________________________________________________________


<a id='1f'></a>
### (f) Train and evaluate accuracy with embedding layer weights fixed

In [170]:
print('Training...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_valid, y_valid),
          callbacks=[EarlyStopping(patience=5)],
          verbose=0)

_, acc = model.evaluate(x_test, y_test, batch_size=batch_size)

print('Test accuracy:', acc)

Training...
Test accuracy: 0.7074400186538696


<a id='1g'></a>
### (g) Conclusions about the trainable and GLOVE embedding


The test accuracy of the model with GLOVE embedding layer is close to but no better than that of the one with trainable embedding layer.

<a id='1h'></a>
### (h) Build a model with an LSTM layer and a trainable embedding layer

In [172]:
model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

model.summary()

Model: "sequential_49"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_49 (Embedding)     (None, 20, 8)             80000     
_________________________________________________________________
lstm_15 (LSTM)               (None, 32)                5248      
_________________________________________________________________
dense_85 (Dense)             (None, 1)                 33        
Total params: 85,281
Trainable params: 85,281
Non-trainable params: 0
_________________________________________________________________


<a id='1i'></a>
### (i) Train and evaluate

In [173]:
print('Training...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_valid, y_valid),
          callbacks=[EarlyStopping(patience=5)],
          verbose=0)

_, acc = model.evaluate(x_test, y_test, batch_size=batch_size)

print('Test accuracy:', acc)

Training...
Test accuracy: 0.7252799868583679


The test accuracy of the model with trainable embedding layer and LSTM layer is slightly higher (~2.%) than that of the one with fully connected layer.

<a id='2'></a>
## 2: Topic classfication

<a id='2a'></a>
### (a) Load the Reuters newswire data

In [0]:
max_features = 10000
maxlen = 500

(x_train, y_train), (x_test, y_test) = reuters.load_data(path="reuters.npz",
                                                         num_words=max_features,
                                                         skip_top=0,
                                                         maxlen=maxlen,
                                                         test_split=0.3,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

# Pad sequences
x_train = sequence.pad_sequences(x_train, maxlen)
x_test = sequence.pad_sequences(x_test, maxlen)

<a id='2b'></a>
### (b) Split the Reuters data

In [145]:
x_valid, x_test, y_valid, y_test = train_test_split(x_test, y_test,
                                                    test_size=0.5,
                                                    random_state=0,
                                                    stratify=y_test)

y_train = to_categorical(y_train)
y_valid = to_categorical(y_valid)
y_test = to_categorical(y_test)

print('x_train shape:', x_train.shape)
print('x_valid shape:', x_valid.shape)
print('x_test shape:', x_test.shape)
print('y_train shape:', y_train.shape)
print('y_valid shape:', y_valid.shape)
print('y_test shape:', y_test.shape)

x_train shape: (7543, 500)
x_valid shape: (1617, 500)
x_test shape: (1617, 500)
y_train shape: (7543, 46)
y_valid shape: (1617, 46)
y_test shape: (1617, 46)


<a id='2c'></a>
### (c) Build model with a trainable embedding layer and two fully connected layers

In [161]:
embedding_size = 32 
batch_size = 32
epochs = 100

model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(46, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

model.summary()

Model: "sequential_46"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_46 (Embedding)     (None, 500, 32)           320000    
_________________________________________________________________
flatten_33 (Flatten)         (None, 16000)             0         
_________________________________________________________________
dense_79 (Dense)             (None, 64)                1024064   
_________________________________________________________________
dense_80 (Dense)             (None, 46)                2990      
Total params: 1,347,054
Trainable params: 1,347,054
Non-trainable params: 0
_________________________________________________________________


<a id='2d'></a>
### (d) Train and evaluate accuracy with both trainable and loaded embedding layer

In [162]:
# With trainable embedding layer
print('Training...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_valid, y_valid),
          callbacks=[EarlyStopping(patience=5)],
          verbose=0)

_, acc = model.evaluate(x_test, y_test, batch_size=batch_size)

print('Test accuracy:', acc)

Training...
Test accuracy: 0.716759443283081


In [159]:
# with loaded embedding layer
# Prepare the GLOVE word-embeddings matrix
embedding_dim = 100
word_index = reuters.get_word_index(path="reuters_word_index.json")

embedding_matrix = np.zeros((max_features, embedding_dim))
for word, i in word_index.items():
  if i < max_features:
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
      embedding_matrix[i] = embedding_vector


# Initialize the embedding layer using the embedding matrix and freeze the weights
model = Sequential()
model.add(Embedding(max_features, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(46, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

model.summary()

print('Training...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_valid, y_valid),
          callbacks=[EarlyStopping(patience=5)],
          verbose=0)

_, acc = model.evaluate(x_test, y_test, batch_size=batch_size)

print('Test accuracy:', acc)

Model: "sequential_45"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_45 (Embedding)     (None, 500, 100)          1000000   
_________________________________________________________________
flatten_32 (Flatten)         (None, 50000)             0         
_________________________________________________________________
dense_77 (Dense)             (None, 64)                3200064   
_________________________________________________________________
dense_78 (Dense)             (None, 46)                2990      
Total params: 5,203,054
Trainable params: 4,203,054
Non-trainable params: 1,000,000
_________________________________________________________________
Training...
Test accuracy: 0.6035869121551514


<a id='2e'></a>
### (e) Compare results and draw conclusion



The test accuracy of the model with trainable embedding layer and 2 fully connected layers is much higher (~13%) than that of the one with loaded embedding layer.

<a id='2f'></a>
### (f) Build a model with an LSTM layer and a trainable embedding layer; train, evaluate, compare and draw conclusions.

In [156]:
# Build a model with an LSTM layer and a trainable embedding layer
model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(LSTM(64))
model.add(Dense(46, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

model.summary()

Model: "sequential_44"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_44 (Embedding)     (None, 500, 32)           320000    
_________________________________________________________________
lstm_14 (LSTM)               (None, 64)                24832     
_________________________________________________________________
dense_76 (Dense)             (None, 46)                2990      
Total params: 347,822
Trainable params: 347,822
Non-trainable params: 0
_________________________________________________________________


In [157]:
# Train and evaluate
print('Training...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_valid, y_valid),
          callbacks=[EarlyStopping(patience=5)],
          verbose=0)

_, acc = model.evaluate(x_test, y_test, batch_size=batch_size)

print('Test accuracy:', acc)

Training...
Test accuracy: 0.6561533808708191


<a id='2g'></a>

### (g) Add an addtional LSTM layer to the previous model; train and evaluate.

In [151]:
# Build a model with two LSTM layers and a trainable embedding layer
model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(LSTM(64, return_sequences=True))
model.add(LSTM(64))
model.add(Dense(46, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

model.summary()

# Train and evaluate
print('Training...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_valid, y_valid),
          callbacks=[EarlyStopping(patience=5)],
          verbose=0)

_, acc = model.evaluate(x_test, y_test, batch_size=batch_size)

print('Test accuracy:', acc)

Model: "sequential_43"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_43 (Embedding)     (None, 500, 32)           320000    
_________________________________________________________________
lstm_12 (LSTM)               (None, 500, 64)           24832     
_________________________________________________________________
lstm_13 (LSTM)               (None, 64)                33024     
_________________________________________________________________
dense_75 (Dense)             (None, 46)                2990      
Total params: 380,846
Trainable params: 380,846
Non-trainable params: 0
_________________________________________________________________
Training...
Test accuracy: 0.6171923279762268


<a id='2h'></a>

### (h) Conclusion

The model with trainable embedding layer and 2 fully connected layers had the highest accuracy over others. In topic classification, the number of rare words that belong to a certain topic can distinguish it well from other topics. Thus, the order of words in the topic is not as important as those in the sentiment analysis.

## References

Chollet, F. (2018). Deep learning with python (1st ed.). Shelter Island, NY: Manning Publications Co.