# MSIS 579 HW3 RNN to Classify Reuters Topics

In this homework, we will train a recurrent neural network to Classify Reuters newswires into 46 Topics.

Dataset of 11,228 newswires from Reuters, labeled over 46 topics. As with the IMDB dataset, each wire is encoded as a sequence of word indexes (same conventions).

In [21]:
%tensorflow_version 1.14
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)

from __future__ import print_function

import numpy as np
import keras
keras.__version__

!pip install numpy==1.16.1

`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `1.14`. This will be interpreted as: `1.x`.


TensorFlow is already loaded. Please restart the runtime to change versions.


## Load Reuters Dataset

First let's load the Reuters dataset. Please refer to [this API page](https://keras.io/datasets/#reuters-newswire-topics-classification) for details on how to load the data.

In [44]:
from keras.datasets import reuters
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer

max_words = 10000

print('Loading data...')
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_words, test_split=0.2)
word_index = reuters.get_word_index(path="reuters_word_index.json")

Loading data...


## Task 1: Fully Connected Neural Networks

In this task, we will learn a word embedding layer as well as fully connected layers to classify Reuters newwires. Please refer to the lab code from lesson 4. Watch out the overfitting. 

In [45]:
# TODO
from keras.layers import Embedding
embedding_layer = Embedding(10000, 64)

In [56]:
from keras.datasets import reuters
from keras import preprocessing
from tensorflow.keras import utils as np_utils


max_features = 10000

maxlen = 20
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=max_features)

x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)

In [58]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding, Dropout
from keras.layers.convolutional import Convolution1D, MaxPooling1D
model = Sequential()
model.add(Embedding(10000, 8, input_length= maxlen))
model.add(Flatten())

model.add(Dense(64, activation='relu'))

model.add(Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()
history = model.fit(
    x_train, 
    y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2
)


Model: "sequential_24"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_27 (Embedding)     (None, 20, 8)             80000     
_________________________________________________________________
flatten_18 (Flatten)         (None, 160)               0         
_________________________________________________________________
dense_44 (Dense)             (None, 64)                10304     
_________________________________________________________________
dense_45 (Dense)             (None, 46)                2990      
Total params: 93,294
Trainable params: 93,294
Non-trainable params: 0
_________________________________________________________________
Train on 7185 samples, validate on 1797 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [61]:
results = model.evaluate(x_test, y_test)
print("Test Accuracy:",results[1])

Test Accuracy: 0.5494211912155151


## Task 2: RNN/LSTM
Now, we have a fully connected neural networks trained for prediction topics in Reuters data. In this task, we will swap out the fully connect layers and replace with a more powerful RNN layers (LSTM, GRU). Try experiment with different RNN layers and see if they can help improve the model performance.

In [67]:
# TODO

from keras.layers import LSTM
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN

model2 = Sequential()
model2.add(Embedding(max_features, 20))
model2.add(LSTM(32))
model2.add(Dense(46, activation='softmax'))

model2.summary()
model2.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
history2 = model2.fit(
    x_train, 
    y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2
)

results = model2.evaluate(x_test, y_test)
print("Test Accuracy for model2:",results[1])

Model: "sequential_30"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_33 (Embedding)     (None, None, 20)          200000    
_________________________________________________________________
lstm_5 (LSTM)                (None, 32)                6784      
_________________________________________________________________
dense_48 (Dense)             (None, 46)                1518      
Total params: 208,302
Trainable params: 208,302
Non-trainable params: 0
_________________________________________________________________
Train on 7185 samples, validate on 1797 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy for model2: 0.544968843460083


In [68]:
model3 = Sequential()
model3.add(Embedding(max_features, 20))
model3.add(SimpleRNN(32))
model3.add(Dense(46, activation='softmax'))

model3.summary()
model3.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
history3 = model3.fit(
    x_train, 
    y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2
)

results = model3.evaluate(x_test, y_test)
print("Test Accuracy for model3:",results[1])

Model: "sequential_31"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_34 (Embedding)     (None, None, 20)          200000    
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, 32)                1696      
_________________________________________________________________
dense_49 (Dense)             (None, 46)                1518      
Total params: 203,214
Trainable params: 203,214
Non-trainable params: 0
_________________________________________________________________
Train on 7185 samples, validate on 1797 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy for model3: 0.5133571028709412


In [70]:
from keras.layers import GRU

model4 = Sequential()
model4.add(Embedding(max_features, 20))
model4.add(GRU(32))
model4.add(Dense(46, activation='softmax'))

model4.summary()
model4.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
history4 = model4.fit(
    x_train, 
    y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2
)

results = model4.evaluate(x_test, y_test)
print("Test Accuracy for model4:",results[1])

Model: "sequential_33"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_36 (Embedding)     (None, None, 20)          200000    
_________________________________________________________________
gru_2 (GRU)                  (None, 32)                5088      
_________________________________________________________________
dense_51 (Dense)             (None, 46)                1518      
Total params: 206,606
Trainable params: 206,606
Non-trainable params: 0
_________________________________________________________________
Train on 7185 samples, validate on 1797 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy for model4: 0.5089046955108643


## Task 3: Use a Pre-trained Word Embedding

In this task, instead of learning the word embedding layer from scratch, we apply a pre-trained word embedding layer and only use the classification base for reuters data. Please refer to the [API](https://keras.io/examples/pretrained_word_embeddings/) for different pre-trained word embedding.

Does the pre-trained word embedding help improve the model prediction?


In [None]:
# TODO


In [93]:
def loadGloveModel(gloveFile):
    """
    Loads GloVe Model
    
    Arguments:
    gloveFile -- path to the glove file

    Returns:
    model -- a word_to_vec_map, where keys are words, and values are vectors (represented by arrays)
    """
    
    print("Loading Glove Model")
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print("Done.",len(model)," words loaded!")
    return model

def pretrained_embedding_layer(word_to_vec_map, word_to_wordidx):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_wordidx -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_wordidx) + 1                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    ### START CODE HERE ###
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros(((vocab_len, emb_dim)))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_wordidx.items():
        if word in word_to_vec_map:
            emb_matrix[index, :] = word_to_vec_map[word]
        else:
            emb_matrix[index, :] = word_to_vec_map["random"]  #just to set something when work is not in word_to_vec_map

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    embedding_layer = Embedding(input_dim=vocab_len, output_dim=emb_dim, trainable=False)
    ### END CODE HERE ###

    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

In [96]:
!wget --no-check-certificate \
    http://nlp.stanford.edu/data/glove.6B.zip \
    -O /tmp/glove.6B.zip

--2021-05-27 04:31:18--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-05-27 04:31:18--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-05-27 04:31:19--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘/tmp/glove.6B.zip’


2

In [100]:
word_to_vec_map = loadGloveModel('glove.6B.100d.txt')


Loading Glove Model
Done. 400000  words loaded!


In [112]:
from keras.models import Sequential, Model

sentence_indices = Input(shape=(maxlen,), dtype='int32')
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_index)
embeddings = embedding_layer(sentence_indices)

X = Flatten()(embeddings)

X = Dense(46, activation="softmax")(X)
model = Model(inputs = sentence_indices, outputs = X)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_11 (InputLayer)        (None, 20)                0         
_________________________________________________________________
embedding_46 (Embedding)     (None, 20, 100)           3098000   
_________________________________________________________________
flatten_20 (Flatten)         (None, 2000)              0         
_________________________________________________________________
dense_61 (Dense)             (None, 46)                92046     
Total params: 3,190,046
Trainable params: 92,046
Non-trainable params: 3,098,000
_________________________________________________________________


In [113]:

history = model.fit(
    x_train, 
    y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2
)

results = model.evaluate(x_test, y_test)
print("Test Accuracy for model:",results[1])

Train on 7185 samples, validate on 1797 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy for model: 0.47818344831466675


pre-trained didn't help with the accuracy, but part of the reason could be that the model only use basic catogorization, instead of RNN of CNN. 