### Question 1
Visualize the categories of your target variable and describe the dataset generally (the data includes news articles from the BBC news.)  A simple description is fine.

In [3]:
# importing necessary packages and data
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV, train_test_split
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, SimpleRNN, LSTM, Conv1D, GlobalMaxPooling1D, Bidirectional
from keras.utils.np_utils import to_categorical
df = pd.read_csv("https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv")

In [5]:
# visualizing target variable, please note that the interactive graph won't show up in the static github page.
# to properly view the histogram, please open up the notebook using Jupyter in your local drive
fig = px.histogram(df, x="category", 
                   title='Univariate Exploratory Analysis',
                   labels={'category':'BBC News Article Category'},
                   color_discrete_sequence=['#330C73'])
fig.show()

In [12]:
word_counts = list()

for text in df.text:
    word_counts.append(len(text.split(" ")))

df["# of Words Per Article"] = word_counts
summary = df["# of Words Per Article"].describe()

print(df.head())
print("\n\nSummary of dataframe:\n\n",summary)

        category                                               text  \
0           tech  tv future in the hands of viewers with home th...   
1       business  worldcom boss  left books alone  former worldc...   
2          sport  tigers wary of farrell  gamble  leicester say ...   
3          sport  yeading face newcastle in fa cup premiership s...   
4  entertainment  ocean s twelve raids box office ocean s twelve...   

   # of Words Per Article  
0                     806  
1                     332  
2                     270  
3                     390  
4                     287  


Summary of dataframe:

 count    2225.000000
mean      419.757303
std       260.055935
min        94.000000
25%       268.000000
50%       361.000000
75%       514.000000
max      4759.000000
Name: # of Words Per Article, dtype: float64


From the histogram above, it is evident that BBC tends to publish articles that fall under business and sport categories more so than others. Suprisingly, entertainment holds the lowest count for newspaper articles. For the exact counts, please hover your mouse over the graph, as it is interactive in nature. Moreover, it seems like the minimum number of words of the entire dataframe is 94, whereas the maximum is 4759 words. The mean number of words is centered around 420 words per article.

### Question 2
Preprocess your data such that each document in the data is represented as a sequence of equal length.

In [29]:
# Articles are cut after 94 words because the minimum number of words for an article was 94 and want all lengths to be equal
maxlen = 94
# We will only consider the top 10,000 words in the dataset
max_words = 10000 

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df.text)
sequences = tokenizer.texts_to_sequences(df.text) # converts words in each text to each word's numeric index in tokenizer dictionary.

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)
labels = preprocessing.LabelEncoder()
labels = labels.fit_transform(df.category)
labels = np.asarray(labels)
labels = to_categorical(labels)

print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Found 29726 unique tokens.
Shape of data tensor: (2225, 94)
Shape of label tensor: (2225, 5)


In [30]:
# Split the data into a training set and a validation set, and in accordance with the homework instructions
X_train,X_test,y_train,y_test=train_test_split(data,labels,test_size=0.20,random_state=42)
print('Shape of training set:', X_train.shape)
print('Shape of testing set:', X_test.shape)

Shape of training set: (1780, 94)
Shape of testing set: (445, 94)


### Question 3
Use the data to fit separate models to each of the following architectures

#### A: A model with an embedding layer and dense layers (but w/ no layers meant for sequential data)

In [14]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(Flatten())
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 94, 32)            320000    
_________________________________________________________________
flatten_2 (Flatten)          (None, 3008)              0         
_________________________________________________________________
dense_7 (Dense)              (None, 5)                 15045     
Total params: 335,045
Trainable params: 335,045
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.6078389563587274
Test accuracy: 0.8089887655183171


#### B. A model using an Embedding layer with Conv1d Layers

In [28]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(Conv1D(filters= 32, kernel_size=5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 94, 32)            320000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 90, 32)            5152      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32)                0         
_________________________________________________________________
dense_12 (Dense)             (None, 5)                 165       
Total params: 325,317
Trainable params: 325,317
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.5710740325156223
Test accuracy: 0.8247191001859944


#### C. A model using an Embedding layer with one sequential layer (LSTM or GRU)

In [31]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(32))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)





_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 94, 32)            320000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 165       
Total params: 328,485
Trainable params: 328,485
Non-trainable params: 0
_________________________________________________________________
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

Train on 1424 samples, validate on 356 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.657381586479337
Test accuracy: 0.8157303373465378


#### D. A model using an Embedding layer with stacked sequential layers (LSTM or GRU)

In [16]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(32, return_sequences=True))
model.add(LSTM(32, return_sequences=True))
model.add(LSTM(32))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 94, 32)            320000    
_________________________________________________________________
lstm_7 (LSTM)                (None, 94, 32)            8320      
_________________________________________________________________
lstm_8 (LSTM)                (None, 94, 32)            8320      
_________________________________________________________________
lstm_9 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_9 (Dense)              (None, 5)                 165       
Total params: 345,125
Trainable params: 345,125
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
E

#### E. A model using an Embedding layer with bidirectional sequential layers

In [18]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 94, 32)            320000    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 64)                16640     
_________________________________________________________________
dense_11 (Dense)             (None, 5)                 325       
Total params: 336,965
Trainable params: 336,965
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.5789619089512342
Test accuracy: 0.8471910102983539


#### F. Now retrain your best model from C, D, and E using dropout (you may need to increase epochs!).

In [19]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 94, 32)            320000    
_________________________________________________________________
lstm_12 (LSTM)               (None, 32)                8320      
_________________________________________________________________
dense_12 (Dense)             (None, 5)                 165       
Total params: 328,485
Trainable params: 328,485
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 2

In [20]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_13 (Embedding)     (None, 94, 32)            320000    
_________________________________________________________________
lstm_13 (LSTM)               (None, 94, 32)            8320      
_________________________________________________________________
lstm_14 (LSTM)               (None, 94, 32)            8320      
_________________________________________________________________
lstm_15 (LSTM)               (None, 32)                8320      
_________________________________________________________________
dense_13 (Dense)             (None, 5)                 165       
Total params: 345,125
Trainable params: 345,125
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
E

In [21]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(Bidirectional(LSTM(32, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_14 (Embedding)     (None, 94, 32)            320000    
_________________________________________________________________
bidirectional_4 (Bidirection (None, 64)                16640     
_________________________________________________________________
dense_14 (Dense)             (None, 5)                 325       
Total params: 336,965
Trainable params: 336,965
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Test score: 0.517785180000107
Test accuracy: 0.869662922821687

### Question 4
Discuss 1) which model(s) performed best and speculate about 2) how you might try to further improve the predictive power of your model (e.g. Glove embeddings? More layers? Combining Conv1D with LSTM layers? More LSTM hidden nodes?)

Model A: embedding layer and dense layer <br>
Test Score: 0.608 <br>
Test Accuracy: 0.809 <br>

Model B: embedding layer with conv1d layers <br>
Test Score: 0.571 <br>
Test Accuracy: 0.825 <br>

Model C: embedding layer with one sequential layer <br>
Test Score: 0.657 <br>
Test Accuracy: 0.816 <br>

Model D: embedding layer with stacked sequential layers<br>
Test Score: 0.845 <br>
Test Accuracy: 0.768 <br>

Model E: embedding layer with bidirectional sequential layers <br>
Test Score: 0.579 <br>
Test Accuracy: 0.847 <br>

Part F) <br>
Model C: embedding layer with one sequential layer with dropout <br>
Test Score: 0.581 <br>
Test Accuracy: 0.820 <br>

Model D: embedding layer with stacked sequential layers with dropout <br>
Test Score: 1.063 <br>
Test Accuracy: 0.773 <br>

Model E: embedding layer with bidirectional sequential layers with dropout <br>
Test Score: 0.518 <br>
Test Accuracy: 0.870 <br>

From the results above, it seems the last model (embedding layer with bidirectional sequence layer) seems to be the best model with an accuracy of 0.870 and a reasonably low loss score, which is a good starting point. To improve the predictive power of the model, I first tried to experiment with the parameters of the embedding layer, specifically the number of features the embedding layer will have for each word. Then, I experimented with the number of feature inputs for the bidirection sequential layer. Next, I tried a combination of a Conv1D and LSTM layers. Finally, I used a matrix of pretained embeddings from Glove, and added the weights into the embedding layer of the previous best model. These results are shown in the next section followed with a brief discussion of the results at the end!

#### Hypertuning Parameters

In [9]:
model = Sequential()
model.add(Embedding(10000, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(32, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 94, 128)           1280000   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 64)                41216     
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 325       
Total params: 1,321,541
Trainable params: 1,321,541
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Test score: 0.4845955259344551
Test accuracy: 0.8651685383

In [10]:
model = Sequential()
model.add(Embedding(10000, 512, input_length=maxlen))
model.add(Bidirectional(LSTM(32, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 94, 512)           5120000   
_________________________________________________________________
bidirectional_3 (Bidirection (None, 64)                139520    
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 325       
Total params: 5,259,845
Trainable params: 5,259,845
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Test score: 0.3810089057750916
Test accuracy: 0.8876404497

In [25]:
model = Sequential()
model.add(Embedding(10000, 1024, input_length=maxlen))
model.add(Bidirectional(LSTM(32, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 94, 1024)          10240000  
_________________________________________________________________
bidirectional_9 (Bidirection (None, 64)                270592    
_________________________________________________________________
dense_9 (Dense)              (None, 5)                 325       
Total params: 10,510,917
Trainable params: 10,510,917
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Test score: 0.4227927507979147
Test accuracy: 0.87865168

In [26]:
model = Sequential()
model.add(Embedding(10000, 512, input_length=maxlen))
model.add(Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=15,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 94, 512)           5120000   
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 256)               656384    
_________________________________________________________________
dense_10 (Dense)             (None, 5)                 1285      
Total params: 5,777,669
Trainable params: 5,777,669
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 0.3995294874638654
Test accuracy: 0.8876404497060882


In [39]:
model = Sequential()
model.add(Embedding(10000, 512, input_length=maxlen))
model.add(Conv1D(filters= 128, kernel_size=5, activation='relu'))
model.add(Bidirectional(LSTM(128, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=32,
                    validation_split=0.2)

score, acc = model.evaluate(X_test, y_test, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_25 (Embedding)     (None, 94, 512)           5120000   
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 90, 128)           327808    
_________________________________________________________________
bidirectional_17 (Bidirectio (None, 256)               263168    
_________________________________________________________________
dense_15 (Dense)             (None, 5)                 1285      
Total params: 5,712,261
Trainable params: 5,712,261
Non-trainable params: 0
_________________________________________________________________
Train on 1424 samples, validate on 356 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/2

#### Glove Embeddings

In [18]:
# Extract embedding data for 100 feature embedding matrix
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'), encoding="utf8")
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [19]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [27]:
# Set up model architecture from previous best model
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Bidirectional(LSTM(32, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_16 (Embedding)     (None, 94, 100)           1000000   
_________________________________________________________________
bidirectional_6 (Bidirection (None, 64)                34048     
_________________________________________________________________
dense_16 (Dense)             (None, 5)                 325       
Total params: 1,034,373
Trainable params: 1,034,373
Non-trainable params: 0
_________________________________________________________________


In [28]:
# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)
model.save_weights('pre_trained_glove_model.h5')


Train on 1424 samples, validate on 356 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [21]:
# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(X_train, y_train,
                    epochs=25,
                    batch_size=32,
                    validation_split=0.2)
model.save_weights('pre_trained_glove_model_V2.h5')


Train on 1424 samples, validate on 356 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [22]:
model.load_weights('pre_trained_glove_model_V2.h5')
model.evaluate(X_test, y_test)



[0.08698486754398667, 0.9712359526184168]

### Question 4 Continued

Hypertuning Embedding Features of Best Model<br>

Number of Embedding Features: 128 <br> 
Test Score: 0.485 <br>
Test Accuracy: 0.865 <br>

Number of Embedding Features: 512 <br> 
Test Score: 0.381 <br>
Test Accuracy: 0.888 <br>

Number of Embedding Features: 1024 <br>
Test Score: 0.423 <br>
Test Accuracy: 0.879 <br>

Hypertuning LSTM Units of Best Model<br>

Number of Units: 32 <br> 
Test Score: 0.381 <br>
Test Accuracy: 0.888 <br>

Number of Units: 128 <br> 
Test Score: 0.399 <br>
Test Accuracy: 0.888 <br>

Implementing Conv1D with LSTM layers<br>
Test Score: 0.690 <br>
Test Accuracy: 0.840 <br>

It is generally better to use greater number of embedding features because it allows the model to extract moreaning from every word, as Professor Parrott stated. However, as seen from the model performance from increasing the number of features from 512 to 1024, there is a point where simply adding more features no longer improves the model performance. Instead, it may hinder the efficiency of the model by introducing unncessary complexity. Next, it seems like increasing the number of units of the LSTM layer did not do much to the actual model performance. Finally, changing the model layout by adding Conv1D layer actually hindered the performance of the model.

With Glove Embeddings:<br>
Test Score: 0.087 <br>
Test Accuracy: 0.971 <br>

Using the pretrained weights from the Glove Embedding Layer (sourced from Stanford), I was able improve the performance of my mode to a substantial lower loss and higher accuracy. This makes sense as the Glove Embedding layer has been trained on a large corpus. Overall, the combination of Glove Embedding and Bidirectional LSTM layers have the highest predictive power compared to any other models.

Link to Github: https://github.com/JaeWHam/Advanced_ML_GR5074/blob/master/HW3/GR5074_Assignment3_JaeHam.ipynb