# Assignment 8

In this assignment, you will build a multi-class recurrent neural network classifier for text classification. You will first need to import the libraries. Then you will need to pre-process your data by removing stop words and stemming. After cleaning the data, you will download a pretrained word embedding and use the embedding to give each word a vector. The vectors will be the features of your classifier. You will split your data into training (80%) and validation (20%). Then, you will train your neural network and find the best model using random search, and test it and your testing data.

## Import Libraries

In [1]:
'''implement your code'''
import keras
import sklearn
import csv

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Load Data

In [2]:
'''implement your code'''
tid = []
tweets = []
labels = []
with open('training.csv', encoding='latin1', newline='') as csvfile:
    for rec in csv.reader(csvfile, delimiter=','):
        tid.append(int(rec[0][1:-1]))
        tweets.append(rec[1])
        if(rec[2] == 'caution and advice'):
            labels.append(0)
        elif(rec[2] == 'infromation source'):
            labels.append(1)
        elif(rec[2] == 'casualities and damage'):
            labels.append(-1)
        else:
            labels.append(2)

## Clean Data

 Pre-process your data by removing stop words and perform stemming.

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
import numpy as np

tokenizer = Tokenizer()
tokenizer.fit_on_texts(tweets)
sequences = tokenizer.texts_to_sequences(tweets)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

Found 4427 unique tokens.
Shape of data tensor: (1059, 45)
Shape of label tensor: (1059, 3)


## Word Embedding

Download the word embedding from this link http://nlp.stanford.edu/data/glove.twitter.27B.zip and create the embedding matrix to be used in the embedding layer. You have to use the embedding file of dimension 50.

In [4]:
embeddings_index = {}
f = open('glove.twitter.27B.50d.txt', encoding = 'utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))
embedding_matrix = np.zeros((len(word_index) + 1, 50))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

Found 1193514 word vectors.


## Split Data

Split your data into training (80%) and validation (20%).

In [5]:
nb_validation_samples = int(0.8 * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = labels[-nb_validation_samples:]

## Exercise 1

You will train a recurrent neural network for 100 epochs with a batch size of 32 without doing any hyperameters tuning.

The architecture should be as follow:
- One embedding layer ( You don't need to retrain the embeddings. You have to use the pretrained embeddings)
- One LSTM layer with 200 units
- One Dense Layer with 100 units
- One output layer
- The activation of the Dense layer is a Relu
- The activation of the output layer is a Softmax
- The loss function is a categorical cross-entropy funtion
- The optimizer of this model is RMSProp

### Create Model

Create the above neural network architecture.

In [6]:
'''implement your code'''
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.embeddings import Embedding

model = Sequential()
model.add(Embedding(len(word_index) + 1, 
                    50, 
                    weights = [embedding_matrix], 
                    input_length=45, 
                    trainable = False))
model.add(LSTM(200))
model.add(Dense(100, activation='relu'))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop', 
              metrics=['accuracy'])

### Training

Train your model on the training dataset.

In [7]:
'''implement your code'''
history = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=100, batch_size=32)

Train on 212 samples, validate on 847 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
E

### Validation

Test your model on the validation and compute the F-measure and accuracy.

In [8]:
'''implement your code'''
pred = model.predict(x_val)
loss, accuracy = model.evaluate(x_val, y_val, verbose=0)
print("Accuracy = ", accuracy)

Accuracy =  0.7012987015801873


## Exercise 2

### Random Search

Write the random search function. You will use the random search method in exercise 3 to find the best hyperparameters.

In [9]:
'''implement your code'''
from sklearn.model_selection import RandomizedSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.embeddings import Embedding

def createmodel():
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                        50,
                        weights = [embedding_matrix],
                        input_length=45,
                        trainable = False))
    model.add(LSTM(200))
    model.add(Dense(100, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    return model

def randomsearch(param_grid):
    clf = KerasClassifier(build_fn=createmodel, epochs=100, batch_size=32, verbose=0)
    rs = RandomizedSearchCV(estimator = clf, param_distributions = param_grid, n_jobs=-1)
    return rs

## Exercise 3

### Hyperparameters Tuning

You will tune the hyperparameters of the above architecture using random search by validating on the validation dataset.

Plot the learning curve of the best model (loss versus number
of epochs). You should show both the training loss and the validation loss.

You should also report the values of the hyperparameters of your best model and the validation accuracy and F-measure.  

The hyperparameters that need to be tuned are:
- Learning rates
- Dropout
- Number of hidden units
- Mini-batch size
- Learning rate decay
- Number of layers
- Type of layers

In [None]:
'''implement your code'''
from sklearn import preprocessing

learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
dropout_rate = [0.0, 0.1]
param_grid = dict(learn_rate=learn_rate, dropout_rate=dropout_rate)

normalized_X = preprocessing.normalize(x_train)
normalized_y = preprocessing.normalize(y_train)

rs = randomsearch(param_grid)
rs.fit(normalized_X, normalized_y)

In [None]:
from sklearn import metrics
pred = rs.best_estimator_.predict(x_val)
accuracy = sklearn.metrics.accuracy_score(y_val, pred)
fscore = sklearn.metrics.f1_score(y_val, pred, average='macro')
print("Accuracy = ", accuracy)
print("F score = ", fscore)

### Testing

Test your best model on the testing data, and report the F-measure and accuracy.

In [None]:
'''implement your code'''
from sklearn import metrics
tid = []
tweet = []
label = []
with open('test.csv', encoding='latin1', newline='') as csvfile:
    for rec in csv.reader(csvfile, delimiter=','):
        tid.append(int(rec[0][1:-1]))
        tweet.append(rec[1])
        if(rec[2] == 'caution and advice'):
            label.append(0)
        elif(rec[2] == 'infromation source'):
            label.append(1)
        elif(rec[2] == 'casualities and damage'):
            label.append(-1)
        else:
            label.append(2)
            
tokenizer = Tokenizer()
tokenizer.fit_on_texts(tweet)
sequences = tokenizer.texts_to_sequences(tweet)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences)

label = to_categorical(np.asarray(label))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', label.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
label = label[indices]
        
pred = rs.best_estimator_.predict(data)
accuracy = sklearn.metrics.accuracy_score(label, pred)
fscore = sklearn.metrics.f1_score(label, pred, average='macro')
print("Accuracy = ", accuracy)
print("F score = ", fscore)

Rename the jupyter notebook to Assignment8_*netid*.ipynb (Assignment8_xyz01.ipynb) and upload it on Moodle no later than Wednesday, Nov 28 11:55 pm.