# Lab 9: RNNs

Authors: Omar, Rupal, Van

### Context

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. Data was originally used to detect SMS spam.

### Content

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

### Acknowledgements

The original dataset can be found here. The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.

### Classification problem

Can we use this dataset to build a prediction model that will accurately classify which texts are spam?


## Data preprocessing

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('spam.csv', encoding='latin1')
# drop blank columns
df = df.drop('Unnamed: 2',1)
df = df.drop('Unnamed: 3',1)
df = df.drop('Unnamed: 4',1)

In [2]:
# v1 is target v2 is the text
columns = ['v1','v2']
df_bin = pd.DataFrame(columns = columns)
end = df.size - 2
# change v1 from spam ham to binary 0 1
for i in range(0,5572):
    if(df['v1'][i] == "ham"):
        df['v1'][i] = 0
    else:
        df['v1'][i] = 1

In [3]:
X = df['v2']
y = df['v1']
# still need to encode y for karis
print(X.shape, y.shape)

(5572,) (5572,)


In [4]:
# max chars
max_text_length = 0
max_text = ''
for text in X:
    if len(text) > max_text_length:
        max_text_length = len(text)
        max_text = text
print(max_text_length, max_text)

910 For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later..


_____

# RNNs with Pre-processing in Keras

## Moving to 20 Newsgroups
So we should be able to convert a new dataset into the same format as above. Let's do this from scratch, converting the 20 news groups data. 
- http://qwone.com/~jason/20Newsgroups/

We looked at this a while back when we created the tf-idf and bag-of-words models. This time, we are not going to get rid of the sequence of words for classification. 

In [5]:
%%time
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

NUM_TOP_WORDS = None
MAX_ART_LEN = max_text_length # maximum and minimum number of words

tokenizer = Tokenizer(num_words=NUM_TOP_WORDS)
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)

word_index = tokenizer.word_index
NUM_TOP_WORDS = len(word_index) if NUM_TOP_WORDS==None else NUM_TOP_WORDS
top_words = min((len(word_index),NUM_TOP_WORDS))
print('Found %s unique tokens. Distilled to %d top words.' % (len(word_index),top_words))

X = pad_sequences(sequences, maxlen=MAX_ART_LEN)

y_ohe = keras.utils.to_categorical(y)
print('Shape of data tensor:', X.shape)
print('Shape of label tensor:', y_ohe.shape)
print(np.max(X))

Using TensorFlow backend.


Found 8920 unique tokens. Distilled to 8920 top words.
Shape of data tensor: (5572, 910)
Shape of label tensor: (5572, 2)
8920
CPU times: user 2.53 s, sys: 338 ms, total: 2.87 s
Wall time: 11.7 s


In [6]:
print(type(X))
print(X.shape)
print(type(y_ohe))
print(y_ohe.shape)
print(type(y))
y_matrix = y.as_matrix
print(type(y_matrix))

<class 'numpy.ndarray'>
(5572, 910)
<class 'numpy.ndarray'>
(5572, 2)
<class 'pandas.core.series.Series'>
<class 'method'>


So that's it! The representation is now:
- each word is converted to an integer 
- each article is a series of integers that represent the correct ordering of words
- the target is one hot encoded

___

## Splitting the Data

Stratified 10 fold cross validation. This is a good choice because our models train relatively fast, so training 100 models won't be that big of a deal.

In [7]:
# from sklearn.model_selection import StratifiedKFold
# skf = StratifiedKFold(n_splits=10, shuffle=True)
# skf.get_n_splits(X, y)
# print(skf)  
# # for train_index, test_index in skf.split(X, y):
# #     print("TRAIN:", train_index, "TEST:", test_index)
# #     X_train, X_test = X[train_index], X[test_index]
# #     y_train, y_test = y[train_index], y[test_index]

In [8]:
from sklearn.model_selection import train_test_split
# Split it into train / test subsets
X_train, X_test, y_train_ohe, y_test_ohe = train_test_split(X, y_ohe, test_size=0.2,
                                                            stratify=y, 
                                                            random_state=42)
NUM_CLASSES = 2
print(X_train.shape,y_train_ohe.shape)
print(np.sum(y_train_ohe,axis=0))

(4457, 910) (4457, 2)
[ 3859.   598.]


## Loading the embedding
But this is going to be a more involved process. Maybe we can speed up the training by loading up a pre-trained embedding of the words?!

Let's use the GloVe word embedding in keras. We will follow the example at:
- https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

You can download the nearly 1GB pretrained embeddings here:
- http://nlp.stanford.edu/data/glove.6B.zip

Let's take a quick look at the format of the file:

In [9]:
%%time
EMBED_SIZE = 100
# the embed size should match the file you load glove from
embeddings_index = {}
f = open('glove.6B.100d.txt', encoding='utf8')
# save key/array pairs of the embeddings
#  the key of the dictionary is the word, the array is the embedding
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# now fill in the matrix, using the ordering from the
#  keras word tokenizer from before
embedding_matrix = np.zeros((len(word_index) + 1, EMBED_SIZE))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

print(embedding_matrix.shape)

Found 400000 word vectors.
(8921, 100)
CPU times: user 14.7 s, sys: 353 ms, total: 15 s
Wall time: 15.4 s


## Metrics

We decided to use a custom mix of the precision score and recall score.

Classifying something that is spam as not spam is mildly irritating to the user. However, when something that is not spam is classified as spam, there's a serious problem. Not many users check their spam folder, even when they are expecting an important e-mail. We were thinking of using F1 score, but as explained before our precision is not of equal importance as our recall. Our custom scorer is the following:

$$ PrecisionScore * .9 + RecallScore * .1 $$

Which is basically saying our precision is worth 9 times more than our recall.

Also a confusion matrix perfectly represents the results of the model so we will include it in the evaluation of every model.

In [10]:
import matplotlib.pyplot as plt
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [11]:
from keras.layers import Embedding

embedding_layer = Embedding(len(word_index) + 1,
                            EMBED_SIZE,
                            weights=[embedding_matrix],
                            input_length=MAX_ART_LEN,
                            trainable=False)

### LSTM RNN

In [20]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM


rnn_lstm = Sequential()
rnn_lstm.add(embedding_layer)
rnn_lstm.add(LSTM(100,dropout=0.2, recurrent_dropout=0.2))
rnn_lstm.add(Dense(NUM_CLASSES, activation='sigmoid'))
rnn_lstm.compile(loss='binary_crossentropy', 
              optimizer='rmsprop', 
              metrics=['accuracy', score])
print(rnn_lstm.summary())

ValueError: Expected array-like (array or non-string sequence), got <tf.Tensor 'dense_3_target:0' shape=(?, ?) dtype=float32>

In [15]:
rnn_lstm.fit(X_train, y_train_ohe, validation_data=(X_test, y_test_ohe), epochs=3, batch_size=64)

Train on 4457 samples, validate on 1115 samples
Epoch 1/3
 832/4457 [====>.........................] - ETA: 211s - loss: 0.3861 - acc: 0.8377

KeyboardInterrupt: 

In [16]:
y_pred = rnn_lstm.predict(X_test)
y_pred

array([[ 0.98808986,  0.01594687],
       [ 0.97222602,  0.02793487],
       [ 0.9792794 ,  0.01926178],
       ..., 
       [ 0.98612761,  0.01514243],
       [ 0.97455126,  0.02374485],
       [ 0.96638721,  0.03925234]], dtype=float32)

### GRU RNN

In [None]:
from keras.layers import GRU

rnn_gru = Sequential()
rnn_gru.add(embedding_layer)
rnn_gru.add(GRU(100,dropout=0.2, recurrent_dropout=0.2))
rnn_gru.add(Dense(NUM_CLASSES, activation='sigmoid'))
rnn_gru.compile(loss='binary_crossentropy', 
              optimizer='rmsprop', 
              metrics=['accuracy'])
print(rnn_gru.summary())

In [None]:
rnn_gru.fit(X_train, y_train_ohe, validation_data=(X_test, y_test_ohe), epochs=3, batch_size=64)

In [None]:
y_pred = rnn_gru.predict(X_test)
y_pred

### Stacked RNN with two LSTMs

In [None]:
rnn_stacked = Sequential()
rnn_stacked.add(embedding_layer)
rnn_stacked.add(LSTM(100, return_sequences=True))
rnn_stacked.add(LSTM(50))
rnn_stacked.add(Dense(NUM_CLASSES, activation='sigmoid'))
rnn_stacked.compile(loss='binary_crossentropy', 
              optimizer='rmsprop', 
              metrics=['accuracy'])
print(rnn_stacked.summary())

In [None]:
rnn_stacked.fit(X_train, y_train_ohe, validation_data=(X_test, y_test_ohe), epochs=3, batch_size=64)

In [None]:
y_pred = rnn_stacked.predict(X_test)

In [None]:
y_predict = []
for element in y_pred:
    if element[0] >= .5:
        y_predict.append(0)
    else:
        y_predict.append(1)
y_test = []
for element in y_test_ohe:
    if element[0] >= .5:
        y_test.append(0)
    else:
        y_test.append(1)

In [None]:
print(len(y_predict))
print(len(y_test))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_predict)