## Lab Assignment Two: Exploring Text Data 

### Justin Ledford, Luke Wood, Traian Pop 
___

## Business Understanding

### Data Background
SMS messages play a huge role in a person's life, and the confidentiality and integrity of said messages are of the highest priority to mobile carriers around the world. Due to this fact, many unlawful individuals and groups try and take advantange of the average consumer by flooding their inbox with spam, and while the majority of people successfully avoid it, there are people out there affected negatively by falling for false messages.  

The data we selected is a compilation of 5574 SMS messages acquired from a variety of different sources, broken down in the following way: 452 of the messages came from the Grumbletext Web Site, 3375 of the messages were taken from the NUS SMS Corpus (database with legitimate message from the University of Singapore), 450 messages collected from Caroline Tag's PhD Thesis, and the last 1324 messages were from the SMS Spam Corpus v.0.1 Big. 

Overall there were 4827 "ham" messages and 747 "spam" messages, and about 92,000 words.

### Purpose
This data was collected initially for studies on deciphering the differences between a spam or ham (legitimate) messages. Uses for this research can involve advanced spam filtering technology or improved data sets for machine learning programs. However, a slight problem with this data set, as with most localized language-based data sets, is that due to the relatively small area of sampling, there are a lot of regional data points (such as slang, acronyms, etc) that can be considering "useless" data if a much more generalized data set is wanted. For our specific project however, we are keeping all this data in order for us to analyze it and get a better understanding of our data.
___

## Preparation (40 points total)

### [20 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).   

In [1]:
import pandas as pd
import numpy as np
import requests
import re
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

descriptors_url = 'https://raw.githubusercontent.com/LukeWoodSMU/TextAnalysis/master/data/SMSSpamCollection'
descriptors = requests.get(descriptors_url).text
texts = []


for line in descriptors.splitlines():
    texts.append(line.rstrip().split("\t"))

After the first look at the data we noticed a lot of phone numbers. Since almost every number was unique we concluded that the numbers were irrelevant to consider as words. We considered grouping all number tokens into one "word" and analyze the presence of words, but we decided to first start by just removing the numbers.

In [2]:
# Remove numbers
texts = list(zip([a for a,b in texts], [re.sub('[0-9-]3+.*', ' ', b) for a,b in texts]))
texts[:10]

[('ham',
  'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'),
 ('ham', 'Ok lar... Joking wif u oni...'),
 ('spam',
  "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"),
 ('ham', 'U dun say so early hor... U c already then say...'),
 ('ham', "Nah I don't think he goes to usf, he lives around here though"),
 ('spam',
  "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"),
 ('ham',
  'Even my brother is not like to speak with me. They treat me like aids patent.'),
 ('ham',
  "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"),
 ('spam',
  'WINNER!! As a valued network customer you have been selected to receivea £

In [3]:
import numpy as np
from keras.preprocessing import sequence

Using TensorFlow backend.


In [4]:
X = [x[1] for x in texts]
y = [x[0] for x in texts]
X = np.array(X)
print(type(X))

<class 'numpy.ndarray'>


In [5]:
import keras
y = [0 if y_ == "spam" else 1 for y_ in y]
y_ohe = keras.utils.to_categorical(y)
y_ohe

array([[ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.],
       ..., 
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.]])

In [6]:
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

NUM_TOP_WORDS = None

tokenizer = Tokenizer(num_words=NUM_TOP_WORDS)
tokenizer.fit_on_texts(X)
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(X)
sequences = pad_sequences(sequences)
sequences

MAX_TEXT_LEN = len(sequences[0]) # maximum and minimum number of words


In [7]:
EMBED_SIZE = 100
# the embed size should match the file you load glove from
embeddings_index = {}
f = open('GLOVE/glove.6B/glove.6B.100d.txt')
# save key/array pairs of the embeddings
#  the key of the dictionary is the word, the array is the embedding
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# now fill in the matrix, using the ordering from the
#  keras word tokenizer from before
embedding_matrix = np.zeros((len(word_index) + 1, EMBED_SIZE))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

print(embedding_matrix.shape)

Found 400000 word vectors.
(8621, 100)


In [8]:
from sklearn.model_selection import train_test_split
# Split it into train / test subsets
X_train, X_test, y_train_ohe, y_test_ohe = train_test_split(sequences, y_ohe, test_size=0.2,
                                                            stratify=y_ohe, 
                                                            random_state=42)
NUM_CLASSES = len(y_train_ohe[0])
NUM_CLASSES

2

# Evaluation Metrics
We decided that due to our business understanding being that we can potentially create a spam filter, our largest cost should be false positives.  It would be incredibly frustrating to have a real text filtered out so we should evaluate our models in accordance with this.  To evaluate this, we must implement precision score which has been removed from keras.  Luckily, the old code is available in a one of keras' old versions.

In [13]:
# Old version of keras had precision score, copied the code to re-implement it.
import keras.backend as K
def precision(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

Citation: old keras version

# Building a Single Model

In [14]:
from keras.layers import Embedding

embedding_layer = Embedding(len(word_index) + 1,
                            EMBED_SIZE,
                            weights=[embedding_matrix],
                            input_length=MAX_TEXT_LEN,
                            trainable=False)
metrics=[precision,"accuracy"]

In [15]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

rnn = Sequential()
rnn.add(embedding_layer)
rnn.add(LSTM(100,dropout=0.2, recurrent_dropout=0.2))
rnn.add(Dense(NUM_CLASSES, activation='sigmoid'))
rnn.compile(loss='categorical_crossentropy', 
              optimizer='rmsprop', 
              metrics=metrics)
print(rnn.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 189, 100)          862100    
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 202       
Total params: 942,702.0
Trainable params: 80,602.0
Non-trainable params: 862,100.0
_________________________________________________________________
None


In [12]:
rnn.fit(X_train, y_train_ohe, validation_data=(X_test, y_test_ohe), epochs=3, batch_size=64)

Train on 4459 samples, validate on 1115 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x126f3d668>

# Comparing Different Model Types

In [13]:
from keras.layers import LSTM, GRU, SimpleRNN

rnns = []

for func in [SimpleRNN, LSTM, GRU]:
    rnn = Sequential()
    rnn.add(embedding_layer)
    rnn.add(func(100,dropout=0.2, recurrent_dropout=0.2))
    rnn.add(Dense(NUM_CLASSES, activation='sigmoid'))

    rnn.compile(loss='categorical_crossentropy', 
                  optimizer='rmsprop', 
                  metrics=metrics)
    rnns.append(rnn)

In [14]:
for rnn, name in zip(rnns,['simple','lstm','gru']):
    print('\nTesting Cell Type: ',name,'========')
    rnn.fit(X_train, y_train_ohe, epochs=3, batch_size=64, validation_data=(X_test, y_test_ohe))

Train on 4459 samples, validate on 1115 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 4459 samples, validate on 1115 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 4459 samples, validate on 1115 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


As we can see, the GRU model performs the best by a large margin.  If we continue to train the GRU model it seems that we will get some really great results.  We will try also try to find the best hyperparameters for the GRU model.

# Gridsearch on GRU Model

In [15]:
dropouts=[.1,.2,.3]
recurrent_dropouts=[.1,.2,.3]

for dropout in dropouts:
    for recurrent_dropout in recurrent_dropouts:
        rnn = Sequential()
        rnn.add(embedding_layer)
        rnn.add(func(100,dropout=dropout, recurrent_dropout=recurrent_dropout))
        rnn.add(Dense(NUM_CLASSES, activation='sigmoid'))

        rnn.compile(loss='categorical_crossentropy', 
                      optimizer='rmsprop', 
                      metrics=metrics)
        print("Hyper Paramater Set:\n\tdropout=%.1f\n\trecurrent_dropout=%.1f" % (dropout,recurrent_dropout))
        rnn.fit(X_train,y_train_ohe,epochs=3, batch_size=64, validation_data=(X_train,y_train_ohe))

Hyper Paramater Set:
	dropout=0.1
	recurrent_dropout=0.1
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.1
	recurrent_dropout=0.2
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.1
	recurrent_dropout=0.3
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.2
	recurrent_dropout=0.1
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.2
	recurrent_dropout=0.2
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.2
	recurrent_dropout=0.3
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.3
	recurrent_dropout=0.1
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.3
	recurrent_dropout=0.

### We get some pretty ridiculously high accuracy with both of our hyperparameters set to .1

As we can see, with dropout and recurrent dropout at .1 we get some really great results; with accuracy getting as high as 98.6%.  This is ridiculously high.  The model gets .997 precision and .98 accuracy on the validation set with these hyperparameters.

We actually get a similar precision score in a few sets of hyperparameters, but we get a higher accuracy with the .1 and .1 set so this is our most effective model.

In [15]:
best_model = Sequential()
best_model.add(embedding_layer)
best_model.add(GRU(100,dropout=.1, recurrent_dropout=.1))
best_model.add(Dense(NUM_CLASSES, activation='sigmoid'))
best_model.compile(loss='categorical_crossentropy', 
                      optimizer='rmsprop', 
                      metrics=metrics)

# Running Our Best Model With More Epochs

In [16]:
best_model.fit(X_train,y_train_ohe,epochs=10, batch_size=64, validation_data=(X_train,y_train_ohe))

Train on 4459 samples, validate on 4459 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x12a7f0b70>

We end up getting above 99.5% accuracy and a precision score of .9986 on the validation set!  We could absolutely use this to publish a spam filter.  This is a very good score on this dataset.

# Grid Search Using LSTM

In [16]:
dropouts=[.1,.2,.3]
recurrent_dropouts=[.1,.2,.3]

for dropout in dropouts:
    for recurrent_dropout in recurrent_dropouts:
        rnn = Sequential()
        rnn.add(embedding_layer)
        rnn.add(LSTM(100,dropout=dropout, recurrent_dropout=recurrent_dropout))
        rnn.add(Dense(NUM_CLASSES, activation='sigmoid'))

        rnn.compile(loss='categorical_crossentropy', 
                      optimizer='rmsprop', 
                      metrics=metrics)
        print("Hyper Paramater Set:\n\tdropout=%.1f\n\trecurrent_dropout=%.1f" % (dropout,recurrent_dropout))
        rnn.fit(X_train,y_train_ohe,epochs=3, batch_size=64, validation_data=(X_train,y_train_ohe))

Hyper Paramater Set:
	dropout=0.1
	recurrent_dropout=0.1
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.1
	recurrent_dropout=0.2
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.1
	recurrent_dropout=0.3
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.2
	recurrent_dropout=0.1
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.2
	recurrent_dropout=0.2
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.2
	recurrent_dropout=0.3
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.3
	recurrent_dropout=0.1
Train on 4459 samples, validate on 4459 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Hyper Paramater Set:
	dropout=0.3
	recurrent_dropout=0.

### [10 points] Choose and explain what metric(s) you will use to evaluate your algorithm’s performance. You should give a detailed argument for why this (these) metric(s) are appropriate on your data. That is, why is the metric appropriate for the task (e.g., in terms of the business case for the task). Please note: rarely is accuracy the best evaluation metric to use. Think deeply about an appropriate measure of performance.

### [10 points] Choose the method you will use for dividing your data into training and testing (i.e., are you using Stratified 10-fold cross validation? Shuffle splits? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. Convince me that your cross validation method is a realistic mirroring of how an algorithm would be used in practice. 

## Modeling (50 points total)

[25 points] Investigate at least two different recurrent network architectures (perhaps LSTM and GRU). Adjust hyper-parameters of the networks as needed to improve generalization performance. 
[25 points] Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab. Visualize the best results of the RNNs.   
Exceptional Work (10 points total)
You have free reign to provide additional analyses.
One idea: Use more than a single chain of LSTMs or GRUs (i.e., use multiple parallel chains). 
Another Idea: Try to create a RNN for generating novel text. 

# NLTK tokenize vs keras tokenizer.

We thought it could be interesting to compare the generalized NLTK tokenizer to the keras tokenizer.  We decided to compare them using basic LSTM networks.

In [17]:
from nltk.tokenize import word_tokenize
X_nltk = [word_tokenize(x) for x in X]

In [18]:
encoder = {}
counter = 0
def encode_sentence(seq):
    global encoder, counter
    fseq = []
    for x in seq:
        if x not in encoder:
            encoder[x] = counter
            counter+=1
        fseq.append(encoder[x])
    return fseq

X_nltk = [encode_sentence(x) for x in X]
X_nltk = pad_sequences(X_nltk, maxlen=None)

In [19]:
embedding_layer = Embedding(len(word_index) + 1,
                            EMBED_SIZE,
                            weights=[embedding_matrix],
                            input_length=len(X_nltk[0]),
                            trainable=False)

In [20]:
rnn = Sequential()
rnn.add(embedding_layer)
rnn.add(LSTM(100,dropout=0.2, recurrent_dropout=0.2))
rnn.add(Dense(NUM_CLASSES, activation='sigmoid'))
rnn.compile(loss='categorical_crossentropy', 
              optimizer='rmsprop', 
              metrics=metrics)
print(rnn.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 910, 100)          862100    
_________________________________________________________________
lstm_12 (LSTM)               (None, 100)               80400     
_________________________________________________________________
dense_12 (Dense)             (None, 2)                 202       
Total params: 942,702.0
Trainable params: 80,602.0
Non-trainable params: 862,100.0
_________________________________________________________________
None


In [None]:
X_train, X_test, y_train_ohe, y_test_ohe = train_test_split(X_nltk, y_ohe, test_size=0.2,
                                                            stratify=y_ohe, 
                                                            random_state=42)

In [None]:
rnn.fit(X_train, y_train_ohe, validation_data=(X_test, y_test_ohe), epochs=3, batch_size=64)

Train on 4459 samples, validate on 1115 samples
Epoch 1/3


# KerasGlove Published to PyPi
I really liked being able to easily use glove embeddings with keras so I published a package to PyPi for it.  It's available under kerasglove and removes the need for a lot of the code in the notebook.  Here is a sample usage of it:

In [None]:
from kerasglove import GloveEmbedding
EMBED_SIZE=100
embedding_layer = GloveEmbedding(
                            EMBED_SIZE,
                            MAX_TEXT_LEN,
                            word_index)
embedding_layer

The full source is here:
https://github.com/LukeWoodSMU/KerasGlove