# Spam Detection using LSTM neural network
In this notebook, I will implement a recurrent neural network that perform spam detection. Using an RNN/LSTM rather than a feedforward or logistic regression is more accurate since we can include information about sequence of words.

In [1]:
import tensorflow as tf
import numpy as np
import os
from distutils.version import LooseVersion
import warnings

In [2]:
# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

TensorFlow Version: 1.2.0


# Load data

In [3]:
with open('Assets/SMSSpamCollection', 'r') as f:
    data = f.read()
    
data[:300]

"ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\nham\tOk lar... Joking wif u oni...\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 084528100"

# Preprocessing


In [4]:
# Remove punctuation and lowercase
from string import punctuation
all_text = ''.join([c for c in data if c not in punctuation])
all_text = all_text.lower()

# split label and text of each line.
messages = all_text.split('\n')
messages = [x.split('\t') for x in messages if len(x)>=1]
[labels, texts] = np.array([list(x) for x in zip(*messages)])

In [5]:
print("Example: ")
print("Label: {},\tText: {}".format(labels[0],texts[0]))
print("Label: {},\tText: {}".format(labels[1],texts[1]))
print("Label: {},\tText: {}".format(labels[2],texts[2]))

Example: 
Label: ham,	Text: go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Label: ham,	Text: ok lar joking wif u oni
Label: spam,	Text: free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s




Our labels are "spam" or "ham". To use these labels in our network, we need to convert them to 0 and 1.

In [6]:
labels = np.array([1 if each == 'spam' else 0 for each in labels])
labels

array([0, 0, 1, ..., 0, 0, 0])

In [7]:
# All words
all_text = ' '.join(texts)
words = all_text.split()
words[:20]

['go',
 'until',
 'jurong',
 'point',
 'crazy',
 'available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'cine',
 'there',
 'got',
 'amore',
 'wat']

# Encoding the words
Encode the words with integers and build a dictionary that maps words to integers.

In [8]:
from collections import Counter
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse = True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab)}

In [9]:
vocab_to_int

{'todayfrom': 2907,
 'mmmmmm': 2908,
 'tbspersolvo': 4392,
 'bruce': 2909,
 'anderson': 4393,
 'discussed': 2910,
 'mental': 2911,
 '08000930705': 678,
 '08704439680tscs': 4394,
 'tables': 4395,
 'window': 2912,
 'fool': 1841,
 'real': 327,
 'mushy': 4396,
 'switch': 2913,
 'hunlove': 4397,
 'ffffuuuuuuu': 4398,
 'loo': 4399,
 'sf': 4400,
 'drinks': 1678,
 'home': 80,
 '09063458130': 2228,
 'gbp5month': 4401,
 'continue': 2915,
 'minutes': 366,
 'spelled': 4402,
 'ivatte': 4403,
 'pierre': 4404,
 'financial': 4405,
 'asus': 4406,
 'c': 160,
 'hicts': 8696,
 '4few': 6876,
 'newest': 2336,
 'lot': 333,
 'subscribers': 5417,
 'screamed': 2916,
 'rem': 1599,
 'rearrange': 4411,
 'walls': 2229,
 'dhorte': 4413,
 'yourinclusive': 4414,
 'stdtxtrate': 4415,
 'confused': 2917,
 'outreach': 4416,
 'addicted': 1842,
 'moms': 1042,
 'swt': 2918,
 'kissing': 4417,
 'fantastic': 956,
 'we': 37,
 'newsby': 4536,
 'lt3': 2251,
 '40gb': 1405,
 'blah': 2230,
 '07801543489': 4419,
 'wwwtxttowincouk': 29

In [10]:
# Convert the reviews to integers, same shape as reviews list, but with integers
text_ints = []
for each in texts:
    text_ints.append([vocab_to_int[word] for word in each.split()])

In [11]:
text_ints[3]

[5, 234, 140, 23, 355, 3900, 5, 160, 144, 59, 140]

In [12]:
from collections import Counter
text_lens = Counter([len(x) for x in text_ints])
print("Zero-length text: {}".format(text_lens[0]))
print("Maximum text length: {}".format(max(text_lens)))

Zero-length text: 2
Maximum text length: 171


In [13]:
non_zero_idx = [ii for ii, texts in enumerate(text_ints) if len(texts) != 0]
len(non_zero_idx)

5572

In [14]:
len(texts)

5574

In [15]:
# fillter out that review with 0 length
text_ints = [text_ints[ii] for  ii in non_zero_idx]
labels = np.array([labels[ii] for ii in non_zero_idx])

create an array features that contains the data we'll pass to the network. Each row should be 170 elements long. For text shorter than 170 words, left pad with 0s. For text longer than 170, use on the first 170 words as the feature vector.

In [16]:
seq_len = 170
features = np.zeros((len(text_ints), seq_len), dtype=int)
for i, row in enumerate(text_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]

In [17]:
features[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

# Training, Validation, Test

In [18]:
split_frac = 0.8
split_idx = int(len(features)*0.8)
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_idx], val_x[test_idx:]
val_y, test_y = val_y[:test_idx], val_y[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(4457, 170) 
Validation set: 	(557, 170) 
Test set: 		(558, 170)


# Build the Neural Network

In [19]:
lstm_size = 256
lstm_layers = 2
batch_size = 250
learning_rate = 0.001
drop_out = 0.5
epochs = 5

Create TF Placeholders for the Neural Network.

In [20]:
n_words = len(vocab_to_int)

# Create the graph object
graph = tf.Graph()
# Add nodes to the graph
with graph.as_default():
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

### Embedding
There are about 1000 words in our vocabulary. It is massively inefficient to one-hot encode. Instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table.

In [21]:
# Size of the embedding vectors (number of units in the embedding layer)
embed_size = 300 

with graph.as_default():
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)

### Build RNN Cell and Initialize
Stack one or more LSTMCells in a MultiRNNCell. (if you are using tensflow < 1.0, you may get errors)

In [22]:
with graph.as_default():
    def lstm_cell():
        cell = tf.contrib.rnn.LSTMCell(lstm_size, 
                                       initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2),
                                       state_is_tuple=True)
        drop = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)
        return drop
    
    stack_cells = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(lstm_layers)])
    
    initial_state = state = stack_cells.zero_state(batch_size, tf.float32)

In [23]:
with graph.as_default():
    outputs, final_state = tf.nn.dynamic_rnn(stack_cells, embed, initial_state=initial_state)

In [24]:
with graph.as_default():
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

In [25]:
with graph.as_default():
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

In [26]:
def get_batches(x, y, batch_size=100):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

In [27]:
with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_prob: drop_out,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(stack_cells.zero_state(batch_size, tf.float32))
                for x, y in get_batches(val_x, val_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1
    saver.save(sess, "checkpoints/sentiment.ckpt")

Epoch: 0/5 Iteration: 5 Train loss: 0.089
Epoch: 0/5 Iteration: 10 Train loss: 0.070
Epoch: 0/5 Iteration: 15 Train loss: 0.044
Epoch: 1/5 Iteration: 20 Train loss: 0.044
Epoch: 1/5 Iteration: 25 Train loss: 0.016
Val acc: 0.956
Epoch: 1/5 Iteration: 30 Train loss: 0.020
Epoch: 2/5 Iteration: 35 Train loss: 0.018
Epoch: 2/5 Iteration: 40 Train loss: 0.013
Epoch: 2/5 Iteration: 45 Train loss: 0.016
Epoch: 2/5 Iteration: 50 Train loss: 0.013
Val acc: 0.972
Epoch: 3/5 Iteration: 55 Train loss: 0.017
Epoch: 3/5 Iteration: 60 Train loss: 0.008
Epoch: 3/5 Iteration: 65 Train loss: 0.017
Epoch: 4/5 Iteration: 70 Train loss: 0.002
Epoch: 4/5 Iteration: 75 Train loss: 0.004
Val acc: 0.976
Epoch: 4/5 Iteration: 80 Train loss: 0.016
Epoch: 4/5 Iteration: 85 Train loss: 0.006


In [28]:
test_acc = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    test_state = sess.run(stack_cells.zero_state(batch_size, tf.float32))
    for ii, (x, y) in enumerate(get_batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_acc, test_state = sess.run([accuracy, final_state], feed_dict=feed)
        test_acc.append(batch_acc)
    print("Test accuracy: {:.3f}".format(np.mean(test_acc)))

INFO:tensorflow:Restoring parameters from checkpoints/sentiment.ckpt
Test accuracy: 0.988
