# Spam Detection with an RNN

>Using an RNN rather than a strictly feedforward network is more accurate since we can include information about the *sequence* of words. 

#### Dataset: Kaggle SMS Spam Collection
[Sms Spam](https://www.kaggle.com/uciml/sms-spam-collection-dataset/downloads/spam.csv/1)

### Load in and visualize the data

In [1]:
import numpy as np
from numpy import  genfromtxt


In [2]:
data = genfromtxt('data/spam.csv',delimiter = '\n', dtype='str')


print("Shape of data:", data.shape)
print(data[0], "\n")
print(data[1],"\n")
print(data[2],"\n")




Shape of data: (5575,)
v1,v2,,, 

ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",,, 

ham,Ok lar... Joking wif u oni...,,, 



In [3]:
# Throw fist row. 
data = data[1:]

# Separate into messages and labels

labels,messages =zip(*list(map( 
            lambda x: (x[:3]  , x[4:-3]) if x.startswith('h') else (x[:4],x[5:-3])
                               
                               ,data)))
labels = np.array(labels)
messages = np.array(messages)

In [4]:
print(len(messages), len(labels))

for i in range(15):
    print("Message {}: {} \n Label: {} \n ".format(i,messages[i], labels[i]))

5574 5574
Message 0: "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." 
 Label: ham 
 
Message 1: Ok lar... Joking wif u oni... 
 Label: ham 
 
Message 2: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 
 Label: spam 
 
Message 3: U dun say so early hor... U c already then say... 
 Label: ham 
 
Message 4: "Nah I don't think he goes to usf, he lives around here though" 
 Label: ham 
 
Message 5: "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv" 
 Label: spam 
 
Message 6: Even my brother is not like to speak with me. They treat me like aids patent. 
 Label: ham 
 
Message 7: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your fri

#### Convert labels to float. If spam then 1, else 0.

In [5]:
labels[labels == "spam"] = 1
labels[labels == "ham"] = 0
print(labels[:10])
print(labels[0].dtype)

#convert to float
labels = labels.astype('float') 
print(labels[:10])

['0' '0' '1' '0' '0' '1' '0' '0' '1' '1']
<U1
[0. 0. 1. 0. 0. 1. 0. 0. 1. 1.]


#### Remove punctuation

In [6]:
from string import punctuation

for k in range(messages.shape[0]):
    messages[k] = messages[k].lower()
    messages[k] = "".join( [s for s in   messages[k] if s not in punctuation])


print(messages[0],"\n")
print(messages[1])


go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat 

ok lar joking wif u oni


In [7]:
all_messages="".join( [s.lower() for s in messages if s not in punctuation] )

all_messages[:1000]

'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore watok lar joking wif u onifree entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18su dun say so early hor u c already then saynah i dont think he goes to usf he lives around here thoughfreemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send å£150 to rcveven my brother is not like to speak with me they treat me like aids patentas per your request melle melle oru minnaminunginte nurungu vettam has been set as your callertune for all callers press 9 to copy your friends callertunewinner as a valued network customer you have been selected to receivea å£900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours onlyhad your mobile 11 months or more u r entitled to update to the latest colour mobiles with camera for 

In [8]:
words = all_messages.split()
print("Number of words", len(words))
words[:20]

Number of words 78233


['go',
 'until',
 'jurong',
 'point',
 'crazy',
 'available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'cine',
 'there',
 'got',
 'amore',
 'watok']

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

In [9]:


## Build a dictionary that maps words to integers
UNK_TOKEN = 'UNK'

#Count words
word_count ={}
for word in words:
    r = word_count.get(word,None)
    
    if r :
        word_count[word]+=1
    else:
        word_count[word] = 1
        
        

#word to index
word_to_index = {}

keys = word_count.keys()
# Begin indexing with 1
i= 1
for key in  keys:
    
    if word_count[key] >= 5:
        word_to_index[key] = i
        i+= 1
### Add Unknow token
word_to_index[UNK_TOKEN] = 0 
        
        
print(len(word_to_index.keys() ))
word_to_index
    

1645


{'go': 1,
 'until': 2,
 'point': 3,
 'crazy': 4,
 'available': 5,
 'only': 6,
 'in': 7,
 'bugis': 8,
 'n': 9,
 'great': 10,
 'world': 11,
 'la': 12,
 'e': 13,
 'cine': 14,
 'there': 15,
 'got': 16,
 'lar': 17,
 'wif': 18,
 'u': 19,
 'entry': 20,
 '2': 21,
 'a': 22,
 'wkly': 23,
 'comp': 24,
 'to': 25,
 'win': 26,
 'cup': 27,
 'final': 28,
 'may': 29,
 'text': 30,
 'receive': 31,
 'txt': 32,
 'apply': 33,
 'dun': 34,
 'say': 35,
 'so': 36,
 'early': 37,
 'c': 38,
 'already': 39,
 'then': 40,
 'i': 41,
 'dont': 42,
 'think': 43,
 'he': 44,
 'goes': 45,
 'usf': 46,
 'around': 47,
 'here': 48,
 'hey': 49,
 'darling': 50,
 'its': 51,
 'been': 52,
 '3': 53,
 'weeks': 54,
 'now': 55,
 'and': 56,
 'no': 57,
 'word': 58,
 'back': 59,
 'id': 60,
 'like': 61,
 'some': 62,
 'fun': 63,
 'you': 64,
 'up': 65,
 'for': 66,
 'it': 67,
 'still': 68,
 'ok': 69,
 'xxx': 70,
 'std': 71,
 'send': 72,
 'å£150': 73,
 'my': 74,
 'brother': 75,
 'is': 76,
 'not': 77,
 'speak': 78,
 'with': 79,
 'me': 80,
 'they

In [52]:
index_to_word = {}

for key in word_to_index.keys():
    
    index_to_word[ word_to_index[key] ] = key

    

### Save Dictionaries 

In [62]:
import pickle
def save_obj(obj,path ):
    
    direct, fname = os.path.split(path)
    
    if not os.path.exists(direct):
        os.makedirs(direct)
    
    
    
    with open(path + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(path):
    with open(path  + '.pkl', 'rb') as f:
        return pickle.load(f)

    

In [66]:
# Save index to word 
save_obj(index_to_word, "data/index_to_word")

#d = load_obj("data/index_to_word")




In [67]:
# Save word to index 
save_obj(word_to_index, "data/word_to_index")
#e = load_obj("data/word_to_index")


dict_keys(['go', 'until', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'cine', 'there', 'got', 'lar', 'wif', 'u', 'entry', '2', 'a', 'wkly', 'comp', 'to', 'win', 'cup', 'final', 'may', 'text', 'receive', 'txt', 'apply', 'dun', 'say', 'so', 'early', 'c', 'already', 'then', 'i', 'dont', 'think', 'he', 'goes', 'usf', 'around', 'here', 'hey', 'darling', 'its', 'been', '3', 'weeks', 'now', 'and', 'no', 'word', 'back', 'id', 'like', 'some', 'fun', 'you', 'up', 'for', 'it', 'still', 'ok', 'xxx', 'std', 'send', 'å£150', 'my', 'brother', 'is', 'not', 'speak', 'with', 'me', 'they', 'treat', 'per', 'your', 'request', 'melle', 'has', 'set', 'as', 'callertune', 'all', 'callers', 'press', '9', 'copy', 'friends', 'valued', 'network', 'customer', 'have', 'selected', 'å£900', 'prize', 'reward', 'claim', 'call', 'code', 'valid', '12', 'hours', 'mobile', '11', 'months', 'or', 'more', 'r', 'entitled', 'update', 'the', 'latest', 'colour', 'mobiles', 'camera', 'fre

### Get word and get index fucntions

In [11]:
def get_word(index_to_word, index):
    """
    index_to_word: dictionary
        index to word dict
    index: int
    
    return word given index. If index (key) not in dict returns 'UNK' unknow token
    """
    
    result = index_to_word.get(index,None)
    
    if result:
        return result
    return UNK_TOKEN
    

In [12]:
def get_index(word_to_index, word):
    """
    word_to_index: dictionary
        word to index dict
    word: string
    return index of the word from word_to_index
    if word not in word_to_index return 0, index of unknow token.
    
    """
    
    result = word_to_index.get(word,None)
    
    if result: 
        return result
    return 0

In [13]:
print(get_word(index_to_word, 12),"\n",get_index(word_to_index,"hi" ))

la 
 412


### Messages to vectors

In [14]:
vectors = []

for message in messages:
    
    vector = [ get_index(word_to_index,w) for w in message.split()]
    vectors.extend([vector])


for  j in range(10):
    print(vectors[j],"\n")
    
print("Zero length messages")    
for  j in range(len(vectors)):
    if   len(vectors[j]) == 0:
        
        #print(vectors[j],"\n")
        pass
        


[1, 2, 0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0, 14, 15, 16, 0, 562] 

[69, 17, 0, 18, 19, 0] 

[124, 20, 7, 21, 22, 23, 24, 25, 26, 0, 27, 28, 0, 0, 29, 0, 30, 0, 25, 0, 25, 31, 20, 0, 32, 0, 33, 0] 

[19, 34, 35, 36, 37, 0, 19, 38, 39, 40, 35] 

[0, 41, 42, 43, 44, 45, 25, 46, 44, 0, 47, 48, 814] 

[0, 49, 15, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 0, 69, 70, 71, 0, 25, 72, 73, 25, 0] 

[320, 74, 75, 76, 77, 61, 25, 78, 79, 80, 81, 82, 80, 61, 0, 0] 

[89, 83, 84, 85, 86, 86, 0, 0, 0, 0, 87, 52, 88, 89, 84, 90, 66, 91, 92, 93, 94, 25, 95, 84, 96, 90] 

[527, 89, 22, 97, 98, 99, 64, 100, 52, 101, 25, 0, 102, 103, 104, 25, 105, 106, 0, 105, 107, 0, 108, 109, 110, 6] 

[259, 84, 111, 112, 113, 114, 115, 19, 116, 117, 25, 118, 25, 119, 120, 121, 122, 79, 123, 66, 124, 106, 119, 111, 118, 125, 124, 126, 0] 

Zero length messages


### Remove Outliers

1. Getting rid of extremely long or short messages; the outliers
2. Padding/truncating the remaining data so that we have messages of the same length.

In [15]:
from collections import Counter

#outliers messages stats
messages_lens = Counter([len(x)  for x in vectors])
print("Zero-length messages: {}".format(messages_lens[0]))
print("Maximum message length: {}".format(max(messages_lens)))

Zero-length messages: 12
Maximum message length: 171


In [16]:
print("Number of messages before removing outliers: ", len(vectors))

Number of messages before removing outliers:  5574


In [17]:
import torch as th

non_zero_idx = [i for i,message in enumerate(vectors) if len(message)!= 0 ]

#remove 0 length messages end their labels 
vectors = [vectors[i] for i in non_zero_idx ]
labels = np.array([labels[i] for i in non_zero_idx])

print("Number of messages after removing outliers: ",len(vectors))



len(vectors) == labels.shape[0]

Number of messages after removing outliers:  5562


True

## Padding sequences

In [18]:
def pad_features(vectors, seq_length):
    ''' Return features of vectors, where each review is padded with 0's 
        or truncated to the input seq_length.
    '''
    
    # getting the correct rows x cols shape
    features = np.zeros((len(vectors), seq_length), dtype=int)

    # for each review, I grab that review and 
    for i, row in enumerate(vectors):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

In [19]:
seq_length = 200

features = pad_features(vectors, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(vectors), "Your features should have as many rows as vectors."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."



# print first 200 values of the first 10 batches 
print(features[:10,:200])




[[  0   0   0 ...  16   0 562]
 [  0   0   0 ...  18  19   0]
 [  0   0   0 ...   0  33   0]
 ...
 [  0   0   0 ...  84  96  90]
 [  0   0   0 ... 109 110   6]
 [  0   0   0 ... 124 126   0]]


## Training, Validation, Test

In [20]:
split_frac = 0.8

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features)*0.8)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))



			Feature Shapes:
Train set: 		(4449, 200) 
Validation set: 	(556, 200) 
Test set: 		(557, 200)


In [21]:
import torch as th
from torch.utils.data import TensorDataset, DataLoader


# Create Tensor datasets
# CONVERT TO int64 for embedding layer.
train_data = TensorDataset(th.from_numpy(train_x).to(th.int64), th.from_numpy(train_y))
valid_data = TensorDataset(th.from_numpy(val_x).to(th.int64)  , th.from_numpy(val_y))
test_data = TensorDataset(th.from_numpy(test_x).to(th.int64)  , th.from_numpy(test_y))

batch_size = 32

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)


In [22]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x, sample_x.dtype)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([32, 200])
Sample input: 
 tensor([[   0,    0,    0,  ...,    7,  414,  487],
        [   0,    0,    0,  ..., 1578,    9,    0],
        [   0,    0,    0,  ...,   77,  390,  925],
        ...,
        [   0,    0,    0,  ...,   25,  267,    0],
        [   0,    0,    0,  ...,  255,    0,    0],
        [   0,    0,    0,  ...,  128,  232,    0]]) torch.int64

Sample label size:  torch.Size([32])
Sample label: 
 tensor([0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       dtype=torch.float64)


# Define Model

In [23]:
# First checking if GPU is available
train_on_gpu=th.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


In [24]:
import torch.nn as nn

class SpamRNN(nn.Module):
    """
    The RNN model that will be used to perform Spam Detection.
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SpamRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob,
                            batch_first=True
                           )
        
        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        batch_size = x.size(0)

        # embeddings and lstm_out
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out)
        out = self.fc(out)
        # sigmoid function
        sig_out = self.sig(out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        train_on_gpu=th.cuda.is_available()
        #print("Train on GPU:", train_on_gpu)

        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden
        

## Instantiate the network

Here, we'll instantiate the network. First up, defining the hyperparameters.

* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3

> **Important:** Define the model  hyperparameters.


In [25]:
# Instantiate the model w/ hyperparams
vocab_size = len(index_to_word) 
output_size = 1
embedding_dim = 200
hidden_dim = 128
n_layers = 2

net = SpamRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SpamRNN(
  (embedding): Embedding(1645, 200)
  (lstm): LSTM(200, 128, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=128, out_features=1, bias=True)
  (sig): Sigmoid()
)


### Training 

Hyperparameters:

* `lr`: Learning rate for our optimizer.
* `epochs`: Number of times to iterate through the training dataset.
* `clip`: The maximum gradient value to clip at (to prevent exploding gradients).

In [26]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = th.optim.Adam(net.parameters(), lr=lr)


In [27]:
# training params

epochs = 3 # 3-4 is approx where I noticed the validation loss stop decreasing

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net = net.cuda()
    

net.train()
# train for some number of epochs
for e in range(epochs):

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()
            
        # initialize hidden state
        # Variable batch size for len of dataset  not divisible by batch_size
        current_batch_size = inputs.shape[0] 
        h = net.init_hidden(current_batch_size)
        

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()
        
        #print("h shape", h[0].shape)
        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            #val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:
                # initialize hidden state
                # Variable batch size for len of  dataset  not divisible by batch_size
                current_batch_size = inputs.shape[0] 
                val_h = net.init_hidden(current_batch_size)

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()
                #print("inputs shape", inputs.shape)
                #print("val_h shape", val_h[0].shape, val_h[1].shape)
                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 1/3... Step: 100... Loss: 0.048679... Val Loss: 0.087054


  return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)


Epoch: 2/3... Step: 200... Loss: 0.083733... Val Loss: 0.075113
Epoch: 3/3... Step: 300... Loss: 0.177228... Val Loss: 0.072674
Epoch: 3/3... Step: 400... Loss: 0.002527... Val Loss: 0.074421


---
## Testing

There are a few ways to test your network.

* **Test data performance:** First, we'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data.

* **Inference on user-generated data:** Second, we'll see if we can input just one example review at a time (without a label), and see what the trained model predicts. Looking at new, user input data like this, and predicting an output label, is called **inference**.

In [28]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0



net.eval()
# iterate over test data
for inputs, labels in test_loader:
    
    
    # initialize hidden state
    # Variable batch size for len of dataset  not divisible by batch_size
    current_batch_size = inputs.shape[0] 
    h = net.init_hidden(current_batch_size)

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, h = net(inputs, h)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = th.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.072
Test accuracy: 0.978


In [29]:
no_spam = "Hi, will go to the beach this week. You are invited"

### Inference on a new message



In [30]:
from string import punctuation

def tokenize_message(message):
    message = message.lower() # lowercase
    # get rid of punctuation
    test_text = ''.join([c for c in message if c not in punctuation])

    # splitting by spaces
    test_words = test_text.split()

    # tokens
    test_ints = []
    test_ints.append([get_index(word_to_index,word) for word in test_words])

    return test_ints

# test code and generate tokenized review
test_ints = tokenize_message(no_spam)
print(test_ints)

[[412, 163, 1, 25, 119, 0, 134, 151, 64, 231, 524]]


In [31]:
# test sequence padding
seq_length=200
features = pad_features(test_ints, seq_length)

print(features)

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0 412 163   1  25 119   0 134 151  64
  231 524]]


In [32]:
# test conversion to tensor and pass into your model
feature_tensor = th.from_numpy(features)
print(feature_tensor.size())

torch.Size([1, 200])


In [33]:
def predict(net, message, sequence_length=200):
    
    net.eval()
    
    # tokenize review
    test_ints = tokenize_message(message)
    
    # pad tokenized sequence
    seq_length=sequence_length
    features = pad_features(test_ints, seq_length)
    
    # convert to tensor to pass into your model
    feature_tensor = th.from_numpy(features)
    # cast tensor to int64
    feature_tensor = feature_tensor.to(th.int64)
    
    batch_size = feature_tensor.size(0)
    
    # initialize hidden state
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
        feature_tensor = feature_tensor.cuda()
    
    # get the output from the model
    output, h = net(feature_tensor, h)
    
    # convert output probabilities to predicted class (0 or 1)
    pred = th.round(output.squeeze()) 
    # printing output value, before rounding
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    # print custom response
    if(pred.item()==1):
        print("SPAM detected!")
    else:
        print("No spam message")
        
    return 1 if pred.item()==1 else 0
        

In [34]:
# spam message
spam = 'URGENT! claim your prize. Be a winner, new lot ends at nigth. Like and suscribe '

no_spam = 'Hi, will go to the beach this week. You are invited'


In [48]:
# call function
seq_length=200 # good to use the length that was trained on

predict(net, spam, seq_length)
predict(net, no_spam, seq_length)

Prediction value, pre-rounding: 0.990310
SPAM detected!
Prediction value, pre-rounding: 0.006003
No spam message


# Save and Load Model

In [49]:
# Save Model

import os 

model_path = "./models/"
if not os.path.exists(model_path):
    os.makedirs(model_path)

file_path = model_path + "rnn"
    
th.save(net.state_dict(),file_path)

In [51]:
# Load Model
# Model class must be defined somewhere
model = SpamRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)
model.load_state_dict(th.load(file_path))
model.eval()





SpamRNN(
  (embedding): Embedding(1645, 200)
  (lstm): LSTM(200, 128, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=128, out_features=1, bias=True)
  (sig): Sigmoid()
)

In [68]:


# Load dictionaries 
index_to_word= load_obj("data/index_to_word")
word_to_index = load_obj("data/word_to_index")

def classify(model,message= "",sequence_length = 200):
    
    vector = tokenize_message(message)
    vector = pad_features(vector , sequence_length)
    
    class_ = predict(net,message,sequence_length)
    
    return class_

In [71]:
classify(model, "New offers come to nigth. 30% or until 90% off")

Prediction value, pre-rounding: 0.037792
No spam message
