# Spam Detection with an RNN

>Using an RNN rather than a strictly feedforward network is more accurate since we can include information about the *sequence* of words. 

#### Dataset: Kaggle SMS Spam Collection
[Sms Spam](https://www.kaggle.com/uciml/sms-spam-collection-dataset/downloads/spam.csv/1)

### Load in and visualize the data

In [99]:
import numpy as np
from numpy import  genfromtxt


In [100]:
data = genfromtxt('data/spam.csv',delimiter = '\n', dtype='str')


print("Shape of data:", data.shape)
print(data[0], "\n")
print(data[1],"\n")
print(data[2],"\n")




Shape of data: (5575,)
v1,v2,,, 

ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",,, 

ham,Ok lar... Joking wif u oni...,,, 



In [101]:
# Throw fist row. 
data = data[1:]

# Separate into messages and labels

labels,messages =zip(*list(map( 
            lambda x: (x[:3]  , x[4:-3]) if x.startswith('h') else (x[:4],x[5:-3])
                               
                               ,data)))
labels = np.array(labels)
messages = np.array(messages)

In [102]:
for i in range(10):
    print("Message {}: {} \n Label: {} \n ".format(i,messages[i], labels[i]))

Message 0: "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." 
 Label: ham 
 
Message 1: Ok lar... Joking wif u oni... 
 Label: ham 
 
Message 2: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 
 Label: spam 
 
Message 3: U dun say so early hor... U c already then say... 
 Label: ham 
 
Message 4: "Nah I don't think he goes to usf, he lives around here though" 
 Label: ham 
 
Message 5: "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv" 
 Label: spam 
 
Message 6: Even my brother is not like to speak with me. They treat me like aids patent. 
 Label: ham 
 
Message 7: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Calle

#### Convert labels to float. If spam then 1, else 0.

In [103]:
labels[labels == "spam"] = 1
labels[labels == "ham"] = 0
print(labels[:10])
print(labels[0].dtype)

#convert to float
labels = labels.astype('float')
print(labels[:10])

['0' '0' '1' '0' '0' '1' '0' '0' '1' '1']
<U1
[0. 0. 1. 0. 0. 1. 0. 0. 1. 1.]


#### Remove punctuation

In [104]:
from string import punctuation

for k in range(messages.shape[0]):
    messages[k] = messages[k].lower()
    messages[k] = "".join( [s for s in   messages[k] if s not in punctuation])


print(messages[0],"\n")
print(messages[1])


go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat 

ok lar joking wif u oni


In [105]:
all_messages="".join( [s.lower() for s in messages if s not in punctuation] )

all_messages[:1000]

'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore watok lar joking wif u onifree entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18su dun say so early hor u c already then saynah i dont think he goes to usf he lives around here thoughfreemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send å£150 to rcveven my brother is not like to speak with me they treat me like aids patentas per your request melle melle oru minnaminunginte nurungu vettam has been set as your callertune for all callers press 9 to copy your friends callertunewinner as a valued network customer you have been selected to receivea å£900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours onlyhad your mobile 11 months or more u r entitled to update to the latest colour mobiles with camera for 

In [113]:
words = all_messages.split()
print("Number of words", len(words))
words[:20]

Number of words 78233


['go',
 'until',
 'jurong',
 'point',
 'crazy',
 'available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 'cine',
 'there',
 'got',
 'amore',
 'watok']

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

In [107]:


## Build a dictionary that maps words to integers
UNK_TOKEN = 'UNK'

#Count words
word_count ={}
for word in words:
    r = word_count.get(word,None)
    
    if r :
        word_count[word]+=1
    else:
        word_count[word] = 1
        
        

#word to index
word_to_index = {}

keys = word_count.keys()
# Begin indexing with 1
i= 1
for key in  keys:
    
    if word_count[key] >= 5:
        word_to_index[key] = i
        i+= 1
### Add Unknow token
word_to_index[UNK_TOKEN] = 0 
        
        
print(len(word_to_index.keys() ))
word_to_index
    

1645


{'go': 1,
 'until': 2,
 'point': 3,
 'crazy': 4,
 'available': 5,
 'only': 6,
 'in': 7,
 'bugis': 8,
 'n': 9,
 'great': 10,
 'world': 11,
 'la': 12,
 'e': 13,
 'cine': 14,
 'there': 15,
 'got': 16,
 'lar': 17,
 'wif': 18,
 'u': 19,
 'entry': 20,
 '2': 21,
 'a': 22,
 'wkly': 23,
 'comp': 24,
 'to': 25,
 'win': 26,
 'cup': 27,
 'final': 28,
 'may': 29,
 'text': 30,
 'receive': 31,
 'txt': 32,
 'apply': 33,
 'dun': 34,
 'say': 35,
 'so': 36,
 'early': 37,
 'c': 38,
 'already': 39,
 'then': 40,
 'i': 41,
 'dont': 42,
 'think': 43,
 'he': 44,
 'goes': 45,
 'usf': 46,
 'around': 47,
 'here': 48,
 'hey': 49,
 'darling': 50,
 'its': 51,
 'been': 52,
 '3': 53,
 'weeks': 54,
 'now': 55,
 'and': 56,
 'no': 57,
 'word': 58,
 'back': 59,
 'id': 60,
 'like': 61,
 'some': 62,
 'fun': 63,
 'you': 64,
 'up': 65,
 'for': 66,
 'it': 67,
 'still': 68,
 'ok': 69,
 'xxx': 70,
 'std': 71,
 'send': 72,
 'å£150': 73,
 'my': 74,
 'brother': 75,
 'is': 76,
 'not': 77,
 'speak': 78,
 'with': 79,
 'me': 80,
 'they

In [108]:
index_to_word = {}

for key in word_to_index.keys():
    
    index_to_word[ word_to_index[key] ] = key

index_to_word

{1: 'go',
 2: 'until',
 3: 'point',
 4: 'crazy',
 5: 'available',
 6: 'only',
 7: 'in',
 8: 'bugis',
 9: 'n',
 10: 'great',
 11: 'world',
 12: 'la',
 13: 'e',
 14: 'cine',
 15: 'there',
 16: 'got',
 17: 'lar',
 18: 'wif',
 19: 'u',
 20: 'entry',
 21: '2',
 22: 'a',
 23: 'wkly',
 24: 'comp',
 25: 'to',
 26: 'win',
 27: 'cup',
 28: 'final',
 29: 'may',
 30: 'text',
 31: 'receive',
 32: 'txt',
 33: 'apply',
 34: 'dun',
 35: 'say',
 36: 'so',
 37: 'early',
 38: 'c',
 39: 'already',
 40: 'then',
 41: 'i',
 42: 'dont',
 43: 'think',
 44: 'he',
 45: 'goes',
 46: 'usf',
 47: 'around',
 48: 'here',
 49: 'hey',
 50: 'darling',
 51: 'its',
 52: 'been',
 53: '3',
 54: 'weeks',
 55: 'now',
 56: 'and',
 57: 'no',
 58: 'word',
 59: 'back',
 60: 'id',
 61: 'like',
 62: 'some',
 63: 'fun',
 64: 'you',
 65: 'up',
 66: 'for',
 67: 'it',
 68: 'still',
 69: 'ok',
 70: 'xxx',
 71: 'std',
 72: 'send',
 73: 'å£150',
 74: 'my',
 75: 'brother',
 76: 'is',
 77: 'not',
 78: 'speak',
 79: 'with',
 80: 'me',
 81: '

### Get word and get index fucntions

In [109]:
def get_word(index_to_word, index):
    """
    index_to_word: dictionary
        index to word dict
    index: int
    
    return word given index. If index (key) not in dict returns 'UNK' unknow token
    """
    
    result = index_to_word.get(index,None)
    
    if result:
        return result
    return UNK_TOKEN
    

In [114]:
def get_index(word_to_index, word):
    """
    word_to_index: dictionary
        word to index dict
    word: string
    return index of the word from word_to_index
    if word not in word_to_index return 0, index of unknow token.
    
    """
    
    result = word_to_index.get(word,None)
    
    if result: 
        return result
    return 0

In [115]:
print(get_word(index_to_word, 12),"\n",

get_index(word_to_index,"hi" ))

la 
 412


### Messages to vectors

In [117]:
vectors = []

for message in messages:
    
    vector = [ get_index(word_to_index,w) for w in message.split()]
    vectors.extend([vector])


for  j in range(10):
    print(vectors[j],"\n")

[1, 2, 0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0, 14, 15, 16, 0, 562] 

[69, 17, 0, 18, 19, 0] 

[124, 20, 7, 21, 22, 23, 24, 25, 26, 0, 27, 28, 0, 0, 29, 0, 30, 0, 25, 0, 25, 31, 20, 0, 32, 0, 33, 0] 

[19, 34, 35, 36, 37, 0, 19, 38, 39, 40, 35] 

[0, 41, 42, 43, 44, 45, 25, 46, 44, 0, 47, 48, 814] 

[0, 49, 15, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 0, 69, 70, 71, 0, 25, 72, 73, 25, 0] 

[320, 74, 75, 76, 77, 61, 25, 78, 79, 80, 81, 82, 80, 61, 0, 0] 

[89, 83, 84, 85, 86, 86, 0, 0, 0, 0, 87, 52, 88, 89, 84, 90, 66, 91, 92, 93, 94, 25, 95, 84, 96, 90] 

[527, 89, 22, 97, 98, 99, 64, 100, 52, 101, 25, 0, 102, 103, 104, 25, 105, 106, 0, 105, 107, 0, 108, 109, 110, 6] 

[259, 84, 111, 112, 113, 114, 115, 19, 116, 117, 25, 118, 25, 119, 120, 121, 122, 79, 123, 66, 124, 106, 119, 111, 118, 125, 124, 126, 0] 

