# Naive Bayes SPAM detection model


create a Naive Bayes SPAM detection model for the SMS SPAM Collection.


# SMS SPAM Collection

The SMS SPAM Collection is a corpus of real text messages (SMS messages) that have been classified as either SPAM or HAM (i.e. not SPAM). The corpus contains 5,574 documents, 747 of which are SPAM and 4,827 of which are HAM. You can find the readme for the corpus [here](https://storage.googleapis.com/wd13/SMSSpamCollectionReadme).

The following code downloads a copy of the SMS SPAM Corpus and saves it in a variable `sms_corpus`.

In [None]:
import urllib.request, json
sms_corpus = []
with urllib.request.urlopen("https://storage.googleapis.com/wd13/SMSSpamCollection.txt") as url:
  for line in url.readlines():
    sms_corpus.append(line.decode().split('\t'))

`sms_corpus` is a list. Each element of the list is another list which stores a document and its label.

In [None]:
# print the text and label of document 16
docid = 16
print(sms_corpus[docid])


['ham', "Oh k...i'm watching here:)\n"]


In [None]:
# print the label of document 16
docid = 16
print(sms_corpus[docid][0])

ham


In [None]:
# print the text of document 16
docid = 0
print(sms_corpus[docid][1])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...



# Create a tokenizer

Write a function `tokenize` that takes a string and returns a list of tokens.

In [None]:

import string
import re

#old tokenizer from homework 2
"""
def tokenize(doc):
  my_list=[]
  filtered_list=[]
  doc= doc.lower()
  newDocString = doc.split(' ')

  for x in newDocString:
      my_list.append(re.sub('[^A-Za-z0-9]+', '', x))

  filtered_list = list(filter(lambda x: len(x) > 3, my_list))
  return filtered_list
"""
# new tokenizer
#creating tokenized list using regression.
def tokenize(doc):
  doc= doc.lower()
  tokens=re.findall(r'\w+',doc)
  # eliminate tokens with single letter
  filtered_list = list(filter(lambda x: len(x) > 1, tokens))
  return filtered_list


# Import log function

In [None]:
from math import log
log(1)

import math

def safe_log(x,y):
    if x <= 0:
        return 0
    return math.log(x/y)

# Calculate token scores

Calculate scores for every token in the corpus, using the method discussed in class. Store these scores in a dictionary called `token_scores`.

In [None]:
token_scores = {}
token_list=[]

# created a token list of all the words in the corpus
for i in range(len(sms_corpus)):
  for j in tokenize(sms_corpus[i][1]):
    if j not in token_list:
      token_list.append(j)

# creating a dictionary with token score of spam and ham for each token
for i in range(len(token_list)):
  spam=0
  ham=0
  for j in range(len(sms_corpus)):
    if token_list[i] in sms_corpus[j][1]:
      if sms_corpus[j][0]=='spam':
        spam+=1
      else:
        ham+=1

  token_scores[token_list[i]]=[spam,ham]



# Create a score message function

Write a funciton `score_message` that takes an SMS message `doc` and returns a SPAM score, using the method discussed in class.

In [None]:
totalDocuments=5574
TotalSpamDoc=747
TotalHamDoc=4827
P_spam=TotalSpamDoc/totalDocuments
P_ham=TotalHamDoc/totalDocuments
def score_message(doc):
  score=log(P_spam/P_ham)
  tokenized_list=tokenize(doc)

  for index , key in enumerate(tokenized_list):
    score += safe_log(token_scores[key][0]/TotalSpamDoc,token_scores[key][1]/TotalHamDoc)

  return score

# Score some messages

> Indented block



In [None]:
print(score_message('hello bob'))
print(score_message('free tv'))

-1.8659152505276757
0.5929495747147882


# Discussion

What tokens are most predictive of a message being SPAM?

In [None]:
import operator
from operator import itemgetter

token_score_spam={}
for i in token_scores:
  token_score_spam[i]=token_scores[i][0]

res = dict(sorted(token_score_spam.items(), key = itemgetter(1), reverse = True)[:10])
# we have eliminated single letter words like e, o etc by adding filter in tokenizer
print("Top 10 tokens are most predictive of a message being SPAM: "+str(res))

Top 10 tokens are most predictive of a message being SPAM: {'in': 536, 'to': 536, 'ou': 520, 'on': 495, 'er': 488, 're': 472, 'al': 454, 'll': 437, 'or': 435, 'ur': 415}


What tokens are most predictive of a message being HAM?

In [None]:
import operator
from operator import itemgetter

token_score_ham={}
for i in token_scores:
  token_score_ham[i]=token_scores[i][1]

res = dict(sorted(token_score_ham.items(), key = itemgetter(1), reverse = True)[:10])
# we have eliminated single letter words like e, o etc by adding filter in tokenizer
print("Top 10 tokens are most predictive of a message being HAM: "+str(res))

Top 10 tokens are most predictive of a message being HAM: {'in': 2633, 'th': 2281, 'ou': 2150, 'he': 2103, 're': 1971, 'an': 1943, 'er': 1879, 'at': 1811, 'to': 1716, 'on': 1711}
