# Building a spam Filter with Naive Bayes
In this project we are gonna build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages wih accuracy greater than 80%. So we will ale to rectify more than 80% messages as spam oe non-spam.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). You can also download the dataset directly [from this link](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection). The data collection process is described in more details on [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition), where you can also find some of the authors' papers.


## Exploring Dataset


In [1]:
import pandas as pd
sms_spam=pd.read_csv('SMSSpamCollection',sep='\t',header=None,
                    names=['Label','SMS'])
print(sms_spam.shape)
sms_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
sms_spam.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [3]:
sms_spam['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

Almost 87% of the messages are ham and 13% are spam, thus dataset can be used to make spam filter as maximum messages are ham

## Training and Test Set
We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset)


In [4]:
# Randomizing the dataset
data_randomized=sms_spam.sample(frac=1,random_state=1)

#Calulating index for splitting
splitting_index=round(len(data_randomized)*0.8)

# Splitting the dataset
training_set=data_randomized[:splitting_index].reset_index(drop=True)
test_set=data_randomized[splitting_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [5]:
training_set['Label'].value_counts(normalize=True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [6]:
test_set['Label'].value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

## Data Cleaning


In [7]:
training_set['SMS']=training_set['SMS'].str.replace('\W',' ')
training_set['SMS']=training_set['SMS'].str.lower()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating the Vocabulary


In [8]:
vocabulary=[]
training_set['SMS']=training_set['SMS'].str.split()

for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)

vocabulary=list(set(vocabulary))    
print(len(vocabulary))


7783


Our vocabulary comprises of 7783 unique words in all messages of the training set

## The Final Training Set

In [9]:
word_counts_per_sms={unique_word: [0]*len(training_set['SMS']) for unique_word in vocabulary}
for index,sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index]+=1

In [10]:
word_counts=pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [11]:
training_set_clean=pd.concat([training_set,word_counts],axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Constants


In [12]:
spam_messages=training_set_clean[training_set_clean['Label']=='spam']
ham_messages=training_set_clean[training_set_clean['Label']=='ham']

p_spam=len(spam_messages)/len(training_set_clean)
p_ham=len(ham_messages)/len(training_set_clean)

n_words_per_spam_message=spam_messages['SMS'].apply(len)
n_spam=n_words_per_spam_message.sum()

n_words_per_ham_message=ham_messages['SMS'].apply(len)
n_ham=n_words_per_ham_message.sum()

n_vocabulary=len(vocabulary)

alpha=1





## Calculating Parameters
Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas:

$$
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
$$$$
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
$$

In [13]:
# Initiate Parameters
parameters_spam={unique_word:0 for unique_word in vocabulary}
parameters_ham={unique_word:0 for unique_word in vocabulary} 

for word in vocabulary:
    n_word_given_spam=spam_messages[word].sum()
    p_word_given_spam=(n_word_given_spam+alpha)/(n_spam+alpha*n_vocabulary)
    parameters_spam[word]=p_word_given_spam
    
    n_word_given_ham=ham_messages[word].sum()
    p_word_given_ham=(n_word_given_ham+alpha)/(n_ham+alpha*n_vocabulary)
    parameters_ham[word]=p_word_given_ham

## Classifying A New Message
Now that we've calculated all the constants and parameters we need, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn)
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn)
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:

      *If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
      *If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
      *If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.
      
      

In [14]:
import re

def classify(message):
    # message is a string    
    message=re.sub('\W',' ',message)
    message=message.lower()
    message=message.split()
    
    p_spam_given_message=p_spam
    p_ham_given_message=p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *=parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *=parameters_ham[word]
            
    print('P(Spam|Message)',p_spam_given_message)
    print('P(Ham|Message)',p_ham_given_message)
    
    if p_spam_given_message<p_ham_given_message:
        print('Label:Ham')
    elif p_spam_given_message>p_ham_given_message:
        print('Label:Spam')
    else:
        print('Equal Probabilities,need to be classified using humanintelligence')
        
    

In [15]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|Message) 1.3481290211300841e-25
P(Ham|Message) 1.9368049028589875e-27
Label:Spam


In [16]:
classify('Sounds good, Tom, then see u there')

P(Spam|Message) 2.4372375665888117e-25
P(Ham|Message) 3.687530435009238e-21
Label:Ham


## Measuring the Spam Filter's Accuracy


In [17]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [18]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [22]:
correct=0
total=len(test_set['Label'])

for row in test_set.iterrows():
    row=row[1]
    if row['Label']==row['predicted']:
        correct+=1
print('Correct',correct)
print('Incorrect',total-correct)
print('Accuracy',(correct)/total*100)

Correct 1100
Incorrect 14
Accuracy 98.74326750448833
