# Building a Spam Filter Using NLP and Naive Bayes

In this project, I'm going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. My goal is to write a program that classifies new messages with an accuracy greater than 80% — so I expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, I'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository.


In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
stopwords=nltk.corpus.stopwords.words('english')

In [None]:
# Read the dataset
df=pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label','SMS'])
print(df.shape)
df.head()

In [None]:
## We see that about 87% of the messages are ham, and the remaining 13% are spam
df['Label'].value_counts(normalize=True)

In [None]:
# Randomize the dataset
df=df.sample(frac=1,random_state=100)
#Training/Test split (80% for traning and 20% for test)
training_set=df[:4458].reset_index(drop=True)
test_set=df[4459:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

In [None]:
# Training and test sets are well randomized
training_set['Label'].value_counts(normalize=True)
test_set['Label'].value_counts(normalize=True)

## Data Cleaning


In [None]:
# Before cleaning
training_set.head()

In [None]:
# After cleaning (removing punctutation and setting lower cases)
training_set['SMS']=training_set['SMS'].str.replace(r'\W+',' ').str.lower()
training_set.head()

In [None]:
training_set['SMS']=training_set['SMS'].str.split()

# Remove word in stopwords
for row in training_set['SMS']:
    for word in stopwords:
        while word in row:
            row.remove(word)

In [None]:
vocabulary=[]
for row in training_set['SMS']:
    for word in row:
        vocabulary.append(word)
vocabulary=set(vocabulary)
vocabulary=list(vocabulary)
len(vocabulary)

In [None]:
# Creating a vocabulary list containing unique words
vocabulary=[]
for row in training_set['SMS']:
    for word in row:
        if word not in stopwords:
            vocabulary.append(word)
vocabulary=set(vocabulary)
vocabulary=list(vocabulary)
len(vocabulary)
# There are 7,783 unique words in all the messages of the training set

In [None]:
# Creating a new dataset that counts word in each message
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        if word in vocabulary:
            word_counts_per_sms[word][index] += 1

In [None]:
word_counts_df=pd.DataFrame(word_counts_per_sms)
word_counts_df.head()

In [None]:
# Concatenate the two data sets (training_set and word_counts_df)
training_set_clean=pd.concat([training_set,word_counts_df],axis=1)
training_set_clean.head()

## Calculating Parameters

In [None]:
# Probability of spam and ham
p_spam=training_set_clean['Label'].value_counts(normalize=True)[1]
p_ham=training_set_clean['Label'].value_counts(normalize=True)[0]

In [None]:
# Number of words in spam messages
n_spam=0
for row in training_set_clean['SMS'][training_set_clean['Label']=='spam']:
        n_spam += len(row)

# Number of words in ham messages      
n_ham=0
for row in training_set_clean['SMS'][training_set_clean['Label']=='ham']:
        n_ham += len(row)

# Number of unique words        
n_vocabulary=len(vocabulary)
alpha=1

In [None]:
# Isolating spam and ham messages
spam_messages=training_set_clean[training_set_clean['Label']=='spam']
ham_messages=training_set_clean[training_set_clean['Label']=='ham']

# Initiate paramters
parameter_spam={}
parameter_ham={}

# Caculate parameters
for word in vocabulary:
    # Calculate probability of a word given spam messages
    n_word_given_spam=spam_messages[word].sum()
    p_word_given_spam=(n_word_given_spam+alpha)/(n_spam+alpha*n_vocabulary)
    parameter_spam[word]=p_word_given_spam
    
    # Calculate probability of a word given ham messages
    n_word_given_ham=ham_messages[word].sum()
    p_word_given_ham=(n_word_given_ham+alpha)/(n_ham+alpha*n_vocabulary)
    parameter_ham[word]=p_word_given_ham
    

## Creating A Function to Classify A New Message

In [None]:
# Create a classifying function
def classify_test_set(message):
    message = re.sub(r'\W+', ' ', message)
    message = message.lower()
    message = message.split()
    for word in stopwords:
        while word in message:
            message.remove(word)
    
    p_spam_given_message=p_spam
    p_ham_given_message=p_ham 
    for word in message:
        if word in vocabulary:
            p_word_given_spam=parameter_spam[word]
            p_spam_given_message *= p_word_given_spam
            
            p_word_given_ham=parameter_ham[word]
            p_ham_given_message *= p_word_given_ham
      
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

## Measuring the Spam Filter's Accuracy Using the Test Set

In [None]:
test_set['predicted']= test_set['SMS'].apply(classify_test_set)

# The accuracy is close to 98.74%, which is really good.
accuracy_score(test_set['Label'],test_set['predicted'])

## An Alternative Method (Using CountVectorizer and MultinomialNB from Sklearn)

In [None]:
df=pd.read_csv('SMSSpamCollection',sep='\t',header=None,names=['Label','SMS'])

In [None]:
X=df['SMS']
y=df['Label']

# Split the data into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=101)

In [None]:
pipeline=Pipeline([('cv',CountVectorizer(stop_words=stopwords)),
                   ('nb',MultinomialNB()),
])

In [None]:
pipeline.fit(X_train,y_train)
predictions=pipeline.predict(X_test)

In [None]:
accuracy_score(y_test,predictions)

## Summary

Using CountVectorizer and MultinominalNB yields a similar result but takes a lot fewer steps. In addition, it allows us to apply other models such as random forests, k-neighbors, etc.