# Building a Spam Filter with Naive Bayes
Building a spam filter for SMS messages is a practical and useful application of machine learning algorithms that can help users avoid being bombarded with unwanted messages and improve their overall experience.

![](https://www.informaticsinc.com/application/files/4515/2718/5763/iStock-122143117.jpg)

In this project, we will delve into the practical application of algorithms by building a spam filter for SMS messages. The aim of this project is to train a computer to classify SMS messages as either spam or non-spam with an accuracy greater than 80%. This will be done by:
- leveraging human knowledge of how messages are classified
- using the information above to estimate the probability of a new message being either spam or non-spam. 
- having the computer classify messages as either spam or non-spam based on the probability values.

To achieve this, we will use the **Multinomial Naive Bayes** algorithm along with a dataset of 5,572 SMS messages that have already been classified by humans. Tiago A. Almeida and José María Gómez Hidalgo compiled the [dataset](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection), which is available for download from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

> _If you want to learn more about how the data used in this project was collected, you can visit [this page](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition). Additionally, the page also contains papers authored by the creators of the dataset, which you may find useful._

> _Due to the nature of spam messages, the dataset contains content that may be offensive to some users._

The dataset comprises pre-labelled SMS messages that have been labelled as either spam or non-spam and will be used to train the algorithm. Once the algorithm is trained, it can be used to classify new messages as spam or non-spam.

## Importing Libraries

In [1]:
import requests
import zipfile
import io
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set the default behaviour of plots
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Exploring the Dataset

### 1. Downloading the dataset
First, we will download the file from the url using the [requests](https://pypi.org/project/requests/) library, then extract the contents into a local folder. To extract these contents, we use the [zipfile](https://docs.python.org/3/library/zipfile.html) library.

In [2]:
# Pull the data from the url
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
response = requests.get(url)

# Extract the contents into a folder locally
folder_name = url.split('/')[-1][:-4]
z = zipfile.ZipFile(io.BytesIO(response.content))
z.extractall("./" + folder_name)

# Check if the extraction process was successful
! ls

notebook.ipynb    [1m[36msmsspamcollection[m[m


### 2. Checking the file structure
Next, we will check how the file contents are structured (by displaying the first five rows of the file). Understanding how the data is structured makes it easier to work with the data in Pandas.

In [3]:
# Preview from the command line
! cd smsspamcollection; head -n 5 SMSSpamCollection; cd ../

ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though


The data lacks a header row, and the columns are separated by tabs. We can account for these structural attributes when reading the data with Pandas.
### 3. Exploring with Pandas
Finally, we can read the dataset and explore its attributes further:

In [4]:
# Read-in the data with pandas
df = pd.read_csv("./smsspamcollection/SMSSpamCollection", 
                 sep= "\t", 
                 header= None, 
                 names= ["label", "SMS"]
                )
                

# Explore dataset attributes
print(f'STRUCTURE:\nThe SMS dataset has {df.shape[0]} rows and {df.shape[1]} columns.\n')
print(f'COMPLETENESS:\nThe dataset has {df.isnull().sum().sum()} number of null values.', '\n')

print('PREVIEW:\nPrinting the first five rows of the dataset...')
df.head()

STRUCTURE:
The SMS dataset has 5572 rows and 2 columns.

COMPLETENESS:
The dataset has 0 number of null values. 

PREVIEW:
Printing the first five rows of the dataset...


Unnamed: 0,label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
# Check the proportion of spam vs ham messages
print('SPAM VS HAM (NON-SPAM):\n\
Printing the proportion on spam vs ham messages...')
df.label.value_counts(normalize= True).to_frame(name='proportion')

SPAM VS HAM (NON-SPAM):
Printing the proportion on spam vs ham messages...


Unnamed: 0,proportion
ham,0.865937
spam,0.134063


Of the total number of SMS messages in the dataset, about 87% are classified as "ham" and the remaining 13% are classified as "spam". This indicates that the dataset is representative of the actual message distribution in practice, where most of the messages people receive are legitimate messages ("ham") rather than unsolicited messages ("spam"). 

Therefore, the dataset can be considered appropriate for training our filter to accurately distinguish between spam and ham messages, which will ultimately result in an effective spam filter for SMS messages.

## Creating the Training and Test Datasets
We found that about 87% of the messages in the dataset are ham and 13% are spam. To build a spam filter, it is helpful to first design a test to ensure that the filter works effectively. 

We will split our dataset into two categories as follows:
1. A training set for teaching the computer how to classify messages.
2. A test set for evaluating the performance of the spam filter. 
> _We'll keep **80%** for training and **20%** for testing._ 

Our aim is to create a spam filter that can classify new messages with an accuracy greater than 80%. To test this, we will compare the results of the algorithm's classification with that of the human classification on the test set.

In [6]:
# Randomize entire dataframe
randomized_df = df.sample(frac=1, random_state=1)

# Create training and test set
num_records = randomized_df.shape[0]
train_set_size = round(num_records * 0.8)

train_set = randomized_df.iloc[:train_set_size].reset_index(drop=True)
test_set = randomized_df.iloc[train_set_size:].reset_index(drop=True)

# Verify training set as 80% and test set as 20% of dataframe
assert train_set.shape[0] == round(0.8 * randomized_df.shape[0])
assert test_set.shape[0] == round(0.2 * randomized_df.shape[0])
print(f'The training set has {train_set.shape[0]} records.')
print(f'The test set has {test_set.shape[0]} records.')

The training set has 4458 records.
The test set has 1114 records.


In the next step, we will examine the proportion of spam and ham messages in both the training and test sets. We anticipate that these proportions will be similar to the ones found in the complete dataset, where approximately 87% of the messages are ham and the remaining 13% are spam.

In [7]:
print("Calculating spam vs ham proportions...")

pd.concat([df.label.value_counts(normalize=True).round(2),
                     train_set.label.value_counts(normalize=True).round(2),
                     test_set.label.value_counts(normalize=True).round(2)], 
                     axis=1, 
                     keys=['full_dataset', 'train_set', 'test_set']
         )

Calculating spam vs ham proportions...


Unnamed: 0,full_dataset,train_set,test_set
ham,0.87,0.87,0.87
spam,0.13,0.13,0.13


Since the proportions resemble those of the full dataset, our sampling method is deemed representative, and we can proceed.

## Cleaning the Dataset