## CLASSIFICATION OF MESSAGES AS EITHER SPAM OR NON-SPAM

In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).


In [40]:
#Import the necessaty libraries needed for the analysis
import pandas as pd
import numpy as np

In [41]:
#read the file
text = pd.read_csv("SMSSpamCollection",sep='\t',header=None,names=['Label', 'SMS'])
text.head(3)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


In [42]:
#Find the number of rows and columns
text.shape

(5572, 2)

In [43]:
text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [44]:
#Find the number of messages already classified as spam or non-spam
text['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

### SPLIT THE DATA INTO TRAINING AND TEST DATASET
We're now going to split our dataset into a training and a test set, where the training set accounts for 80% of the data, and the test set for the remaining 20%.

In [45]:
#Randomize the dataset and use random state to ensure results are reproducible

text2=text.sample(frac=1,random_state=1)
text2.head(2)

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"


In [46]:

# Calculate index for split
training_test_index = round(len(text2) * 0.8)

# Training/Test split
training_set = text2[:training_test_index].reset_index(drop=True)
test_set = text2[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


## LETTER CASE AN PUNCTUATION
To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.




In [47]:
#removing Puntuation on the data set
training_set .head(3)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired


In [48]:
#remove punctuation and transform all dataset to lower case
training_set['SMS']= training_set['SMS'].str.lower()
training_set['SMS']=training_set['SMS'].str.replace('\W','')
training_set.head(3)

  training_set['SMS']=training_set['SMS'].str.replace('\W','')


Unnamed: 0,Label,SMS
0,ham,yepbytheprettysculpture
1,ham,yesprincessareyougoingtomakememoan
2,ham,welpapparentlyheretired


### CREATING THE VOCABULARY
Create a vocabulary for the messages in the training set. The vocabulary should be a Python list containing all the unique words across all messages, where each word is represented as a string.

Begin by transforming each message from the SMS column into a list by splitting the string at the space character — use the Series.str.split() method.

Initiate an empty list named vocabulary.
Iterate over the the SMS column (each message in this column should be a list of strings by the time you start this loop).

Using a nested loop, iterate each message in the SMS column (each message should be a list of strings) and append each string (word) to the vocabulary list.

Transform the vocabulary list into a set using the set() function. This will remove the duplicates from the vocabulary list.

Transform the vocabulary set back into a list using the list() function.

In [49]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary=[]
for i in training_set['SMS']:
    for word in i:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))


In [50]:
#preview vocabulary
print(vocabulary)

