**Bussiness Case: Spam filtering using naive Bayes classifiers in order to predict whether a new mail based on its content, can be categorized as spam or not-spam**

In [2]:
# importing required libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import string
import matplotlib.pyplot as plt

In [3]:
# Load the dataset

data = pd.read_csv("spam.tsv", sep="\t", names=['Class', 'Message'])
data.head(8)

Unnamed: 0,Class,Message
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!!
5,ham,As per your request 'Melle Melle (Oru Minnamin...
6,spam,WINNER!! As a valued network customer you have...
7,spam,Had your mobile 11 months or more? U R entitle...


In [4]:
data.loc[:0] #returns the first row of the DataFrame as a new DataFrame.

Unnamed: 0,Class,Message
0,ham,I've been searching for the right words to tha...


In [5]:
# Summary of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5567 entries, 0 to 5566
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Class    5567 non-null   object
 1   Message  5567 non-null   object
dtypes: object(2)
memory usage: 87.1+ KB


In [6]:
data['Length'] = data['Message'].apply(len)

In [7]:
data.head(1)

Unnamed: 0,Class,Message,Length
0,ham,I've been searching for the right words to tha...,196


In [8]:
data['Length']

0       196
1       155
2        61
3        77
4        36
       ... 
5562    160
5563     36
5564     57
5565    125
5566     26
Name: Length, Length: 5567, dtype: int64

In [9]:
#This line groups the data by 'Class' and shows how many messages are in each class

data.groupby('Class').count()

Unnamed: 0_level_0,Message,Length
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,4821,4821
spam,746,746


In [10]:
data['Length'].describe() # to find the max length of the message. 

count    5567.000000
mean       80.450153
std        59.891023
min         2.000000
25%        36.000000
50%        62.000000
75%       122.000000
max       910.000000
Name: Length, dtype: float64

See what we found, A 910 character long message. Let's use masking to find this message:

In [11]:
# the message that has the max characters
data[data['Length']==910]['Message']

1080    For me the love should start with attraction.i...
Name: Message, dtype: object

In [12]:
# view the message that has 910 characters in it
data[data['Length']==910]['Message'].iloc[0]

"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."

In [13]:
# View the message that has min characters
data[data['Length']==2]['Message'].iloc[0]

'Ok'

# Text Preprocessing

In [35]:
# creating an object for the target values
dObject = data['Class'].values
dObject

array([1, 0, 1, ..., 1, 1, 1], dtype=object)

In [33]:
# Lets assign ham as 1

data.loc[data['Class']=="ham", "Class"] = 1

In [34]:
# Lets assign spam as 0

data.loc[data['Class']=="spam", "Class"] = 0

In [36]:
dObject2=data['Class'].values
dObject2

array([1, 0, 1, ..., 1, 1, 1], dtype=object)

In [37]:
data.head()

Unnamed: 0,Class,Message,Length,text_clean
0,1,I've been searching for the right words to tha...,196,Ive been searching for the right words to than...
1,0,Free entry in 2 a wkly comp to win FA Cup fina...,155,Free entry in 2 a wkly comp to win FA Cup fina...
2,1,"Nah I don't think he goes to usf, he lives aro...",61,Nah I dont think he goes to usf he lives aroun...
3,1,Even my brother is not like to speak with me. ...,77,Even my brother is not like to speak with me T...
4,1,I HAVE A DATE ON SUNDAY WITH WILL!!!,36,I HAVE A DATE ON SUNDAY WITH WILL


First removing punctuation. We can just take advantage of Python's built-in string library to get a quick list of all the possible punctuation:

In [38]:
# default list of punctuation

import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [39]:
# Why the remove of punctuation is important

"This message is spam" == "This message is spam"

True

In [40]:
# Why the remove of punctuation is important

"This message is spam" == "This message is spam?"

False

In [41]:
# Lets remove the punctuation

def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

data['text_clean'] = data['Message'].apply(lambda x: remove_punct(x))

data.head()

Unnamed: 0,Class,Message,Length,text_clean
0,1,I've been searching for the right words to tha...,196,Ive been searching for the right words to than...
1,0,Free entry in 2 a wkly comp to win FA Cup fina...,155,Free entry in 2 a wkly comp to win FA Cup fina...
2,1,"Nah I don't think he goes to usf, he lives aro...",61,Nah I dont think he goes to usf he lives aroun...
3,1,Even my brother is not like to speak with me. ...,77,Even my brother is not like to speak with me T...
4,1,I HAVE A DATE ON SUNDAY WITH WILL!!!,36,I HAVE A DATE ON SUNDAY WITH WILL


Now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with and machine learning model which we will gonig to use can understand.

In [42]:
# Countvectorizer is a method to convert text to numerical data. 

# Initialize the object for countvectorizer 
CV = CountVectorizer(stop_words="english")

In [43]:
data

Unnamed: 0,Class,Message,Length,text_clean
0,1,I've been searching for the right words to tha...,196,Ive been searching for the right words to than...
1,0,Free entry in 2 a wkly comp to win FA Cup fina...,155,Free entry in 2 a wkly comp to win FA Cup fina...
2,1,"Nah I don't think he goes to usf, he lives aro...",61,Nah I dont think he goes to usf he lives aroun...
3,1,Even my brother is not like to speak with me. ...,77,Even my brother is not like to speak with me T...
4,1,I HAVE A DATE ON SUNDAY WITH WILL!!!,36,I HAVE A DATE ON SUNDAY WITH WILL
...,...,...,...,...
5562,0,This is the 2nd time we have tried 2 contact u...,160,This is the 2nd time we have tried 2 contact u...
5563,1,Will ü b going to esplanade fr home?,36,Will ü b going to esplanade fr home
5564,1,"Pity, * was in mood for that. So...any other s...",57,Pity was in mood for that Soany other suggest...
5565,1,The guy did some bitching but I acted like i'd...,125,The guy did some bitching but I acted like id ...


In [44]:
data

Unnamed: 0,Class,Message,Length,text_clean
0,1,I've been searching for the right words to tha...,196,Ive been searching for the right words to than...
1,0,Free entry in 2 a wkly comp to win FA Cup fina...,155,Free entry in 2 a wkly comp to win FA Cup fina...
2,1,"Nah I don't think he goes to usf, he lives aro...",61,Nah I dont think he goes to usf he lives aroun...
3,1,Even my brother is not like to speak with me. ...,77,Even my brother is not like to speak with me T...
4,1,I HAVE A DATE ON SUNDAY WITH WILL!!!,36,I HAVE A DATE ON SUNDAY WITH WILL
...,...,...,...,...
5562,0,This is the 2nd time we have tried 2 contact u...,160,This is the 2nd time we have tried 2 contact u...
5563,1,Will ü b going to esplanade fr home?,36,Will ü b going to esplanade fr home
5564,1,"Pity, * was in mood for that. So...any other s...",57,Pity was in mood for that Soany other suggest...
5565,1,The guy did some bitching but I acted like i'd...,125,The guy did some bitching but I acted like id ...


In [45]:
CV

In [46]:
# Splitting x and y

xSet = data['text_clean'].values
ySet = data['Class'].values
ySet

array([1, 0, 1, ..., 1, 1, 1], dtype=object)

In [47]:
xSet

array(['Ive been searching for the right words to thank you for this breather I promise i wont take your help for granted and will fulfil my promise You have been wonderful and a blessing at all times',
       'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s',
       'Nah I dont think he goes to usf he lives around here though', ...,
       'Pity  was in mood for that Soany other suggestions',
       'The guy did some bitching but I acted like id be interested in buying something else next week and he gave it to us for free',
       'Rofl Its true to its name'], dtype=object)

Splitting Train and Test Data

In [56]:
# Datatype for y is object. lets convert it into int
ySet = ySet.astype('int')
ySet

array([1, 0, 1, ..., 1, 1, 1])

In [57]:
xSet_train, xSet_test, ySet_train, ySet_test = train_test_split(xSet, ySet, test_size = 0.2, random_state = 42)

In [59]:
xSet_train_CV = CV.fit_transform(xSet_train)
xSet_train_CV

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 34481 stored elements and shape (4453, 8216)>

# Creating a model

In [62]:
NB = MultinomialNB()

In [63]:
NB.fit(xSet_train_CV, ySet_train)

In [64]:
# Let's test CV on our test data
xSet_test_CV = CV.transform(xSet_test)

In [65]:
# prediction for xSet_test_CV

ySet_predict = NB.predict(xSet_test_CV)
ySet_predict

array([1, 1, 1, ..., 1, 1, 1])

In [66]:
# Checking accuracy

accuracyScore = accuracy_score(ySet_test,ySet_predict)

print("Prediction Accuracy :",accuracyScore)

Prediction Accuracy : 0.9892280071813285


In [67]:
from sklearn.metrics import classification_report

In [68]:
# Print classification report
print("Classification Report:")
print(classification_report(ySet_test, ySet_predict))

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.96      0.96       145
           1       0.99      0.99      0.99       969

    accuracy                           0.99      1114
   macro avg       0.98      0.98      0.98      1114
weighted avg       0.99      0.99      0.99      1114



The Naive Bayes classifier using CountVectorizer has shown excellent performance in detecting email spam and ham messages. With an overall accuracy of 99% and strong precision and recall scores for both classes, it proves to be a reliable and effective approach for email filtering tasks. This model can significantly reduce the chances of spam messages reaching the user while minimizing the risk of incorrectly blocking legitimate messages.|