Spam Filter: Develop a spam filter using Naïve Bayes classifier that can analyse emails and classify them as spam or non-spam, based on the email features.

## Implementation of NaiveBayes Classifier for Spam Detection

#### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt

#### Load Data

In [2]:
msgs = pd.read_csv("spam.csv",encoding="ISO-8859-1")
msgs

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


#### Data Cleaning

In [3]:
msgs.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True) #drop unwated columns

msgs.rename(columns={'v2':'text_msgs','v1':'label'},inplace=True) #renamed columns

msgs.label = msgs.label.apply(lambda x : 1 if x == 'spam' else 0) #changed label to binary ie.spam =1,ham=0

msgs.text_msgs = msgs.text_msgs.apply(lambda t: t.lower().translate(str.maketrans('', '', string.punctuation)))#lowercase text_msgs and remove punctuation

In [4]:
print(msgs.label.unique()) # checking the types of labels

[0 1]


In [5]:
print(msgs.head(5))

   label                                          text_msgs
0      0  go until jurong point crazy available only in ...
1      0                            ok lar joking wif u oni
2      1  free entry in 2 a wkly comp to win fa cup fina...
3      0        u dun say so early hor u c already then say
4      0  nah i dont think he goes to usf he lives aroun...


#### Split the Data

In [6]:
X = msgs.text_msgs# instance features
y = msgs.label # label
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.3,random_state=1)
print("Dimension of each set : \n",X_train.shape,X_test.shape,y_train.shape,y_test.shape)

Dimension of each set : 
 (3900,) (1672,) (3900,) (1672,)


#### Transform the words to word count

In [7]:
cv = CountVectorizer()
X_train_count = cv.fit_transform(X_train.values)
print("Transformed messages : \n",X_train_count.toarray())

Transformed messages : 
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


#### Train the model

In [8]:
model = MultinomialNB()
model.fit(X_train_count,y_train)

MultinomialNB()

#### Testing random datas

In [9]:
#testing a random data to check wether its correctly classified as not spam
email_1 = ['Hey whanna meet up?']
email_1_count = cv.transform(email_1)

print("Model prediction : ",model.predict(email_1_count))

Model prediction :  [0]


In [10]:
#testing a random data to check wether its correctly classified as spam
email_2 = ['offer for a lucky winner']
email_2_count = cv.transform(email_2)

print("Model prediction :",model.predict(email_2_count))

Model prediction : [1]


#### Test the testdata

In [11]:
X_test_count = cv.transform(X_test.values)
X_test_count.toarray()
y_pred = model.predict(X_test_count)

#### Model Score

In [12]:
print("Model Score : ",model.score(X_test_count,y_test))

Model Score :  0.9802631578947368
