<a href="https://colab.research.google.com/github/SakethMattupalli/Spam-vs-Ham/blob/master/spam_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Filtering using Multinomial Bayes theorem
Bag of words (Count Vectorizer)

# Task 1: Importing necessary libraries

In [0]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from math import log, sqrt
import pandas as pd
import numpy as np
import re
%matplotlib inline

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

**Import the data set**

In [0]:
data = pd.read_csv('https://raw.githubusercontent.com/SakethMattupalli/csv/master/spam.csv', encoding = 'latin-1')

In [4]:
data.head() # check the data

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [14]:
data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

**Remove the irrelavent columns**

In [0]:
data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1, inplace = True) 

In [6]:
data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Renaming into relevant column names**

In [7]:
data.rename(columns= {'v1':'labels', 'v2':'message'}, inplace= True)
data.head()

Unnamed: 0,labels,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [232]:
data.shape

(5572, 2)

In [233]:
print(data.columns)
data['labels'].value_counts()


Index(['labels', 'message'], dtype='object')


ham     4825
spam     747
Name: labels, dtype: int64

*Ham ==> 0*
*Spam ==> 1*

In [8]:
data['labels'] = data['labels'].map({'ham':0, 'spam':1})
data.head()

Unnamed: 0,labels,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
X = data.message
Y = data.labels
print(X[0])
print(Y[0])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
0


**Training size: 75**
**Testing size : 25** 

In [0]:
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size = 0.25)

In [0]:
vectorizer = CountVectorizer()
x_train_cv = vectorizer.fit_transform(x_train.values)

In [12]:
arr = x_train_cv.toarray()
arr

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [0]:
classifier = MultinomialNB()

In [14]:
classifier.fit(x_train_cv,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [0]:
x_test_cv = vectorizer.transform(x_test.values)
y_pred = classifier.predict(x_test_cv)

In [16]:
y_pred #predicted values

array([0, 0, 0, ..., 0, 0, 0])

In [17]:
y_test = np.array(y_test)
y_test # actual values

array([0, 0, 0, ..., 0, 0, 0])

In [28]:
comp = pd.DataFrame({'actual': y_test, 'predicted': y_pred})
comp['message'] = x_test.values
comp.head()

Unnamed: 0,actual,predicted,message
0,0,0,"Right on brah, see you later"
1,0,0,Convey my regards to him
2,0,0,Company is very good.environment is terrific a...
3,0,0,U dun say so early hor... U c already then say...
4,0,0,K I'll be there before 4.


In [0]:
#a1 = comp[(comp['predicted'] == 1) & (comp['actual'] == 1) ]
#a1.iloc[174].message

In [18]:
count = 0
for i in range(len(y_test)):
  if y_test[i] == y_pred[i]:
    count = count + 1
print('correctly classified:', count)
print('total:', len(y_test))

correctly classified: 1369
total: 1393


**Accuracy**

In [19]:
accuracy = count/len(y_test)
print(accuracy*100) 

98.27709978463747


# Try with your own custom message

In [278]:
custom_message = input('enter message')			
custom_cv = vectorizer.transform([custom_message])
pred = classifier.predict(custom_cv)
if pred[0] == 1:
  print('Given message is spam')
else:
  print('not a spam')

enter messageget mobile + laptop free
Given message is spam
