<a href="https://colab.research.google.com/github/Jumaantony/basic-ml-course/blob/main/Naive_bayes_spam_email_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Email Spam Classification Using Naive Bayes Classifier**

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

In [32]:
import pandas as pd

email_spam = pd.read_csv('/content/drive/MyDrive/emails.csv', header=None, names=['Label', 'SMS'], )
# names=['Label', 'SMS'] are names to the columns

print(email_spam.shape)
email_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,0,1
1,ham,"""Go until jurong point, crazy.. Available only..."
2,ham,Ok lar... Joking wif u oni...
3,spam,Free entry in 2 a wkly comp to win FA Cup fina...
4,ham,U dun say so early hor... U c already then say...


In [60]:
email_spam['Label'].value_counts(normalize=True)*100

ham     86.575736
spam    13.406317
0        0.017947
Name: Label, dtype: float64

In the above output, about 87% emails are ham(non-spam emails) while 13% are ham emails

**Training and Testing**

70% of the data will be used for training while the other 30% will be used for testing

In [62]:
# randomizing the dataset
data_randomized = email_spam.sample(frac=1, random_state=1)

# Calculating index for split
training_test_index = round(len(data_randomized) * 0.7)

# Split into training and test sets
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(3900, 2)
(1672, 2)


In [63]:
# analyzing the spam and ham percentage
training_set['Label'].value_counts(normalize=True)

ham     0.865641
spam    0.134359
Name: Label, dtype: float64

**Data Cleaning**

In [40]:
# Before cleaning
training_set.head(5)

Unnamed: 0,Label,SMS
0,ham,"""Yep, by the pretty sculpture"""
1,ham,"""Yes, princess. Are you going to make me moan?"""
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [44]:
# After cleaning
training_set['SMS'] = training_set['SMS'].str.replace(
   '\W', ' ') # Removes punctuation
training_set['SMS'] = training_set['SMS'].str.lower()
training_set.head(5)

  training_set['SMS'] = training_set['SMS'].str.replace(


Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


**Creating the vocabulary**

This is a list of all unique words in the dataset.

In [45]:
#  converting the SMS column into a list by splitting the string whenever 
# there is a space
training_set['SMS'] = training_set['SMS'].str.split()

# intializing an empty list
vocabulary = []

# iterating over the training set and appending the sms in the dictionary
for sms in training_set['SMS']:
   for word in sms:
      vocabulary.append(word)

# removing all duplicates using the set() function
vocabulary = list(set(vocabulary))

In [46]:
# available unique words
len(vocabulary)

7792

**Final Training Set**

*Transforming the Data into a DataFrame*

In [47]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

# looping over the training set get the index and the sms
for index, sms in enumerate(training_set['SMS']):
  # looping over sms to get the words
   for word in sms:
      word_counts_per_sms[word][index] += 1

In [48]:
# Final transformation for the training set
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,inlude,400,disagreeable,cakes,individual,batch,heart,temp,po,hor,...,alter,girl,tomo,pocay,shoes,urself,floor,tt,sometimes,mylife
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0




In [49]:
# adding the LAbel and SMS column
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,inlude,400,disagreeable,cakes,individual,batch,heart,temp,...,alter,girl,tomo,pocay,shoes,urself,floor,tt,sometimes,mylife
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Calculating Constanants**

In [50]:
# Isolating spam and ham messages
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages) / len(training_set_clean)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

**Calculating Parameter**s

In [51]:
# Initiate parameters
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

# Calculate parameters
for word in vocabulary:
   n_word_given_spam = spam_messages[word].sum() # spam_messages already defined
   p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
   parameters_spam[word] = p_word_given_spam

   n_word_given_ham = ham_messages[word].sum() # ham_messages already defined
   p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
   parameters_ham[word] = p_word_given_ham

**Classifying A New Message**

In [52]:
import re

def classify(message):
   '''
   message: a string
   '''

   message = re.sub('\W', ' ', message)
   message = message.lower().split()

   p_spam_given_message = p_spam
   p_ham_given_message = p_ham

   for word in message:
      if word in parameters_spam:
         p_spam_given_message *= parameters_spam[word]

      if word in parameters_ham: 
         p_ham_given_message *= parameters_ham[word]

   print('P(Spam|message):', p_spam_given_message)
   print('P(Ham|message):', p_ham_given_message)

   if p_ham_given_message > p_spam_given_message:
      print('Label: Ham')
   elif p_ham_given_message < p_spam_given_message:
      print('Label: Spam')
   else:
      print('Equal proabilities, have a human classify this!')

In [53]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3419772268006177e-25
P(Ham|message): 1.932019374430739e-27
Label: Spam


In [54]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4216990000062882e-25
P(Ham|message): 3.725501990221994e-21
Label: Ham


**Measuring The Accuracy**

In [55]:
def classify_test_set(message):
   '''
   message: a string
   '''

   message = re.sub('\W', ' ', message)
   message = message.lower().split()

   p_spam_given_message = p_spam
   p_ham_given_message = p_ham

   for word in message:
      if word in parameters_spam:
         p_spam_given_message *= parameters_spam[word]

      if word in parameters_ham:
         p_ham_given_message *= parameters_ham[word]

   if p_ham_given_message > p_spam_given_message:
      return 'ham'
   elif p_spam_given_message > p_ham_given_message:
      return 'spam'
   else:
      return 'needs human classification'

In [56]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"""All done, all handed in. Don't know if mega s...",ham


In [57]:
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows():
   row = row[1]
   if row['Label'] == row['predicted']:
      correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1098
Incorrect: 16
Accuracy: 0.9856373429084381
