* Q. Email service providers aim to automatically classify incoming emails as Spam or Not Spam based on the words or patterns they contain.

A small company wants to develop a simple rule-based spam detection model using Naive Bayes classification.

Using this dataset, the task is to:

`Task : Train a Naive Bayes classifier using historical data.`

`Calculate probabilities for each feature given the class.`

`Predict whether a new incoming email is Spam or Not Spam based on its words.`

* Storing Data / Extracting The Data

In [None]:
data = [
    ['Contains_Free',' Contains_Offer', 'Contains_Win', 'Label'],
    ["Yes", "Yes", "No", "Spam"],
    ["No", "Yes", "Yes", "Spam"],
    ["Yes", "No", "No", "Not Spam"],
    ["No", "Yes", "No", "Not Spam"],
    ["Yes", "Yes", "Yes", "Spam"]
]

#Import from csv
# import pandas as pd 
# df = pd.read_csv('email_spam_data.csv')




1) Extract Features Name

In [14]:
features_name = []

for i in range(0,len(data[0])):
  features_name.append(data[0][i])

print(features_name)

class_name = data[0][len(data[0])-1]

print(class_name)

['Contains_Free', ' Contains_Offer', 'Contains_Win', 'Label']
Label


2) Create Column Dictionary

In [2]:
columns = {}
L = []

for j in range(0,len(data[0])):
  for i in range(1,len(data)):
    L.append(data[i][j])

  columns[features_name[j]] = L
  L = []

print(columns)

{'Contains_Free': ['Yes', 'No', 'Yes', 'No', 'Yes'], ' Contains_Offer': ['Yes', 'Yes', 'No', 'Yes', 'Yes'], 'Contains_Win': ['No', 'Yes', 'No', 'No', 'Yes'], 'Label': ['Spam', 'Spam', 'Not Spam', 'Not Spam', 'Spam']}


3) find unique attribute

In [3]:
unique_attribute = {}

for i in range(0,len(data[0])):
  unique_attribute[features_name[i]] = list(set(columns[features_name[i]]))

print(unique_attribute)

{'Contains_Free': ['No', 'Yes'], ' Contains_Offer': ['No', 'Yes'], 'Contains_Win': ['No', 'Yes'], 'Label': ['Not Spam', 'Spam']}


4) Calculate Class Probabilities

In [4]:
probability_of_yes = columns[class_name].count('Spam')/len(columns[class_name])
probability_of_no = columns[class_name].count('Not Spam')/len(columns[class_name])

print(probability_of_yes)
print(probability_of_no)

0.6
0.4


5) Calculate Conditional Probability

In [5]:
def probability(column_name,attribute,class_value):
  mycolumn = columns[column_name]
  classcolumn = columns[class_name]

  count = 0

  for i in range(0,len(mycolumn)):
    if mycolumn[i] == attribute and classcolumn[i]==class_value:
      count=count+1

  return count/classcolumn.count(class_value)

6) Maintaining Conditional Probabilities

In [6]:
features_probability_spam = {}
features_probability_Notspam = {}

for feature in features_name[:-1]:
  for attribute in unique_attribute[feature]:
    features_probability_spam[f'{feature}_{attribute}'] = probability(feature, attribute, 'Spam')
    features_probability_Notspam[f'{feature}_{attribute}'] = probability(feature, attribute, 'Not Spam')


print("Conditional probabilities given Spam:", features_probability_spam)
print("Conditional probabilities given Not Spam:", features_probability_Notspam)

Conditional probabilities given Spam: {'Contains_Free_No': 0.3333333333333333, 'Contains_Free_Yes': 0.6666666666666666, ' Contains_Offer_No': 0.0, ' Contains_Offer_Yes': 1.0, 'Contains_Win_No': 0.3333333333333333, 'Contains_Win_Yes': 0.6666666666666666}
Conditional probabilities given Not Spam: {'Contains_Free_No': 0.5, 'Contains_Free_Yes': 0.5, ' Contains_Offer_No': 0.5, ' Contains_Offer_Yes': 0.5, 'Contains_Win_No': 1.0, 'Contains_Win_Yes': 0.0}


7) Prediction Using Naive Bayes Formula

$$\text{Prediction} = \arg \max_{C_k} P(C_k) \cdot \prod_{i=1}^{n} P(x_i | C_k)$$

In [8]:
def predict(Contains_Free_value, Contains_Offer_value, Contains_Win_value,label):
  if(label=='Spam'):
    return probability_of_yes*features_probability_spam[f'Contains_Free_{Contains_Free_value}']*features_probability_spam[f' Contains_Offer_{Contains_Offer_value}']*features_probability_spam[f'Contains_Win_{Contains_Win_value}']
  else:
    return probability_of_no*features_probability_Notspam[f'Contains_Free_{Contains_Free_value}']*features_probability_Notspam[f' Contains_Offer_{Contains_Offer_value}']*features_probability_Notspam[f'Contains_Win_{Contains_Win_value}']

8) Test

In [13]:
p_spam = predict("Yes","Yes","No","Spam")
p_notspam = predict("Yes", "Yes", "No", "Not Spam")

print(p_spam,p_notspam,sep='\n')

predict_class = 'Spam' if p_spam>p_notspam else 'Not Spam'
print(predict_class)


0.1333333333333333
0.1
Spam
