#### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?


Let:


P(S) be the probability of being a smoker.

P(H) be the probability of using the health insurance plan.

P(S∣H) be the probability of being a smoker given that the employee uses the health insurance plan.
From the given data:


P(H)=0.70 (70% of employees use the health insurance plan).

P(S∣H)=0.40 (40% of employees who use the plan are smokers).
We can use Bayes' theorem to calculate 

P(S∣H):

P(S∣H)= 
P(H)
P(H∣S)×P(S)
​
 

But 

P(H∣S) (probability of using the health insurance plan given that the employee is a smoker) is not provided. However, we can compute it using Bayes' theorem:

P(H∣S)= 
P(S)
P(S∣H)×P(H)
​
 

Where:


P(S) is the overall probability of being a smoker.

P(S)=0.70 (given that 70% of employees use the health insurance plan).

P(H) is the overall probability of using the health insurance plan.

P(H)=0.40 (given that 40% of the employees who use the plan are smokers).
Let's compute 

P(S∣H):

P(S∣H)= 
P(H)
P(S)×P(H∣S)
​
 

P(S∣H)= 
0.70
0.70×0.40
​
 =0.40

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.40


#### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?


Bernoulli Naive Bayes: Assumes that features are binary-valued (e.g., occurrences of words in a document).
Multinomial Naive Bayes: Assumes that features represent counts (e.g., frequency of words in a document).

#### Q3. How does Bernoulli Naive Bayes handle missing values?


Bernoulli Naive Bayes treats missing values as if they were non-occurrences. In other words, missing values are considered as the absence of a feature.

#### Q4. Can Gaussian Naive Bayes be used for multi-class classification?


Yes, Gaussian Naive Bayes can be used for multi-class classification. It assumes that features follow a Gaussian (normal) distribution and calculates the likelihood of each class using the probability density function of the Gaussian distribution.

Q5. Assignment:

Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:

Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:

Summarise your findings and provide some suggestions for future work.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
names = ["word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", "word_freq_our",
         "word_freq_over", "word_freq_remove", "word_freq_internet", "word_freq_order", "word_freq_mail",
         "word_freq_receive", "word_freq_will", "word_freq_people", "word_freq_report", "word_freq_addresses",
         "word_freq_free", "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit",
         "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", "word_freq_hp", "word_freq_hpl",
         "word_freq_george", "word_freq_650", "word_freq_lab", "word_freq_labs", "word_freq_telnet",
         "word_freq_857", "word_freq_data", "word_freq_415", "word_freq_85", "word_freq_technology",
         "word_freq_1999", "word_freq_parts", "word_freq_pm", "word_freq_direct", "word_freq_cs",
         "word_freq_meeting", "word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu",
         "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(", "char_freq_[", "char_freq_!",
         "char_freq_$", "char_freq_#", "capital_run_length_average", "capital_run_length_longest",
         "capital_run_length_total", "spam"]
data = pd.read_csv(url, names=names, header=None)

# Split features and target
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Initialize lists to store performance metrics
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

# Perform 10-fold cross-validation
for clf in [bernoulli_nb, multinomial_nb, gaussian_nb]:
    accuracy = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
    precision = cross_val_score(clf, X, y, cv=10, scoring='precision')
    recall = cross_val_score(clf, X, y, cv=10, scoring='recall')
    f1 = cross_val_score(clf, X, y, cv=10, scoring='f1')
    
    accuracy_scores.append(np.mean(accuracy))
    precision_scores.append(np.mean(precision))
    recall_scores.append(np.mean(recall))
    f1_scores.append(np.mean(f1))

# Print performance metrics
print("Performance Metrics:")
print("Bernoulli Naive Bayes:")
print("Accuracy:", accuracy_scores[0])
print("Precision:", precision_scores[0])
print("Recall:", recall_scores[0])
print("F1 Score:", f1_scores[0])
print("\nMultinomial Naive Bayes:")
print("Accuracy:", accuracy_scores[1])
print("Precision:", precision_scores[1])
print("Recall:", recall_scores[1])
print("F1 Score:", f1_scores[1])
print("\nGaussian Naive Bayes:")
print("Accuracy:", accuracy_scores[2])
print("Precision:", precision_scores[2])
print("Recall:", recall_scores[2])
print("F1 Score:", f1_scores[2])


Performance Metrics:
Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 Score: 0.8481249015095276

Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 Score: 0.7282909724016348

Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 Score: 0.8130660909542995
