In [1]:
# Q1. A company conducted a survey of its employees and found that 70% of the employees use the
# company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
# probability that an employee is a smoker given that he/she uses the health insurance plan?

In [2]:
# Given information:

# 70% of employees use the company's health insurance plan: P(Uses Plan) = 0.7
# 40% of employees who use the plan are smokers: P(Smoker | Uses Plan) = 0.4

# P(Smoker | Uses Plan) =?

# Let's find P(Smoker):
# Since 40% of employees who use the plan are smokers, we can assume that the proportion of smokers among all employees is similar. Let's assume P(Smoker) = x.
# We know that 70% of employees use the plan, and 40% of them are smokers. So, 
# the number of smokers who use the plan is 0.4 * 0.7 = 0.28.Since x is the proportion of smokers among all employees, we can set up the equation:
# 0.28 = x * 0.7
# Solving for x, we get:
# x = 0.28 / 0.7 = 0.4
# So, P(Smoker) = 0.4.

# Now, let's find P(Smoker | Uses Plan):
# Using Bayes' Theorem:
# P(Smoker | Uses Plan) = P(Uses Plan | Smoker) * P(Smoker) / P(Uses Plan) = 0.4 * 0.4 / 0.7 = 0.5714

# The probability that an employee is a smoker given that they use the health insurance plan is approximately 57.14%.

In [3]:
# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

In [4]:
# Feature type: Bernoulli Naive Bayes is used for binary features, while Multinomial Naive Bayes is used for categorical 
# features with more than two values.

# Distribution: Bernoulli Naive Bayes assumes a Bernoulli distribution, while Multinomial Naive Bayes assumes a multinomial 
# distribution.

# Likelihood calculation: The likelihood calculations differ between the two algorithms, as shown above.

In [5]:
# Q3. How does Bernoulli Naive Bayes handle missing values?

In [6]:
# Bernoulli Naive Bayes is immune to missing values. This means that the algorithm can simply ignore missing values because 
# it handles the input features independently.

# There are two approaches to handle missing values in Naive Bayes:

#     Imputation: This involves using various imputation algorithms to estimate the missing values based on the other 
#     observations in the dataset.

#     Using a different algorithm: Some algorithms are more robust to missing values, and using them can improve the
#     performance of the Naive Bayes algorithm.

In [7]:
# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

In [8]:
# Yes, Gaussian Naive Bayes can be used for multi-class classification.

In [9]:
# Q5. Assignment:
# Data preparation:
# Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
# datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
# is spam or not based on several input features.
# Implementation:
# Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
# scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
# dataset. You should use the default hyperparameters for each classifier.
# Results:
# Report the following performance metrics for each classifier:
# Accuracy
# Precision
# Recall
# F1 score
# Discussion:
# Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
# the case? Are there any limitations of Naive Bayes that you observed?
# Conclusion:
# Summarise your findings and provide some suggestions for future work.

# Note: Create your assignment in Jupyter notebook and upload it to GitHub & share that github repository
# link through your dashboard. Make sure the repository is public.
# Note: This dataset contains a binary classification problem with multiple features. The dataset is
# relatively small, but it can be used to demonstrate the performance of the different variants of Naive
# Bayes on a real-world problem.

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler

In [20]:
df = pd.read_csv('spambase.csv')


In [25]:
df.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [21]:
df.columns

Index(['0', '0.64', '0.64.1', '0.1', '0.32', '0.2', '0.3', '0.4', '0.5', '0.6',
       '0.7', '0.64.2', '0.8', '0.9', '0.10', '0.32.1', '0.11', '1.29', '1.93',
       '0.12', '0.96', '0.13', '0.14', '0.15', '0.16', '0.17', '0.18', '0.19',
       '0.20', '0.21', '0.22', '0.23', '0.24', '0.25', '0.26', '0.27', '0.28',
       '0.29', '0.30', '0.31', '0.33', '0.34', '0.35', '0.36', '0.37', '0.38',
       '0.39', '0.40', '0.41', '0.42', '0.43', '0.778', '0.44', '0.45',
       '3.756', '61', '278', '1'],
      dtype='object')

In [32]:
target_column = '1'

In [33]:
X = df.drop(target_column, axis=1)
y = df[target_column]

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [35]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [36]:
models = {
    'BernoulliNB': BernoulliNB(),
    'MultinomialNB': MultinomialNB(),
    'GaussianNB': GaussianNB()
}

In [37]:
for name, model in models.items():
    if name == 'GaussianNB':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    print(f'{name} - Accuracy: {accuracy_score(y_test, y_pred):.4f}')
    print(f'{name} - Precision: {precision_score(y_test, y_pred):.4f}')
    print(f'{name} - Recall: {recall_score(y_test, y_pred):.4f}')
    print(f'{name} - F1 Score: {f1_score(y_test, y_pred):.4f}')

BernoulliNB - Accuracy: 0.8717
BernoulliNB - Precision: 0.8846
BernoulliNB - Recall: 0.7972
BernoulliNB - F1 Score: 0.8387
MultinomialNB - Accuracy: 0.7717
MultinomialNB - Precision: 0.7426
MultinomialNB - Recall: 0.6950
MultinomialNB - F1 Score: 0.7180
GaussianNB - Accuracy: 0.8145
GaussianNB - Precision: 0.7093
GaussianNB - Recall: 0.9428
GaussianNB - F1 Score: 0.8095
