Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

In [1]:
# Given probabilities
P_A = 0.70  # Probability that an employee uses the health insurance plan
P_B_given_A = 0.40  # Probability that an employee is a smoker given that they use the health insurance plan
P_not_A = 1 - P_A  # Probability that an employee does not use the health insurance plan

# Assumed range for P(B|not A)
# You can adjust this range based on any additional information or assumptions you have
P_B_given_not_A_range = [0.10, 0.50]  # Assumed range of values for P(B|not A)

# Calculate P(B)
P_B = P_B_given_A * P_A + ((sum(P_B_given_not_A_range)) / len(P_B_given_not_A_range)) * P_not_A

# Calculate P(A|B) using Bayes' theorem
P_A_given_B = (P_B_given_A * P_A) / P_B

print("Probability that an employee is a smoker given that they use the health insurance plan:", P_A_given_B)


Probability that an employee is a smoker given that they use the health insurance plan: 0.7567567567567567


Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes:
Deals with binary features (presence/absence of a specific feature).
Models the probability of a feature being present or absent given a class.
Useful for data where features can be simply categorized as "yes/no" or "true/false."
Example: Email spam classification (spam/not spam) based on keywords.


Multinomial Naive Bayes:
Deals with discrete features that can take on multiple values.
Models the probability of each feature value given a class.
Used for data where features have various categories, but the number of categories is finite.
Example: Text classification (positive/negative review) based on word frequency.

Q3. Handling missing values in Bernoulli Naive Bayes:

Bernoulli Naive Bayes doesn't have a built-in mechanism to handle missing values directly. Common approaches include:

Ignoring instances with missing values: This can be inefficient and introduce bias, especially if the missing values are not random.
Imputation: Filling in missing values with estimates based on other data points (e.g., mean/median of the feature or using specific imputation techniques).
Encoding missing values as a separate category: This can be done by adding an extra "missing" category to each feature.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

 Gaussian Naive Bayes can be used for multi-class classification. In fact, Gaussian Naive Bayes is a variant of the Naive Bayes algorithm that is commonly used for classification tasks where the features are continuous and assumed to follow a Gaussian (normal) distribution.

In multi-class classification problems, Gaussian Naive Bayes calculates the conditional probability of each class given the features using Bayes' theorem and assumes that the features within each class are independent and follow a Gaussian distribution.

The algorithm estimates the mean and variance of each feature for each class and then uses these parameters to calculate the probability of observing the given features for each class. Finally, it assigns the class with the highest probability as the predicted class for the input data point.

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 
  
# metadata 
print(spambase.metadata) 
  
# variable information 
print(spambase.variables) 


{'uci_id': 94, 'name': 'Spambase', 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase', 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv', 'abstract': 'Classifying Email as Spam or Non-Spam', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 4601, 'num_features': 57, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1999, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C53G6X', 'creators': ['Mark Hopkins', 'Erik Reeber', 'George Forman', 'Jaap Suermondt'], 'intro_paper': None, 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collecti

In [4]:
from sklearn.model_selection import train_test_split

In [6]:
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)


In [7]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [8]:
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

In [11]:
bernoulli_nb.fit(x_train,y_train)

  y = column_or_1d(y, warn=True)


In [20]:
scores_accuracy = cross_val_score(bernoulli_nb,x_train, y_train, cv=10, scoring='accuracy')
scores_precision = cross_val_score(bernoulli_nb,x_train, y_train, cv=10, scoring='precision')
scores_recall = cross_val_score(bernoulli_nb,x_train, y_train, cv=10,  scoring='recall')
scores_f1 = cross_val_score(bernoulli_nb,x_train, y_train, cv=10, scoring='f1')

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

In [21]:

print("Accuracy:", scores_accuracy.mean())
print("Precision:", scores_precision.mean())
print("Recall:", scores_recall.mean())
print("F1 score:", scores_f1.mean())

Accuracy: 0.883768115942029
Precision: 0.8791415378468012
Recall: 0.813141061609247
F1 score: 0.8440783264421254


In [22]:
scores_accuracy = cross_val_score(multinomial_nb,x_train, y_train, cv=10, scoring='accuracy')
scores_precision = cross_val_score(multinomial_nb,x_train, y_train, cv=10, scoring='precision')
scores_recall = cross_val_score(multinomial_nb,x_train, y_train, cv=10,  scoring='recall')
scores_f1 = cross_val_score(multinomial_nb,x_train, y_train, cv=10, scoring='f1')

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

In [24]:
print("Accuracy:", scores_accuracy.mean())
print("Precision:", scores_precision.mean())
print("Recall:", scores_recall.mean())
print("F1 score:", scores_f1.mean())

Accuracy: 0.7878260869565218
Precision: 0.7411018084158372
Recall: 0.698019301986309
F1 score: 0.7184613753647053


In [25]:
scores_accuracy = cross_val_score(gaussian_nb,x_train, y_train, cv=10, scoring='accuracy')
scores_precision = cross_val_score(gaussian_nb,x_train, y_train, cv=10, scoring='precision')
scores_recall = cross_val_score(gaussian_nb,x_train, y_train, cv=10,  scoring='recall')
scores_f1 = cross_val_score(gaussian_nb,x_train, y_train, cv=10, scoring='f1')

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

In [26]:
print("Accuracy:", scores_accuracy.mean())
print("Precision:", scores_precision.mean())
print("Recall:", scores_recall.mean())
print("F1 score:", scores_f1.mean())

Accuracy: 0.8159420289855073
Precision: 0.6919704567487083
Recall: 0.9528840758612951
F1 score: 0.8011249285678238
