Q1. Probability of smoker given health insurance:

This is a classic example of conditional probability. We know:

P(Uses health insurance) = 70%
P(Smoker | Uses health insurance) = 40% (probability of being a smoker given using insurance)
We want to find P(Uses health insurance | Smoker), which is the probability of using the insurance plan given someone is a smoker. However, Naive Bayes directly calculates the opposite scenario we're given.

Here's the approach using the given information:

We can rewrite the second term using the concept of total probability:
P(Smoker | Uses health insurance) = P(Smoker and Uses health insurance) / P(Uses health insurance)

Unfortunately, we lack the value for P(Smoker and Uses health insurance), which is the number of employees who are both smokers and use the insurance plan divided by the total number of employees.

Without this information, we cannot calculate the exact probability of being a smoker given health insurance.

Q2. Bernoulli vs Multinomial Naive Bayes:

Bernoulli Naive Bayes: This is suitable for binary classification problems where each feature can only take two values (e.g., yes/no, true/false). It assumes each feature is generated independently based on the class label (e.g., presence of a disease based on symptoms).

Multinomial Naive Bayes: This is used for multi-class classification problems where each feature can have multiple discrete values (e.g., email categories like spam/not spam, weather conditions like sunny/rainy/cloudy). It models the probability of each feature value given the class label.

In simpler terms, Bernoulli Naive Bayes deals with yes/no features, while Multinomial Naive Bayes handles features with multiple categories.

Q3. Handling missing values in Bernoulli Naive Bayes:

Missing values can be problematic for Naive Bayes as they disrupt the calculation of feature probabilities. Here are some common approaches:

Ignoring instances with missing values: This is a simple but potentially inefficient method, especially if missing data is frequent.

Imputation: Estimate the missing value using statistical techniques like mean/median imputation or more complex methods based on available data.

Smoothing: Adjust probabilities slightly to account for missing values, ensuring they sum to 1.

Q4. Gaussian Naive Bayes for multi-class classification:

Standard Naive Bayes assumes features are independent and discrete. Gaussian Naive Bayes is a specific type that assumes features follow a Gaussian distribution (normal distribution like bell curve). However, it's primarily used for binary classification problems.

For multi-class classification with continuous features, other methods like Support Vector Machines (SVM) or Multi-Layer Perceptrons (MLP) are generally preferred.

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [9]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
names = ['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our',
         'word_freq_over', 'word_freq_remove', 'word_freq_internet', 'word_freq_order', 'word_freq_mail',
         'word_freq_receive', 'word_freq_will', 'word_freq_people', 'word_freq_report', 'word_freq_addresses',
         'word_freq_free', 'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit',
         'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money', 'word_freq_hp',
         'word_freq_hpl', 'word_freq_george', 'word_freq_650', 'word_freq_lab', 'word_freq_labs',
         'word_freq_telnet', 'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85',
         'word_freq_technology', 'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct',
         'word_freq_cs', 'word_freq_meeting', 'word_freq_original', 'word_freq_project', 'word_freq_re',
         'word_freq_edu', 'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(',
         'char_freq_[', 'char_freq_!', 'char_freq_$', 'char_freq_#', 'capital_run_length_average',
         'capital_run_length_longest', 'capital_run_length_total', 'is_spam']
data = pd.read_csv(url, names=names, header=None)

In [10]:
X = data.drop('is_spam', axis=1)
y = data['is_spam']

In [11]:
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

In [12]:
scoring = ['accuracy', 'precision', 'recall', 'f1']

In [14]:
scores_bernoulli = cross_validate(bernoulli_nb, X, y, cv=10, scoring=scoring)
scores_multinomial = cross_validate(multinomial_nb, X, y, cv=10, scoring=scoring)
scores_gaussian = cross_validate(gaussian_nb, X, y, cv=10, scoring=scoring)

In [16]:
mean_scores_bernoulli = {k: np.mean(v) for k, v in scores_bernoulli.items()}
mean_scores_multinomial = {k: np.mean(v) for k, v in scores_multinomial.items()}
mean_scores_gaussian = {k: np.mean(v) for k, v in scores_gaussian.items()}

In [17]:
print("Bernoulli Naive Bayes:")
print("Accuracy:", mean_scores_bernoulli['test_accuracy'])
print("Precision:", mean_scores_bernoulli['test_precision'])
print("Recall:", mean_scores_bernoulli['test_recall'])
print("F1 score:", mean_scores_bernoulli['test_f1'])
print()

print("Multinomial Naive Bayes:")
print("Accuracy:", mean_scores_multinomial['test_accuracy'])
print("Precision:", mean_scores_multinomial['test_precision'])
print("Recall:", mean_scores_multinomial['test_recall'])
print("F1 score:", mean_scores_multinomial['test_f1'])
print()

print("Gaussian Naive Bayes:")
print("Accuracy:", mean_scores_gaussian['test_accuracy'])
print("Precision:", mean_scores_gaussian['test_precision'])
print("Recall:", mean_scores_gaussian['test_recall'])
print("F1 score:", mean_scores_gaussian['test_f1'])

Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 score: 0.8481249015095276

Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 score: 0.7282909724016348

Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 score: 0.8130660909542995
