In [None]:
Q1:

P(smoker | uses health insurance) = P(uses health insurance | smoker) * P(smoker) / P(uses health insurance)

We are given P(uses health insurance) = 0.7 and P(smoker | uses health insurance) = 0.4. We need to find P(smoker). Unfortunately,
the information provided is not enough to calculate P(smoker) directly. We would need either the total percentage of smokers in the company or the percentage of non-smokers who use the health insurance plan.

In [None]:
Q2:

Bernoulli Naive Bayes: Assumes binary features (0 or 1), like "present" or "absent" for a specific word in a document. Ideal for text classification or data with boolean features.
Multinomial Naive Bayes: Assumes features are counts or frequencies, like word count in a document. Suitable for text classification with "bag-of-words" representation.

In [1]:
Q3:

Bernoulli Naive Bayes can handle missing values in two ways:

Ignoring: Simply ignore the missing data point when calculating probabilities for each feature. This assumes missing values are random and don't provide any information.
Imputation: Impute missing values with a placeholder value (e.g., mean, median) or another strategy before calculating probabilities.

Population mean estimated with 95% confidence interval: 486.1407070887437 - 513.8592929112564


In [None]:
Q4:

No, Gaussian Naive Bayes assumes continuous features normally distributed (Gaussian) and works best for regression problems or multi-class classification with continuous features.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Spambase dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
names = [
    "word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", "word_freq_our",
    "word_freq_over", "word_freq_remove", "word_freq_internet", "word_freq_order", "word_freq_mail",
    "word_freq_receive", "word_freq_will", "word_freq_people", "word_freq_report", "word_freq_addresses",
    "word_freq_free", "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit",
    "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", "word_freq_hp", "word_freq_hpl",
    "word_freq_george", "word_freq_650", "word_freq_lab", "word_freq_labs", "word_freq_telnet", "word_freq_857",
    "word_freq_data", "word_freq_415", "word_freq_85", "word_freq_technology", "word_freq_1999", "word_freq_parts",
    "word_freq_pm", "word_freq_direct", "word_freq_cs", "word_freq_meeting", "word_freq_original", "word_freq_project",
    "word_freq_re", "word_freq_edu", "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(",
    "char_freq_[", "char_freq_!", "char_freq_$", "char_freq_#", "capital_run_length_average", "capital_run_length_longest",
    "capital_run_length_total", "is_spam"
]
data = pd.read_csv(url, names=names)

# Prepare data
X = data.drop('is_spam', axis=1)
y = data['is_spam']

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform 10-fold cross-validation and compute performance metrics
def evaluate_classifier(classifier, X, y):
    accuracy = cross_val_score(classifier, X, y, cv=10, scoring='accuracy')
    precision = cross_val_score(classifier, X, y, cv=10, scoring='precision')
    recall = cross_val_score(classifier, X, y, cv=10, scoring='recall')
    f1 = cross_val_score(classifier, X, y, cv=10, scoring='f1')
    return accuracy, precision, recall, f1

# Evaluate Bernoulli Naive Bayes
accuracy_bernoulli, precision_bernoulli, recall_bernoulli, f1_bernoulli = evaluate_classifier(bernoulli_nb, X, y)

# Evaluate Multinomial Naive Bayes
accuracy_multinomial, precision_multinomial, recall_multinomial, f1_multinomial = evaluate_classifier(multinomial_nb, X, y)

# Evaluate Gaussian Naive Bayes
accuracy_gaussian, precision_gaussian, recall_gaussian, f1_gaussian = evaluate_classifier(gaussian_nb, X, y)

# Report performance metrics
print("Bernoulli Naive Bayes:")
print("Accuracy:", np.mean(accuracy_bernoulli))
print("Precision:", np.mean(precision_bernoulli))
print("Recall:", np.mean(recall_bernoulli))
print("F1 Score:", np.mean(f1_bernoulli))
print("\n")

print("Multinomial Naive Bayes:")
print("Accuracy:", np.mean(accuracy_multinomial))
print("Precision:", np.mean(precision_multinomial))
print("Recall:", np.mean(recall_multinomial))
print("F1 Score:", np.mean(f1_multinomial))
print("\n")

print("Gaussian Naive Bayes:")
print("Accuracy:", np.mean(accuracy_gaussian))
print("Precision:", np.mean(precision_gaussian))
print("Recall:", np.mean(recall_gaussian))
print("F1 Score:", np.mean(f1_gaussian))


Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 Score: 0.8481249015095276


Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 Score: 0.7282909724016348


Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 Score: 0.8130660909542995


In [12]:
Q5:

Due to the limitations of this platform, I cannot access and process external links or code directly. However, I can guide you through the assignment with information and suggestions:

Data Preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository.
Explore the data, understand the features, and check for missing values.
Preprocess the data (e.g., handle missing values, convert categorical features to numerical).
Split the data into training and testing sets.
Implementation:

Use scikit-learn library in Python to implement:
Bernoulli Naive Bayes
Multinomial Naive Bayes
(Skip Gaussian Naive Bayes as it's not suitable)
Use 10-fold cross-validation for each classifier.
Evaluate performance using accuracy, precision, recall, and F1 score.
Results:

Report the performance metrics for each classifier.
Analyze the results and compare performance.
Discussion:

Discuss which variant performed best and why.
Consider factors like data characteristics, feature types, and assumptions of each classifier.
Mention any limitations you observed with Naive Bayes (e.g., assumption independence).
Conclusion:

Summarize your findings and insights.
Suggest potential improvements or future work using other algorithms or techniques.
Additional Tips:

Explore hyperparameter tuning for each classifier to potentially improve performance.
Visualize the results (e.g., confusion matrix) to gain further insights.
Consider class imbalance if present and explore techniques to address it.

95% confidence interval for the population mean height: 174.02001800772996 - 175.97998199227004
