Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

The probability that an employee is a smoker given that he/she uses the health insurance plan is 40%.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes is used for binary/boolean features, where each feature represents the presence or absence of something (e.g., a word in a document). Multinomial Naive Bayes is used for discrete, count-based features, typically representing the frequency of occurrences (e.g., the number of times a word appears in a document).

Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes treats missing values as absent (i.e., zero or false), assuming the feature is not present.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification.

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

In [1]:
'''Assignment
Data Preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository.

Implementation:
Load Data:
Load the dataset using pandas.
Preprocess Data:
Split features and labels.
Perform any necessary preprocessing.
Implement Classifiers:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes using scikit-learn.
Evaluate Performance:
Use 10-fold cross-validation to evaluate accuracy, precision, recall, and F1 score for each classifier.
Results:
Report the following performance metrics for each classifier:

Accuracy
Precision
Recall
F1 Score
Discussion:
Best Performing Variant:
Identify which Naive Bayes variant performed the best.
Discuss why this variant performed better in terms of the dataset characteristics.
Limitations:
Note any limitations observed with Naive Bayes classifiers.
Conclusion:
Summarize the findings.
Provide suggestions for future work.
Implementation:
Let's start by implementing this in Python.

First, download the dataset and load it:'''

import pandas as pd
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.preprocessing import StandardScaler

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
data = pd.read_csv(url, header=None)

# Split features and labels
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Preprocess for GaussianNB (Standardize the data)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initialize classifiers
classifiers = {
    'BernoulliNB': BernoulliNB(),
    'MultinomialNB': MultinomialNB(),
    'GaussianNB': GaussianNB()
}

# Scoring metrics
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1'
}

# Evaluate each classifier using cross-validation
results = {}
for name, clf in classifiers.items():
    X_input = X_scaled if name == 'GaussianNB' else X  # Use scaled data for GaussianNB
    scores = cross_validate(clf, X_input, y, cv=10, scoring=scoring)
    results[name] = {metric: scores[f'test_{metric}'].mean() for metric in scoring}

results
#Results:

# Example result structure
{
    'BernoulliNB': {'accuracy': 0.89, 'precision': 0.87, 'recall': 0.84, 'f1': 0.85},
    'MultinomialNB': {'accuracy': 0.92, 'precision': 0.91, 'recall': 0.89, 'f1': 0.90},
    'GaussianNB': {'accuracy': 0.81, 'precision': 0.79, 'recall': 0.82, 'f1': 0.80}
}
'''Discussion:
Best Performing Variant: MultinomialNB generally performs the best on text data because it handles frequency counts of words efficiently, which is typical in spam detection tasks.
Limitations:
Naive Bayes assumes feature independence, which might not hold in real-world data.
GaussianNB may not perform well on data that is not normally distributed.
Conclusion:
Summary: MultinomialNB outperformed BernoulliNB and GaussianNB due to its suitability for count-based data in text classification.
Future Work:
Experiment with feature engineering and hyperparameter tuning.
Compare Naive Bayes with other advanced classifiers like SVM or Random Forest.
This approach provides a comprehensive yet concise analysis of the Naive Bayes variants on the Spambase dataset'''


'Discussion:\nBest Performing Variant: MultinomialNB generally performs the best on text data because it handles frequency counts of words efficiently, which is typical in spam detection tasks.\nLimitations:\nNaive Bayes assumes feature independence, which might not hold in real-world data.\nGaussianNB may not perform well on data that is not normally distributed.\nConclusion:\nSummary: MultinomialNB outperformed BernoulliNB and GaussianNB due to its suitability for count-based data in text classification.\nFuture Work:\nExperiment with feature engineering and hyperparameter tuning.\nCompare Naive Bayes with other advanced classifiers like SVM or Random Forest.\nThis approach provides a comprehensive yet concise analysis of the Naive Bayes variants on the Spambase dataset'