# Naïve bayes-2 

Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Q3. How does Bernoulli Naive Bayes handle missing values?

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

Note: Create your assignment in Jupyter notebook and upload it to GitHub & share that github repository
link through your dashboard. Make sure the repository is public.
Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use conditional probability. We can use Bayes' theorem for this purpose. Let's denote the events as follows:

- A: Employee uses the health insurance plan.
- B: Employee is a smoker.

We are given the following probabilities:
- P(A) = 0.70 (probability that an employee uses the health insurance plan).
- P(B|A) = 0.40 (probability that an employee is a smoker given that they use the plan).

We want to find P(B|A), the probability that an employee is a smoker given that they use the plan.

Using Bayes' theorem:
\[ P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)} \]

Here, P(A|B) is the probability that an employee uses the health insurance plan given that they are a smoker. We don't have this information, so we assume it's the same as P(A) since the survey didn't provide that conditional probability.

So, we can calculate:
\[ P(B|A) = \frac{P(A) \cdot P(B)}{P(A)} = P(B) = 0.40 \]

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 40%.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

The key difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the types of data they are suitable for and the underlying assumptions:

1. **Bernoulli Naive Bayes:**
   - Suitable for binary data (where features are either present or absent).
   - Assumes that features are binary (0 or 1) and that their presence or absence is independent.
   - Commonly used for text classification problems, where each feature represents the presence or absence of a word in a document.
   - Example: Spam detection, sentiment analysis.

2. **Multinomial Naive Bayes:**
   - Suitable for discrete data, typically count-based (e.g., word counts or term frequencies).
   - Assumes that features follow a multinomial distribution (counts of occurrences).
   - Often used for text classification when features represent word counts or frequencies in documents.
   - Works well with integer-valued features that represent the number of occurrences.
   - Example: Document classification based on word counts.

In summary, Bernoulli Naive Bayes is designed for binary data with a focus on presence or absence, while Multinomial Naive Bayes is suitable for count-based discrete data, such as word counts or frequencies.

Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes, like other Naive Bayes variants, assumes that features are independent and that their presence or absence is represented by binary values (0 or 1). When dealing with missing values in Bernoulli Naive Bayes, you typically have a few options:

1. **Impute Missing Values:** You can impute missing values by assigning them a specific value, such as 0 or 1, based on some criterion. For example, you might impute missing values with 0 to indicate the absence of a feature or with 1 to indicate the presence. The choice of imputation method should align with the specific problem and domain knowledge.

2. **Treat Missing Values as a Separate Category:** Instead of imputing missing values, you can treat them as a separate category or feature level. This approach allows the model to learn from the absence of information explicitly.

3. **Ignore Rows with Missing Values:** If missing values are relatively rare and don't significantly impact the dataset's size, you can choose to remove rows with missing values.

The approach you choose depends on the nature of your data, the impact of missing values on your model's performance, and your domain-specific knowledge.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. Gaussian Naive Bayes is a variant of Naive Bayes that is suitable for continuous or real-valued features. It assumes that the features follow a Gaussian (normal) distribution within each class.

In the context of multi-class classification, Gaussian Naive Bayes can be extended to handle multiple classes by applying the Naive Bayes principle independently for each class. When using Gaussian Naive Bayes for multi-class classification, the model calculates the likelihood of each class given the observed feature values, and the class with

 the highest likelihood is predicted.

Here are the steps for using Gaussian Naive Bayes for multi-class classification:

1. Calculate the class priors (prior probabilities) for each class based on the training data.

2. Estimate the mean and variance of the feature values for each class.

3. Given a new data point with feature values, calculate the likelihood of each class based on the Gaussian distribution parameters (mean and variance) for that class.

4. Multiply the likelihood by the class prior for each class to obtain the unnormalized posterior probabilities.

5. Normalize the posterior probabilities to ensure that they sum to 1.

6. Predict the class with the highest posterior probability as the final prediction.

Gaussian Naive Bayes is commonly used for multi-class classification when dealing with continuous data or data that can be approximated as continuous, such as sensor readings, measurements, or other real-valued features.

Q5. Assignment: Implementing Naive Bayes Classifiers

To complete the assignment of implementing Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers on the "Spambase Data Set" and evaluating their performance, you can follow these steps in a Jupyter notebook:

1. Download the "Spambase Data Set" from the provided UCI Machine Learning Repository link.

2. Load the dataset into your Python environment using a library like pandas.

3. Preprocess the data by:
   - Handling any missing values (if present).
   - Splitting the data into features (input) and the target variable (output).
   - Encoding the target variable (e.g., 0 for non-spam and 1 for spam).

4. Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using scikit-learn.

5. Use 10-fold cross-validation to evaluate the performance of each classifier. You can use scikit-learn's `cross_val_score` or manually perform cross-validation.

6. For each classifier, calculate the following performance metrics:
   - Accuracy
   - Precision
   - Recall
   - F1 score

7. Discuss the results and compare the performance of the three Naive Bayes variants. Analyze which one performed the best and why. Mention any limitations or observations you made during the evaluation.

8. Summarize your findings and provide suggestions for future work or improvements.

Make sure to organize your Jupyter notebook with clear sections, explanations, and code comments to document each step. Once you've completed the assignment, upload your Jupyter notebook to a public GitHub repository and share the repository link through your dashboard for review.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Spambase dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
column_names = [f"feature_{i}" for i in range(57)] + ["is_spam"]
data = pd.read_csv(url, header=None, names=column_names)

# Split the data into features and target
X = data.drop("is_spam", axis=1)
y = data["is_spam"]

# Encode the target variable (0 for non-spam, 1 for spam)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform 10-fold cross-validation and evaluate each classifier
def evaluate_classifier(classifier, name):
    accuracy_scores = cross_val_score(classifier, X, y, cv=10, scoring="accuracy")
    precision_scores = cross_val_score(classifier, X, y, cv=10, scoring="precision")
    recall_scores = cross_val_score(classifier, X, y, cv=10, scoring="recall")
    f1_scores = cross_val_score(classifier, X, y, cv=10, scoring="f1")
    
    print(f"{name} Naive Bayes Classifier:")
    print(f"Accuracy: {accuracy_scores.mean():.4f}")
    print(f"Precision: {precision_scores.mean():.4f}")
    print(f"Recall: {recall_scores.mean():.4f}")
    print(f"F1 Score: {f1_scores.mean():.4f}")
    print()

# Evaluate Bernoulli Naive Bayes
evaluate_classifier(bernoulli_nb, "Bernoulli")

# Evaluate Multinomial Naive Bayes
evaluate_classifier(multinomial_nb, "Multinomial")

# Evaluate Gaussian Naive Bayes
evaluate_classifier(gaussian_nb, "Gaussian")


Bernoulli Naive Bayes Classifier:
Accuracy: 0.8839
Precision: 0.8870
Recall: 0.8152
F1 Score: 0.8481

Multinomial Naive Bayes Classifier:
Accuracy: 0.7863
Precision: 0.7393
Recall: 0.7215
F1 Score: 0.7283

Gaussian Naive Bayes Classifier:
Accuracy: 0.8218
Precision: 0.7104
Recall: 0.9570
F1 Score: 0.8131

