**Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?**

**ANSWER:--------**


To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we need to use conditional probability.

Let's denote:
- \( P(S) \) as the probability that an employee is a smoker.
- \( P(H) \) as the probability that an employee uses the health insurance plan.
- \( P(S \mid H) \) as the probability that an employee is a smoker given that they use the health insurance plan.

From the problem statement:
- \( P(H) = 0.70 \), since 70% of the employees use the health insurance plan.
- \( P(S \mid H) = 0.40 \), since 40% of the employees who use the health insurance plan are smokers.

Now, using the definition of conditional probability:
\[ P(S \mid H) = \frac{P(S \cap H)}{P(H)} \]

Where:
- \( P(S \cap H) \) is the probability that an employee is both a smoker and uses the health insurance plan.

We can find \( P(S \cap H) \) using:
\[ P(S \cap H) = P(S \mid H) \cdot P(H) \]

Substituting the values we know:
\[ P(S \cap H) = 0.40 \cdot 0.70 = 0.28 \]

Now, calculate \( P(S \mid H) \):
\[ P(S \mid H) = \frac{0.28}{0.70} = \frac{4}{10} = 0.4 \]

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is \( \boxed{0.4} \), or 40%.

**Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?**

**ANSWER:-------**


The main difference between Bernoulli Naive Bayes (BNB) and Multinomial Naive Bayes (MNB) lies in how they model the feature probabilities:

1. **Bernoulli Naive Bayes (BNB)**:
   - BNB is typically used when features are binary (i.e., presence or absence of a feature).
   - It assumes that features are binary variables, where each feature is considered independently and contributes equally to the likelihood regardless of the frequency of its occurrence.
   - Example: Text classification where each term's presence or absence (whether a term appears in the document or not) is considered.

2. **Multinomial Naive Bayes (MNB)**:
   - MNB is used when features represent counts or frequencies of events (e.g., word counts in document classification).
   - It assumes that features follow a multinomial distribution (which is a generalization of the binomial distribution for more than two categories).
   - Example: Document classification based on word counts, where the frequency of each term in the document matters.

In summary:
- **BNB** deals with binary feature vectors, where the presence or absence of each feature is the focus.
- **MNB** deals with count-based feature vectors, where the frequency of each feature (e.g., word counts) is considered.

Both BNB and MNB are variations of the Naive Bayes classifier, which assumes independence between features given the class label (hence "naive"). They are commonly used in text classification and other tasks involving categorical data.

**Q3. How does Bernoulli Naive Bayes handle missing values?**

**ANSWER:--------**


Bernoulli Naive Bayes (BNB) typically handles missing values by treating them as a separate category or as an indication of absence, depending on how the model is trained and implemented. Here are common approaches to deal with missing values in the context of BNB:

1. **Imputation as Absence (0)**:
   - In many implementations, missing values are often treated as if the feature is absent (i.e., the binary value for that feature is set to 0).
   - This approach assumes that the absence of information implies the feature is not present, aligning with the binary nature of Bernoulli Naive Bayes where features are either present (1) or absent (0).

2. **Separate Category for Missing Values**:
   - Another approach is to explicitly consider missing values as a separate category or state of the feature.
   - This involves modifying the feature representation to include an additional category that explicitly denotes missing values.
   - During training, the model learns how to classify instances with missing values based on the available information from other features.

3. **Ignoring Missing Values**:
   - In some implementations, missing values might be ignored during training, assuming their impact on classification is minimal or that their absence can be inferred indirectly through other features.
   - However, this approach requires careful consideration of how missing values might affect classification performance and model accuracy.

The specific handling of missing values can depend on the implementation details of the Bernoulli Naive Bayes algorithm in a particular software library or framework. It's important to review the documentation or specific implementation details to understand how missing values are treated in practice.

**Q4. Can Gaussian Naive Bayes be used for multi-class classification?**

**ANSWER:---------**


Yes, Gaussian Naive Bayes (GNB) can be used for multi-class classification tasks. Gaussian Naive Bayes is an extension of the Naive Bayes algorithm that assumes continuous-valued features follow a Gaussian (normal) distribution. It's particularly suitable when dealing with continuous data where each class's features are normally distributed.

Here's how Gaussian Naive Bayes handles multi-class classification:

1. **Modeling Class Conditional Distributions**:
   - For each class \( C_k \), GNB models the distribution of feature values as Gaussian (normal) distributions with mean \( \mu_{k,i} \) and variance \( \sigma_{k,i}^2 \) for each feature \( i \).

2. **Class Prior Probability**:
   - GNB calculates the prior probability \( P(C_k) \) for each class \( C_k \) based on the relative frequency of each class in the training data.

3. **Posterior Probability Calculation**:
   - To classify a new instance with feature vector \( \mathbf{x} = (x_1, x_2, \ldots, x_n) \):
     \[ P(C_k \mid \mathbf{x}) \propto P(C_k) \prod_{i=1}^{n} P(x_i \mid C_k) \]
   - \( P(x_i \mid C_k) \) is computed using the Gaussian probability density function with parameters \( \mu_{k,i} \) and \( \sigma_{k,i}^2 \).

4. **Decision Rule**:
   - The class \( C_k \) that maximizes \( P(C_k \mid \mathbf{x}) \) is chosen as the predicted class for the instance \( \mathbf{x} \).

Therefore, Gaussian Naive Bayes can handle multiple classes by extending the binary classification approach of Naive Bayes to accommodate multiple class labels. Each class is modeled with its own set of Gaussian distributions for the features, and classification is performed based on the most probable class given the observed feature values.

**Q5. Assignment:**


**Data preparation:**

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.


**Implementation:**

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.


**Results:**

Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score


**Discussion:**

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?


**Conclusion:**
Summarise your findings and provide some suggestions for future work.


Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

**ANSWER:------**


To proceed with implementing and evaluating the Naive Bayes classifiers on the Spambase dataset, we'll follow these steps:

### Data Preparation

1. **Download the Dataset**: Obtain the Spambase dataset from the UCI Machine Learning Repository.
2. **Load the Dataset**: Read the dataset into a pandas DataFrame and prepare it for training and evaluation.

### Implementation

3. **Implement Naive Bayes Classifiers**:
   - Bernoulli Naive Bayes
   - Multinomial Naive Bayes
   - Gaussian Naive Bayes

4. **Use scikit-learn Library**: Utilize scikit-learn's implementations of these classifiers.

### Evaluation

5. **Perform 10-fold Cross-Validation**: Evaluate each classifier using 10-fold cross-validation to obtain robust performance metrics.

### Performance Metrics

6. **Report Metrics**: Compute and report the following metrics for each classifier:
   - Accuracy
   - Precision
   - Recall
   - F1 score

### Discussion

7. **Discuss Results**: Analyze and compare the performance of the classifiers. Discuss which variant of Naive Bayes performed the best and why. Highlight any limitations observed during the evaluation.

### Conclusion

8. **Summarize Findings**: Provide a summary of the findings and suggest potential areas for future work or improvements.

Let's begin by downloading the dataset and implementing the classifiers in Python using scikit-learn.

Here’s how you can proceed with implementing and evaluating the Naive Bayes classifiers on the Spambase dataset using Python and scikit-learn:

### Step 1: Data Preparation

Download the Spambase dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Spambase). Save the dataset file (`spambase.data`) in your working directory.

### Step 2: Implementation

Let's implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using scikit-learn.


In [1]:

import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data'
names = [
    "word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", "word_freq_our",
    "word_freq_over", "word_freq_remove", "word_freq_internet", "word_freq_order", "word_freq_mail",
    "word_freq_receive", "word_freq_will", "word_freq_people", "word_freq_report", "word_freq_addresses",
    "word_freq_free", "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit",
    "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", "word_freq_hp", "word_freq_hpl",
    "word_freq_george", "word_freq_650", "word_freq_lab", "word_freq_labs", "word_freq_telnet",
    "word_freq_857", "word_freq_data", "word_freq_415", "word_freq_85", "word_freq_technology",
    "word_freq_1999", "word_freq_parts", "word_freq_pm", "word_freq_direct", "word_freq_cs",
    "word_freq_meeting", "word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu",
    "word_freq_table", "word_freq_conference", "char_freq_;", "char_freq_(", "char_freq_[", "char_freq_!",
    "char_freq_$", "char_freq_#", "capital_run_length_average", "capital_run_length_longest",
    "capital_run_length_total", "spam"
]
data = pd.read_csv(url, names=names, header=None)

# Separate features and target variable
X = data.drop('spam', axis=1)
y = data['spam']

# Initialize classifiers
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

# Initialize lists to store results
classifiers = [bnb, mnb, gnb]
clf_names = ['Bernoulli Naive Bayes', 'Multinomial Naive Bayes', 'Gaussian Naive Bayes']

# Evaluate each classifier using 10-fold cross-validation
for clf, clf_name in zip(classifiers, clf_names):
    scores_accuracy = cross_val_score(clf, X, y, cv=10, scoring='accuracy')
    scores_precision = cross_val_score(clf, X, y, cv=10, scoring='precision')
    scores_recall = cross_val_score(clf, X, y, cv=10, scoring='recall')
    scores_f1 = cross_val_score(clf, X, y, cv=10, scoring='f1')

    # Print the results
    print(f"Classifier: {clf_name}")
    print(f"Accuracy: {scores_accuracy.mean():.4f}")
    print(f"Precision: {scores_precision.mean():.4f}")
    print(f"Recall: {scores_recall.mean():.4f}")
    print(f"F1 Score: {scores_f1.mean():.4f}")
    print()


Classifier: Bernoulli Naive Bayes
Accuracy: 0.8839
Precision: 0.8870
Recall: 0.8152
F1 Score: 0.8481

Classifier: Multinomial Naive Bayes
Accuracy: 0.7863
Precision: 0.7393
Recall: 0.7215
F1 Score: 0.7283

Classifier: Gaussian Naive Bayes
Accuracy: 0.8218
Precision: 0.7104
Recall: 0.9570
F1 Score: 0.8131



### Discussion

After running the above code, you will get the performance metrics (accuracy, precision, recall, F1 score) for each variant of Naive Bayes. Analyze the results to compare which classifier performed the best and discuss the reasons behind their performance.

### Conclusion

Summarize your findings based on the results obtained. Discuss any limitations observed with Naive Bayes classifiers and suggest potential future work or improvements.

This approach will provide you with a comprehensive evaluation of Naive Bayes classifiers on the Spambase dataset, demonstrating their performance on a real-world binary classification problem.