# Q1. A company conducted a survey of its employees and found that 70% of the employees use thecompany's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?
To calculate the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use conditional probability. The notation for this probability is \( P(\text{Smoker}|\text{Uses Insurance}) \), and it is calculated using the formula:

\[ P(\text{Smoker}|\text{Uses Insurance}) = \frac{P(\text{Smoker} \cap \text{Uses Insurance})}{P(\text{Uses Insurance})} \]

The information given in the problem is as follows:

- \( P(\text{Uses Insurance}) = 0.70 \) (the probability that an employee uses the health insurance plan).
- \( P(\text{Smoker}|\text{Uses Insurance}) = 0.40 \) (the probability that an employee is a smoker given that he/she uses the health insurance plan).

Let's calculate \( P(\text{Smoker} \cap \text{Uses Insurance}) \) using the formula:

\[ P(\text{Smoker} \cap \text{Uses Insurance}) = P(\text{Smoker}|\text{Uses Insurance}) \cdot P(\text{Uses Insurance}) \]

Substitute the given values:

\[ P(\text{Smoker} \cap \text{Uses Insurance}) = 0.40 \cdot 0.70 \]

Now, use this result to calculate \( P(\text{Smoker}|\text{Uses Insurance}) \):

\[ P(\text{Smoker}|\text{Uses Insurance}) = \frac{0.40 \cdot 0.70}{0.70} \]

Simplify the expression:

\[ P(\text{Smoker}|\text{Uses Insurance}) = 0.40 \]

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.40, or 40%.

# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?
Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes classifier, and they are designed for different types of data. Here are the key differences between the two:

1. **Nature of the Features:**
   - **Bernoulli Naive Bayes:** It is suitable for binary data, where features represent the presence or absence of a particular attribute. The model assumes that each feature is a binary-valued variable.
   - **Multinomial Naive Bayes:** It is designed for discrete data, typically used when features represent counts or frequencies. This is commonly used in text classification where features are word counts.

2. **Feature Representation:**
   - **Bernoulli Naive Bayes:** Features are represented as binary variables (0 or 1), indicating the absence or presence of a feature.
   - **Multinomial Naive Bayes:** Features are typically represented as integer counts, indicating the number of occurrences of a feature.

3. **Probability Distribution:**
   - **Bernoulli Naive Bayes:** Assumes a Bernoulli distribution for each feature, modeling the probability of occurrence as a binary outcome.
   - **Multinomial Naive Bayes:** Assumes a multinomial distribution for each feature, modeling the probability of occurrence as a count-based outcome.

4. **Use Cases:**
   - **Bernoulli Naive Bayes:** Often used in document classification tasks, especially when the presence or absence of certain words in a document is crucial (e.g., spam detection).
   - **Multinomial Naive Bayes:** Commonly used in natural language processing tasks, such as document classification or sentiment analysis, where the frequency of words in a document is important.

5. **Application in Scikit-Learn:**
   - **Bernoulli Naive Bayes:** In scikit-learn, you can use the `BernoulliNB` class.
   - **Multinomial Naive Bayes:** In scikit-learn, you can use the `MultinomialNB` class.

In summary, the choice between Bernoulli Naive Bayes and Multinomial Naive Bayes depends on the nature of your data. If your features are binary or represent the presence/absence of attributes, use Bernoulli Naive Bayes. If your features are counts or frequencies, especially in the context of text data, use Multinomial Naive Bayes.

# Q3. How does Bernoulli Naive Bayes handle missing values?
In scikit-learn's implementation of Bernoulli Naive Bayes, missing values are treated as if they were absent (0) when calculating probabilities. This is consistent with the assumption of the Bernoulli distribution, where features are binary variables representing the presence or absence of a particular attribute.

Here's how scikit-learn's `BernoulliNB` handles missing values:

1. **Training Phase:**
   - During the training phase, the model estimates probabilities based on the presence or absence of features.
   - Missing values are treated as if they were absent (0). The model learns the probabilities of each feature being 0 or 1 in each class.

2. **Prediction Phase:**
   - When making predictions for new instances with missing values, the model uses the probabilities learned during training.
   - The missing values are implicitly treated as if they were absent (0) when computing the likelihood of each feature.

It's important to note that the handling of missing values is implicit in the sense that the model doesn't have a specific mechanism for dealing with missing values. The missing values are effectively treated as one of the possible values of the feature, and the model learns how the presence or absence of features correlates with class labels based on the available data.

If your dataset has a significant number of missing values, it's a good practice to consider imputation techniques or preprocessing methods to handle missing data before training a machine learning model. Imputation involves filling in missing values with estimated or substituted values, and scikit-learn provides tools such as the `SimpleImputer` class for this purpose. After imputation, you can then proceed to train your Bernoulli Naive Bayes model.

# Q4. Can Gaussian Naive Bayes be used for multi-class classification?
Yes, Gaussian Naive Bayes can be used for multi-class classification. The Gaussian Naive Bayes classifier is a variant of the Naive Bayes algorithm that assumes that the features follow a Gaussian (normal) distribution. It is well-suited for continuous data.

In the case of multi-class classification, where there are more than two classes, Gaussian Naive Bayes can be extended to handle multiple classes by using the "one-vs-all" (OvA) or "one-vs-one" (OvO) strategy. Here's a brief explanation of both strategies:

1. **One-vs-All (OvA):**
   - Also known as "one-vs-rest," this strategy involves training a separate binary classifier for each class. Each classifier is trained to distinguish one class from the rest.
   - During prediction, the class with the highest predicted probability among all the binary classifiers is chosen as the final predicted class.

2. **One-vs-One (OvO):**
   - In this strategy, a binary classifier is trained for every pair of classes. For \(K\) classes, this results in \(K(K-1)/2\) binary classifiers.
   - During prediction, each classifier "votes" for one of the classes. The class that receives the most votes is the final predicted class.

Scikit-learn, a popular machine learning library in Python, provides a `GaussianNB` class for Gaussian Naive Bayes classification. By default, this class supports multi-class classification using the OvO strategy. You can use it in a straightforward manner, specifying the target classes when fitting the model.

Here's a simple example of using Gaussian Naive Bayes for multi-class classification in scikit-learn:

In [2]:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset as an example
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Gaussian Naive Bayes model
model = GaussianNB()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 1.00


Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

In [4]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd

# Load the Spambase dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
columns = ["word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", "word_freq_our", 
           "word_freq_over", "word_freq_remove", "word_freq_internet", "word_freq_order", 
           "word_freq_mail", "word_freq_receive", "word_freq_will", "word_freq_people", 
           "word_freq_report", "word_freq_addresses", "word_freq_free", "word_freq_business", 
           "word_freq_email", "word_freq_you", "word_freq_credit", "word_freq_your", 
           "word_freq_font", "word_freq_000", "word_freq_money", "word_freq_hp", 
           "word_freq_hpl", "word_freq_george", "word_freq_650", "word_freq_lab", 
           "word_freq_labs", "word_freq_telnet", "word_freq_857", "word_freq_data", 
           "word_freq_415", "word_freq_85", "word_freq_technology", "word_freq_1999", 
           "word_freq_parts", "word_freq_pm", "word_freq_direct", "word_freq_cs", 
           "word_freq_meeting", "word_freq_original", "word_freq_project", 
           "word_freq_re", "word_freq_edu", "word_freq_table", "word_freq_conference", 
           "char_freq_;", "char_freq_(", "char_freq_[", "char_freq_!", "char_freq_$", 
           "char_freq_#", "capital_run_length_average", "capital_run_length_longest", 
           "capital_run_length_total", "is_spam"]

# Assuming the data is in CSV format and has a header row
data = pd.read_csv(url, header=None, names=columns)

# Separate features and target variable
X = data.iloc[:, :-1]
y = data['is_spam']

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Function to calculate metrics
def calculate_metrics(model, X, y):
    y_pred = model.predict(X)
    accuracy = accuracy_score(y, y_pred)
    return accuracy  # Return a single numeric value

# Perform 10-fold cross-validation and calculate metrics for each classifier
metrics_bernoulli = cross_val_score(bernoulli_nb, X, y, cv=10, scoring='accuracy')
metrics_multinomial = cross_val_score(multinomial_nb, X, y, cv=10, scoring='accuracy')
metrics_gaussian = cross_val_score(gaussian_nb, X, y, cv=10, scoring='accuracy')

# Display results
print("Bernoulli Naive Bayes Metrics:")
print("Accuracy:", metrics_bernoulli.mean())
print()

print("Multinomial Naive Bayes Metrics:")
print("Accuracy:", metrics_multinomial.mean())
print()

print("Gaussian Naive Bayes Metrics:")
print("Accuracy:", metrics_gaussian.mean())


Bernoulli Naive Bayes Metrics:
Accuracy: 0.8839380364047911

Multinomial Naive Bayes Metrics:
Accuracy: 0.7863496180326323

Gaussian Naive Bayes Metrics:
Accuracy: 0.8217730830896915
