# Assignment - Naïve bayes-2

#### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?n?

#### Answer:

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use the conditional probability formula. Let's denote:

- \( A \): the event that an employee uses the health insurance plan.
- \( B \): the event that an employee is a smoker.

The probability that an employee uses the health insurance plan is denoted by \( P(A) \), and the probability that an employee who uses the plan is a smoker is denoted by \( P(B|A) \).

The conditional probability formula is given by:

\[ P(B|A) = \frac{P(A \cap B)}{P(A)} \]

Given that 70% of the employees use the health insurance plan (\( P(A) = 0.7 \)) and 40% of the employees who use the plan are smokers (\( P(B|A) = 0.4 \)), we can substitute these values into the formula:

\[ P(B|A) = \frac{0.4 \times 0.7}{0.7} = 0.4 \]

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.4 or 40%.

#### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

#### Answer:

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the types of features they are designed to handle and the underlying assumptions about the distribution of these features.

1. **Bernoulli Naive Bayes:**
   - **Features:** Bernoulli Naive Bayes is designed for binary or boolean features, meaning each feature can take on one of two values (0 or 1).
   - **Assumption:** It assumes that features are binary variables representing the presence (1) or absence (0) of certain characteristics.
   - **Use Cases:**
      - Document classification tasks where each term is either present (1) or absent (0).
      - Any situation where features can be represented as binary values.

   ```python
   from sklearn.naive_bayes import BernoulliNB
   classifier = BernoulliNB()
   ```

2. **Multinomial Naive Bayes:**
   - **Features:** Multinomial Naive Bayes is designed for discrete features, typically representing counts or frequencies of events.
   - **Assumption:** It assumes that features are multinomially distributed, meaning they represent counts of occurrences of different events.
   - **Use Cases:**
      - Text classification tasks where features are word frequencies or term frequencies.
      - Any situation where features can be represented as counts of occurrences.

   ```python
   from sklearn.naive_bayes import MultinomialNB
   classifier = MultinomialNB()
   ```

In summary, the choice between Bernoulli Naive Bayes and Multinomial Naive Bayes depends on the nature of your features:

- Use **Bernoulli Naive Bayes** if your features are binary (0 or 1).
- Use **Multinomial Naive Bayes** if your features are discrete and represent counts or frequencies.

Both classifiers follow the same principles of Naive Bayes, assuming independence between features given the class, and they are commonly used in natural language processing tasks such as text classification or spam filtering. The choice should align with the characteristics of your data and the specific requirements of your problem. inference.

#### Q3. How does Bernoulli Naive Bayes handle missing values?

#### Answer:

In scikit-learn's implementation of Bernoulli Naive Bayes, missing values are generally not handled explicitly. The algorithm assumes that the input data consists of binary features, where each feature is either present (1) or absent (0). If a feature has a missing value, it can be treated as if it is absent (0) during the classification process.

Here are a few points to consider:

1. **Binary Features:**
   - Bernoulli Naive Bayes is designed for binary features, and it expects the input data to consist of binary values (0 or 1).
   - If a feature has a missing value, it can be treated as 0, assuming the feature is absent.

2. **Imputation:**
   - If missing values are explicitly present in the dataset and you want to handle them, you may need to perform imputation before applying the Bernoulli Naive Bayes algorithm.
   - Common imputation techniques include replacing missing values with the mode (most frequent value) or using more advanced imputation methods based on the characteristics of your data.

3. **Preprocessing:**
   - Ensure that the dataset is preprocessed and transformed into a suitable format for Bernoulli Naive Bayes, with binary features.

Example of handling missing values:

```python
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split

# Assume 'X' is your feature matrix and 'y' is the target variable
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Impute missing values (replace NaN with 0 for binary features)
imputer = SimpleImputer(strategy='constant', fill_value=0)
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Create and train Bernoulli Naive Bayes classifier
classifier = BernoulliNB()
classifier.fit(X_train_imputed, y_train)

# Make predictions
y_pred = classifier.predict(X_test_imputed)
```

In this example, the `SimpleImputer` is used to replace missing values with 0, assuming that the missing value corresponds to the absence of the feature. It's important to choose an imputation strategy that aligns with the semantics of your data and the assumptions of the Bernoulli Naive Bayes algorithm.decisions.ed model complexity.m.

#### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

#### Answer:

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is an extension of the Naive Bayes algorithm that is suitable for continuous data, assuming that each class is characterized by a Gaussian (normal) distribution of the features.

In the context of multi-class classification, Gaussian Naive Bayes models the likelihood of the features given the class using Gaussian distributions. The class with the highest posterior probability is then predicted for a given set of features.

Here's how you can use Gaussian Naive Bayes for multi-class classification in scikit-learn:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset as an example
data = load_iris()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train Gaussian Naive Bayes classifier
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = classifier.predict(X_test)

# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)
```

In this example, the Iris dataset is used for demonstration purposes, but you can apply the same approach to other datasets with continuous features. The `GaussianNB` class in scikit-learn automatically handles multi-class classification using the one-vs-all strategy.

Remember that Gaussian Naive Bayes makes the assumption that the features within each class follow a Gaussian distribution. If this assumption aligns with the characteristics of your data, Gaussian Naive Bayes can be a simple and effective choice for multi-class classification problems.ke the SVM robust to outliers.

#### Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
- Accuracy
- Precision
- Recall
- F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

#### Answer:

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Spambase dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
column_names = [
    "word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d",
    "word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet",
    "word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will",
    "word_freq_people", "word_freq_report", "word_freq_addresses", "word_freq_free",
    "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit",
    "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money",
    "word_freq_hp", "word_freq_hpl", "word_freq_george", "word_freq_650",
    "word_freq_lab", "word_freq_labs", "word_freq_telnet", "word_freq_857",
    "word_freq_data", "word_freq_415", "word_freq_85", "word_freq_technology",
    "word_freq_1999", "word_freq_parts", "word_freq_pm", "word_freq_direct",
    "word_freq_cs", "word_freq_meeting", "word_freq_original", "word_freq_project",
    "word_freq_re", "word_freq_edu", "word_freq_table", "word_freq_conference",
    "char_freq_;", "char_freq_(", "char_freq_[", "char_freq_!", "char_freq_$",
    "char_freq_#", "capital_run_length_average", "capital_run_length_longest",
    "capital_run_length_total", "spam"
]
data = pd.read_csv(url, header=None, names=column_names)

# Separate features and target variable
X = data.drop("spam", axis=1)
y = data["spam"]

# Implement Bernoulli Naive Bayes
bnb_classifier = BernoulliNB()
bnb_scores = cross_val_score(bnb_classifier, X, y, cv=10, scoring='accuracy')

# Implement Multinomial Naive Bayes
mnb_classifier = MultinomialNB()
mnb_scores = cross_val_score(mnb_classifier, X, y, cv=10, scoring='accuracy')

# Implement Gaussian Naive Bayes
gnb_classifier = GaussianNB()
gnb_scores = cross_val_score(gnb_classifier, X, y, cv=10, scoring='accuracy')

# Performance metrics
def calculate_metrics(y_true, y_pred):
    return {
        'Accuracy': accuracy_score(y_true, y_pred),
        'Precision': precision_score(y_true, y_pred),
        'Recall': recall_score(y_true, y_pred),
        'F1 Score': f1_score(y_true, y_pred),
    }

# Report metrics for each classifier
bnb_metrics = calculate_metrics(y, cross_val_score(bnb_classifier, X, y, cv=10, scoring='accuracy'))
mnb_metrics = calculate_metrics(y, cross_val_score(mnb_classifier, X, y, cv=10, scoring='accuracy'))
gnb_metrics = calculate_metrics(y, cross_val_score(gnb_classifier, X, y, cv=10, scoring='accuracy'))

print("Bernoulli Naive Bayes Metrics:")
print(bnb_metrics)

print("\nMultinomial Naive Bayes Metrics:")
print(mnb_metrics)

print("\nGaussian Naive Bayes Metrics:")
print(gnb_metrics)