**`Q.No-01`    A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?**

**Ans :-**

**To solve this problem, we need to use the concept of conditional probability and Bayes' theorem.**

**`Given information` -**

$$P(uses ~health ~insurance ~plan) ~~or ~~P(A)  = 0.7 $$ 

$$P(smoker|uses ~health ~insurance ~plan) ~~or ~~P(B|A)  = 0.4 $$

**We have to find** $P(B|A)$ **which is alreadt given,**

$$ P(B|A) = 0.4 \quad \text{or} \quad 40\% $$

**40% is the probability that an employee is a smoker given that he/she uses the health insurance plan**

---------------------------------------------------------------------------------------------------------------------------------------------------------

**`Q.No-02`    What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?**

**Ans :-**

**Both Bernoulli Naive Bayes and Multinomial Naive Bayes are types of Naive Bayes classifiers, a popular machine learning method based on Bayes' theorem.**

**They share the core principle of assuming independence between features, but differ in how they `handle the features` themselves -**

* **Bernoulli Naive Bayes :** This is suited for **binary features**, meaning each feature can only take on two values, typically represented as presence (1) or absence (0) of a specific characteristic. For example, an email can be classified as spam (1) or not spam (0). Here, the model considers how often a particular word appears (presence) or doesn't appear (absence) in spam emails compared to non-spam emails.

* **Multinomial Naive Bayes :** This is designed for features with **discrete counts**. Here, a feature can have multiple values, but each value represents a count or frequency. A common application is text classification. Each word in a document is considered a feature, and its value is the number of times it appears in that document. The model then analyzes how frequently each word shows up in different categories (e.g., sports news vs. entertainment news).

**`In simpler terms`, Bernoulli Naive Bayes cares about "yes" or "no" for a feature, while Multinomial Naive Bayes cares about "how many times" for a feature.**

**Here's a table `summarizing the key differences` -**

| Feature Type | Naive Bayes Variant | Description |
|---|---|---|
| Binary | Bernoulli | Focuses on presence (1) or absence (0) of a feature |
| Discrete Counts | Multinomial | Analyzes the number of times a feature appears |

-------------------------------------------------------------------------------------------------------------------------------------------------------

**`Q.No-03`    How does Bernoulli Naive Bayes handle missing values?**

**Ans :-**

**Bernoulli Naive Bayes, a type of Naive Bayes classifier for binary features, doesn't have a built-in way to handle missing values directly. This can be problematic because the algorithm relies on probabilities calculated from feature values, and missing data can skew these calculations.**

**Here are some common approaches to address missing values before using Bernoulli Naive Bayes :**

1. **Dropping Data Points -** The simplest method is to remove any data points containing missing values.  This can be a good option if the amount of missing data is small. However, it can also lead to a loss of information, potentially affecting the model's performance.

2. **Imputation -** Another approach is to impute missing values. This involves estimating a value to fill in the missing spot. There are various imputation techniques, like replacing missing values with the mean/median of the feature or using more sophisticated methods.

3. **Ignoring the Feature -** When a data point has a missing value, you can choose to ignore that specific feature for that particular data point. This essentially treats the missing value as its own category and adjusts the probability calculations accordingly.

`While scikit-learn's BernoulliNB model doesn't directly handle missing values`, **We can implement these strategies before feeding your data to the model. There are also libraries offering alternative Naive Bayes implementations that might have built-in missing value handling capabilities.**

--------------------------------------------------------------------------------------------------------------------------------------------------------

**`Q.No-04`    Can Gaussian Naive Bayes be used for multi-class classification?**

**Ans :-**

**`Yes, Gaussian Naive Bayes (GNB) can indeed be used for multi-class classification tasks`. Despite its "naive" assumption of feature independence, which may not always hold true in practice, GNB is still commonly employed for its simplicity and efficiency, especially when dealing with relatively small datasets.**

For multi-class classification, GNB extends naturally by applying the Bayes' theorem to calculate the probability of each class given the input features and then selecting the class with the highest probability. This approach works by assuming that the features follow a Gaussian distribution within each class, hence the name "Gaussian" Naive Bayes.

`However`, it's worth noting that GNB might not perform optimally compared to more complex models like Support Vector Machines, Random Forests, or deep learning models, especially when dealing with highly correlated features or complex relationships within the data. Nonetheless, it can still serve as a baseline model or be used in scenarios where computational efficiency and simplicity are crucial.

--------------------------------------------------------------------------------------------------------------------------------------------------------

**`Q.No-05`    Assignment :-**

-    **Data preparation :** Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.


-    **Implementation :** Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.


-    **Results :** Report the following performance metrics for each classifier -

        -    Accuracy

        -    Precision

        -    Recall

        -    F1 score


-    **Discussion :** Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?


-    **Conclusion :** Summarise your findings and provide some suggestions for future work.

**Ans :-**

In [21]:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
spambase = fetch_ucirepo(id=94) 

In [23]:
# data (as pandas dataframes) 
X = spambase.data.features 
y = (spambase.data.targets)

In [24]:
# metadata 
display(spambase.metadata) 

{'uci_id': 94,
 'name': 'Spambase',
 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase',
 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv',
 'abstract': 'Classifying Email as Spam or Non-Spam',
 'area': 'Computer Science',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 4601,
 'num_features': 57,
 'feature_types': ['Integer', 'Real'],
 'demographics': [],
 'target_col': ['Class'],
 'index_col': None,
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 1999,
 'last_updated': 'Mon Aug 28 2023',
 'dataset_doi': '10.24432/C53G6X',
 'creators': ['Mark Hopkins',
  'Erik Reeber',
  'George Forman',
  'Jaap Suermondt'],
 'intro_paper': None,
 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spa

In [25]:
# variable information 
display(spambase.variables)

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,word_freq_make,Feature,Continuous,,,,no
1,word_freq_address,Feature,Continuous,,,,no
2,word_freq_all,Feature,Continuous,,,,no
3,word_freq_3d,Feature,Continuous,,,,no
4,word_freq_our,Feature,Continuous,,,,no
5,word_freq_over,Feature,Continuous,,,,no
6,word_freq_remove,Feature,Continuous,,,,no
7,word_freq_internet,Feature,Continuous,,,,no
8,word_freq_order,Feature,Continuous,,,,no
9,word_freq_mail,Feature,Continuous,,,,no


In [29]:
import warnings
warnings.filterwarnings('ignore')

# Initialize classifiers
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

# Perform 10-fold cross-validation and compute performance metrics
classifiers = {'Bernoulli NB': bnb, 'Multinomial NB': mnb, 'Gaussian NB': gnb}
metrics = ['accuracy', 'precision', 'recall', 'f1']

results = {}
for name, clf in classifiers.items():
    scores = {}
    for metric in metrics:
        score = cross_val_score(clf, X, y, cv=10, scoring=metric)
        scores[metric] = score.mean()
    results[name] = scores
    
results_df = pd.DataFrame(results, index=metrics)
display(results_df)

Unnamed: 0,Bernoulli NB,Multinomial NB,Gaussian NB
accuracy,0.883938,0.78635,0.821773
precision,0.886962,0.739318,0.710373
recall,0.815239,0.721498,0.956952
f1,0.848125,0.728291,0.813066


**`Based on the outcome of the performance metrics, we can Discuss about the following trends` :**

1. **Accuracy -** Bernoulli Naive Bayes achieved the highest accuracy among the three variants, followed by Gaussian Naive Bayes and Multinomial Naive Bayes.

2. **Precision -** Bernoulli Naive Bayes also exhibited the highest precision, followed by Gaussian Naive Bayes and Multinomial Naive Bayes.

3. **Recall -** Gaussian Naive Bayes had the highest recall, followed by Bernoulli Naive Bayes and Multinomial Naive Bayes.

4. **F1 Score -** Bernoulli Naive Bayes showed the highest F1 score, followed by Gaussian Naive Bayes and Multinomial Naive Bayes.

These results indicate that for this Spambase dataset, Bernoulli Naive Bayes generally outperformed the other variants in terms of accuracy, precision, and F1 score. However, Gaussian Naive Bayes performed notably well in terms of recall.

`The reason behind Bernoulli Naive Bayes' superior performance could be attributed to its assumption of binary features`, which might be well-suited for the nature of the input features in the dataset. Since the Spambase dataset consists of binary features indicating the presence or absence of certain words or characters in email messages, Bernoulli Naive Bayes, which assumes binary features, might be more appropriate for this type of data.

`On the other hand`, Multinomial Naive Bayes assumes integer counts as features and might not be the best choice for binary data like in the Spambase dataset. Gaussian Naive Bayes assumes a Gaussian distribution for numerical features, which might not be the most suitable assumption for this dataset, although it performed surprisingly well in terms of recall.

`In conclusion based on the evaluation results`, **Bernoulli Naive Bayes appears to be the most suitable variant for the task of classifying spam emails in the Spambase dataset. Its superior performance in terms of accuracy, precision, and F1 score suggests its effectiveness in handling binary feature data such as email content.**

However, it's worth noting that Gaussian Naive Bayes showed remarkable performance in terms of recall, indicating its potential usefulness in capturing spam instances effectively, albeit with some trade-offs in other performance metrics.

Future work could involve exploring feature engineering techniques to enhance the performance of Naive Bayes classifiers further. Additionally, experimenting with other classification algorithms and ensemble methods could provide insights into whether alternative approaches could yield even better results for spam classification tasks. Moreover, investigating the impact of different hyperparameters and preprocessing techniques could also be valuable for improving classification performance.