Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

The probability that an employee is a smoker given that he/she uses the health insurance plan can be calculated using the conditional probability formula. The notation for this probability is denoted as \( P(\text{Smoker}|\text{Uses Insurance}) \).

The formula for conditional probability is given by:

\[ P(\text{A|B}) = \frac{P(\text{A and B})}{P(\text{B})} \]

In this case:
- Let \( A \) be the event "employee is a smoker."
- Let \( B \) be the event "employee uses the health insurance plan."

The information provided is:
- \( P(\text{Uses Insurance}) = 0.70 \) (probability that an employee uses the health insurance plan).
- \( P(\text{Smoker}|\text{Uses Insurance}) = 0.40 \) (probability that an employee is a smoker given that he/she uses the health insurance plan).

We want to find \( P(\text{Smoker}|\text{Uses Insurance}) \).

Using the conditional probability formula:

\[ P(\text{Smoker}|\text{Uses Insurance}) = \frac{P(\text{Smoker and Uses Insurance})}{P(\text{Uses Insurance})} \]

Substitute the given values:

\[ P(\text{Smoker}|\text{Uses Insurance}) = \frac{0.40 \times 0.70}{0.70} \]

Cancel out the common factor (0.70):

\[ P(\text{Smoker}|\text{Uses Insurance}) = 0.40 \]

So, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.40 or 40%.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes classifier, and their main difference lies in the nature of the features they are designed to handle.

### Bernoulli Naive Bayes:
1. **Nature of Features:**
   - **Binary Features:** Bernoulli Naive Bayes is designed for binary or Boolean features, where each feature can take only two possible values (0 or 1).
   - It is particularly suitable for problems where features represent the presence or absence of certain attributes.

2. **Use Cases:**
   - **Text Classification:** It is commonly used in text classification problems where features represent the presence or absence of words in a document.
   - **Binary Data:** Any problem involving binary data, such as spam detection based on the presence or absence of specific words.

3. **Probability Calculation:**
   - It models the probability of the presence (1) or absence (0) of each feature given the class label.

### Multinomial Naive Bayes:
1. **Nature of Features:**
   - **Discrete Features:** Multinomial Naive Bayes is designed for features that represent counts or frequencies of events in a fixed-size sample.
   - It is suitable for problems where features are integer counts (e.g., word counts in a document).

2. **Use Cases:**
   - **Text Classification:** Commonly used in text classification tasks where features represent the frequency of words in a document (bag-of-words models).
   - **Document Classification:** Problems where features represent the frequency of certain events or words.

3. **Probability Calculation:**
   - It models the probability distribution of the counts or frequencies of each feature given the class label.

### Summary of Differences:
- **Bernoulli Naive Bayes:** Designed for binary features, suitable for presence or absence data.
- **Multinomial Naive Bayes:** Designed for discrete features representing counts or frequencies.

When choosing between Bernoulli and Multinomial Naive Bayes, consider the nature of your data. If your features are binary, Bernoulli Naive Bayes may be more appropriate. If your features are counts or frequencies, Multinomial Naive Bayes may be a better choice. In practice, both can be tested, and the performance on a specific dataset can guide the selection.

Q3. How does Bernoulli Naive Bayes handle missing values?

The handling of missing values in Bernoulli Naive Bayes depends on the specific implementation or library used. In general, the approach is to treat missing values as a separate category or as an indication of the absence of a particular feature. Let's discuss a common approach:

1. **Missing Values as a Separate Category:**
   - In Bernoulli Naive Bayes, where features are binary (0 or 1), missing values can be treated as a third category.
   - When a feature is missing for a particular observation, it is considered as neither 0 nor 1 but as a separate category indicating the absence of information.
   - The presence or absence of this third category can contribute to the probability calculations when predicting the class label.

2. **Imputation Strategies:**
   - Depending on the implementation or the specific requirements of the problem, missing values might be imputed before applying the Bernoulli Naive Bayes model.
   - Imputation can involve replacing missing values with a default value (e.g., 0 or 1), the mean, or another imputation strategy.

3. **Consideration of Impact:**
   - It's crucial to consider the impact of missing values on the problem at hand and whether treating them as a separate category makes sense.
   - The appropriateness of imputation strategies should also be carefully evaluated based on the nature of the data and the assumptions of the model.

4. **Library-Specific Handling:**
   - Different machine learning libraries and implementations may have variations in how they handle missing values in the context of Bernoulli Naive Bayes.
   - When using a specific library, it's advisable to consult the documentation to understand the default behavior or options for dealing with missing values.

As a best practice, it's essential to explore the dataset, understand the reasons for missing values, and choose an approach that aligns with the problem's requirements and the characteristics of the data. Experimentation and validation using cross-validation techniques can help assess the impact of handling missing values in a particular way on the performance of the Bernoulli Naive Bayes model.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is an extension of the Naive Bayes algorithm that is specifically designed for continuous or numerical features. While it is commonly used for binary classification problems, it can be extended to handle multiple classes.

In the context of multi-class classification with Gaussian Naive Bayes:

1. **Assumption:**
   - Gaussian Naive Bayes assumes that the features within each class are normally distributed.

2. **Multi-class Extension:**
   - For multiple classes, the model calculates the likelihood of each feature for each class assuming a Gaussian distribution.
   - The class with the highest likelihood for a given set of features is then predicted as the output.

3. **Decision Rule:**
   - The decision rule is based on the maximum likelihood estimation (MLE) for each class.

4. **Probability Calculation:**
   - The probability of a class given the features is proportional to the product of the class prior probability and the likelihood of the features given the class.

5. **Mathematical Formulation:**
   - The mathematical formulation for Gaussian Naive Bayes in the case of multiple classes involves calculating the class conditional probabilities based on Gaussian distributions for each class.

In summary, Gaussian Naive Bayes can naturally extend to handle multi-class classification problems. It is a computationally efficient and easy-to-implement algorithm, and its performance can be quite competitive, especially when the assumptions of the Gaussian distribution hold reasonably well for the data. However, it's important to note that the performance of Gaussian Naive Bayes may vary depending on the characteristics of the dataset. As always, it's advisable to experiment with different algorithms and perform model evaluation to choose the best approach for a specific problem.

Q5. Assignment:

Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository 

(https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.

Results:

Report the following performance metrics for each classifier:

Accuracy

Precision

Recall

F1 score

Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:

Summarise your findings and provide some suggestions for future work.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [4]:
df=pd.read_csv("spambase copy.csv",header=None)

In [18]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [6]:
df.shape

(4601, 58)

In [7]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4601 non-null   float64
 1   1       4601 non-null   float64
 2   2       4601 non-null   float64
 3   3       4601 non-null   float64
 4   4       4601 non-null   float64
 5   5       4601 non-null   float64
 6   6       4601 non-null   float64
 7   7       4601 non-null   float64
 8   8       4601 non-null   float64
 9   9       4601 non-null   float64
 10  10      4601 non-null   float64
 11  11      4601 non-null   float64
 12  12      4601 non-null   float64
 13  13      4601 non-null   float64
 14  14      4601 non-null   float64
 15  15      4601 non-null   float64
 16  16      4601 non-null   float64
 17  17      4601 non-null   float64
 18  18      4601 non-null   float64
 19  19      4601 non-null   float64
 20  20      4601 non-null   float64
 21  21      4601 non-null   float64
 22  

In [9]:
df.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
37    0
38    0
39    0
40    0
41    0
42    0
43    0
44    0
45    0
46    0
47    0
48    0
49    0
50    0
51    0
52    0
53    0
54    0
55    0
56    0
57    0
dtype: int64

In [17]:
X=df.iloc[:,:57]
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [21]:
y=df.iloc[:,-1]
y.value_counts()

0    2788
1    1813
Name: 57, dtype: int64

In [10]:
from sklearn.model_selection import train_test_split

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Gaussian Naive Bayes

In [23]:
from sklearn.naive_bayes import GaussianNB

In [25]:
gnb=GaussianNB()

In [26]:
gnb.fit(X_train,y_train)

In [43]:
y_pred=gnb.predict(X_test)
y_pred

array([1, 1, 0, ..., 1, 1, 1])

In [45]:
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix

In [44]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[597 225]
 [ 34 525]]
              precision    recall  f1-score   support

           0       0.95      0.73      0.82       822
           1       0.70      0.94      0.80       559

    accuracy                           0.81      1381
   macro avg       0.82      0.83      0.81      1381
weighted avg       0.85      0.81      0.81      1381

0.8124547429398986


In [28]:
from sklearn.model_selection import GridSearchCV

In [30]:
param_grid_nb = {
    'var_smoothing': np.logspace(0,-9, num=100)
}

In [31]:
grid=GridSearchCV(gnb,cv=10,param_grid=param_grid_nb,verbose=3)

In [32]:
grid.fit(X_train,y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
[CV 1/10] END ................var_smoothing=1.0;, score=0.637 total time=   0.0s
[CV 2/10] END ................var_smoothing=1.0;, score=0.643 total time=   0.0s
[CV 3/10] END ................var_smoothing=1.0;, score=0.652 total time=   0.0s
[CV 4/10] END ................var_smoothing=1.0;, score=0.643 total time=   0.0s
[CV 5/10] END ................var_smoothing=1.0;, score=0.634 total time=   0.0s
[CV 6/10] END ................var_smoothing=1.0;, score=0.637 total time=   0.0s
[CV 7/10] END ................var_smoothing=1.0;, score=0.637 total time=   0.0s
[CV 8/10] END ................var_smoothing=1.0;, score=0.634 total time=   0.0s
[CV 9/10] END ................var_smoothing=1.0;, score=0.624 total time=   0.0s
[CV 10/10] END ...............var_smoothing=1.0;, score=0.646 total time=   0.0s
[CV 1/10] END .var_smoothing=0.8111308307896871;, score=0.640 total time=   0.0s
[CV 2/10] END .var_smoothing=0.8111308307896

In [33]:
grid.best_params_

{'var_smoothing': 1.519911082952933e-06}

In [47]:
y_pred=grid.predict(X_test)
y_pred

array([1, 0, 0, ..., 0, 1, 0])

In [48]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[718 104]
 [ 99 460]]
              precision    recall  f1-score   support

           0       0.88      0.87      0.88       822
           1       0.82      0.82      0.82       559

    accuracy                           0.85      1381
   macro avg       0.85      0.85      0.85      1381
weighted avg       0.85      0.85      0.85      1381

0.8530050687907313


## Bernoulli Naive Bayes

In [39]:
from sklearn.naive_bayes import BernoulliNB

In [40]:
bnb=BernoulliNB()

In [41]:
bnb.fit(X_train,y_train)

In [50]:
y_pred=bnb.predict(X_test)
y_pred

array([1, 0, 0, ..., 0, 1, 0])

In [51]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[769  53]
 [119 440]]
              precision    recall  f1-score   support

           0       0.87      0.94      0.90       822
           1       0.89      0.79      0.84       559

    accuracy                           0.88      1381
   macro avg       0.88      0.86      0.87      1381
weighted avg       0.88      0.88      0.87      1381

0.8754525706010138


In [68]:
param={
    "alpha":[1],
    "force_alpha":[False],
    "binarize":[0],
    "fit_prior":[True],
    "class_prior":[None]
}

In [69]:
grid1=GridSearchCV(bnb,param_grid=param,cv=10,verbose=3)

In [70]:
grid1.fit(X_train,y_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
[CV 1/10] END alpha=1, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.910 total time=   0.0s
[CV 2/10] END alpha=1, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.863 total time=   0.0s
[CV 3/10] END alpha=1, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.870 total time=   0.0s
[CV 4/10] END alpha=1, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.876 total time=   0.0s
[CV 5/10] END alpha=1, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.894 total time=   0.0s
[CV 6/10] END alpha=1, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.882 total time=   0.0s
[CV 7/10] END alpha=1, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.888 total time=   0.0s
[CV 8/10] END alpha=1, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.91

In [71]:
grid1.best_params_

{'alpha': 1,
 'binarize': 0,
 'class_prior': None,
 'fit_prior': True,
 'force_alpha': False}

In [73]:
y_pred=grid1.predict(X_test)
y_pred

array([1, 0, 0, ..., 0, 1, 0])

In [74]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[769  53]
 [119 440]]
              precision    recall  f1-score   support

           0       0.87      0.94      0.90       822
           1       0.89      0.79      0.84       559

    accuracy                           0.88      1381
   macro avg       0.88      0.86      0.87      1381
weighted avg       0.88      0.88      0.87      1381

0.8754525706010138


## Multinomial Naive Bayes

In [75]:
from sklearn.naive_bayes import MultinomialNB

In [76]:
mnb=MultinomialNB()

In [77]:
mnb.fit(X_train,y_train)

In [79]:
y_pred=mnb.predict(X_test)
y_pred

array([1, 0, 0, ..., 0, 1, 0])

In [80]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[699 123]
 [140 419]]
              precision    recall  f1-score   support

           0       0.83      0.85      0.84       822
           1       0.77      0.75      0.76       559

    accuracy                           0.81      1381
   macro avg       0.80      0.80      0.80      1381
weighted avg       0.81      0.81      0.81      1381

0.8095582910934106


In [84]:
param={
    "alpha":[1],
    "force_alpha":[False],
    "fit_prior":[True],
    "class_prior":[None]
}

In [85]:
grid2=GridSearchCV(mnb,param_grid=param,cv=10,verbose=3)

In [86]:
grid2.fit(X_train,y_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
[CV 1/10] END alpha=1, class_prior=None, fit_prior=True, force_alpha=False;, score=0.823 total time=   0.0s
[CV 2/10] END alpha=1, class_prior=None, fit_prior=True, force_alpha=False;, score=0.776 total time=   0.0s
[CV 3/10] END alpha=1, class_prior=None, fit_prior=True, force_alpha=False;, score=0.801 total time=   0.0s
[CV 4/10] END alpha=1, class_prior=None, fit_prior=True, force_alpha=False;, score=0.848 total time=   0.0s
[CV 5/10] END alpha=1, class_prior=None, fit_prior=True, force_alpha=False;, score=0.780 total time=   0.0s
[CV 6/10] END alpha=1, class_prior=None, fit_prior=True, force_alpha=False;, score=0.826 total time=   0.0s
[CV 7/10] END alpha=1, class_prior=None, fit_prior=True, force_alpha=False;, score=0.783 total time=   0.0s
[CV 8/10] END alpha=1, class_prior=None, fit_prior=True, force_alpha=False;, score=0.801 total time=   0.0s
[CV 9/10] END alpha=1, class_prior=None, fit_prior=True, force_alpha=False;

In [87]:
grid1.best_params_

{'alpha': 1,
 'binarize': 0,
 'class_prior': None,
 'fit_prior': True,
 'force_alpha': False}

In [88]:
y_pred=grid2.predict(X_test)
y_pred

array([1, 0, 0, ..., 0, 1, 0])

In [89]:
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

[[699 123]
 [140 419]]
              precision    recall  f1-score   support

           0       0.83      0.85      0.84       822
           1       0.77      0.75      0.76       559

    accuracy                           0.81      1381
   macro avg       0.80      0.80      0.80      1381
weighted avg       0.81      0.81      0.81      1381

0.8095582910934106
