## Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, you can use conditional probability. You want to find P(Smoker | Uses Health Insurance), which is read as "the probability of being a smoker given that one uses the health insurance plan."

You are given the following information:

- P(Uses Health Insurance) = 70% = 0.70
- P(Smoker | Uses Health Insurance) = 40% = 0.40

You can use the formula for conditional probability:

$P(A|B) = \frac{P(A \cap B)}{P(B)}$

In this case:

- A is the event of being a smoker.
- B is the event of using the health insurance plan.

So, you want to find $P(A|B)$, which is the probability of being a smoker given that one uses the health insurance plan. Plugging in the values:

$P(Smoker | Uses Health Insurance) = \frac{P(Smoker \cap Uses Health Insurance)}{P(Uses Health Insurance)}$

Now, you have:

- $P(Smoker | Uses Health Insurance) = \frac{P(Smoker \cap Uses Health Insurance)}{0.70}$

You know $P(Uses Health Insurance)$ is 0.70 and $P(Smoker | Uses Health Insurance)$ is 0.40, so you can rearrange the equation to find $P(Smoker \cap Uses Health Insurance)$:

$P(Smoker \cap Uses Health Insurance) = 0.40 \times 0.70$

Now, calculate the value:

$P(Smoker \cap Uses Health Insurance) = 0.28$

So, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.28 or 28%.

## Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are two different variants of the Naive Bayes algorithm used in machine learning, and they are primarily used for text classification tasks. The main difference between them lies in how they handle the features and the underlying probability distributions they assume.

1. **Bernoulli Naive Bayes**:

   - **Features**: Bernoulli Naive Bayes is typically used for binary feature vectors, where each feature is either present or absent. This makes it suitable for tasks like text classification, where the presence or absence of specific words or terms is important.

   - **Probability Distribution**: It models the probability of a document belonging to a particular class as a product of the probabilities of the presence or absence of each term in the document given the class. It assumes a Bernoulli distribution for each feature.

   - **Use Cases**: It is commonly used for problems like sentiment analysis, spam detection, and document classification, where the presence or absence of specific words is more important than their frequency.

2. **Multinomial Naive Bayes**:

   - **Features**: Multinomial Naive Bayes is used when dealing with discrete data, such as word counts in text data. It is suitable for text classification tasks where the frequency of words in a document matters.

   - **Probability Distribution**: It models the probability of a document belonging to a particular class as a product of the probabilities of the word counts or frequencies in the document given the class. It assumes a multinomial distribution for each feature.

   - **Use Cases**: It is commonly used for tasks like document categorization, topic classification, and spam filtering, where the frequency of words or terms in a document is crucial.


## Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes, like other variants of Naive Bayes, generally assumes that all features are binary (i.e., they are either present or absent) and does not handle missing values in the traditional sense. When using Bernoulli Naive Bayes, it's essential to preprocess your data and ensure that all features are properly binary-encoded.

Here are some common strategies for dealing with missing values when using Bernoulli Naive Bayes:

1. **Imputation**:
   - One approach is to impute missing values before applying Bernoulli Naive Bayes. You can replace missing values with a specific value (e.g., 0 or 1) based on domain knowledge or by analyzing the data. This ensures that all features are binary.

2. **Dropping Missing Data**:
   - If the number of instances with missing values is relatively small, you may choose to remove those instances from your dataset. However, be cautious when doing this, as it can result in a loss of information.

3. **Use a Special Category**:
   - Another option is to create a special category for missing values and encode it as a binary feature. This allows you to explicitly model the absence of data as a feature in your Bernoulli Naive Bayes classifier.

4. **Data Imputation Algorithms**:
   - You can apply data imputation techniques specifically designed for binary data. These methods aim to estimate missing values based on the observed data while maintaining the binary nature of the features.

5. **Advanced Models**:
   - In some cases, if dealing with missing values is a significant concern, you might consider using other classification algorithms that handle missing data more naturally, such as decision trees, random forests, or support vector machines.



## Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. Gaussian Naive Bayes is a variant of the Naive Bayes algorithm that assumes that the features follow a Gaussian (normal) distribution. It's particularly suited for continuous or real-valued features.

In the context of multi-class classification, Gaussian Naive Bayes extends the binary classification capability of the standard Gaussian Naive Bayes to handle multiple classes. It does this by applying the Gaussian probability density function separately for each class and then assigning the class with the highest probability as the predicted class.

Here's how Gaussian Naive Bayes works for multi-class classification:

1. **Model Building**:
   - For each class in the dataset, Gaussian Naive Bayes estimates the mean and variance of each feature. These parameters are used to characterize the Gaussian distribution for each class.

2. **Class Probability**:
   - When making a prediction for a new data point, the algorithm calculates the probability of the data point belonging to each class based on the Gaussian distribution parameters for each class. This is done using the Gaussian probability density function.

3. **Prediction**:
   - The algorithm assigns the class with the highest calculated probability as the predicted class for the new data point.

Gaussian Naive Bayes can be effective for multi-class classification problems, especially when the assumption of Gaussian distribution for the features is reasonable. However, it's essential to keep in mind that the "naive" assumption of independence between features still applies, which may or may not hold in your specific dataset.



## Q5. Assignment:
## Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

## Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

## Results:

Report the following performance metrics for each classifier:

Accuracy

Precision

Recall

F1 score

## Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

## Conclusion:

Summarise your findings and provide some suggestions for future work.

In [1]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings('ignore')

In [2]:

df=pd.read_csv(r'C:\Users\tanji\Desktop\myPW\assignments\datasets\emails.csv')
df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


In [3]:
df.Prediction.value_counts()

Prediction
0    3672
1    1500
Name: count, dtype: int64

In [4]:
df_minority=df[df['Prediction']==1]
df_majority=df[df['Prediction']==0]

In [5]:
from sklearn.utils import resample 
df_minority_upsampled= resample(df_minority, replace=True,  #upsampling minority group
                               n_samples=len(df_majority),
                               random_state=42
                               )
df_minority_upsampled.shape

(3672, 3002)

In [6]:
newdf_upsampled=pd.concat([df_majority,df_minority_upsampled])
newdf_upsampled['Prediction'].value_counts()

Prediction
0    3672
1    3672
Name: count, dtype: int64

In [7]:
newdf_upsampled.drop('Email No.', axis=1, inplace=True)

In [8]:
X = newdf_upsampled.iloc[:, :-1]  # Features
y = newdf_upsampled.iloc[:, -1] 

## bernoulli NB

In [9]:
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()


accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

classifiers = [bernoulli_nb, multinomial_nb, gaussian_nb]
for classifier in classifiers:
    print(classifier)
    accuracy = cross_val_score(classifier, X, y, cv=10, scoring='accuracy',verbose=3)
    precision = cross_val_score(classifier, X, y, cv=10, scoring='precision', verbose=3)
    recall = cross_val_score(classifier, X, y, cv=10, scoring='recall', verbose=3)
    f1 = cross_val_score(classifier, X, y, cv=10, scoring='f1', verbose=3)

    accuracy_scores.append(accuracy.mean())
    precision_scores.append(precision.mean())
    recall_scores.append(recall.mean())
    f1_scores.append(f1.mean())


BernoulliNB()


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.863) total time=   1.4s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.5s remaining:    0.0s


[CV] END ................................ score: (test=0.814) total time=   1.4s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.0s remaining:    0.0s


[CV] END ................................ score: (test=0.850) total time=   1.4s
[CV] END ................................ score: (test=0.854) total time=   1.4s
[CV] END ................................ score: (test=0.873) total time=   1.5s
[CV] END ................................ score: (test=0.868) total time=   1.4s
[CV] END ................................ score: (test=0.846) total time=   1.4s
[CV] END ................................ score: (test=0.837) total time=   1.5s
[CV] END ................................ score: (test=0.858) total time=   1.5s
[CV] END ................................ score: (test=0.759) total time=   1.5s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   15.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.918) total time=   1.3s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.3s remaining:    0.0s


[CV] END ................................ score: (test=0.899) total time=   1.1s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.6s remaining:    0.0s


[CV] END ................................ score: (test=0.951) total time=   1.7s
[CV] END ................................ score: (test=0.928) total time=   1.8s
[CV] END ................................ score: (test=0.939) total time=   1.4s
[CV] END ................................ score: (test=0.950) total time=   1.7s
[CV] END ................................ score: (test=0.899) total time=   1.4s
[CV] END ................................ score: (test=0.924) total time=   1.5s
[CV] END ................................ score: (test=0.928) total time=   1.4s
[CV] END ................................ score: (test=0.765) total time=   1.4s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   16.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.796) total time=   1.4s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s


[CV] END ................................ score: (test=0.706) total time=   1.4s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.0s remaining:    0.0s


[CV] END ................................ score: (test=0.739) total time=   1.5s
[CV] END ................................ score: (test=0.769) total time=   1.4s
[CV] END ................................ score: (test=0.798) total time=   1.4s
[CV] END ................................ score: (test=0.777) total time=   1.4s
[CV] END ................................ score: (test=0.779) total time=   1.6s
[CV] END ................................ score: (test=0.733) total time=   1.5s
[CV] END ................................ score: (test=0.777) total time=   1.4s
[CV] END ................................ score: (test=0.747) total time=   1.4s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   15.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.853) total time=   1.5s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.5s remaining:    0.0s


[CV] END ................................ score: (test=0.791) total time=   1.4s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.0s remaining:    0.0s


[CV] END ................................ score: (test=0.832) total time=   1.5s
[CV] END ................................ score: (test=0.841) total time=   1.4s
[CV] END ................................ score: (test=0.863) total time=   1.4s
[CV] END ................................ score: (test=0.855) total time=   1.5s
[CV] END ................................ score: (test=0.835) total time=   1.4s
[CV] END ................................ score: (test=0.818) total time=   1.4s
[CV] END ................................ score: (test=0.846) total time=   1.4s
[CV] END ................................ score: (test=0.756) total time=   1.4s
MultinomialNB()


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   15.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.947) total time=   0.3s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV] END ................................ score: (test=0.940) total time=   0.3s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.8s remaining:    0.0s


[CV] END ................................ score: (test=0.965) total time=   0.4s
[CV] END ................................ score: (test=0.951) total time=   0.3s
[CV] END ................................ score: (test=0.956) total time=   0.3s
[CV] END ................................ score: (test=0.965) total time=   0.4s
[CV] END ................................ score: (test=0.946) total time=   0.3s
[CV] END ................................ score: (test=0.937) total time=   0.3s
[CV] END ................................ score: (test=0.940) total time=   0.3s
[CV] END ................................ score: (test=0.891) total time=   0.3s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    4.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.946) total time=   0.3s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV] END ................................ score: (test=0.945) total time=   0.3s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.8s remaining:    0.0s


[CV] END ................................ score: (test=0.980) total time=   0.4s
[CV] END ................................ score: (test=0.956) total time=   0.4s
[CV] END ................................ score: (test=0.972) total time=   0.4s
[CV] END ................................ score: (test=0.967) total time=   0.4s
[CV] END ................................ score: (test=0.934) total time=   0.3s
[CV] END ................................ score: (test=0.935) total time=   0.3s
[CV] END ................................ score: (test=0.945) total time=   0.3s
[CV] END ................................ score: (test=0.844) total time=   0.3s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    4.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.948) total time=   0.3s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.3s remaining:    0.0s


[CV] END ................................ score: (test=0.935) total time=   0.3s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.8s remaining:    0.0s


[CV] END ................................ score: (test=0.948) total time=   0.3s
[CV] END ................................ score: (test=0.946) total time=   0.3s
[CV] END ................................ score: (test=0.940) total time=   0.3s
[CV] END ................................ score: (test=0.962) total time=   0.3s
[CV] END ................................ score: (test=0.959) total time=   0.3s
[CV] END ................................ score: (test=0.940) total time=   0.3s
[CV] END ................................ score: (test=0.935) total time=   0.3s
[CV] END ................................ score: (test=0.959) total time=   0.3s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    4.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.947) total time=   0.3s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV] END ................................ score: (test=0.940) total time=   0.3s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.8s remaining:    0.0s


[CV] END ................................ score: (test=0.964) total time=   0.3s
[CV] END ................................ score: (test=0.951) total time=   0.3s
[CV] END ................................ score: (test=0.956) total time=   0.3s
[CV] END ................................ score: (test=0.964) total time=   0.3s
[CV] END ................................ score: (test=0.946) total time=   0.3s
[CV] END ................................ score: (test=0.938) total time=   0.3s
[CV] END ................................ score: (test=0.940) total time=   0.3s
[CV] END ................................ score: (test=0.898) total time=   0.3s
GaussianNB()


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    4.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.961) total time=   1.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s


[CV] END ................................ score: (test=0.965) total time=   1.2s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.5s remaining:    0.0s


[CV] END ................................ score: (test=0.986) total time=   1.1s
[CV] END ................................ score: (test=0.976) total time=   1.2s
[CV] END ................................ score: (test=0.981) total time=   1.2s
[CV] END ................................ score: (test=0.970) total time=   1.1s
[CV] END ................................ score: (test=0.955) total time=   1.1s
[CV] END ................................ score: (test=0.962) total time=   1.1s
[CV] END ................................ score: (test=0.970) total time=   1.1s
[CV] END ................................ score: (test=0.940) total time=   1.1s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   12.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.940) total time=   1.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s


[CV] END ................................ score: (test=0.948) total time=   1.1s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.5s remaining:    0.0s


[CV] END ................................ score: (test=0.981) total time=   1.2s
[CV] END ................................ score: (test=0.970) total time=   1.1s
[CV] END ................................ score: (test=0.971) total time=   1.1s
[CV] END ................................ score: (test=0.950) total time=   1.1s
[CV] END ................................ score: (test=0.937) total time=   1.2s
[CV] END ................................ score: (test=0.943) total time=   1.1s
[CV] END ................................ score: (test=0.948) total time=   1.1s
[CV] END ................................ score: (test=0.907) total time=   1.1s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   12.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.984) total time=   1.2s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s


[CV] END ................................ score: (test=0.984) total time=   1.1s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.5s remaining:    0.0s


[CV] END ................................ score: (test=0.992) total time=   1.2s
[CV] END ................................ score: (test=0.981) total time=   1.2s
[CV] END ................................ score: (test=0.992) total time=   1.1s
[CV] END ................................ score: (test=0.992) total time=   1.1s
[CV] END ................................ score: (test=0.975) total time=   1.1s
[CV] END ................................ score: (test=0.984) total time=   1.2s
[CV] END ................................ score: (test=0.995) total time=   1.1s
[CV] END ................................ score: (test=0.981) total time=   1.1s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   12.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.961) total time=   1.1s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s


[CV] END ................................ score: (test=0.965) total time=   1.1s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.4s remaining:    0.0s


[CV] END ................................ score: (test=0.986) total time=   1.1s
[CV] END ................................ score: (test=0.976) total time=   1.2s
[CV] END ................................ score: (test=0.981) total time=   1.1s
[CV] END ................................ score: (test=0.971) total time=   1.1s
[CV] END ................................ score: (test=0.956) total time=   1.4s
[CV] END ................................ score: (test=0.963) total time=   1.2s
[CV] END ................................ score: (test=0.971) total time=   1.1s
[CV] END ................................ score: (test=0.942) total time=   1.4s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   13.1s finished


In [10]:
accuracy_scores, precision_scores, recall_scores,f1_scores

([0.8421824315557286, 0.9437598472631559, 0.9665003985245324],
 [0.9103036502195325, 0.942356569721567, 0.9494866420361572],
 [0.7619868795166449, 0.9471678414879754, 0.9858384670062789],
 [0.8288040782726682, 0.9443154187716232, 0.9672347345739196])

# Discussion

- We find that GaussianNB performs better on this data because the features are almost normally distributed  with a right skew, but still best suited for GaussianNB compared to other models

- MultinomialNB also has a good performance as we are using text data to train out model

- BernoulliNB doesnt perform well because it is designed for binomial data

## Conclusion

    GaussianNB is better suited for this particular dataset with an accuracy score of 0.9665003985245324 which is highest of all the three models