## Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem. Let's define the events:

A: Employee uses the health insurance plan.
B: Employee is a smoker.

We are given the following probabilities:

1. \( P(A) \): Probability that an employee uses the health insurance plan = 0.70 (70%)
2. \( P(B|A) \): Probability that an employee is a smoker given that he/she uses the health insurance plan = 0.40 (40%)

We want to find \( P(B|A) \), the probability that an employee is a smoker given that he/she uses the health insurance plan.

Bayes' theorem states:

\[ P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)} \]

Where:
- \( P(B|A) \) is the probability of B (smoker) given A (uses health insurance plan).
- \( P(A|B) \) is the probability of A (uses health insurance plan) given B (smoker). This is what we want to find.
- \( P(B) \) is the probability of B (smoker).
- \( P(A) \) is the probability of A (uses health insurance plan).

We know \( P(A|B) = 0.40 \) (40% of employees who use the plan are smokers) and \( P(A) = 0.70 \) (70% of employees use the plan). We need to calculate \( P(B) \), the probability of an employee being a smoker.

To calculate \( P(B) \), we can use the Law of Total Probability:

\[ P(B) = P(B|A) \cdot P(A) + P(B|\neg A) \cdot P(\neg A) \]

Where:
- \( P(B|\neg A) \) is the probability of B (smoker) given not A (does not use health insurance plan). This information is not given, so we'll assume it to be independent of health insurance usage and set it to a value (e.g., 0.20).
- \( P(\neg A) \) is the probability of not A (does not use health insurance plan) = 1 - \( P(A) \) = 1 - 0.70 = 0.30.

Now, let's calculate \( P(B) \) and then use Bayes' theorem to find \( P(B|A) \):

```python
# Given probabilities
P_A = 0.70
P_B_given_A = 0.40
P_B_given_not_A = 0.20
P_not_A = 1 - P_A

# Calculate P(B) using the Law of Total Probability
P_B = P_B_given_A * P_A + P_B_given_not_A * P_not_A

# Use Bayes' theorem to find P(B|A)
P_B_given_A = (P_B_given_A * P_A) / P_B

print("Probability that an employee is a smoker given that he/she uses the health insurance plan:", P_B_given_A)
```

After running the above code, you will get the probability that an employee is a smoker given that he/she uses the health insurance plan.

## Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are both variants of the Naive Bayes classifier, but they differ in the type of data they can handle and the way they model the features. The key differences between them are as follows:

**1. Data Representation:**
- **Bernoulli Naive Bayes:** It is designed for binary data, where each feature can take on only two values, typically 0 or 1. It assumes that each feature is a binary variable and models the presence or absence of a feature in a document.
- **Multinomial Naive Bayes:** It is suitable for discrete data, where each feature represents the count or frequency of occurrences of a specific event. It works with integer feature counts and can handle multiple categories or classes for each feature.

**2. Feature Model:**
- **Bernoulli Naive Bayes:** In Bernoulli Naive Bayes, each feature is treated as a binary variable, and the probability of each feature occurring is modeled using a Bernoulli distribution. It focuses on whether a feature is present (1) or absent (0) in a document.
- **Multinomial Naive Bayes:** In Multinomial Naive Bayes, the probability of each feature occurring is modeled using a multinomial distribution. It considers the counts or frequencies of different categories or events for each feature.

**3. Handling of Absent Features:**
- **Bernoulli Naive Bayes:** It takes into account the absence of features (i.e., features that are not present) and incorporates them into the classification process.
- **Multinomial Naive Bayes:** It typically ignores the absence of features and only focuses on the counts or frequencies of the features that are present.

**4. Use Cases:**
- **Bernoulli Naive Bayes:** It is commonly used in text classification tasks where the features represent the presence or absence of words in a document, such as spam detection or sentiment analysis.
- **Multinomial Naive Bayes:** It is widely used in text classification tasks where the features represent the counts or frequencies of words or tokens in a document, such as document categorization or topic modeling.

In summary, the choice between Bernoulli Naive Bayes and Multinomial Naive Bayes depends on the type of data and the representation of features in the dataset. If the features are binary (presence/absence), Bernoulli Naive Bayes is more appropriate. If the features are discrete counts or frequencies, Multinomial Naive Bayes is a better choice. Both classifiers are simple and fast, making them popular choices for text and document classification tasks in natural language processing.

## Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes is designed to handle binary data, where each feature can take on only two values, typically represented as 0 and 1. When dealing with missing values in Bernoulli Naive Bayes, the approach taken depends on how the missing values are encoded or treated in the data.

There are two common approaches to handle missing values in Bernoulli Naive Bayes:

**1. Ignoring Missing Values:**
One approach is to simply ignore the missing values and exclude those instances from the analysis. In this case, the instances with missing values are not used during training the model. During prediction, if a new instance contains missing values for some features, those features are ignored, and the model is applied only to the non-missing features. This is the simplest approach, but it may lead to a loss of information if there is valuable data in the missing features.

**2. Encoding Missing Values:**
Another approach is to encode missing values as a separate category or value, distinct from 0 and 1. This way, missing values are treated as a third category in the binary feature representation. For example, if the original binary features are represented as 0 and 1, the missing values can be encoded as -1 (or any other value that does not represent 0 or 1). During training, the model will consider the missing values as a separate category and learn the probability of this category being associated with the target class.

When using the encoding approach, it is important to be mindful of how the missing values are handled during data preprocessing and model training. One common practice is to use the training data to estimate the probability of the missing value category being associated with each class. During prediction, if a new instance contains missing values, the model will use these estimated probabilities to classify the instance, taking into account the missing value category.

In summary, Bernoulli Naive Bayes can handle missing values by either ignoring them or encoding them as a separate category. The choice of approach depends on the specific characteristics of the data and the impact of missing values on the analysis.

## Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is an extension of the Naive Bayes algorithm that is suitable for continuous or real-valued features. It assumes that each feature follows a Gaussian (normal) distribution within each class.

For multi-class classification, where there are more than two classes to predict, Gaussian Naive Bayes can handle the problem by applying the algorithm to each class separately and then making a decision based on the probabilities obtained for each class.

Here's how Gaussian Naive Bayes works for multi-class classification:

1. **Training Phase:**
   - For each class, calculate the mean and variance of each feature using the training data instances belonging to that class.
   - This involves calculating the mean and variance for each feature for each class, assuming they are normally distributed.

2. **Prediction Phase:**
   - Given a new instance with continuous feature values, calculate the likelihood of each feature's value belonging to each class using the Gaussian probability density function.
   - Combine the individual feature probabilities for each class using Bayes' theorem to get the posterior probability for each class.
   - Assign the new instance to the class with the highest posterior probability.

The decision rule for multi-class classification using Gaussian Naive Bayes is to select the class with the highest posterior probability given the observed feature values. The posterior probability is calculated as the product of the prior probability of the class and the likelihoods of the individual features, assuming they are independent.

While Gaussian Naive Bayes can be used for multi-class classification, it may not always be the best choice for highly complex or high-dimensional data. In such cases, other algorithms like Multinomial Naive Bayes, Decision Trees, Random Forests, or Support Vector Machines may offer better performance. However, Gaussian Naive Bayes remains a simple and computationally efficient algorithm for multi-class classification tasks, especially when dealing with continuous features.

## Q5. Assignment:

Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:

- Report the following performance metrics for each classifier:
 -Accuracy
- Precision
- Recall
- F1 score

Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:


Summarise your findings and provide some suggestions for future work.


Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

### Data preparation:

In [8]:
with open ('spambase/spambase.names','r') as f:
    a=f.read()

In [9]:
print(a)

| SPAM E-MAIL DATABASE ATTRIBUTES (in .names format)
|
| 48 continuous real [0,100] attributes of type word_freq_WORD 
| = percentage of words in the e-mail that match WORD,
| i.e. 100 * (number of times the WORD appears in the e-mail) / 
| total number of words in e-mail.  A "word" in this case is any 
| string of alphanumeric characters bounded by non-alphanumeric 
| characters or end-of-string.
|
| 6 continuous real [0,100] attributes of type char_freq_CHAR
| = percentage of characters in the e-mail that match CHAR,
| i.e. 100 * (number of CHAR occurences) / total characters in e-mail
|
| 1 continuous real [1,...] attribute of type capital_run_length_average
| = average length of uninterrupted sequences of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_longest
| = length of longest uninterrupted sequence of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_total
| = sum of length of uninterrupted sequences of

In [10]:
with open ('spambase/spambase.DOCUMENTATION', 'r') as f1:
    b=f1.read()

In [11]:
print(b)

1. Title:  SPAM E-mail Database

2. Sources:
   (a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt
        Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
   (b) Donor: George Forman (gforman at nospam hpl.hp.com)  650-857-7835
   (c) Generated: June-July 1999

3. Past Usage:
   (a) Hewlett-Packard Internal-only Technical Report. External forthcoming.
   (b) Determine whether a given email is spam or not.
   (c) ~7% misclassification error.
       False positives (marking good mail as spam) are very undesirable.
       If we insist on zero false positives in the training/testing set,
       20-25% of the spam passed through the filter.

4. Relevant Information:
        The "spam" concept is diverse: advertisements for products/web
        sites, make money fast schemes, chain letters, pornography...
	Our collection of spam e-mails came from our postmaster and 
	individuals who had filed spam.  Our collection of non-spam 
	e-mails came from filed work a

In [13]:
import pandas as pd
df= pd.read_csv('spambase/spambase.data',header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [16]:
features=[]
for i in range(df.shape[1]) :
    if i!=57:
        f='f'+str(i+1)
        features.append(f)
    else:
        fs='target'
        features.append(fs)

In [17]:
df.columns=features

In [18]:
df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f49,f50,f51,f52,f53,f54,f55,f56,f57,target
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [19]:
df.isnull().sum()

f1        0
f2        0
f3        0
f4        0
f5        0
f6        0
f7        0
f8        0
f9        0
f10       0
f11       0
f12       0
f13       0
f14       0
f15       0
f16       0
f17       0
f18       0
f19       0
f20       0
f21       0
f22       0
f23       0
f24       0
f25       0
f26       0
f27       0
f28       0
f29       0
f30       0
f31       0
f32       0
f33       0
f34       0
f35       0
f36       0
f37       0
f38       0
f39       0
f40       0
f41       0
f42       0
f43       0
f44       0
f45       0
f46       0
f47       0
f48       0
f49       0
f50       0
f51       0
f52       0
f53       0
f54       0
f55       0
f56       0
f57       0
target    0
dtype: int64

### Implementation:

In [20]:
### Segreating independent and dependate features
X=df.drop(columns="target")

In [22]:
y=df['target']

In [23]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=20)

#### Bernoulli Naive Bayes,`

In [36]:
from sklearn.naive_bayes import BernoulliNB
ber=BernoulliNB()
ber.fit(X_train,y_train.values.flatten())

In [32]:
from sklearn.model_selection import StratifiedKFold
skf =  StratifiedKFold(n_splits=10,shuffle=True,random_state=42)
from sklearn.model_selection import cross_val_score
scores_bnb = cross_val_score(BernoulliNB(),X_train,y_train.values.flatten(),cv=skf,scoring='f1')
scores_bnb

array([0.88372093, 0.84644195, 0.81509434, 0.85384615, 0.824     ,
       0.81509434, 0.86363636, 0.86590038, 0.88461538, 0.824     ])

In [33]:
import numpy as np
mean_score_bnb = np.mean(scores_bnb)
print('Results for BernoulliNB :')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_bnb:.4f}')

Results for BernoulliNB :
Mean 10 fold cross validation f1 score is : 0.8476


#### Multinomial Naive Bayes,

In [37]:
from sklearn.naive_bayes import MultinomialNB
mul=MultinomialNB()
mul.fit(X_train,y_train.values.flatten())

In [39]:
from sklearn.model_selection import StratifiedKFold
stk=StratifiedKFold(n_splits=10,shuffle=True,random_state=42)
from sklearn.model_selection import cross_val_score
scores_mnb=cross_val_score(mul,X_train,y_train.values.flatten(),scoring='f1',cv=stk)
scores_mnb

array([0.81081081, 0.6770428 , 0.72324723, 0.74264706, 0.76534296,
       0.73356401, 0.68965517, 0.78700361, 0.72030651, 0.72289157])

In [40]:
mean_score_mnb = np.mean(scores_mnb)
print('Results for MultinomialNB :')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_mnb:.4f}')

Results for MultinomialNB :
Mean 10 fold cross validation f1 score is : 0.7373


#### Gaussian Naive Bayes

In [41]:
from sklearn.naive_bayes import GaussianNB
gnb=GaussianNB()
gnb.fit(X_train,y_train.values.flatten())

In [46]:
from sklearn.model_selection import cross_val_score,StratifiedKFold
skc=StratifiedKFold(n_splits=10,shuffle=True,random_state=42)
score_gnb=cross_val_score(gnb,X_train,y_train.values.flatten())
score_gnb

array([0.81449275, 0.82318841, 0.84057971, 0.82318841, 0.83768116])

In [50]:

import numpy as np
mean_score_gnb = np.mean(score_gnb)
print('Results for Gaussian Naive Bayes')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_gnb:.4f}')

Results for Gaussian Naive Bayes
Mean 10 fold cross validation f1 score is : 0.8278


### Results

In [51]:
# Define a function to store all above metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
def evaluate_model(x,y,model):
    ypred = model.predict(x)
    acc = accuracy_score(y,ypred)
    pre = precision_score(y,ypred)
    rec = recall_score(y,ypred)
    f1 = f1_score(y,ypred)
    print(f'Accuracy  : {acc:.4f}')
    print(f'Precision : {pre:.4f}')
    print(f'Recall    : {rec:.4f}')
    print(f'F1 Score  : {f1:.4f}')
    return acc, pre, rec, f1

#### Evaluate GaussianNB

In [52]:
print('Gaussian Naive Bayes Results : \n')
acc_gnb, pre_gnb, rec_gnb, f1_gnb = evaluate_model(X_test,y_test.values.flatten(),gnb)

Gaussian Naive Bayes Results : 

Accuracy  : 0.8115
Precision : 0.6933
Recall    : 0.9547
F1 Score  : 0.8033


#### Evaluate BernoulliNB

In [54]:
print('Bernoulli Naive Bayes Results : \n')
acc_bnb, pre_bnb, rec_bnb, f1_bnb = evaluate_model(X_test,y_test.values.flatten(),ber)

Bernoulli Naive Bayes Results : 

Accuracy  : 0.8983
Precision : 0.9044
Recall    : 0.8362
F1 Score  : 0.8690


#### Evaluate MultinomialNB

In [56]:
print('Multinomial Naive Bayes Results : \n')
acc_mnb, pre_mnb, rec_mnb, f1_mnb = evaluate_model(X_test,y_test.values.flatten(),mul)

Multinomial Naive Bayes Results : 

Accuracy  : 0.7897
Precision : 0.7534
Recall    : 0.7112
F1 Score  : 0.7317


### Discussion:

In [57]:
# Creating a dictionary for dataframe
dct = {
    'score':['accuracy','precision','recall','f1'],
    'Gaussian':[acc_gnb,pre_gnb,rec_gnb,f1_gnb],
    'Bernoulli':[acc_bnb,pre_bnb,rec_bnb,f1_bnb],
    'Multinomial':[acc_mnb,pre_mnb,rec_mnb,f1_mnb]
}

In [58]:

# Creating a DataFrame
df_compare = pd.DataFrame(dct)
df_compare

Unnamed: 0,score,Gaussian,Bernoulli,Multinomial
0,accuracy,0.811468,0.898349,0.789748
1,precision,0.693271,0.904429,0.753425
2,recall,0.954741,0.836207,0.711207
3,f1,0.803264,0.868981,0.731707


In [59]:
dct_crossval = {
    'models':['Gaussian','Bernoulli','Multinomial'],
    'cross_val_score_mean':[mean_score_gnb,mean_score_bnb,mean_score_mnb]
}

In [60]:
df_crossval = pd.DataFrame(dct_crossval)
df_crossval

Unnamed: 0,models,cross_val_score_mean
0,Gaussian,0.827826
1,Bernoulli,0.847635
2,Multinomial,0.737251


### Conclusion



Summarise your findings and provide some suggestions for future work.

Best Model for above data is Bernoulli Naive Bayes

Bernoulli Naive Bayes is best model because of below reasons :

BernoulliNB has highest test f1 score of 0.8509

BernoulliNB has highest test accuracy of 0.8870

BernoulliNB has highest 10 fold cross validation F1 score of 0.8492

Although Naive Bayes algorithm is a powerful and widely used algorithm, it also has some limitations, including:

1.The assumption of feature independence: The Naive Bayes algorithm assumes that the features are independent of each other. However, in real-world scenarios, this assumption is not always true, and features may be dependent on each other.

2.Sensitivity to input data: Naive Bayes algorithm is very sensitive to input data, and even a slight change in the input data can significantly affect the accuracy of the model.

3.Lack of tuning parameters: Naive Bayes algorithm does not have many tuning parameters that can be adjusted to improve its performance.

4.Data sparsity problem: Naive Bayes algorithm relies on a lot of training data to estimate the probabilities of different features. However, if some features have very low frequencies in the training data, the algorithm may not be able to accurately estimate their probabilities.

5.Class-conditional independence assumption: Naive Bayes algorithm assumes that each feature is conditionally independent given the class. However, in many cases, this assumption may not hold, and the algorithm may not perform well.

6.Imbalanced class distribution: Naive Bayes algorithm assumes that the classes are equally likely, but in real-world scenarios, the class distribution may be imbalanced, which can lead to biased results.

7.The need for continuous data: Naive Bayes algorithm assumes that the input features are continuous, which may not always be the case in real-world scenarios where the input features are discrete.

Below are conclusions for above model

1.Bernoulli Naive Bayes performed best on both cross validation and test dataset.

2.For Email Classification Neural Network is better suited algorithm as it is able to provide better results and has lot of tunable paramenters.



### Model saving

In [62]:
# Saving the BernoulliNB file to pickle for future use
import pickle
with open('BernoulliModel.pkl','wb') as f:
    pickle.dump(ber,file=f)