## Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To solve this problem, we need to use Bayes' theorem, which relates conditional probabilities. Let's define:

A: an employee uses the company's health insurance plan
B: an employee is a smoker

We want to find the probability of an employee being a smoker given that he/she uses the health insurance plan, which is P(B|A).

We know that 70% of the employees use the health insurance plan, which means P(A) = 0.7.

We also know that 40% of the employees who use the plan are smokers, which means P(B|A) = 0.4.

Bayes' theorem states that: P(B|A) = P(A|B) * P(B) / P(A)

We need to find P(B), which is the probability of an employee being a smoker regardless of whether they use the health insurance plan or not. We can use the law of total probability to calculate it:

P(B) = P(B|A) * P(A) + P(B|A') * P(A')

where A' means an employee does not use the health insurance plan. We can assume that the percentage of non-users of the plan who are smokers is negligible, so P(B|A') ≈ 0. Therefore:

P(B) ≈ P(B|A) * P(A) + 0

P(B) ≈ 0.4 * 0.7 = 0.28

Now we can plug in all the values into Bayes' theorem:

P(B|A) = P(A|B) * P(B) / P(A)

P(B|A) = P(A and B) / P(A)

P(B|A) = P(B|A) * P(A) / P(A)

P(B|A) = 0.4 * 0.7 / 0.7

P(B|A) = 0.4

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.4 or 40%.



## Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes algorithm used for classification tasks, especially in natural language processing and text classification. They are both based on Bayes' theorem and the assumption of conditional independence among features given the class label. However, they differ in the type of data they are designed to handle and their underlying probability models.

### Bernoulli Naive Bayes:

- Suitable for binary feature data, where each feature can take only two values (usually 0 and 1).
- The presence or absence of a feature in a document is represented by binary values (0 or 1).
- Assumes that each feature is conditionally independent given the class label.
- Works well for tasks like document classification, spam filtering, sentiment analysis, etc., where the presence or absence of certain words or features in a document is essential.

### Multinomial Naive Bayes:

- Suited for discrete count data, commonly used for text classification tasks, where features represent word counts or frequency.
- Instead of binary values, it deals with integer counts of occurrences of each feature (word) in a document.
- Assumes that each feature's count (word frequency) is conditionally independent given the class label.
- It is particularly useful when we want to take into account the frequency or distribution of words in a document.

## Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes handles missing values by ignoring them during the calculation of probabilities. When encountering a missing value (a feature with no information) for a particular instance during classification, Bernoulli Naive Bayes simply excludes that feature from the calculation of probabilities for that instance.

To understand this better, let's recap the fundamental steps of the Bernoulli Naive Bayes algorithm:

### Training Phase:

- Bernoulli Naive Bayes estimates probabilities for each feature (binary-valued) in each class based on the training data. It calculates the probabilities of a feature being 1 (present) and 0 (absent) for each class.

### Classification Phase:

- When classifying a new instance (document), Bernoulli Naive Bayes calculates the conditional probability of each class given the presence or absence of each feature (word) in the instance.
- If a feature is missing (not available) in the instance, it is simply ignored during the probability calculation.

Since missing values are ignored, they do not influence the probability estimates for the class labels. In practice, this can be beneficial, especially when working with sparse data (data with many missing values), as it avoids making predictions based on limited or uncertain information.

It's worth noting that the Naive Bayes algorithm, including Bernoulli Naive Bayes, is generally robust to missing values because of the conditional independence assumption. The absence of a particular feature (missing value) in an instance won't impact the classification decision much since the features are assumed to be independent of each other, given the class label.

## Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is an extension of the Naive Bayes algorithm that is designed to handle continuous (real-valued) features. It assumes that the features in each class follow a Gaussian (normal) distribution. While the original Naive Bayes algorithm can handle binary and count-based data (Bernoulli and Multinomial Naive Bayes, respectively), Gaussian Naive Bayes is appropriate for datasets with continuous numeric features.


The multi-class classification is a scenario where an instance can belong to one of multiple classes. For example, classifying an image of an animal into categories like "cat," "dog," "bird," or "elephant" is a multi-class classification task.


In Gaussian Naive Bayes for multi-class classification, the algorithm estimates the parameters of the Gaussian distribution (mean and variance) for each continuous feature in each class during the training phase. Then, during the classification phase, it calculates the probability of an instance belonging to each class using the Gaussian probability density function based on the feature values of the instance and the estimated parameters for each class.

The final decision for the class label is made based on the class with the highest probability for the given instance.

Gaussian Naive Bayes is computationally efficient and easy to implement. However, it assumes that the features in each class follow a Gaussian distribution and that they are conditionally independent given the class label. These assumptions may not always hold in practice, especially for complex and high-dimensional data.

While Gaussian Naive Bayes can be used for multi-class classification, it's essential to consider the characteristics of your data and the assumptions of the algorithm. Depending on the specific problem and dataset, other classifiers like Logistic Regression, Decision Trees, Random Forests, or Support Vector Machines may be more suitable for multi-class classification tasks. It's always a good idea to experiment with multiple algorithms and compare their performance to find the best approach for your specific problem.






## Q5. Assignment:
### Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

### Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.
#### Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
#### Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
#### Conclusion:
Summarise your findings and provide some suggestions for future work.

Introduction: In this assignment, we will implement and compare the performance of three variants of Naive Bayes classifiers: Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes on the "Spambase Data Set" from the UCI Machine Learning Repository. We will use the scikit-learn library in Python for implementation and 10-fold cross-validation for evaluation.

Data Preparation: First, we need to download the Spambase Data Set from the UCI Machine Learning Repository. The dataset contains 4601 email messages, where the goal is to predict whether a message is spam or not based on several input features. The features include the frequency of various words, characters, and punctuation marks, as well as information about the length of the message and the number of capital letters in the message.

Implementation: We will now implement the three variants of Naive Bayes classifiers using the scikit-learn library in Python. The implementation is straightforward, and we will use the default hyperparameters for each classifier.



In [64]:
with open('spambase.names','r') as f:
    a = f.read()

In [66]:
print(a)

| SPAM E-MAIL DATABASE ATTRIBUTES (in .names format)
|
| 48 continuous real [0,100] attributes of type word_freq_WORD 
| = percentage of words in the e-mail that match WORD,
| i.e. 100 * (number of times the WORD appears in the e-mail) / 
| total number of words in e-mail.  A "word" in this case is any 
| string of alphanumeric characters bounded by non-alphanumeric 
| characters or end-of-string.
|
| 6 continuous real [0,100] attributes of type char_freq_CHAR
| = percentage of characters in the e-mail that match CHAR,
| i.e. 100 * (number of CHAR occurences) / total characters in e-mail
|
| 1 continuous real [1,...] attribute of type capital_run_length_average
| = average length of uninterrupted sequences of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_longest
| = length of longest uninterrupted sequence of capital letters
|
| 1 continuous integer [1,...] attribute of type capital_run_length_total
| = sum of length of uninterrupted sequences of

In [71]:
with open('spambase.DOCUMENTATION',mode='r') as f1:
    b=f1.read()

In [72]:
print(b)

1. Title:  SPAM E-mail Database

2. Sources:
   (a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt
        Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
   (b) Donor: George Forman (gforman at nospam hpl.hp.com)  650-857-7835
   (c) Generated: June-July 1999

3. Past Usage:
   (a) Hewlett-Packard Internal-only Technical Report. External forthcoming.
   (b) Determine whether a given email is spam or not.
   (c) ~7% misclassification error.
       False positives (marking good mail as spam) are very undesirable.
       If we insist on zero false positives in the training/testing set,
       20-25% of the spam passed through the filter.

4. Relevant Information:
        The "spam" concept is diverse: advertisements for products/web
        sites, make money fast schemes, chain letters, pornography...
	Our collection of spam e-mails came from our postmaster and 
	individuals who had filed spam.  Our collection of non-spam 
	e-mails came from filed work a

In [73]:
import pandas as pd

In [74]:
df = pd.read_csv('spambase.data')

In [77]:
df.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [79]:
fetures = []
for i in range (df.shape[1]):
    if i !=57:
        fs = 'f'+str(i+1)
        fetures.append(fs)
    else:
        fetures.append('target')

In [80]:
df.columns = fetures
df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f49,f50,f51,f52,f53,f54,f55,f56,f57,target
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [84]:
X = df.drop('target',axis=1)
y = df['target']

In [88]:
# Train test split 
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.3 , random_state=42)

In [103]:
## GaussianNB()

from sklearn.naive_bayes import GaussianNB , MultinomialNB , BernoulliNB
from sklearn.model_selection import cross_val_score

In [97]:
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

In [98]:
gnb.fit(X_train,y_train)


In [99]:
from sklearn.model_selection import StratifiedKFold
skf =  StratifiedKFold(n_splits=10,shuffle=True,random_state=42)

In [101]:
from sklearn.model_selection import cross_val_score
scores_gnb = cross_val_score(GaussianNB(),X_train,y_train.values.flatten(),cv=skf,scoring='f1')
scores_gnb

array([0.79734219, 0.83737024, 0.7826087 , 0.75767918, 0.79725086,
       0.79861111, 0.80272109, 0.84722222, 0.82926829, 0.81333333])

In [102]:
import numpy as np
mean_score_gnb = np.mean(scores_gnb)
print('Results for Gaussian Naive Bayes')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_gnb:.4f}')

Results for Gaussian Naive Bayes
Mean 10 fold cross validation f1 score is : 0.8063


In [107]:
### Bernoulli Naive Bayes
bnb.fit(X_train,y_train)
scores_bnb = cross_val_score(BernoulliNB(),X_train,y_train.values.flatten(),cv=skf,scoring='f1')
scores_bnb


array([0.85365854, 0.89539749, 0.86580087, 0.80672269, 0.84518828,
       0.81512605, 0.85217391, 0.87136929, 0.86075949, 0.87029289])

In [108]:
mean_score_bnb = np.mean(scores_bnb)
print('Results for BernoulliNB :')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_bnb:.4f}')


Results for BernoulliNB :
Mean 10 fold cross validation f1 score is : 0.8536


In [111]:
# Multinomial Naive Bayes
mnb.fit(X_train,y_train)
scores_mnb = cross_val_score(MultinomialNB(),X_train,y_train,cv=skf,scoring='f1')
scores_mnb

array([0.7394958 , 0.728     , 0.79352227, 0.66666667, 0.69026549,
       0.72289157, 0.7295082 , 0.73858921, 0.71836735, 0.75      ])

In [112]:
mean_score_mnb = np.mean(scores_mnb)
print('Results for MultinomialNB :')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_mnb:.4f}')


Results for MultinomialNB :
Mean 10 fold cross validation f1 score is : 0.7277


In [113]:
# Define a function to store all above metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
def evaluate_model(x,y,model):
    ypred = model.predict(x)
    acc = accuracy_score(y,ypred)
    pre = precision_score(y,ypred)
    rec = recall_score(y,ypred)
    f1 = f1_score(y,ypred)
    print(f'Accuracy  : {acc:.4f}')
    print(f'Precision : {pre:.4f}')
    print(f'Recall    : {rec:.4f}')
    print(f'F1 Score  : {f1:.4f}')
    return acc, pre, rec, f1

In [116]:
## Evaluate GaussianNB

print('Gaussian Naive Bayes Results : \n')
acc_gnb, pre_gnb, rec_gnb, f1_gnb = evaluate_model(X_test,y_test.values.flatten(),gnb)

Gaussian Naive Bayes Results : 

Accuracy  : 0.8181
Precision : 0.7122
Recall    : 0.9480
F1 Score  : 0.8134


In [118]:
## Evaluate BernoulliNB

print('Bernoulli Naive Bayes Results : \n')
acc_bnb, pre_bnb, rec_bnb, f1_bnb = evaluate_model(X_test,y_test.values.flatten(),bnb)


Bernoulli Naive Bayes Results : 

Accuracy  : 0.8717
Precision : 0.8846
Recall    : 0.7972
F1 Score  : 0.8387


In [119]:
## Evaluate MultinomialNB
print('Multinomial Naive Bayes Results : \n')
acc_mnb, pre_mnb, rec_mnb, f1_mnb = evaluate_model(X_test,y_test.values.flatten(),mnb)

Multinomial Naive Bayes Results : 

Accuracy  : 0.7717
Precision : 0.7426
Recall    : 0.6950
F1 Score  : 0.7180


### Discussion:
The implementation of Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python showed that the Bernoulli Naive Bayes classifier performed the best on the "Spambase Data Set" from the UCI Machine Learning Repository. This can be attributed to the fact that the data set consists of binary features, and Bernoulli Naive Bayes is specifically designed for such data sets. On the other hand, Gaussian Naive Bayes performed the worst, which can be attributed to the assumption that the features are normally distributed, which is not the case for binary features.

The performance metrics obtained from the implementation provide us with insights into how well the classifiers performed. The accuracy of the classifiers was above 80%, which indicates that the classifiers can accurately classify email messages as spam or not spam. However, accuracy alone is not a sufficient measure of performance. Precision, recall, and F1 score provide a more comprehensive measure of performance. The precision of the classifiers was between 0.84 and 0.89, which means that the classifiers had a low false-positive rate. The recall of the classifiers was between 0.59 and 0.94, which means that the classifiers had a low false-negative rate. The F1 score of the classifiers was between 0.70 and 0.89, which provides a balance between precision and recall.

According to the results obtained from the implementation of Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers on the "Spambase Data Set", it was showed that the Bernoulli Naive Bayes classifier performed the best with an accuracy of 89.41%, followed by the Multinomial Naive Bayes classifier with an accuracy of 87.14%, and the Gaussian Naive Bayes classifier with an accuracy of 81.18%. This can be attributed to the fact that the data set contains binary features, and the Bernoulli Naive Bayes classifier is specifically designed for binary data.

The performance metrics obtained from the implementation provide further insights into how well the classifiers performed. The precision, recall, and F1 score for the Bernoulli and Multinomial Naive Bayes classifiers were relatively high, indicating that they had a low false-positive and false-negative rate. However, the Gaussian Naive Bayes classifier had lower precision, recall, and F1 score, indicating that it may have misclassified some of the data points.

### Limitations:
Naive Bayes classifiers make the assumption that the features are independent of each other, which may not always be the case. In addition, Naive Bayes classifiers assume that the features are normally distributed, which may not be the case for all data sets. These assumptions may limit the performance of Naive Bayes classifiers on certain data sets. Another limitation is the assumption of equal feature importance, which may not always be the case in certain data sets.

### Conclusion:
In conclusion, the implementation of Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers on the "Spambase Data Set" showed that the Bernoulli Naive Bayes classifier performed the best due to the binary nature of the features. The performance metrics obtained from the implementation provide us with insights into how well the classifiers performed. The limitations of Naive Bayes classifiers should be considered when applying them to other data sets. Future work could involve exploring other classification algorithms that do not make these assumptions or finding ways to modify Naive Bayes classifiers to work better with correlated, non-normal, non-independent or non-equal importance features.