# **`Naïve bayes-2`**

`Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?`

We can use Bayes' theorem to calculate the probability that an employee is a smoker given that he/she uses the health insurance plan:

Let S be the event that an employee is a smoker, and H be the event that an employee uses the health insurance plan.

We want to calculate P(S|H), the probability that an employee is a smoker given that he/she uses the health insurance plan.

Conditional probability is defined as:

If P(H) > 0, we define the probability of E given F as
P(S | H)  = P(S ∩ H)/ P(H)

where 
* P(S ∩ H) = probability of employees who are smokers and use health insurance plan
* P(H) = probablity of employees who use the health insurance plan

data give:
- P(S ∩ H) = 0.4
- P(H) = 0.7

P(S | H) = 0.4/0.7 = 0.571



`Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?`

Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes algorithm, which are used for text classification and other applications.

The main difference between these two algorithms lies in the way they model the data.

In Bernoulli Naive Bayes, the data is binary, i.e., it takes on values of 0 or 1. For example, in text classification, a document can be represented as a binary vector where each element represents the presence or absence of a word in the document. Bernoulli Naive Bayes assumes that each feature (i.e., word) is conditionally independent given the class label, and models the data as a set of binary variables.

In Multinomial Naive Bayes, the data is represented as a count of occurrences of each feature. For example, in text classification, a document can be represented as a vector of word counts. Multinomial Naive Bayes assumes that the data follows a multinomial distribution, and models the data as a set of count variables.

To summarize:

1. Bernoulli Naive Bayes is used when the data is binary (0/1).
2. Multinomial Naive Bayes is used when the data is a count of occurrences.
3. Bernoulli Naive Bayes models the data as a set of binary variables.
4. Multinomial Naive Bayes models the data as a set of count variables.
5. Both algorithms assume that the features are conditionally independent given the class label.

`Q3. How does Bernoulli Naive Bayes handle missing values?`

Bernoulli Naive Bayes assumes that the input data is a binary feature vector, where each feature is either present (1) or absent (0). Bernoulli Naive Bayes algorithm can handle missing data. Attributes are handled separately by the algorithm, at both model construction time and prediction time. As such, if a data instance has a missing value for an attribute, it can be ignored while preparing the model, and ignored when a probability is calculated for a class value.

`Q4. Can Gaussian Naive Bayes be used for multi-class classification?`

Yes, Gaussian Naive Bayes can be used for multi-class classification. In Gaussian Naive Bayes, each feature is assumed to follow a Gaussian (normal) distribution, and the class-conditional probability density function is estimated for each class. Given a new input, the model calculates the probability of the input belonging to each class, and assigns the input to the class with the highest probability.

For multi-class classification, the same approach is used, but with more than two classes. Specifically, for a dataset with K classes, the model estimates K class-conditional Gaussian distributions for each feature, and uses Bayes' theorem to calculate the probability of each class given the input. The class with the highest probability is then assigned as the predicted class label.

Note that while Gaussian Naive Bayes can be used for multi-class classification, it may not always be the best choice, especially if the classes are not well-separated or if there are complex dependencies between the features. In such cases, other algorithms such as decision trees or neural networks may be more effective.

`Q5. Assignment:`

`Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.`

`Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.`

`Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score`

`Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?`

`Conclusion:`
`Summarise your findings and provide some suggestions for future work.`

In [51]:
import pandas as pd
import numpy as np
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data")

In [5]:
df.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 58 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       4600 non-null   float64
 1   0.64    4600 non-null   float64
 2   0.64.1  4600 non-null   float64
 3   0.1     4600 non-null   float64
 4   0.32    4600 non-null   float64
 5   0.2     4600 non-null   float64
 6   0.3     4600 non-null   float64
 7   0.4     4600 non-null   float64
 8   0.5     4600 non-null   float64
 9   0.6     4600 non-null   float64
 10  0.7     4600 non-null   float64
 11  0.64.2  4600 non-null   float64
 12  0.8     4600 non-null   float64
 13  0.9     4600 non-null   float64
 14  0.10    4600 non-null   float64
 15  0.32.1  4600 non-null   float64
 16  0.11    4600 non-null   float64
 17  1.29    4600 non-null   float64
 18  1.93    4600 non-null   float64
 19  0.12    4600 non-null   float64
 20  0.96    4600 non-null   float64
 21  0.13    4600 non-null   float64
 22  

In [7]:
#splitting the dataset into independent and dependent features
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [9]:
#confirming the split
X.shape,y.shape

((4600, 57), (4600,))

In [11]:
#train test splitting the data
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y , test_size=0.2, random_state=42)

In [13]:
X_train.shape , X_test.shape

((3680, 57), (920, 57))

In [14]:
#BernoulliNB
from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB()

In [38]:
##Cross validation
parameters = {
    "alpha" : [1.0],
    "force_alpha" : [False] ,
    "binarize" : [0],
    "fit_prior" : [True ],
    "class_prior" : [None]
}

In [39]:
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(classifier ,param_grid=parameters,scoring="accuracy" , cv = 10 , verbose= 3)

In [40]:
clf.fit(X_train,y_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
[CV 1/10] END alpha=1.0, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.883 total time=   0.0s
[CV 2/10] END alpha=1.0, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.856 total time=   0.0s
[CV 3/10] END alpha=1.0, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.913 total time=   0.0s
[CV 4/10] END alpha=1.0, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.870 total time=   0.0s
[CV 5/10] END alpha=1.0, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.902 total time=   0.0s
[CV 6/10] END alpha=1.0, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.880 total time=   0.0s
[CV 7/10] END alpha=1.0, binarize=0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.897 total time=   0.0s
[CV 8/10] END alpha=1.0, binarize=0, class_prior=None, fit_prior=True, force_alpha=Fa

In [42]:
y_pred_BNB = clf.predict(X_test)

In [43]:
from sklearn.metrics import classification_report
print("BernoulliNB classification reprot")
print(classification_report(y_test , y_pred_BNB))

BernoulliNB classification reprot
              precision    recall  f1-score   support

           0       0.86      0.93      0.89       530
           1       0.89      0.79      0.84       390

    accuracy                           0.87       920
   macro avg       0.88      0.86      0.87       920
weighted avg       0.87      0.87      0.87       920



In [44]:
#Multinomial
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()

In [45]:
##Cross validation
parameters = {
    "alpha" : [1.0],
    "force_alpha" : [False],
    "fit_prior" : [True ],
    "class_prior" : [None]
}

In [46]:
clf = GridSearchCV(classifier ,param_grid=parameters,scoring="accuracy" , cv = 10 , verbose= 3)

In [47]:
clf.fit(X_train,y_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
[CV 1/10] END alpha=1.0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.777 total time=   0.0s
[CV 2/10] END alpha=1.0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.818 total time=   0.0s
[CV 3/10] END alpha=1.0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.818 total time=   0.0s
[CV 4/10] END alpha=1.0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.802 total time=   0.0s
[CV 5/10] END alpha=1.0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.818 total time=   0.0s
[CV 6/10] END alpha=1.0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.799 total time=   0.0s
[CV 7/10] END alpha=1.0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.818 total time=   0.0s
[CV 8/10] END alpha=1.0, class_prior=None, fit_prior=True, force_alpha=False;, score=0.739 total time=   0.0s
[CV 9/10] END alpha=1.0, class_prior=None, fit_prior=True, 

In [48]:
y_pred_MNB = clf.predict(X_test)

In [49]:
print("Multinomial classification reprot")
print(classification_report(y_test , y_pred_MNB))

Multinomial classification reprot
              precision    recall  f1-score   support

           0       0.78      0.83      0.80       530
           1       0.75      0.68      0.71       390

    accuracy                           0.77       920
   macro avg       0.76      0.76      0.76       920
weighted avg       0.77      0.77      0.76       920



In [50]:
#Gaussian NB
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()

In [52]:
##Cross validation
parameters = {
    "priors" : [None],
    "var_smoothing" :[np.e**(-9)]
}

In [53]:
clf = GridSearchCV(classifier ,param_grid=parameters,scoring="accuracy" , cv = 10 , verbose= 3)

In [54]:
clf.fit(X_train,y_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
[CV 1/10] END priors=None, var_smoothing=0.00012340980408667962;, score=0.658 total time=   0.0s
[CV 2/10] END priors=None, var_smoothing=0.00012340980408667962;, score=0.666 total time=   0.0s
[CV 3/10] END priors=None, var_smoothing=0.00012340980408667962;, score=0.688 total time=   0.0s
[CV 4/10] END priors=None, var_smoothing=0.00012340980408667962;, score=0.688 total time=   0.0s
[CV 5/10] END priors=None, var_smoothing=0.00012340980408667962;, score=0.668 total time=   0.0s
[CV 6/10] END priors=None, var_smoothing=0.00012340980408667962;, score=0.707 total time=   0.0s
[CV 7/10] END priors=None, var_smoothing=0.00012340980408667962;, score=0.666 total time=   0.0s
[CV 8/10] END priors=None, var_smoothing=0.00012340980408667962;, score=0.663 total time=   0.0s
[CV 9/10] END priors=None, var_smoothing=0.00012340980408667962;, score=0.652 total time=   0.0s
[CV 10/10] END priors=None, var_smoothing=0.00012340980408667962;,

In [55]:
y_pred_GNB = clf.predict(X_test)

In [58]:
print("GaussianNB classification reprot")
print(classification_report(y_test , y_pred_MNB))

GaussianNB classification reprot
              precision    recall  f1-score   support

           0       0.78      0.83      0.80       530
           1       0.75      0.68      0.71       390

    accuracy                           0.77       920
   macro avg       0.76      0.76      0.76       920
weighted avg       0.77      0.77      0.76       920



Final scores:
|model|accuracy|precision|recall|f1 score|
|-----|--------|---------|------|--------|
|BernoulliNB|87%|89%|79%|84%|
|Multinomial|77%|75%|68%|71%|
|Gaussian|77%|75%|68%|71%

Discussion: 
The Bernoulli Naive Bayes model had the highest accuracy compared to all and it had the highest precision , recall and f1 scores compared to Multinomial or Gaussian NB classifier. This is because bernoulli NB classifier is good for binary classification problem statement as Bernoulli Naive Bayes is very good if the predictor variable has binary outcomes like whether the outcome is spam (1) or not spam(0). Therefore it has fared better compared to other methods. On the other hand multinomial classfication is more suitable when the predictor variable has mutiple categorical outcomes and Gaussain NB suits data which follows Guassian distribution.

While Naive Bayes is a powerful and widely used classification algorithm, it also has some limitations:

1. **`Strong independence assumption`**: Naive Bayes assumes that all features are independent, which is often not the case in real-world data. This can lead to suboptimal performance if there are correlations or dependencies between features.

2. **`Sensitivity to irrelevant features`**: Naive Bayes can be sensitive to irrelevant features, which can have a negative impact on its performance. This is because the algorithm treats all features equally and assigns equal weight to each feature, regardless of its relevance to the classification task.

3. **`Limited expressive power`**: Naive Bayes can only model linear decision boundaries, which can be a limitation when dealing with complex or nonlinear datasets.

4. **`Limited data`**: Naive Bayes requires a sufficient amount of data to accurately estimate the class-conditional probabilities. If the amount of training data is limited, the algorithm may suffer from overfitting or underfitting.

5. **`Handling of continuous data`**: The standard implementation of Naive Bayes assumes that the input features are categorical or binary. Handling continuous data requires discretization or modeling the features as continuous variables, which can be computationally expensive.

Despite these limitations, Naive Bayes remains a popular and effective classification algorithm, particularly for high-dimensional datasets with sparse features. Its simplicity, efficiency, and ease of implementation make it a good choice for many real-world applications.

Conclusion: 

1. Bernoulli NB had the highest scores interms of all parameters when compared to other methods like polynomial or guassian NB because the target varaible had binary outcomes. 
2. Bernoulli NB assumes that the features are categorical / binary in nature therefore applying it to continous data may not yield best results.
3. Since the dataset has limited number of observations the bernoulli Nb does not attain very high accuracy scores and more data may be required to develop better models
4. The model was trained with default parameters therefore to improve accuracy we can try out different parameters for our models during cross valdation
5. Logistic regression is very good in situations where features are continous in nature and outcome is binary and this can be tested out in future
6. Randome forest classifier may also be suitable for our problem statement and this can also be tried out in the future