# Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

The probability that an employee is a smoker given that uses the health insurance plan is 0.40 or 40%.

# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

1. **Bernoulli Naive Bayes**:
   - Bernoulli Naive Bayes is suitable for features that are binary or Boolean (i.e., they take on values of either 0 or 1).
   - It assumes a Bernoulli distribution for the features, where each feature is considered as a binary random variable indicating the presence (1) or absence (0) of a particular term or attribute.
   - Bernoulli Naive Bayes is commonly used in text classification tasks, where features represent the presence or absence of specific words in documents or texts.

2. **Multinomial Naive Bayes**:
   - Multinomial Naive Bayes is appropriate for features that represent counts or frequencies of events.
   - It assumes a multinomial distribution for the features, where each feature represents the frequency of occurrence of a term or attribute in a document or sample.
   - Multinomial Naive Bayes is commonly used in text classification tasks, where features represent the frequency of words or tokens in documents, such as bag-of-words representations.

# Q3. How does Bernoulli Naive Bayes handle missing values?

1. **Treat missing values as a separate category**:
   - One approach is to treat missing values as a distinct category or state. This means that for each feature with missing values, a new category representing the absence of any observed value is created.
   - When making predictions for instances with missing values, the classifier considers the presence or absence of each feature, including the missing values as a separate category.

2. **Imputation with a specific value**:
   - Another approach is to impute missing values with a specific value, such as 0 or 1, depending on the context of the problem.
   - For example, in a binary feature representing the presence or absence of a term in a document, missing values could be imputed with 0 to indicate the absence of the term.

3. **Use of other imputation methods**:
   - Alternatively, more sophisticated imputation methods can be used to estimate missing values based on the observed data. This could involve methods like mean imputation, median imputation, or more complex techniques such as k-nearest neighbors (KNN) imputation.

# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is an extension of the Naive Bayes algorithm that assumes continuous features follow a Gaussian (normal) distribution. While Naive Bayes is commonly used for binary or two-class classification problems, Gaussian Naive Bayes can be adapted to handle multi-class classification tasks.

In multi-class classification, Gaussian Naive Bayes assigns each class a probability distribution based on the observed values of the features. When a new instance is presented for classification, the algorithm calculates the likelihood of the instance belonging to each class based on the probability distributions of the features for each class. The class with the highest likelihood is then chosen as the predicted class for the instance.

To handle multi-class classification using Gaussian Naive Bayes, the algorithm can be extended to estimate the parameters of the Gaussian distributions (mean and variance) for each class and feature. These parameters are used to compute the probability density function (PDF) of the Gaussian distribution for each feature given each class. The joint probability of all features given each class is then calculated using the product of the PDFs, and the class with the highest joint probability is chosen as the predicted class.

# Q5. Assignment:

# Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features

In [1]:
pip install ucimlrepo

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.6-py3-none-any.whl.metadata (5.3 kB)
Downloading ucimlrepo-0.0.6-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.6


In [130]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 


In [131]:
X

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.0,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.0,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.0,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.0,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78


In [132]:
y

Unnamed: 0,Class
0,1
1,1
2,1
3,1
4,1
...,...
4596,0
4597,0
4598,0
4599,0


In [133]:
import numpy as np

y = np.ravel(y)

In [134]:
y

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

# Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the 
dataset. You should use the default hyperparameters for each classifier.

In [135]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn import metrics

In [136]:
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

In [137]:
cv_scores_bernoulli = cross_val_score(bernoulli_nb, X, y, cv=10)
cv_scores_multinomial = cross_val_score(multinomial_nb, X, y, cv=10)
cv_scores_gaussian = cross_val_score(gaussian_nb, X, y, cv=10)

In [138]:
print(cv_scores_bernoulli.mean())
print(cv_scores_multinomial.mean())
print(cv_scores_gaussian.mean())

0.8839380364047911
0.7863496180326323
0.8217730830896915


# Results:
 Report the following performance metrics for each classifier:
 Accuracy
 Precision
 Recall
 F1 score

In [118]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [145]:
y_pred_bernoulli = cross_val_predict(bernoulli_nb, X, y, cv=10)
y_pred_multinomial = cross_val_predict(multinomial_nb, X, y, cv=10)
y_pred_gaussian = cross_val_predict(gaussian_nb, X, y, cv=10)

In [153]:
print(accuracy_score(y, y_pred_bernoulli))
print(metrics.precision_score(y, y_pred_bernoulli))
print(metrics.recall_score(y, y_pred_bernoulli))
print(metrics.f1_score(y, y_pred_bernoulli))

0.8839382742881983
0.8813357185450209
0.815223386651958
0.8469914040114613


In [156]:
print(classification_report(y,y_pred_bernoulli))
print(accuracy_score(y,y_pred_bernoulli))

              precision    recall  f1-score   support

           0       0.89      0.93      0.91      2788
           1       0.88      0.82      0.85      1813

    accuracy                           0.88      4601
   macro avg       0.88      0.87      0.88      4601
weighted avg       0.88      0.88      0.88      4601

0.8839382742881983


In [154]:
print(classification_report(y,y_pred_multinomial))
print(accuracy_score(y,y_pred_multinomial))

              precision    recall  f1-score   support

           0       0.82      0.83      0.82      2788
           1       0.73      0.72      0.73      1813

    accuracy                           0.79      4601
   macro avg       0.78      0.78      0.78      4601
weighted avg       0.79      0.79      0.79      4601

0.786350793305803


In [155]:
print(classification_report(y,y_pred_gaussian))
print(accuracy_score(y,y_pred_gaussian))

              precision    recall  f1-score   support

           0       0.96      0.73      0.83      2788
           1       0.70      0.96      0.81      1813

    accuracy                           0.82      4601
   macro avg       0.83      0.85      0.82      4601
weighted avg       0.86      0.82      0.82      4601

0.8217778743751358


# Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

Based on the provided results:

1. **Bernoulli Naive Bayes**:
   - Accuracy: 0.8839
   - Precision (class 0): 0.89
   - Recall (class 0): 0.93
   - F1-score (class 0): 0.91
   - Precision (class 1): 0.88
   - Recall (class 1): 0.82
   - F1-score (class 1): 0.85

2. **Multinomial Naive Bayes**:
   - Accuracy: 0.7864
   - Precision (class 0): 0.82
   - Recall (class 0): 0.83
   - F1-score (class 0): 0.82
   - Precision (class 1): 0.73
   - Recall (class 1): 0.72
   - F1-score (class 1): 0.73

3. **Gaussian Naive Bayes**:
   - Accuracy: 0.8218
   - Precision (class 0): 0.96
   - Recall (class 0): 0.73
   - F1-score (class 0): 0.83
   - Precision (class 1): 0.70
   - Recall (class 1): 0.96
   - F1-score (class 1): 0.81

Based on accuracy, Bernoulli Naive Bayes performed the best with an accuracy of 0.8839, followed by Gaussian Naive Bayes with an accuracy of 0.8218, and Multinomial Naive Bayes with an accuracy of 0.7864.

**Reasons for Performance:**
- **Bernoulli Naive Bayes**: This variant assumes that features are binary, which might be suitable for the given dataset. It performed well because the dataset might have binary features, and the assumption of Bernoulli Naive Bayes matches the data well.
- **Multinomial Naive Bayes**: This variant assumes that features follow a multinomial distribution, which might not be the best fit for the dataset. If the features are not strictly counts of occurrences (which multinomial distribution assumes), the model might not perform optimally.
- **Gaussian Naive Bayes**: This variant assumes that features follow a Gaussian distribution. It performed reasonably well, but not as well as Bernoulli Naive Bayes. This suggests that the features might not be normally distributed, or there could be dependencies between features that violate the independence assumption of Gaussian Naive Bayes.

**Limitations of Naive Bayes**:
- **Strong Independence Assumption**: Naive Bayes assumes that features are conditionally independent given the class, which might not hold true in real-world datasets.
- **Sensitivity to Feature Correlations**: Naive Bayes can perform poorly if features are correlated with each other, as it assumes independence.
- **Zero Frequency Problem**: In Multinomial Naive Bayes, if a category does not occur in the training data, it will assign a zero probability, which can cause issues during classification.

# Conclusion:
 Summarise your findings and provide some suggestions for future work.

- Bernoulli Naive Bayes achieved the highest accuracy among the three classifiers, followed by Gaussian Naive Bayes and then Multinomial Naive Bayes.
- Bernoulli Naive Bayes performed well, likely because its assumption of binary features matches the characteristics of the dataset.
- Multinomial Naive Bayes showed relatively lower performance, suggesting that the multinomial assumption might not be the best fit for the dataset.
- Gaussian Naive Bayes performed reasonably well but not as well as Bernoulli Naive Bayes, indicating that the Gaussian distribution assumption might not perfectly model the data.

Suggestions for future work:

1. **Feature Engineering**: Explore additional feature engineering techniques to extract more informative features from the dataset, which could potentially improve the performance of all classifiers.
  
2. **Algorithm Tuning**: Experiment with hyperparameter tuning for each variant of Naive Bayes to optimize their performance further. Techniques like grid search or randomized search can be employed for this purpose.

3. **Ensemble Methods**: Investigate ensemble learning techniques such as bagging, boosting, or stacking, which combine multiple classifiers to improve overall performance.

4. **Model Evaluation**: Apart from accuracy, precision, recall, and F1-score, consider evaluating the models using other metrics such as ROC curve and AUC score to gain a more comprehensive understanding of their performance.

5. **Cross-Validation Strategies**: Explore different cross-validation strategies, such as stratified cross-validation or nested cross-validation, to ensure robust evaluation of the models.

6. **Data Preprocessing**: Evaluate the impact of different data preprocessing techniques, such as normalization, scaling, or handling of imbalanced classes, on the performance of the classifiers.