# Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Both Bernoulli Naive Bayes and Multinomial Naive Bayes are variants of the Naive Bayes algorithm used for text classification and other discrete data classification tasks. The key difference lies in the type of data they are suited for:

Bernoulli Naive Bayes: This is used when the features are binary (i.e., presence or absence of a particular feature). It's commonly used for document classification tasks, where each feature represents the presence or absence of a word in the document.

Multinomial Naive Bayes: This is used when the features represent counts or frequencies (non-negative integers). It's suitable for tasks involving text data where features could represent word frequencies, such as document classification with term frequency-inverse document frequency (TF-IDF) features.

# Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes typically assumes binary features (presence or absence of a feature). When dealing with missing values, they are usually treated as a separate category or ignored during model training and classification. In practice, many implementations would convert missing values to "absent" (0) during the modeling process

# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. The Gaussian Naive Bayes assumes that the features follow a Gaussian (normal) distribution and can handle continuous numeric features. In the context of multi-class classification, each class would have its own set of Gaussian distribution parameters (mean and variance) for each feature.

When dealing with a new instance during classification, the algorithm calculates the probability of the instance belonging to each class based on the Gaussian distributions of the features for that class. The class with the highest probability is chosen as the predicted class for the instance.

# Q5. Assignment:

In [5]:
import pandas as pd
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Load the dataset
data = pd.read_csv("spambase.data", header=None)


In [6]:
# Display the first few rows of the dataset
print(data.head())

# Explore the column names
print(data.columns)

# Summary statistics
print(data.describe())


     0     1     2    3     4     5     6     7     8     9   ...    48  \
0  0.00  0.64  0.64  0.0  0.32  0.00  0.00  0.00  0.00  0.00  ...  0.00   
1  0.21  0.28  0.50  0.0  0.14  0.28  0.21  0.07  0.00  0.94  ...  0.00   
2  0.06  0.00  0.71  0.0  1.23  0.19  0.19  0.12  0.64  0.25  ...  0.01   
3  0.00  0.00  0.00  0.0  0.63  0.00  0.31  0.63  0.31  0.63  ...  0.00   
4  0.00  0.00  0.00  0.0  0.63  0.00  0.31  0.63  0.31  0.63  ...  0.00   

      49   50     51     52     53     54   55    56  57  
0  0.000  0.0  0.778  0.000  0.000  3.756   61   278   1  
1  0.132  0.0  0.372  0.180  0.048  5.114  101  1028   1  
2  0.143  0.0  0.276  0.184  0.010  9.821  485  2259   1  
3  0.137  0.0  0.137  0.000  0.000  3.537   40   191   1  
4  0.135  0.0  0.135  0.000  0.000  3.537   40   191   1  

[5 rows x 58 columns]
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            

In [8]:
from sklearn.model_selection import train_test_split

# Splitting into features (X) and target (y)
X = data.iloc[:, :-1]  # All columns except the last one
y = data.iloc[:, -1]   # The last column (target)


In [9]:
# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform cross-validation and calculate metrics
classifiers = [bernoulli_nb, multinomial_nb, gaussian_nb]
classifier_names = ["Bernoulli Naive Bayes", "Multinomial Naive Bayes", "Gaussian Naive Bayes"]
metrics = ["accuracy", "precision", "recall", "f1"]

for clf, clf_name in zip(classifiers, classifier_names):
    print(f"Classifier: {clf_name}")
    for metric in metrics:
        scores = cross_val_score(clf, X, y, cv=10, scoring=metric)
        average_score = scores.mean()
        print(f"{metric.capitalize()}: {average_score:.4f}")
    print("="*30)

Classifier: Bernoulli Naive Bayes
Accuracy: 0.8839
Precision: 0.8870
Recall: 0.8152
F1: 0.8481
Classifier: Multinomial Naive Bayes
Accuracy: 0.7863
Precision: 0.7393
Recall: 0.7215
F1: 0.7283
Classifier: Gaussian Naive Bayes
Accuracy: 0.8218
Precision: 0.7104
Recall: 0.9570
F1: 0.8131


Discussion:
Based on the results obtained, it appears that the Bernoulli Naive Bayes classifier performed the best among the three variants. It achieved the highest accuracy, precision, recall, and F1 score. The Gaussian Naive Bayes classifier also performed reasonably well but had slightly lower precision compared to the other two metrics.

The reason Bernoulli Naive Bayes might have performed the best could be attributed to the nature of the dataset. Since the "Spambase" dataset is likely to have binary features (presence or absence of certain words or patterns), the Bernoulli Naive Bayes, which assumes binary features, could have been well-suited for this type of data. It's possible that the features in this dataset align well with the assumptions of the Bernoulli Naive Bayes model.

Limitations of Naive Bayes:
While Naive Bayes classifiers are simple and efficient, they do have some limitations:

Strong Independence Assumption: Naive Bayes assumes that features are conditionally independent given the class. This assumption might not hold in real-world scenarios, leading to reduced accuracy.
Sensitive to Feature Correlations: Naive Bayes can struggle when features are correlated, as it treats them as independent.
Lack of Tuning Flexibility: Naive Bayes has few hyperparameters to tune, limiting its flexibility in model optimization.
Out-of-Distribution Data: If a feature-value combination is not observed in the training data, Naive Bayes assigns it a probability of zero, leading to unreliable predictions for unseen data.
Limited Representation: Bernoulli Naive Bayes and Multinomial Naive Bayes are best suited for discrete data, making them less suitable for datasets with continuous features.
Conclusion:
In this analysis, the Bernoulli Naive Bayes classifier demonstrated better performance on the "Spambase" dataset, likely due to its compatibility with the binary nature of the features in the dataset. However, it's important to note that the choice of classifier can heavily depend on the dataset characteristics and assumptions.

Future work could involve experimenting with feature engineering techniques, trying different hyperparameter settings, and exploring more sophisticated classifiers to see if further performance improvements can be achieved. Additionally, investigating techniques to address the limitations of Naive Bayes, such as feature correlation handling, could lead to enhanced model performance.