# **Naïve bayes-2**

### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

**Answer:**
We need to find \( P(\text{Smoker} | \text{Uses Plan}) \).

Given:
- \( P(\text{Uses Plan}) = 0.70 \)
- \( P(\text{Smoker} | \text{Uses Plan}) = 0.40 \)

The probability that an employee is a smoker given that he/she uses the health insurance plan is 0.40, or 40%.

### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

**Answer:**
- **Bernoulli Naive Bayes:**
  - Suitable for binary/Boolean data (features are either 0 or 1).
  - Models the presence or absence of features.
  - Each feature is treated as an independent Bernoulli variable.
  - Typically used in text classification tasks where the focus is on whether a particular word occurs in a document.

- **Multinomial Naive Bayes:**
  - Suitable for discrete data (features are counts or frequencies).
  - Models the occurrence counts of features.
  - Each feature is treated as an independent Multinomial variable.
  - Commonly used in text classification tasks where the frequency of words matters.

### Q3. How does Bernoulli Naive Bayes handle missing values?

**Answer:**
Bernoulli Naive Bayes does not inherently handle missing values directly. Typically, missing values need to be imputed or handled before applying the classifier. Some common strategies for handling missing values include:
- Imputation using mean/median/mode of the feature.
- Using a separate category or value to denote missing values.
- Excluding instances with missing values (if the dataset is large enough).

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

**Answer:**
Yes, Gaussian Naive Bayes can be used for multi-class classification. It assumes that the features follow a Gaussian (normal) distribution and calculates the probability of each class. The class with the highest posterior probability is chosen as the prediction. Gaussian Naive Bayes handles multi-class classification by calculating the probabilities for each class and then selecting the class with the maximum posterior probability.

###  Q5. Assignment:

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("spambase.csv")

In [3]:
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [5]:
df.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [19]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data into features and target variable
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Standardize the features (especially for GaussianNB)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)



In [20]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Define classifiers
classifiers = {
    'BernoulliNB': BernoulliNB(),
    'MultinomialNB': MultinomialNB(),
    'GaussianNB': GaussianNB()
}

# Function to evaluate classifiers
def evaluate_classifier(clf, X, y):
    accuracy = cross_val_score(clf, X, y, cv=10, scoring='accuracy').mean()
    y_pred = cross_val_predict(clf, X, y, cv=10)
    precision = precision_score(y, y_pred)
    recall = recall_score(y, y_pred)
    f1 = f1_score(y, y_pred)
    return accuracy, precision, recall, f1

# Evaluate each classifier
results = {}
for name, clf in classifiers.items():
    if name == 'GaussianNB':
        X_to_use = X_scaled
    else:
        X_to_use = X
    accuracy, precision, recall, f1 = evaluate_classifier(clf, X_to_use, y)
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

results


{'BernoulliNB': {'Accuracy': 0.8839380364047911,
  'Precision': 0.8813357185450209,
  'Recall': 0.815223386651958,
  'F1 Score': 0.8469914040114613},
 'MultinomialNB': {'Accuracy': 0.7863496180326323,
  'Precision': 0.7323628219484882,
  'Recall': 0.7214561500275786,
  'F1 Score': 0.7268685746040566},
 'GaussianNB': {'Accuracy': 0.8187296048288222,
  'Precision': 0.6963497793822704,
  'Recall': 0.9575289575289575,
  'F1 Score': 0.8063167673014399}}

In [21]:
for name, metrics in results.items():
    print(f"Classifier: {name}")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")
    print("\n")


Classifier: BernoulliNB
Accuracy: 0.8839
Precision: 0.8813
Recall: 0.8152
F1 Score: 0.8470


Classifier: MultinomialNB
Accuracy: 0.7863
Precision: 0.7324
Recall: 0.7215
F1 Score: 0.7269


Classifier: GaussianNB
Accuracy: 0.8187
Precision: 0.6963
Recall: 0.9575
F1 Score: 0.8063




### Discussion:

The results for the classifiers are as follows:

- **Bernoulli Naive Bayes:**
  - Accuracy: 0.8839
  - Precision: 0.8813
  - Recall: 0.8152
  - F1 Score: 0.8470

- **Multinomial Naive Bayes:**
  - Accuracy: 0.7863
  - Precision: 0.7324
  - Recall: 0.7215
  - F1 Score: 0.7269

- **Gaussian Naive Bayes:**
  - Accuracy: 0.8187
  - Precision: 0.6963
  - Recall: 0.9575
  - F1 Score: 0.8063

**Which variant of Naive Bayes performed the best?**

Based on the results, **Bernoulli Naive Bayes** performed the best overall, achieving the highest accuracy (0.8839), precision (0.8813), and a balanced F1 score (0.8470).

**Why do you think that is the case?**

The Spambase dataset contains binary features (word presence/absence in email) which suit the Bernoulli Naive Bayes classifier. This classifier models the presence or absence of a feature, making it a natural fit for text classification tasks involving binary features.

**Are there any limitations of Naive Bayes that you observed?**

- **Assumption of Independence:** Naive Bayes classifiers assume that features are conditionally independent given the class label, which is often not the case in real-world data. Despite this strong assumption, Naive Bayes can still perform surprisingly well, but it can be a limitation in some scenarios.
- **Sensitivity to Feature Distribution:** The performance of Gaussian Naive Bayes is affected by the assumption that features follow a Gaussian distribution, which may not be true for all datasets. This can lead to suboptimal performance.
- **Handling of Zero Probabilities:** Naive Bayes can encounter issues with zero probabilities if a feature value never occurs in the training set for a given class. This can be mitigated by using techniques like Laplace smoothing.

### Conclusion:

**Summary of Findings:**

- Bernoulli Naive Bayes performed the best on the Spambase dataset, achieving the highest accuracy, precision, and a balanced F1 score. This is likely due to the dataset's binary features, which are well-suited to the Bernoulli model.
- Multinomial Naive Bayes performed adequately but was outperformed by Bernoulli Naive Bayes, likely because it is more suited to count data rather than binary data.
- Gaussian Naive Bayes had the highest recall (0.9575), indicating it is very good at identifying spam emails, but it had lower precision (0.6963), leading to more false positives.

**Suggestions for Future Work:**

1. **Feature Engineering:** Further improve the feature extraction process to enhance the classifier’s performance. This can include creating additional features, performing dimensionality reduction, or applying advanced text preprocessing techniques.
2. **Hyperparameter Tuning:** Explore hyperparameter tuning for each Naive Bayes classifier to optimize their performance. Techniques such as grid search or random search could be used.
3. **Combine Classifiers:** Consider using an ensemble method that combines the strengths of different Naive Bayes classifiers or other machine learning algorithms to achieve better overall performance.
4. **Addressing Assumptions:** Investigate methods to relax the strong independence assumption of Naive Bayes, such as using Bayesian Networks or other models that capture dependencies between features.
5. **Handling Imbalanced Data:** Explore techniques for handling class imbalance, such as oversampling the minority class or using cost-sensitive learning methods.

By implementing these suggestions, the performance of Naive Bayes classifiers on the Spambase dataset can be further improved, potentially leading to better spam detection.