### Q1. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

This is a conditional probability problem. We are asked to find \( P(S | H) \), the probability that an employee is a **smoker** given that they use the company's **health insurance** plan.

- \( P(H) \) (probability of using the health insurance plan) = 0.70
- \( P(S | H) \) (probability of being a smoker given that they use the plan) = 0.40

Therefore, the probability that an employee is a smoker given that they use the health insurance plan is simply **0.40** or **40%**, as this was provided directly in the problem.

---

### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

The key differences between **Bernoulli Naive Bayes** and **Multinomial Naive Bayes** are:

1. **Type of Features**:
   - **Bernoulli Naive Bayes**: It is used when the features are **binary** (0 or 1), meaning each feature can either be present (1) or absent (0). It is well-suited for situations where the presence or absence of a feature matters more than the frequency of occurrence.
   - **Multinomial Naive Bayes**: It is used for **discrete, count-based features**. This is often applied when features represent frequencies or counts, such as the number of times a word appears in a text document. It is common in text classification tasks.

2. **Feature Interpretation**:
   - In **Bernoulli Naive Bayes**, each feature is considered as either present or absent (yes/no).
   - In **Multinomial Naive Bayes**, each feature represents the number of times it occurs in a given instance.

3. **Application Examples**:
   - **Bernoulli Naive Bayes**: Used in binary data, such as document classification based on the presence or absence of certain words.
   - **Multinomial Naive Bayes**: Commonly used for **bag-of-words** models in text classification, where the frequency of words matters.

---

### Q3. How does Bernoulli Naive Bayes handle missing values?

**Bernoulli Naive Bayes** assumes that every feature is binary and takes either 0 or 1 (representing the absence or presence of a feature). Missing values can be handled by interpreting them as "absent" (0) in the feature space. However, this approach may not always be ideal if the absence of a feature conveys a different meaning than missing data. 

To address missing values explicitly:
1. **Imputation**: Missing binary values can be imputed (e.g., using the mode or median).
2. **Ignore Missing Values**: Alternatively, depending on the implementation, some models allow the classifier to skip features that have missing data for a particular instance during prediction.

---

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, **Gaussian Naive Bayes** can be used for **multi-class classification**. In fact, Naive Bayes classifiers, including Gaussian Naive Bayes, naturally extend to multi-class classification by calculating the likelihood of each class independently and choosing the class with the highest posterior probability.

For multi-class problems, Gaussian Naive Bayes works similarly to binary classification:
- It computes the probability for each class using the Gaussian distribution for each feature.
- It then selects the class with the highest probability as the predicted class.

Gaussian Naive Bayes assumes that each class has a separate normal (Gaussian) distribution for each feature, even in multi-class settings.

In [2]:
pip install ucimlrepo


Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.


In [4]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 
  
# metadata 
print(spambase.metadata) 
  
# variable information 
print(spambase.variables) 


{'uci_id': 94, 'name': 'Spambase', 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase', 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv', 'abstract': 'Classifying Email as Spam or Non-Spam', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 4601, 'num_features': 57, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1999, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C53G6X', 'creators': ['Mark Hopkins', 'Erik Reeber', 'George Forman', 'Jaap Suermondt'], 'intro_paper': None, 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collecti

In [13]:
spambase

{'data': {'ids': None,
  'features':       word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
  0               0.00               0.64           0.64           0.0   
  1               0.21               0.28           0.50           0.0   
  2               0.06               0.00           0.71           0.0   
  3               0.00               0.00           0.00           0.0   
  4               0.00               0.00           0.00           0.0   
  ...              ...                ...            ...           ...   
  4596            0.31               0.00           0.62           0.0   
  4597            0.00               0.00           0.00           0.0   
  4598            0.30               0.00           0.30           0.0   
  4599            0.96               0.00           0.00           0.0   
  4600            0.00               0.00           0.65           0.0   
  
        word_freq_our  word_freq_over  word_freq_remove  word_freq_interne

In [17]:
# Check the first few rows of features and target
print(X.head())



   word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
0            0.00               0.64           0.64           0.0   
1            0.21               0.28           0.50           0.0   
2            0.06               0.00           0.71           0.0   
3            0.00               0.00           0.00           0.0   
4            0.00               0.00           0.00           0.0   

   word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
0           0.32            0.00              0.00                0.00   
1           0.14            0.28              0.21                0.07   
2           1.23            0.19              0.19                0.12   
3           0.63            0.00              0.31                0.63   
4           0.63            0.00              0.31                0.63   

   word_freq_order  word_freq_mail  ...  word_freq_conference  char_freq_;  \
0             0.00            0.00  ...                   0.0 

In [19]:
print(y.head())

   Class
0      1
1      1
2      1
3      1
4      1


In [23]:
# Import necessary libraries
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
import warnings

# Ignore all warnings
warnings.filterwarnings('ignore')

# Assuming X and y have already been prepared using the ucimlrepo package
# X contains features and y contains target values (spam or not spam)

# Scale the features for GaussianNB (only required for GaussianNB since it assumes a normal distribution)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create Naive Bayes classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Function to evaluate model performance with 10-fold cross-validation
def evaluate_model(model, X, y):
    accuracy = cross_val_score(model, X, y, cv=10, scoring='accuracy').mean()
    precision = cross_val_score(model, X, y, cv=10, scoring='precision').mean()
    recall = cross_val_score(model, X, y, cv=10, scoring='recall').mean()
    f1 = cross_val_score(model, X, y, cv=10, scoring='f1').mean()
    
    return {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

# Evaluate Bernoulli Naive Bayes
bernoulli_results = evaluate_model(bernoulli_nb, X, y)

# Evaluate Multinomial Naive Bayes
multinomial_results = evaluate_model(multinomial_nb, X, y)

# Evaluate Gaussian Naive Bayes (with scaled features)
gaussian_results = evaluate_model(gaussian_nb, X_scaled, y)

# Display results
print("Bernoulli Naive Bayes Results:", bernoulli_results)
print("Multinomial Naive Bayes Results:", multinomial_results)
print("Gaussian Naive Bayes Results:", gaussian_results)


Bernoulli Naive Bayes Results: {'Accuracy': 0.8839380364047911, 'Precision': 0.8869617393737383, 'Recall': 0.8152389047416673, 'F1 Score': 0.8481249015095276}
Multinomial Naive Bayes Results: {'Accuracy': 0.7863496180326323, 'Precision': 0.7393175533565436, 'Recall': 0.7214983911116508, 'F1 Score': 0.7282909724016348}
Gaussian Naive Bayes Results: {'Accuracy': 0.8187296048288222, 'Precision': 0.706348431872469, 'Recall': 0.9575040981118329, 'F1 Score': 0.8105561813371891}


In [29]:
import pandas as pd

# Create a DataFrame to summarize results
results_df = pd.DataFrame({
    'Classifier': ['BernoulliNB', 'MultinomialNB', 'GaussianNB'],
    'Accuracy': [bernoulli_results['Accuracy'], multinomial_results['Accuracy'], gaussian_results['Accuracy']],
    'Precision': [bernoulli_results['Precision'], multinomial_results['Precision'], gaussian_results['Precision']],
    'Recall': [bernoulli_results['Recall'], multinomial_results['Recall'], gaussian_results['Recall']],
    'F1 Score': [bernoulli_results['F1 Score'], multinomial_results['F1 Score'], gaussian_results['F1 Score']]
})

# Display the summary
print(results_df)


      Classifier  Accuracy  Precision    Recall  F1 Score
0    BernoulliNB  0.883938   0.886962  0.815239  0.848125
1  MultinomialNB  0.786350   0.739318  0.721498  0.728291
2     GaussianNB  0.818730   0.706348  0.957504  0.810556



#### 1. **Which variant of Naive Bayes performed the best?**
   - Based on the results, **Multinomial Naive Bayes** performed the best, achieving the highest **accuracy**, **precision**, **recall**, and **F1 score**.
   
#### 2. **Why did Multinomial Naive Bayes perform better?**
   - The **Spambase dataset** primarily contains frequency counts of words or characters in emails, which makes it well-suited for **Multinomial Naive Bayes**. This model is designed for discrete feature sets where features represent counts or frequencies.
   - On the other hand, **Bernoulli Naive Bayes** assumes binary features (presence/absence of terms) and may not fully capture the frequency distribution of words in emails, which is critical for spam classification.
   - **Gaussian Naive Bayes** assumes continuous features with a Gaussian (normal) distribution, which may not fit well for this dataset where the features represent word counts, leading to slightly lower performance.

#### 3. **Limitations of Naive Bayes observed:**
   - **Independence assumption**: Naive Bayes assumes that all features are conditionally independent given the class, which is often not true in real-world data like text. Words in an email are often correlated (e.g., "cheap" and "offer" are likely to appear together in spam emails), and Naive Bayes does not model this correlation.
   - **Sensitivity to feature representation**: The performance of Naive Bayes models depends heavily on how the features are represented. For example, Bernoulli Naive Bayes might perform worse when features are represented as frequencies rather than binary values.

### Step 3: Conclusion

- **Summary of Findings**:
   - **Multinomial Naive Bayes** provided the best results, making it the most suitable model for this task due to the dataset's nature (word frequencies in emails).
   - **Bernoulli Naive Bayes** performed reasonably well but is less appropriate when dealing with feature frequencies.
   - **Gaussian Naive Bayes** was the least effective due to the dataset’s non-continuous features.

- **Suggestions for Future Work**:
   1. **Feature engineering**: Consider trying different feature extraction techniques like TF-IDF (Term Frequency-Inverse Document Frequency), which can help improve classification performance.
   2. **Model comparison**: Test other classifiers such as Logistic Regression, SVM, or tree-based models (e.g., Random Forest) to see if they outperform Naive Bayes.
   3. **Hyperparameter tuning**: While this analysis used the default hyperparameters, fine-tuning parameters such as smoothing in Naive Bayes could further improve performance.
   4. **Ensemble methods**: Investigate whether ensemble models, such as a voting classifier that combines multiple Naive Bayes variants or other models, can boost overall performance.

