In [1]:
# Q1. A company conducted a survey of its employees and found that 70% of the employees use the
# company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
# probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem. Let's denote the events as follows:

- Event \( A \): Employee uses the health insurance plan.
- Event \( B \): Employee is a smoker.

We are given:

- \( P(A) \), the probability that an employee uses the health insurance plan, which is 70% or 0.70.
- \( P(B|A) \), the conditional probability that an employee is a smoker given that they use the health insurance plan, which is 40% or 0.40.

We need to find \( P(B|A) \), the probability that an employee is a smoker given that they use the health insurance plan. We can use Bayes' theorem to calculate this:

\[ P(B|A) = \frac{P(A|B) \times P(B)}{P(A)} \]

Given that \( P(A) = 0.70 \) and \( P(B|A) = 0.40 \), we can calculate \( P(B) \), the probability that an employee is a smoker:

\[ P(B) = \frac{P(A|B) \times P(B)}{P(A)} \]

\[ P(B) = \frac{0.40 \times 0.70}{0.70} \]

\[ P(B) = 0.40 \]

So, the probability that an employee is a smoker given that he/she uses the health insurance plan is 40%.

In [2]:
# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the type of data they are designed to handle and the underlying assumptions about the distribution of features:

1. **Bernoulli Naive Bayes**:
   - Bernoulli Naive Bayes is suitable for binary feature data, where features represent presence or absence of certain attributes.
   - It assumes that features are independent binary variables, meaning each feature is considered as a binary variable that follows a Bernoulli distribution.
   - It is commonly used in text classification tasks, such as document classification or sentiment analysis, where features represent the presence or absence of words in documents.
   - Example: Spam email detection, where features represent the presence or absence of specific keywords in an email.

2. **Multinomial Naive Bayes**:
   - Multinomial Naive Bayes is suitable for data with features that represent counts or frequencies, such as word counts in text data.
   - It assumes that features are independent variables following a multinomial distribution, where each feature represents the frequency of occurrence of a particular attribute.
   - It is commonly used in text classification tasks, similar to Bernoulli Naive Bayes, but it considers the frequency of words rather than just their presence or absence.
   - Example: Document classification based on word counts, where features represent the frequency of words in documents.

In summary, Bernoulli Naive Bayes is used for binary feature data, while Multinomial Naive Bayes is used for data with features representing counts or frequencies. The choice between the two depends on the nature of the data and the assumptions about the distribution of features.

In [3]:
# Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes typically does not handle missing values explicitly. In most implementations, missing values are either ignored or replaced with a default value before training the model. 

Here are common approaches to handling missing values in Bernoulli Naive Bayes:

1. **Ignore Missing Values**: Some implementations of Bernoulli Naive Bayes simply ignore instances with missing values during model training and prediction. This means that any instance containing missing values is excluded from the analysis.

2. **Imputation**: Another approach is to impute missing values with a default value, such as 0 or 1, depending on the context of the data. For example, in a binary feature dataset, missing values might be imputed with the mode (most common value) of the feature.

3. **Data Preprocessing**: Before applying Bernoulli Naive Bayes, you can preprocess the data to handle missing values using techniques such as mean imputation, median imputation, or interpolation. Once the missing values are imputed, the data can be used to train the Bernoulli Naive Bayes model.

It's important to note that the choice of handling missing values depends on the specific characteristics of the dataset and the assumptions about the nature of the missingness. It's advisable to carefully consider the implications of each approach and experiment with different strategies to determine the most suitable method for your particular dataset and problem.

In [4]:
# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. While it is often used for binary classification problems, Gaussian Naive Bayes can be extended to handle multi-class classification by using the "one-vs-all" (also known as "one-vs-rest") strategy.

In the one-vs-all strategy, a separate binary classifier is trained for each class, where the data for one class is labeled as positive and the data for all other classes are labeled as negative. During prediction, the class with the highest probability output from the binary classifiers is assigned as the predicted class for the input instance.

Here's how Gaussian Naive Bayes can be adapted for multi-class classification:

1. **Training**:
   - For each class \(i\) in the multi-class problem:
     - Assign the data points belonging to class \(i\) as positive examples and the data points belonging to all other classes as negative examples.
     - Train a separate Gaussian Naive Bayes classifier for class \(i\) using the labeled data.

2. **Prediction**:
   - For a new input instance, obtain the probability estimates for each class from all the trained binary classifiers.
   - Assign the class with the highest probability as the predicted class for the input instance.

By using the one-vs-all strategy, Gaussian Naive Bayes can effectively handle multi-class classification problems. However, it's important to note that the performance of the classifier may vary depending on the nature of the data and the assumptions made by the Gaussian Naive Bayes model.

In [5]:
# Q5. Assignment:
# Data preparation:
# Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
# datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
# is spam or not based on several input features.

In [8]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 
  
# metadata 
print(spambase.metadata) 
  
# variable information 
print(spambase.variables) 


{'uci_id': 94, 'name': 'Spambase', 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase', 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv', 'abstract': 'Classifying Email as Spam or Non-Spam', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 4601, 'num_features': 57, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1999, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C53G6X', 'creators': ['Mark Hopkins', 'Erik Reeber', 'George Forman', 'Jaap Suermondt'], 'intro_paper': None, 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collecti

In [9]:
# Implementation:
# Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
# scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
# dataset. You should use the default hyperparameters for each classifier.

In [11]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

# Load the dataset (replace 'dataset_name' with the actual name of the dataset)
dataset = fetch_openml(name='spambase', version=1)

# Split the dataset into features (X) and target variable (y)
X = dataset.data
y = dataset.target

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform 10-fold cross-validation for each classifier
# Bernoulli Naive Bayes
bernoulli_scores = cross_val_score(bernoulli_nb, X, y, cv=10)

# Multinomial Naive Bayes
multinomial_scores = cross_val_score(multinomial_nb, X, y, cv=10)

# Gaussian Naive Bayes
gaussian_scores = cross_val_score(gaussian_nb, X, y, cv=10)

# Print the mean accuracy of each classifier
print("Bernoulli Naive Bayes Mean Accuracy:", bernoulli_scores.mean())
print("Multinomial Naive Bayes Mean Accuracy:", multinomial_scores.mean())
print("Gaussian Naive Bayes Mean Accuracy:", gaussian_scores.mean())


  warn(


Bernoulli Naive Bayes Mean Accuracy: 0.8839380364047911
Multinomial Naive Bayes Mean Accuracy: 0.7863496180326323
Gaussian Naive Bayes Mean Accuracy: 0.8217730830896915


In this implementation:

We load the dataset using fetch_openml() function from scikit-learn, assuming the dataset is available in the OpenML repository.
We split the dataset into features (X) and target variable (y).
We initialize Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the respective classes from scikit-learn.
We perform 10-fold cross-validation for each classifier using the cross_val_score() function.
Finally, we print the mean accuracy of each classifier across all folds of the cross-validation.

In [12]:
# Results:
# Report the following performance metrics for each classifier:
# Accuracy
# Precision
# Recall
# F1 score

To report the performance metrics for each classifier (Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes), we can calculate the following metrics using the cross-validation results:

1. Accuracy: The proportion of correctly classified instances.
2. Precision: The ratio of true positive predictions to the total number of positive predictions.
3. Recall: The ratio of true positive predictions to the total number of actual positive instances.
4. F1 score: The harmonic mean of precision and recall, providing a balanced measure between the two.

Here's how we can calculate these metrics using the cross-validation scores:

```python

```

This code calculates and prints the accuracy, precision, recall, and F1 score for each classifier based on the cross-validation results. Make sure to replace `y_true` with the true target values and `y_pred` with the predicted values for each classifier.

In [16]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate performance metrics for Bernoulli Naive Bayes
bernoulli_accuracy = accuracy_score(y_true=y, y_pred=bernoulli_scores)
bernoulli_precision = precision_score(y_true=y, y_pred=bernoulli_scores)
bernoulli_recall = recall_score(y_true=y, y_pred=bernoulli_scores)
bernoulli_f1 = f1_score(y_true=y, y_pred=bernoulli_scores)

# Calculate performance metrics for Multinomial Naive Bayes
multinomial_accuracy = accuracy_score(y_true=y, y_pred=multinomial_scores)
multinomial_precision = precision_score(y_true=y, y_pred=multinomial_scores)
multinomial_recall = recall_score(y_true=y, y_pred=multinomial_scores)
multinomial_f1 = f1_score(y_true=y, y_pred=multinomial_scores)

# Calculate performance metrics for Gaussian Naive Bayes
gaussian_accuracy = accuracy_score(y_true=y, y_pred=gaussian_scores)
gaussian_precision = precision_score(y_true=y, y_pred=gaussian_scores)
gaussian_recall = recall_score(y_true=y, y_pred=gaussian_scores)
gaussian_f1 = f1_score(y_true=y, y_pred=gaussian_scores)

# Print the performance metrics for each classifier
print("Results:")
print("Performance Metrics for Bernoulli Naive Bayes:")
print("Accuracy:", bernoulli_accuracy)
print("Precision:", bernoulli_precision)
print("Recall:", bernoulli_recall)
print("F1 Score:", bernoulli_f1)
print()

print("Performance Metrics for Multinomial Naive Bayes:")
print("Accuracy:", multinomial_accuracy)
print("Precision:", multinomial_precision)
print("Recall:", multinomial_recall)
print("F1 Score:", multinomial_f1)
print()

print("Performance Metrics for Gaussian Naive Bayes:")
print("Accuracy:", gaussian_accuracy)
print("Precision:", gaussian_precision)
print("Recall:", gaussian_recall)
print("F1 Score:", gaussian_f1)


In [17]:
# Discussion:
# Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
# the case? Are there any limitations of Naive Bayes that you observed?

Based on the results obtained from evaluating the performance metrics for each variant of Naive Bayes (Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes), we can discuss the following observations:

1. **Performance Comparison**:
   - The performance of each Naive Bayes variant can be assessed based on metrics such as accuracy, precision, recall, and F1 score.
   - After evaluating these metrics, we can compare the performance of each variant to determine which one performed the best overall.

2. **Best Performing Variant**:
   - The variant of Naive Bayes that performs the best is typically the one with the highest values for the performance metrics.
   - It's essential to consider all metrics collectively rather than relying solely on one metric to determine the best-performing variant.

3. **Reasons for Performance**:
   - The reason one variant of Naive Bayes may outperform others could be due to the nature of the dataset and how well the assumptions of each variant align with the underlying data distribution.
   - For instance, if the dataset consists of binary features, Bernoulli Naive Bayes might perform better as it assumes features follow a Bernoulli distribution.
   - Similarly, if the dataset contains features representing counts or frequencies, Multinomial Naive Bayes might be more suitable as it assumes features follow a multinomial distribution.

4. **Limitations of Naive Bayes**:
   - Despite its simplicity and efficiency, Naive Bayes classifiers make strong independence assumptions between features, which might not hold true in all datasets.
   - Another limitation is the sensitivity to the presence of irrelevant or correlated features, which can adversely affect performance.
   - Moreover, Naive Bayes classifiers may struggle with datasets that have imbalanced class distributions, as they assume equal prior probabilities for each class.

Overall, while Naive Bayes classifiers are often considered robust and efficient, their performance can vary depending on the specific characteristics of the dataset and how well their assumptions align with the underlying data distribution. It's essential to carefully evaluate the performance of each variant and consider their limitations when applying Naive Bayes in real-world scenarios.

In [18]:
# Conclusion:
# Summarise your findings and provide some suggestions for future work.

In conclusion, our evaluation of different variants of Naive Bayes classifiers (Bernoulli, Multinomial, and Gaussian) yielded insights into their performance on a binary classification task. Here are the key findings and suggestions for future work:

1. **Findings**:
   - Each variant of Naive Bayes exhibited varying performance across different evaluation metrics such as accuracy, precision, recall, and F1 score.
   - The best-performing variant was determined based on the overall performance across all metrics, considering the specific characteristics of the dataset.
   - The choice of the best-performing variant may depend on the nature of the dataset and how well the assumptions of each variant align with the underlying data distribution.

2. **Suggestions for Future Work**:
   - Investigate Ensemble Methods: Future work could explore ensemble methods, such as Bagging or Boosting, to combine predictions from multiple Naive Bayes classifiers and potentially improve classification performance.
   - Feature Engineering: Conduct further feature engineering to identify and select relevant features or to transform existing features to better align with the assumptions of each Naive Bayes variant.
   - Address Class Imbalance: Implement techniques to address class imbalance, such as oversampling, undersampling, or using different evaluation metrics tailored for imbalanced datasets, to enhance the performance of Naive Bayes classifiers.
   - Evaluate on Diverse Datasets: Test the performance of Naive Bayes classifiers on a diverse range of datasets with varying characteristics to gain a deeper understanding of their strengths and limitations across different domains.

By addressing these suggestions and conducting further research, we can gain a better understanding of the applicability and performance of Naive Bayes classifiers in practical scenarios and potentially enhance their effectiveness in real-world applications.