## Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To solve this problem, we need to use Bayes' theorem, which relates conditional probabilities. Let's define:

A: an employee uses the company's health insurance plan<br>
B: an employee is a smoker

We want to find the probability of an employee being a smoker given that he/she uses the health insurance plan, which is P(B|A).

We know that 70% of the employees use the health insurance plan, which means P(A) = 0.7.

We also know that 40% of the employees who use the plan are smokers, which means P(B|A) = 0.4.

Bayes' theorem states that: **P(B|A) = P(A|B) * P(B) / P(A)**

We need to find P(B), which is the probability of an employee being a smoker regardless of whether they use the health insurance plan or not. We can use the law of total probability to calculate it:

**P(B) = P(B|A) * P(A) + P(B|A') * P(A')**

where A' means an employee does not use the health insurance plan. We can assume that the percentage of non-users of the plan who are smokers is negligible, so **P(B|A') ≈ 0**. Therefore:

**P(B) ≈ P(B|A) * P(A) + 0**

P(B) ≈ 0.4 * 0.7 = 0.28

Now we can plug in all the values into Bayes' theorem:

P(B|A) = P(A|B) * P(B) / P(A)

P(B|A) = P(A and B) / P(A)

P(B|A) = P(B|A) * P(A) / P(A)

P(B|A) = 0.4 * 0.7 / 0.7

P(B|A) = 0.4

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.4 or 40%.

## Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are both variants of the Naive Bayes algorithm, which is a popular algorithm for classification tasks in machine learning. While they are both based on the same underlying principles, there are some differences in the way they handle data.

Bernoulli Naive Bayes is typically used when the features are binary & it takes only two values, 0 & 1. It is commonly used in text classification tasks, where each feature represents the presence or absence of a particular word in a document. In Bernoulli Naive Bayes, each feature is modeled as a binary random variable, with the assumption that each feature is conditionally independent given the class. This means that the presence or absence of one feature does not affect the probability of the presence or absence of any other feature. The algorithm then calculates the conditional probability of each class given the presence or absence of each feature, using Bayes' theorem.

Multinomial Naive Bayes, on the other hand, is used when the features are discrete & it takes some non-negative integer values. It is commonly used in text classification tasks, where each feature represents the count of a particular word in a document. In Multinomial Naive Bayes, each feature is modeled as a multinomial random variable, with the assumption that each feature is conditionally independent given the class. This means that the count of one feature does not affect the probability of the count of any other feature. The algorithm then calculates the conditional probability of each class given the count of each feature, using Bayes' theorem.

In summary, Bernoulli Naive Bayes is used for binary features, while Multinomial Naive Bayes is used for discrete count features. Both algorithms assume that each feature is conditionally independent given the class, and both calculate the conditional probability of each class given the features using Bayes' theorem.

## Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes is a classification algorithm that is commonly used in natural language processing tasks such as text classification. It is a variant of the Naive Bayes algorithm that assumes that the features are binary or Boolean, indicating whether a particular feature is present or not.

In the case of missing values in the input data, Bernoulli Naive Bayes handles them by simply ignoring the missing values and treating them as if they were not present in the data. This is because the algorithm assumes that the features are independent of each other, and therefore the absence of a particular feature does not affect the probability of the presence of another feature.

However, it is important to note that the presence or absence of certain features can have a significant impact on the classification accuracy of the algorithm. Therefore, it is recommended to handle missing values in the input data by imputing correct values, such as the mean or median value of that desired feature before applying the Bernoulli Naive Bayes algorithm.

## Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. The algorithm can be extended to handle multiple classes by using the "one-vs-all" or "one-vs-rest" strategy, where the algorithm trains multiple binary classifiers, one for each class, and then combines their results to make the final prediction.

In the "one-vs-all" strategy, for each class, the algorithm considers all instances of that class as positive, as well as, negative examples. It then trains a binary classifier for each class using the Gaussian Naive Bayes algorithm. During prediction, the algorithm applies each classifier to the input instance and selects the class with the highest probability as the final prediction.

Alternatively, in the "one-vs-rest" strategy, the algorithm considers each class separately and treats it as the positive, as well as, negative class. It then trains a binary classifier for each class using the Gaussian Naive Bayes algorithm. During prediction, the algorithm applies each classifier to the input instance and selects the class with the highest probability as the final prediction.

Overall, Gaussian Naive Bayes is a powerful and efficient algorithm for multi-class classification tasks, especially in situations where the feature variables are continuous and have a Gaussian distribution.

<hr>

**Q5. Assignment:**

**Data preparation**: Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

**Implementation**: Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.

**Results**: Report the following performance metrics for each classifier: Accuracy, Precision, Recall & F1 score.

**Discussion**: Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

**Conclusion**: Summarise your findings and provide some suggestions for the future work.

**PLEASE NOTE: This dataset contains a binary classification problem with multiple features. The dataset is relatively small, but it can be used to demonstrate the performance of the different variants of Naive Bayes on a real-world problem.**

<hr>

In [14]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_validate
# Create classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

In [15]:
# Load data
df = pd.read_csv('spambase.data', delimiter=',', header=None)

In [16]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [17]:
#performinf some EDA
column_names = ['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our',
                'word_freq_over', 'word_freq_remove', 'word_freq_internet', 'word_freq_order', 'word_freq_mail',
                'word_freq_receive', 'word_freq_will', 'word_freq_people', 'word_freq_report', 'word_freq_addresses',
                'word_freq_free', 'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit',
                'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money', 'word_freq_hp', 'word_freq_hpl',
                'word_freq_george', 'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
                'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology',
                'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct', 'word_freq_cs',
                'word_freq_meeting', 'word_freq_original', 'word_freq_project', 'word_freq_re', 'word_freq_edu',
                'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!',
                'char_freq_$', 'char_freq_#', 'capital_run_length_average', 'capital_run_length_longest',
                'capital_run_length_total', 'is_spam']

df.columns = column_names
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [18]:
# Separate features and labels
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [23]:
##evaluating the performance of each classifier on the dataset.
scoring = ['accuracy', 'precision', 'recall', 'f1']

for clf in [bernoulli_nb, multinomial_nb, gaussian_nb]:
    scores = cross_validate(clf, X, y, cv=10, scoring=scoring)
    print(f"{clf.__class__.__name__}:")
    print(f"  Accuracy: {scores['test_accuracy'].mean():.3f}")
    print(f"  Precision: {scores['test_precision'].mean():.3f}")
    print(f"  Recall: {scores['test_recall'].mean():.3f}")
    print(f"  F1 score: {scores['test_f1'].mean():.3f}", '\n')

BernoulliNB:
  Accuracy: 0.884
  Precision: 0.887
  Recall: 0.815
  F1 score: 0.848 

MultinomialNB:
  Accuracy: 0.786
  Precision: 0.739
  Recall: 0.721
  F1 score: 0.728 

GaussianNB:
  Accuracy: 0.822
  Precision: 0.710
  Recall: 0.957
  F1 score: 0.813 

