Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

Ans: To find the probability that an employee is a smoker given that they use the health insurance plan, we can use Bayes' Theorem. 

Let:
- $ S $ be the event that an employee is a smoker.
- $ H $ be the event that an employee uses the health insurance plan.

We are given:
- $ P(H) = 0.70 $ (probability that an employee uses the health insurance plan)
- $ P(S|H) = 0.40 $ (probability that an employee is a smoker given that they use the health insurance plan)

We want to find $ P(S|H) $, the probability that an employee is a smoker given that they use the health insurance plan.

Using Bayes' Theorem:
$ P(S|H) = \frac{P(H|S) \times P(S)}{P(H)} $

We know that:
- $ P(H|S) $ is the probability that an employee uses the health insurance plan given that they are a smoker. Since it is not given directly, we'll need to calculate it.
- $ P(S) $ is the probability that an employee is a smoker, which we're not given directly either.

However, we can calculate $ P(H|S) $ using the fact that the percentage of smokers among employees who use the health insurance plan is 40%. Hence, $ P(H|S) = 0.40 $.

Now, we can calculate $ P(S) $ using the law of total probability:
$ P(S) = P(S|H) \times P(H) + P(S|\neg H) \times P(\neg H) $
$ P(S) = 0.40 \times 0.70 + P(S|\neg H) \times (1 - 0.70) $
$ P(S) = 0.40 \times 0.70 + P(S|\neg H) \times 0.30 $

Since $ P(S|\neg H) $ (probability that an employee is a smoker given that they do not use the health insurance plan) is not given, we cannot calculate $ P(S) $ exactly. 

Therefore, without additional information about the percentage of smokers among employees who do not use the health insurance plan, we cannot determine the probability that an employee is a smoker given that they use the health insurance plan.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Ans: The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in how they handle feature variables and the underlying distribution of the data:

1. **Feature Variables:**
   - **Bernoulli Naive Bayes:** Assumes that features are binary variables, where each feature represents the presence or absence of a particular characteristic. It is typically used for binary feature data.
   - **Multinomial Naive Bayes:** Assumes that features represent counts or frequencies of events, often in the form of integer counts. It is suitable for data with multiple discrete features, such as word counts in text classification.

2. **Underlying Distribution:**
   - **Bernoulli Naive Bayes:** Assumes a Bernoulli distribution for each feature, where each feature is considered as a binary variable with a probability of success (1) or failure (0).
   - **Multinomial Naive Bayes:** Assumes a Multinomial distribution for each feature, where each feature represents the occurrence count of discrete events. It is commonly used for text classification tasks, where features represent word occurrences or frequencies.

3. **Data Representation:**
   - **Bernoulli Naive Bayes:** Often used with binary feature representations, such as document-term matrices with 0s and 1s indicating absence or presence of words.
   - **Multinomial Naive Bayes:** Typically used with count-based representations, such as word counts or term frequency-inverse document frequency (TF-IDF) representations in text classification.

4. **Usage:**
   - **Bernoulli Naive Bayes:** Commonly used in text classification tasks where the presence or absence of words (binary features) in documents is relevant.
   - **Multinomial Naive Bayes:** Widely used in text classification tasks for modeling word frequencies or counts, as well as in other applications involving count-based data.

In summary, while both Bernoulli Naive Bayes and Multinomial Naive Bayes are variants of Naive Bayes classifiers and share similarities in their assumptions and application areas, they differ in how they represent and model the underlying distribution of feature variables, making each more suitable for certain types of data.

Q3. How does Bernoulli Naive Bayes handle missing values?

Ans: Bernoulli Naive Bayes handles missing values differently depending on the implementation and the specific use case. Here are some common approaches:

1. **Ignoring Missing Values:**
   - In some implementations, missing values may be ignored during training and inference. This means that instances with missing values are simply excluded from the analysis, and the classifier makes predictions based only on available features.

2. **Imputation:**
   - Another approach is to impute missing values before applying Bernoulli Naive Bayes. This could involve replacing missing values with a specific placeholder value (e.g., 0 or 1), the mode (most common value) of the feature, or some other statistic calculated from the available data.

3. **Model-Based Imputation:**
   - More sophisticated methods involve using a model to predict missing values based on other features. For example, a separate classifier or regression model could be trained to predict the missing values based on the observed features. The predicted values can then be used in the Bernoulli Naive Bayes classifier.

4. **Treating Missing Values as a Separate Category:**
   - In some cases, missing values may be treated as a separate category or class. This means that a new category is introduced to represent missing values, and the classifier learns to distinguish between instances with missing values and instances with observed values.

The choice of handling missing values depends on various factors such as the nature of the data, the amount of missing data, and the specific requirements of the problem at hand. It's essential to carefully consider the implications of each approach and its potential impact on the performance of the classifier. Additionally, it's advisable to experiment with different strategies and evaluate their effectiveness through cross-validation or other validation techniques.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Ans: Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. Although it is often associated with binary classification due to its simplicity and assumption of Gaussian (normal) distribution of features, it can be extended to handle multiple classes through various strategies, such as one-vs-all (OvA) or one-vs-one (OvO) approaches.

In the OvA approach, also known as one-vs-rest, the classifier is trained to distinguish each class from all other classes. For example, if there are $ n $ classes, $ n $ separate binary classifiers are trained, each predicting whether an instance belongs to one specific class or not. During prediction, the class with the highest probability output by any of the binary classifiers is assigned to the instance.

In the OvO approach, also known as one-vs-one, a binary classifier is trained for each pair of classes. For $ n $ classes, this results in $ \frac{n(n-1)}{2} $ binary classifiers. During prediction, each classifier votes for one class, and the class with the most votes is chosen as the final prediction.

Both approaches allow Gaussian Naive Bayes to handle multi-class classification tasks effectively. The choice between OvA and OvO may depend on factors such as the number of classes, computational resources, and the characteristics of the data.

Q5. Assignment:
- Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
- Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.
- Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
- Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
- Conclusion:
Summarise your findings and provide some suggestions for future work.

In [3]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


In [48]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from ucimlrepo import fetch_ucirepo
import warnings
warnings.filterwarnings("ignore")
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
# data (as pandas dataframes) 
df = spambase.data.features 
X = df.copy()
y = spambase.data.targets 


In [49]:
X.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [50]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 57 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   word_freq_make              4601 non-null   float64
 1   word_freq_address           4601 non-null   float64
 2   word_freq_all               4601 non-null   float64
 3   word_freq_3d                4601 non-null   float64
 4   word_freq_our               4601 non-null   float64
 5   word_freq_over              4601 non-null   float64
 6   word_freq_remove            4601 non-null   float64
 7   word_freq_internet          4601 non-null   float64
 8   word_freq_order             4601 non-null   float64
 9   word_freq_mail              4601 non-null   float64
 10  word_freq_receive           4601 non-null   float64
 11  word_freq_will              4601 non-null   float64
 12  word_freq_people            4601 non-null   float64
 13  word_freq_report            4601 

In [51]:
X.isnull().sum()

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

In [52]:
X.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.031869,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.285735,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,10.0,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0


- Dataset has no Missing Values

In [53]:
bernoulli_nb = make_pipeline(StandardScaler(), BernoulliNB())
multinomial_nb = make_pipeline(MinMaxScaler(), MultinomialNB())
gaussian_nb = make_pipeline(StandardScaler(), GaussianNB())


In [54]:
scorers = {"Accuracy": make_scorer(accuracy_score),
           "Precision": make_scorer(precision_score),
           "Recall": make_scorer(recall_score),
           "F1 Score": make_scorer(f1_score)}


In [55]:
classifiers = {"Bernoulli Naive Bayes": bernoulli_nb,
               "Multinomial Naive Bayes": multinomial_nb,
               "Gaussian Naive Bayes": gaussian_nb}

In [56]:
results = {}
for clf_name, clf in classifiers.items():
    clf_results = {}
    for scorer_name, scorer in scorers.items():
        scores = cross_val_score(clf, X, y, cv=10, scoring=scorer)
        clf_results[scorer_name] = scores.mean()
    results[clf_name] = clf_results

In [57]:
results_df = pd.DataFrame(results)

In [58]:
print(results_df)

           Bernoulli Naive Bayes  Multinomial Naive Bayes  \
Accuracy                0.900891                 0.878074   
Precision               0.909100                 0.905360   
Recall                  0.834539                 0.776705   
F1 Score                0.869448                 0.833776   

           Gaussian Naive Bayes  
Accuracy               0.818730  
Precision              0.706348  
Recall                 0.957504  
F1 Score               0.810556  
