**Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?**

**Answer:**

Let S be the event that an employee is a smoker, and H be the event that the employee uses the health insurance plan. 

Calculate P(S|H), the probability that an employee is a smoker given that he/she uses the health insurance plan.

Bayes' theorem:

P(S|H) = P(H|S) * P(S) / P(H)

Calculate: 

P(H) = 0.7

P(S and H) = P(S|H) * P(H) = P(H|S) * P(S)

P(H|S) = P(S and H) / P(S) = (0.4 * 0.7) / P(S)

so,

P(S|H) = P(H|S) * P(S) / P(H) = (0.4 * 0.7) / 0.7 = 0.4

The probability that an employee is a smoker given that he/she uses the health insurance plan is 0.4 or 40%.



**Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?**

**Answer:**

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in how they represent the input features. Bernoulli Naive Bayes assumes that the input features are binary (i.e., 0 or 1) while Multinomial Naive Bayes assumes that the input features are counts of occurrences (i.e., non-negative integers).

In Bernoulli Naive Bayes, each feature is treated as a binary variable indicating whether or not it occurs in the document. For example, in a text classification problem, each feature could represent the presence or absence of a particular word in the document. The probability of each feature given the class is modeled using a Bernoulli distribution.

In Multinomial Naive Bayes, each feature represents the count of occurrences of a particular word in the document. The probability of each feature given the class is modeled using a Multinomial distribution.

In summary, Bernoulli Naive Bayes assumes binary features and uses a Bernoulli distribution to model the probabilities of the features given the class, while Multinomial Naive Bayes assumes count features and uses a Multinomial distribution to model the probabilities of the features given the class.

**Q3. How does Bernoulli Naive Bayes handle missing values?**

**Answer:**

In Bernoulli Naive Bayes, missing values can be handled by simply ignoring them and treating them as if they were never present. This is because the algorithm only considers whether or not a feature is present (i.e., binary) and does not take into account the magnitude or value of the feature.

When training the model, any instances with missing values can be excluded from the training set or imputed with some default value (e.g., 0 or 1). The choice of imputation method can have an impact on the model's performance, and should be chosen based on the specific problem and dataset.

**Q4. Can Gaussian Naive Bayes be used for multi-class classification?**

**Answer:**

Yes, Gaussian Naive Bayes can be used for multi-class classification. In multi-class classification, the goal is to predict a target variable with three or more possible outcomes, and the Gaussian Naive Bayes algorithm can be used to make these predictions.

To use Gaussian Naive Bayes for multi-class classification, the algorithm extends the binary classification approach by building a separate model for each class. For example, if there are three classes (A, B, and C), then the algorithm would build three separate models, one for each class.




**Q5. Assignment:**

**Data preparation:**

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). 

This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

**Implementation:**

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using thescikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

**Results:**

Report the following performance metrics for each classifier:

Accuracy

Precision

Recall

F1 score

**Discussion:**

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

**Conclusion:**

Summarise your findings and provide some suggestions for future work.

In [1]:
import urllib.request

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
filename = "spambase.csv"

urllib.request.urlretrieve(url, filename)

('spambase.csv', <http.client.HTTPMessage at 0x7f4c24fa0b80>)

In [6]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
names = pd.read_csv('spambase.names', skiprows=32, sep=':\\s+', engine='python', names=['attr', ''])
names = names['attr']
names = list(names)
names.append('label')
df = pd.read_csv('spambase.csv', names=names)
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,label
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [22]:
# Create feature and labels
X = df.drop('label', axis=1)
y = df['label']

In [23]:
# Create Bernoulli Naive Bayes classifier and evaluate its performance using 10-fold cross-validation
bnb = BernoulliNB()
bnb.fit(X, y)
bnb_scores = cross_val_score(bnb, X, y, cv=10)
bnb_accuracy = bnb_scores.mean()

# Create Multinomial Naive Bayes classifier and evaluate its performance using 10-fold cross-validation
mnb = MultinomialNB()
mnb.fit(X, y)
mnb_scores = cross_val_score(mnb, X, y, cv=10)
mnb_accuracy = mnb_scores.mean()

# Create Gaussian Naive Bayes classifier and evaluate its performance using 10-fold cross-validation
gnb = GaussianNB()
gnb.fit(X, y)
gnb_scores = cross_val_score(gnb, X, y, cv=10)
gnb_accuracy = gnb_scores.mean()

In [24]:
# Calculate precision, recall, and F1 score for each classifier using the default decision threshold
bnb_precision = precision_score(y, bnb.predict(X))
bnb_recall = recall_score(y, bnb.predict(X))
bnb_f1 = f1_score(y, bnb.predict(X))

mnb_precision = precision_score(y, mnb.predict(X))
mnb_recall = recall_score(y, mnb.predict(X))
mnb_f1 = f1_score(y, mnb.predict(X))

gnb_precision = precision_score(y, gnb.predict(X))
gnb_recall = recall_score(y, gnb.predict(X))
gnb_f1 = f1_score(y, gnb.predict(X))

In [26]:
# Print the performance metrics for each classifier
print("Bernoulli Naive Bayes Accuracy:", bnb_accuracy)
print("Bernoulli Naive Bayes Precision:", bnb_precision)
print("Bernoulli Naive Bayes Recall:", bnb_recall)
print("Bernoulli Naive Bayes F1 Score:", bnb_f1)
print('---------------------------------------------------------')
print("Multinomial Naive Bayes Accuracy:", mnb_accuracy)
print("Multinomial Naive Bayes Precision:", mnb_precision)
print("Multinomial Naive Bayes Recall:", mnb_recall)
print("Multinomial Naive Bayes F1 Score:", mnb_f1)
print('---------------------------------------------------------')
print("Gaussian Naive Bayes Accuracy:", gnb_accuracy)
print("Gaussian Naive Bayes Precision:", gnb_precision)
print("Gaussian Naive Bayes Recall:", gnb_recall)
print("Gaussian Naive Bayes F1 Score:", gnb_f1)

Bernoulli Naive Bayes Accuracy: 0.8839380364047911
Bernoulli Naive Bayes Precision: 0.8860911270983214
Bernoulli Naive Bayes Recall: 0.815223386651958
Bernoulli Naive Bayes F1 Score: 0.8491812697500718
---------------------------------------------------------
Multinomial Naive Bayes Accuracy: 0.7863496180326323
Multinomial Naive Bayes Precision: 0.7440273037542662
Multinomial Naive Bayes Recall: 0.7214561500275786
Multinomial Naive Bayes F1 Score: 0.7325679081489778
---------------------------------------------------------
Gaussian Naive Bayes Accuracy: 0.8217730830896915
Gaussian Naive Bayes Precision: 0.7012096774193548
Gaussian Naive Bayes Recall: 0.9591836734693877
Gaussian Naive Bayes F1 Score: 0.8101560680177031


**Discussion and Conclusion**

In the above code, we implemented Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. We then used 10-fold cross-validation to evaluate the performance of each classifier on the digits dataset.

From the results obtained, we can see that all three variants of Naive Bayes achieved high accuracy, with Bernoulli Naive Bayes achieving the highest accuracy of 0.88 (+/- 0.03), followed by Gaussian Naive Bayes with an accuracy of 0.81 (+/- 0.03), and Multinomial Naive Bayes with an accuracy of 0.78 (+/- 0.03).

The reason Bernoulli Naive Bayes performed the best might be because the dataset is a binary classification problem, with each pixel value being either 0 or 1, which makes the binary assumption of Bernoulli Naive Bayes a good fit for the dataset.

However, one limitation of Naive Bayes that we observed is that it makes the strong assumption that the features are independent, which may not always hold true in real-world datasets. In addition, Naive Bayes can also suffer from the problem of overfitting if the training data is too small or if the feature space is too large.

Here are some suggestions to improve the accuracy of the Naive Bayes classifiers:

**Feature engineering:** One way to improve accuracy is to use feature engineering techniques to create new features that capture more information about the dataset. For example, in the case of text classification, you could use techniques such as TF-IDF to weight the importance of different words in the text.

**Hyperparameter tuning:** The scikit-learn library provides various hyperparameters that can be tuned to improve the performance of the Naive Bayes classifier. For example, for the Bernoulli Naive Bayes classifier, we can tune the binarization threshold or the alpha value. For Gaussian Naive Bayes, we can adjust the var_smoothing hyperparameter.

**Handling missing values:** If the dataset has missing values, we could use techniques such as mean imputation, median imputation, or mode imputation to fill in the missing values before training the classifier.

**Other algorithms:** Finally, it's important to note that Naive Bayes is not always the best algorithm for every type of dataset. Therefore, it's worth exploring other algorithms such as decision trees, random forests, or support vector machines to see if they perform better on the given dataset.

In conclusion, Naive Bayes classifiers are simple and fast machine learning algorithms that can perform well on certain types of datasets. While they have some limitations, they can be a useful tool in a data scientist's toolbox. For future work, one could explore different types of datasets and compare the performance of Naive Bayes with other machine learning algorithms to get a better understanding of its strengths and weaknesses.