### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To solve this problem using naive Bayes, we need to apply Bayes' theorem and assume that the features (smoker status and insurance plan usage) are conditionally independent given the class (employee status as smoker or non-smoker).

Let S denote the event that an employee is a smoker, and H denote the event that an employee uses the company's health insurance plan. Then, we want to calculate the probability of S given H, i.e., P(S|H).

Using Bayes' theorem, we have:

P(S|H) = P(H|S) * P(S) / P(H)

where P(H|S) is the probability of an employee using the health insurance plan given that he/she is a smoker, P(S) is the prior probability of an employee being a smoker, and P(H) is the overall probability of an employee using the health insurance plan.

From the given information, we know that P(H) = 0.7 (since 70% of employees use the insurance plan), and P(S) is not directly given. However, we can estimate it from the information that 40% of employees who use the plan are smokers, i.e.,

P(S ∩ H) = P(H|S) * P(S) = 0.4 * 0.7

Thus, we can calculate P(S) as:

P(S) = P(S ∩ H) / P(H) = (0.4 * 0.7) / 0.7 = 0.4

Now, we can substitute the values into Bayes' theorem and calculate P(S|H) as:

P(S|H) = P(H|S) * P(S) / P(H) = (0.4 * 0.4) / 0.7 = 0.2286 (rounded to four decimal places)

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.2286, or about 22.86%.






### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes algorithm used for text classification and other machine learning tasks. The main difference between them lies in the assumption they make about the distribution of the features.

Bernoulli Naive Bayes assumes that the features are binary (i.e., they take on values of 0 or 1) and follows a Bernoulli distribution. This means that it is suitable for classification problems where the presence or absence of a feature is important, but the frequency or count of the feature is not relevant. An example of such a problem is spam detection, where the presence or absence of certain words in an email can indicate whether it is spam or not.

In contrast, Multinomial Naive Bayes assumes that the features are counts of occurrences (i.e., integer values) and follows a Multinomial distribution. This means that it is suitable for classification problems where the frequency or count of a feature is important, such as document classification, where the number of times a word appears in a document can help determine its category.

Another key difference between Bernoulli and Multinomial Naive Bayes is in the way they handle missing features. In Bernoulli Naive Bayes, missing features are assumed to have a value of 0, while in Multinomial Naive Bayes, they are ignored during training and set to 0 during prediction.

### Q3. How does Bernoulli Naive Bayes handle missing values?

In Bernoulli Naive Bayes, missing values are typically handled by assuming that their value is 0, meaning that the corresponding feature is not present. This is because Bernoulli Naive Bayes assumes that the features are binary, and a missing value can be interpreted as the absence of that feature.

For example, let's say we are using Bernoulli Naive Bayes for spam detection, and we have a feature for the presence of the word "viagra" in an email. If a particular email does not contain the word "viagra", then the corresponding value in the feature vector would be 0, indicating that the feature is not present in that email. If the value for this feature is missing, then we would assume that the word "viagra" is not present in the email, and the corresponding value would be set to 0.

However, it is worth noting that the treatment of missing values in Bernoulli Naive Bayes may depend on the specific implementation or library used, as different approaches may be used to handle missing data. For example, some implementations may allow for imputing missing values based on the distribution of the feature, while others may simply ignore instances with missing values. In any case, it is important to consider the handling of missing values when applying Bernoulli Naive Bayes to a particular dataset.

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. The algorithm can be extended to handle more than two classes by using the "one-vs-all" or "one-vs-rest" approach, which is a common technique for multi-class classification problems.

In the "one-vs-all" approach, we train a separate binary Gaussian Naive Bayes classifier for each class, with the goal of distinguishing that class from all other classes combined. During prediction, we use all of the classifiers to make a prediction for each input, and the class with the highest probability is selected as the predicted class.

For example, let's say we have a dataset with three classes: A, B, and C. To apply Gaussian Naive Bayes for multi-class classification using the "one-vs-all" approach, we would train three separate binary classifiers: one for A vs. (B + C), one for B vs. (A + C), and one for C vs. (A + B). During prediction, we would compute the probability of each class for a given input using all three classifiers, and select the class with the highest probability.'/

### Q5. Assignment:

#### Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

#### Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

#### Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

#### Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

#### Conclusion:
Summarise your findings and provide some suggestions for future work.

In [49]:
import pandas as pd
import numpy as np

from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#### Data Preparation

In [35]:
with open('data/spambase/spambase.names', 'r') as f:
    attribute_names = []
    for line in f:
        if line.startswith('|') or line.startswith(' '):
            continue
        elif line.startswith('1') or line.startswith('0'):
            attribute_names.append(line.split('|')[0].strip())
        else:
            attribute_names.append(line.split(':')[0].strip())
    attributes = [attr for attr in attribute_names if attr != '']
    attributes = attributes[1:] + [attributes[0]]

In [36]:
print(attributes)

['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our', 'word_freq_over', 'word_freq_remove', 'word_freq_internet', 'word_freq_order', 'word_freq_mail', 'word_freq_receive', 'word_freq_will', 'word_freq_people', 'word_freq_report', 'word_freq_addresses', 'word_freq_free', 'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george', 'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet', 'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology', 'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting', 'word_freq_original', 'word_freq_project', 'word_freq_re', 'word_freq_edu', 'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!', 'char_freq_$', 'char_freq_#', 'capita

In [37]:
spam_df = pd.read_csv("data/spambase/spambase.data",names=attributes)
spam_df.head()


Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,"1, 0."
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [38]:
spam_df = spam_df.rename(columns={'1, 0.' : 'target'})

In [42]:
spam_df.sample(5)

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,target
2355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.558,0.0,0.0,2.0,7,28,0
1712,0.09,0.49,0.59,0.0,0.39,0.19,0.0,0.0,0.09,0.39,...,0.765,0.037,0.0,5.828,1.308,0.0,6.047,54,768,1
644,0.89,0.0,0.89,0.0,0.0,0.0,1.78,0.0,0.0,0.0,...,0.0,0.0,0.0,1.344,0.0,0.0,5.25,16,84,1
2557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.0,4,6,0
1556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.215,0.0,0.0,0.215,0.0,3.937,18,63,1


In [43]:
spam_df.isna().sum()

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

In [45]:
X = spam_df.drop('target', axis=1)
y = spam_df['target']
X.shape, y.shape

((4601, 57), (4601,))

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X ,y,
                                                    test_size=0.3,
                                                    random_state=123)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3220, 57), (1381, 57), (3220,), (1381,))

#### Implementation

In [48]:
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

In [60]:
bnb.fit(X_train, y_train)

BernoulliNB()

In [61]:
mnb.fit(X_train, y_train)

MultinomialNB()

In [62]:
gnb.fit(X_train, y_train)

GaussianNB()

In [63]:
bnb_cv = cross_val_score(bnb, X_train, y_train, cv=10)
print("Cross Val score for Bernoulli : ",bnb_cv)
print("Mean cv score for Bernoulli", bnb_cv.mean())

Cross Val score for Bernoulli :  [0.87267081 0.90993789 0.91925466 0.89130435 0.88198758 0.88198758
 0.86024845 0.89130435 0.88819876 0.87267081]
Mean cv score for Bernoulli 0.8869565217391303


In [64]:
mnb_cv = cross_val_score(mnb, X_train, y_train, cv=10)
print("Cross Val score for Multinomial : ",mnb_cv)
print("Mean cv score for Multinomial", mnb_cv.mean())

Cross Val score for Multinomial :  [0.80434783 0.79503106 0.78571429 0.75776398 0.80434783 0.78571429
 0.80745342 0.77018634 0.77950311 0.80124224]
Mean cv score for Multinomial 0.7891304347826087


In [65]:
gnb_cv = cross_val_score(gnb, X_train, y_train, cv=10)
print("Cross Val score for Gaussian: ",gnb_cv)
print("Mean cv score for Gaussian", gnb_cv.mean())

Cross Val score for Gaussian:  [0.85714286 0.81987578 0.82298137 0.81987578 0.81677019 0.80745342
 0.83229814 0.82608696 0.80124224 0.81677019]
Mean cv score for Gaussian 0.8220496894409937


#### Results

In [68]:
def cal_results(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    
    print("Accuracy :", accuracy)
    print("Precision : ", precision)
    print("Recall : ", recall)
    print("F1 score : ", f1)
    
    return {'accuracy': accuracy,'precision':precision,'recall':recall,'f1':f1}

In [71]:
y_pred_bnb = bnb.predict(X_test)
bnb_score = cal_results(y_test,y_pred_bnb)

Accuracy : 0.8834178131788559
Precision :  0.898
Recall :  0.8032200357781754
F1 score :  0.8479697828139755


In [72]:
y_pred_mnb = mnb.predict(X_test)
mnb_score = cal_results(y_test,y_pred_mnb)

Accuracy : 0.775524981897176
Precision :  0.7398843930635838
Recall :  0.6869409660107334
F1 score :  0.712430426716141


In [73]:
y_pred_gnb = gnb.predict(X_test)
gnb_score = cal_results(y_test,y_pred_gnb)

Accuracy : 0.8167994207096307
Precision :  0.7007874015748031
Recall :  0.9552772808586762
F1 score :  0.8084784254352764


#### Discussion

Based on the results obtained, Bernoulli Naive Bayes performed the best with an accuracy of 0.8834, followed by Gaussian Naive Bayes with an accuracy of 0.8168 and Multinomial Naive Bayes with an accuracy of 0.7755.

One possible reason for the superior performance of Bernoulli Naive Bayes could be that it is specifically designed for binary features, which is the case for the Spambase dataset. In contrast, Multinomial Naive Bayes and Gaussian Naive Bayes are better suited for discrete and continuous features, respectively, which may explain their lower performance on this particular dataset.

It is also worth noting that the precision and recall values for each classifier vary, with Bernoulli Naive Bayes having the highest precision but a lower recall compared to Gaussian Naive Bayes, which has the highest recall but a lower precision. This trade-off between precision and recall is a common issue in classification problems, and the choice of the appropriate metric depends on the specific application.

Limitations of Naive Bayes observed include the assumption of independence between features, which may not hold true in some real-world scenarios, and the sensitivity to imbalanced class distributions, which can lead to biased predictions.


#### Conclusion

In conclusion, our evaluation of three variants of Naive Bayes classifiers on the Spambase dataset showed that Bernoulli Naive Bayes outperformed Multinomial and Gaussian Naive Bayes in terms of accuracy. However, the choice of the appropriate variant depends on the specific characteristics of the dataset and the desired trade-off between precision and recall. Future work could focus on exploring the use of more advanced techniques, such as ensemble methods or deep learning, to further improve the performance of spam classification models.