Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?


Answer--> To find the probability that an employee is a smoker given that they use the health insurance plan, we can use the conditional probability formula:

[ P(Smoker | Uses Health Insurance) = P(Smoker and Uses Health Insurance) \ P(Uses Health Insurance)]

Given the information:
- \( P(Uses Health Insurance) = 0.70 \) (70% of employees use the health insurance plan)
- \( P(Smoker) = 0.40 \) (40% of employees who use the plan are smokers)
- \( P(Uses Health Insurance | Smoker) = 1 \) (all smokers use the health insurance plan)

We want to calculate ( P(Smoker | Uses Health Insurance)).

First, calculate P(Smoker and Uses Health Insurance):
\[ P(Smoker and Uses Health Insurance) = P(Smoker) * P(Uses Health Insurance | Smoker) = 0.40 * 1 = 0.40 \]

Now, calculate ( P(Smoker | Uses Health Insurance)):
\[ P(Smoker | Uses Health Insurance) = {P(Smoker  and Uses Health Insurance)}/{P(Uses Health Insurance)} = {0.40}/{0.70} = 0.5714 \]

So, the probability that an employee is a smoker given that they use the health insurance plan is approximately \(0.5714\) or \(57.14\%\).

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Answer--> Here's the difference between the two:

1. **Bernoulli Naive Bayes**:
   - **Input Data**: Bernoulli Naive Bayes is typically used for binary feature data, where each feature represents the presence or absence of a particular word or term. It's commonly applied in text classification tasks where the focus is on whether a word appears in a document or not.
   - **Feature Representation**: Each feature is binary, representing the presence (1) or absence (0) of a specific term in a document.
   - **Example**: Email spam detection, sentiment analysis (where the presence or absence of certain words in a text is important).

2. **Multinomial Naive Bayes**:
   - **Input Data**: Multinomial Naive Bayes is used for data with discrete counts, often representing word frequencies or term occurrences. It's suitable for text classification tasks where features are counts of words or terms in a document.
   - **Feature Representation**: Features are represented by counts of occurrences (non-negative integers).
   - **Example**: Document classification, topic categorization, spam filtering (where the frequency of words matters).


Q3. How does Bernoulli Naive Bayes handle missing values?

Answer--> In Bernoulli Naive Bayes, which deals with binary features (0 or 1), handling missing values can be approached in a couple of ways:

1. **Ignoring Missing Values**: Treat instances with missing values as if the corresponding feature is absent (0). This assumes missing values don't carry unique information.

2. **Imputing Missing Values**: You can replace missing values with a default value, typically 0, to indicate absence. However, this approach needs careful consideration to avoid introducing bias.


Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Answer--> Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is a variant of the Naive Bayes algorithm that assumes continuous features follow a Gaussian (normal) distribution. While it's commonly used for binary classification, it can also be extended to handle multi-class classification problems.

Q5. Assignment:

Data preparation:Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:
Report the following performance metrics for each classifier:
- Accuracy
- Precision
- Recall
- F1 score

Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

In [1]:
import pandas as pd

# Step 1: Load Column Names from .names File
column_names = []
with open('spam.names', 'r') as names_file:
    for line in names_file:
        if line.startswith('|'):
            continue
        if ':' in line:
            column_name = line.split(':')[0].strip()
            column_names.append(column_name)

# Step 2: Load Data from .data File
data = []
with open('spam.data', 'r') as data_file:
    for line in data_file :
        values = line.strip().split(',')
        data.append(values)

# Step 3: Create Pandas DataFrame
df = pd.DataFrame(data, columns=column_names)
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0.0,0.64,0.64,0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0,0.135,0.0,0.0,3.537,40,191,1


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   word_freq_make              4601 non-null   object
 1   word_freq_address           4601 non-null   object
 2   word_freq_all               4601 non-null   object
 3   word_freq_3d                4601 non-null   object
 4   word_freq_our               4601 non-null   object
 5   word_freq_over              4601 non-null   object
 6   word_freq_remove            4601 non-null   object
 7   word_freq_internet          4601 non-null   object
 8   word_freq_order             4601 non-null   object
 9   word_freq_mail              4601 non-null   object
 10  word_freq_receive           4601 non-null   object
 11  word_freq_will              4601 non-null   object
 12  word_freq_people            4601 non-null   object
 13  word_freq_report            4601 non-null   obje

In [3]:
df.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
count,4601,4601,4601,4601,4601,4601,4601,4601,4601,4601,...,4601,4601,4601,4601,4601,4601,4601,4601,4601,4601
unique,142,171,214,43,255,141,173,170,144,245,...,313,641,225,964,504,316,2161,271,919,2
top,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,5,0
freq,3548,3703,2713,4554,2853,3602,3794,3777,3828,3299,...,3811,1886,4072,2343,3201,3851,349,349,115,2788


In [4]:
# type casting the col into float 

for col in df.columns:
    df[col] = df[col].astype(float)

In [5]:
df.describe()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


## Model traning

In [6]:
x = df.drop("spam", axis = 1)
y = df.spam

In [7]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform 10-fold cross-validation and calculate metrics
classifiers = [('Bernoulli NB', bernoulli_nb), ('Multinomial NB', multinomial_nb), ('Gaussian NB', gaussian_nb)]
metrics = ['accuracy', 'precision', 'recall', 'f1']

for classifier_name, classifier in classifiers:
    print(f"{classifier_name} Classifier:")
    for metric in metrics:
        scores = cross_val_score(classifier, x, y, cv=10, scoring=metric)
        avg_score = scores.mean()
        print(f" - {metric}: {avg_score:.4f}")
    print("-" * 30)

Bernoulli NB Classifier:
 - accuracy: 0.8839
 - precision: 0.8870
 - recall: 0.8152
 - f1: 0.8481
------------------------------
Multinomial NB Classifier:
 - accuracy: 0.7863
 - precision: 0.7393
 - recall: 0.7215
 - f1: 0.7283
------------------------------
Gaussian NB Classifier:
 - accuracy: 0.8218
 - precision: 0.7104
 - recall: 0.9570
 - f1: 0.8131
------------------------------


Discussion:
Based on the performance metrics, the Bernoulli Naive Bayes classifier performed the best among the three variants on this dataset. It achieved the highest accuracy, precision, and F1 score, along with a high recall rate. This could be because the dataset consists of binary features, making Bernoulli Naive Bayes well-suited for such data.

Limited Expressive Power: Naive Bayes is a simple model and might not capture complex relationships in the data.

Conclusion:
In conclusion, this analysis provides valuable insights into the performance of different Naive Bayes variants on the spam classification task. By exploring the suggestions for future work, researchers and practitioners can further enhance the classification accuracy and gain a deeper understanding of the dataset's characteristics.