Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?


In [None]:
"""
The probability that an employee is a smoker given that he/she uses the health insurance plan is 40%.
"""

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?


In [None]:
"""
Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes classification algorithm,
primarily differing in the types of data they are designed to handle and how they model feature probabilities.

Bernoulli Naive Bayes is well-suited for binary data, where features are represented as either 0 (absence) or
1 (presence). It assumes that each feature is a binary variable, often used in text classification tasks where 
features represent the presence or absence of specific words or attributes in a document.

On the other hand, Multinomial Naive Bayes is intended for discrete data, particularly when dealing with 
count-based or frequency-based features. It models the probability of observing a specific count or frequency of 
each feature, making it a natural choice for tasks like text classification where features represent the counts or
frequencies of words in a document, such as term frequency (TF) or term frequency-inverse document frequency (TF-IDF).

The choice between Bernoulli and Multinomial Naive Bayes depends on the nature of the data and the problem domain,
particularly in applications involving text analysis and feature representation.
"""

Q3. How does Bernoulli Naive Bayes handle missing values?


In [None]:
"""
Bernoulli Naive Bayes, like other variants of the Naive Bayes algorithm, typically doesn't handle missing values 
explicitly. Instead, it makes an assumption that can affect how missing values are treated implicitly:

Assumption:
In Bernoulli Naive Bayes, it's assumed that each feature is a binary variable, representing the presence (1) or
absence (0) of a specific attribute or event. This assumption implies that if a feature's value is missing, it is
treated as if the attribute is absent (0).

In practice, this means that when you use Bernoulli Naive Bayes and encounter missing values, you should preprocess 
your data by assigning a value (0 or 1) to represent the presence or absence of the missing feature. The choice of
assigning 0 or 1 depends on your domain knowledge and the specific problem you are addressing.

Handling missing values in any Naive Bayes variant, including Bernoulli Naive Bayes, requires careful consideration 
and domain expertise to ensure that the assumptions made align with the nature of the data and the problem you are
trying to solve.
"""

Q4. Can Gaussian Naive Bayes be used for multi-class classification?


In [None]:
"""
Yes,
Gaussian Naive Bayes can be used for multi-class classification tasks. Gaussian Naive Bayes is an extension of the
Naive Bayes algorithm that assumes that the features in the dataset follow a Gaussian (normal) distribution. While
the original Naive Bayes algorithm is typically used for binary classification problems, Gaussian Naive Bayes can 
be adapted to handle multi-class classification.
"""

Q5. Assignment:
    
Data preparation:
    
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:
    
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:
    
Report the following performance metrics for each classifier:
    
Accuracy

Precision

Recall

F1 score

Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:

Summarise your findings and provide some suggestions for future work.

Note: Create your assignment in Jupyter notebook and upload it to GitHub & share that github repository
link through your dashboard. Make sure the repository is public.

Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

In [30]:
# importing dataset
import pandas as pd
df=pd.read_csv('spambase.data')
df.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [31]:
# define X and y
X=df.iloc[:,:-1]
y=df.iloc[:,-1]

In [39]:
# split train and test
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.20,random_state=44)

In [40]:
# Import Model Libraries
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

In [41]:
# Gaussian Naive Bayes 
G_clf=GaussianNB()

G_clf.fit(X_train,y_train)

y_pred=G_clf.predict(X_test)

# Calculate the accuracy of the predictions
import numpy as np

accuracy = np.sum(y_pred == y_test) / len(y_test)
print('Accuracy:', accuracy)

Accuracy: 0.808695652173913


In [42]:
# Bernoulli Naive bayes 
B_clf=BernoulliNB()

B_clf.fit(X_train,y_train)

y_pred=B_clf.predict(X_test)

# Calculate the accuracy of the predictions
import numpy as np

accuracy = np.sum(y_pred == y_test) / len(y_test)
print('Accuracy:', accuracy)

Accuracy: 0.8771739130434782


In [43]:
# Multinomial Naive Bayes
M_clf=MultinomialNB()

M_clf.fit(X_train,y_train)

y_pred=M_clf.predict(X_test)

# Calculate the accuracy of the predictions
import numpy as np

accuracy = np.sum(y_pred == y_test) / len(y_test)
print('Accuracy:', accuracy)

Accuracy: 0.783695652173913


#### Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?


In [None]:

"""
In general, Naive Bayes classifiers are known for their simplicity and effectiveness. They are also relatively easy to train
and can be used with a variety of different datasets.

In the case of the Spambase dataset, the Bernoulli Naive Bayes classifier performed the best, with an accuracy of 87% on the
test set. This is likely due to the fact that the Bernoulli Naive Bayes classifier assumes that the Target values are binary.
This assumption is reasonable for the features in the Spambase dataset, which are derived from the text of email messages.
"""


#### Summarise your findings and provide some suggestions for future work.

In [None]:
"""
Summary of findings:

->Naive Bayes classifiers are a simple and effective way to classify email messages.
->The Bernoulli Naive Bayes classifier performed the best on the Spambase dataset, with an accuracy of 87% on
  the test set.
->Other variants of Naive Bayes, such as the Gaussian Naive Bayes classifier and the Multinomial Naive Bayes
  classifier, also performed well on the Spambase dataset.
->One limitation of Naive Bayes classifiers is that they make the assumption that the features of a data point 
  are independent of each other.
->Another limitation of Naive Bayes classifiers is that they can be sensitive to outliers.


Suggestions for future work:

->Investigate the performance of Naive Bayes classifiers on other spam filtering datasets.
->Explore ways to address the limitations of Naive Bayes classifiers, such as the independence assumption and 
  sensitivity to outliers.
->Develop new variants of Naive Bayes classifiers that are specifically designed for spam filtering tasks.
->Integrate Naive Bayes classifiers with other machine learning algorithms to improve the overall performance 
  of spam filtering systems.
"""