# Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem.

Let's define the events:
- A: An employee uses the company's health insurance plan.
- B: An employee is a smoker.

We are given the following probabilities:
- \( P(A) \): Probability that an employee uses the health insurance plan = 0.70 (70%).
- \( P(B|A) \): Probability that an employee is a smoker given that they use the health insurance plan = 0.40 (40%).

Now, we want to find \( P(B|A) \), the probability that an employee is a smoker given that they use the health insurance plan. This can be calculated using Bayes' theorem:


P(B∣A)= P(A∣B)⋅P(B)/P(A)

Since we are not given the probability ( P(B)) directly, we can calculate it using the law of total probability:

 P(B) = P(B|A). P(A) + P(B-A). P(- A) 

where (- A ) represents the event that an employee does not use the health insurance plan.

Since we know that ( P(A) = 0.70 ) and ( P(B|A) = 0.40 ), we can find ( P(B) ) as follows:
\[ P(B) = 0.40 . 0.70 + P(B|- A). (1 - 0.70)

Now, we need one more piece of information: the proportion of employees who do not use the health insurance plan ( P(- A)), which can be calculated as:
 P(- A) = 1 - P(A) = 1 - 0.70 = 0.30 

With this information, we can find ( P(B)
P(B) = 0.40 .0.70 + P(B|- A) . 0.30

Now we can calculate  P(B|A) using Bayes' theorem:
 P(B|A) = P(A|B). P(B)/ P(A)

Given that  P(A|B) = P(B|A) = 0.40 (this information is given in the problem), we can now find  P(B|A)
P(B|A) = 0.40. 0.40 /0.70

Calculate the value:
P(B|A) = 0.16/0.70 
approx =0.229 

So, the probability that an employee is a smoker given that he/she uses the health insurance plan is approximately 0.229 or 22.9%.

# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the type of data they are best suited for and how they handle features.

Bernoulli Naive Bayes:
Best suited for binary data, where each feature can take on only two values: 0 or 1 (representing the absence or presence of a particular feature).
It assumes that each feature is binary and independent of each other, meaning the presence of one feature does not affect the presence of another.
Typically used for text classification tasks where the features represent the presence or absence of words in a document or message.

Multinomial Naive Bayes:
Suited for discrete count data, where each feature represents the count or frequency of a specific event.
It is commonly used for text classification tasks where features are word frequencies or term counts.
Unlike Bernoulli Naive Bayes, Multinomial Naive Bayes allows for features with multiple discrete values, not just binary.
Both Bernoulli and Multinomial Naive Bayes are variants of the Naive Bayes algorithm. They are based on the same underlying principles but make different assumptions about the nature of the data.

In summary, choose Bernoulli Naive Bayes when dealing with binary data (presence/absence), and choose Multinomial Naive Bayes when dealing with count data (word frequencies, term counts) or other discrete data with multiple values. The choice of which variant to use depends on the specific characteristics of your data and the nature of the classification problem you are trying to solve.

# Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes handles missing values by ignoring them during the probability calculations. When encountering a missing value for a particular feature, the algorithm simply excludes that feature from the probability calculations for the corresponding class.

Here's how Bernoulli Naive Bayes handles missing values step-by-step:

Training Phase:

During the training phase, the algorithm calculates probabilities for each class based on the presence or absence of features in the training data.
If a feature is missing for a specific instance in the training data, the algorithm treats it as if the feature is not present (i.e., it assumes the value is 0).
The algorithm calculates the probabilities of each class based on the presence or absence of each feature in the training data, including the missing values treated as 0.
Testing Phase:

During the testing phase, when the algorithm encounters an instance with missing values, it still uses the same probabilities calculated during training.
For each missing feature in the instance, the algorithm ignores that feature during the probability calculations for each class.
The missing feature is treated as if it were not present, so its absence contributes to the probability calculations, just like during training.
By treating missing values as the absence of features, Bernoulli Naive Bayes effectively ignores the missing values and continues with the classification process. This approach simplifies the implementation and is particularly useful when dealing with sparse binary data, as is common in text classification tasks.

However, it's important to note that the handling of missing values in Bernoulli Naive Bayes might not be optimal in all cases. In some scenarios, a more sophisticated approach, such as imputation or considering missing values as a separate category, may be more appropriate to handle missing data effectively.

# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is one of the variants of the Naive Bayes algorithm and is suitable for handling continuous or numeric features. It assumes that the features follow a Gaussian (normal) distribution within each class.

For multi-class classification, where the target variable can take on more than two distinct classes, Gaussian Naive Bayes extends its capabilities to handle multiple classes. It does this by estimating the parameters (mean and variance) of the Gaussian distribution for each feature within each class.

Here's how Gaussian Naive Bayes works for multi-class classification:

Training Phase:

During the training phase, the algorithm calculates the mean and variance of each feature for each class using the training data.
For each class, it calculates the mean and variance of each feature based on the instances belonging to that class.
These mean and variance values are used to model the Gaussian distribution for each feature within each class.

Testing Phase:

During the testing phase, when the algorithm encounters a new instance with its feature values, it calculates the probability of the instance belonging to each class using the Gaussian probability density function (PDF) for each feature.
It applies Bayes' theorem to find the conditional probability of the instance belonging to each class given its feature values.
The class with the highest conditional probability is predicted as the output class for the instance.
Gaussian Naive Bayes is a powerful and computationally efficient algorithm for multi-class classification, especially when dealing with continuous-valued features that follow a Gaussian distribution. However, it assumes that the features are independent of each other given the class, which may not always hold true in real-world scenarios. Nonetheless, Gaussian Naive Bayes can still perform well in many multi-class classification tasks, especially when the independence assumption approximately holds or when there is a large amount of data available for training.








# Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:
Summarise your findings and provide some suggestions for future work.

In [2]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import warnings

data = fetch_openml(name='spambase')
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Next, we can create instances of the Bernoulli Naive Bayes, Multinomial Naive Bayes,
# and Gaussian Naive Bayes classifiers and fit them to the training data:


  warn(


In [3]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

# Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb.fit(X_train, y_train)

# Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

#To evaluate the performance of each classifier using 10-fold cross-validation, 
#we can use the cross_val_score function from scikit-learn:

In [8]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, classification_report

# Bernoulli Naive Bayes
bnb_scores = cross_val_score(bnb, X, y, cv=10)
print("Bernoulli Naive Bayes:")
print("Accuracy:", bnb_scores.mean())
print("Precision:", )
print("Recall:", )
print("F1 score:", )

# Multinomial Naive Bayes
mnb_scores = cross_val_score(mnb, X, y, cv=10)
print("Multinomial Naive Bayes:")
print("Accuracy:", mnb_scores.mean())
print("Precision:", )
print("Recall:", )
print("F1 score:", )

# Gaussian Naive Bayes
gnb_scores = cross_val_score(gnb, X, y, cv=10)
print("Gaussian Naive Bayes:")
print("Accuracy:", gnb_scores.mean())
print("Precision:", )
print("Recall:", )
print("F1 score:", )
print(classification_report(y_test, y_pred))

Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision:
Recall:
F1 score:
Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision:
Recall:
F1 score:
Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision:
Recall:
F1 score:


NameError: name 'y_pred' is not defined