Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

ans->A: Employee uses the health insurance plan.
B: Employee is a smoker

P(A) = 70% = 0.70 (Probability that an employee uses the health insurance plan)
P(B|A) = 40% = 0.40 (Probability that an employee is a smoker given that they use the health insurance plan)

P(B)=P(B∣A)⋅P(A)+P(B∣¬A)⋅P(¬A)

P(B)=(0.40)⋅(0.70)+(0.20)⋅(0.30)

P(B)=0.28+0.06

P(B)=0.34

Finally, we can use Bayes' Theorem to find 

P(A∣B): 0.82

So, the probability that an employee is a smoker given that he/she uses the health insurance plan is approximately 82%.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

ans->Bernoulli Naive Bayes: Assumes binary feature variables (presence or absence), often used for text classification where the presence or absence of words is important, like spam detection.

Multinomial Naive Bayes: Handles multiple features (like word counts for text classification), assuming a multinomial distribution. It's commonly used for document classification based on word counts in the documents.

Q3. How does Bernoulli Naive Bayes handle missing values?

ans->Bernoulli Naive Bayes handles missing values by ignoring them during the training process. When encountering a missing value for a feature, it essentially treats it as if the feature were not present (similar to a 0 in binary features). This is because Bernoulli Naive Bayes assumes binary features (presence or absence), so a missing value is considered as the feature being absent. During prediction, if a feature is missing, it does not contribute to the likelihood calculation for that class.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

ans->Yes, Gaussian Naive Bayes can be used for multi-class classification. It is an extension of the Naive Bayes algorithm that is suitable for continuous data where the likelihood of the features is assumed to be Gaussian (normal distribution). This algorithm can be used for both binary and multi-class classification problems.

Q5. Assignment:

In [4]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3
Note: you may need to restart the kernel to use updated packages.


In [1]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 
  
# metadata 
print(spambase.metadata) 
  
# variable information 
print(spambase.variables) 


{'uci_id': 94, 'name': 'Spambase', 'repository_url': 'https://archive.ics.uci.edu/dataset/94/spambase', 'data_url': 'https://archive.ics.uci.edu/static/public/94/data.csv', 'abstract': 'Classifying Email as Spam or Non-Spam', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 4601, 'num_features': 57, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': ['Class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1999, 'last_updated': 'Mon Aug 28 2023', 'dataset_doi': '10.24432/C53G6X', 'creators': ['Mark Hopkins', 'Erik Reeber', 'George Forman', 'Jaap Suermondt'], 'intro_paper': None, 'additional_info': {'summary': 'The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...\n\nThe classification task for this dataset is to determine whether a given email is spam or not.\n\t\nOur collecti

In [3]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state =42)

In [24]:
from sklearn.naive_bayes  import GaussianNB,BernoulliNB,MultinomialNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,precision_score,recall_score,f1_score
import warnings
warnings.filterwarnings('ignore')

# Gaussian

In [11]:
gnb=GaussianNB()
gnb.fit(X_train,y_train)
y_pred=gnb.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.8247646632874729


In [18]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]
}
grid=GridSearchCV(gnb,param_grid=param_grid,cv=10,verbose=3,scoring='accuracy')
grid.fit(X_train, y_train)

Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV 1/10] END ..............var_smoothing=1e-09;, score=0.801 total time=   0.0s
[CV 2/10] END ..............var_smoothing=1e-09;, score=0.773 total time=   0.0s
[CV 3/10] END ..............var_smoothing=1e-09;, score=0.811 total time=   0.0s
[CV 4/10] END ..............var_smoothing=1e-09;, score=0.792 total time=   0.0s
[CV 5/10] END ..............var_smoothing=1e-09;, score=0.814 total time=   0.0s
[CV 6/10] END ..............var_smoothing=1e-09;, score=0.817 total time=   0.0s
[CV 7/10] END ..............var_smoothing=1e-09;, score=0.823 total time=   0.0s
[CV 8/10] END ..............var_smoothing=1e-09;, score=0.848 total time=   0.0s
[CV 9/10] END ..............var_smoothing=1e-09;, score=0.786 total time=   0.0s
[CV 10/10] END .............var_smoothing=1e-09;, score=0.842 total time=   0.0s
[CV 1/10] END ..............var_smoothing=1e-08;, score=0.798 total time=   0.0s
[CV 2/10] END ..............var_smoothing=1e-08;

In [20]:
grid.best_params_

{'var_smoothing': 1e-06}

In [25]:
y_pred=grid.predict(X_test)
print(accuracy_score(y_test,y_pred))
print (precision_score(y_test,y_pred))
print (recall_score(y_test,y_pred))
print (f1_score(y_test,y_pred))

0.8551774076755974
0.7621696801112656
0.949740034662045
0.845679012345679


# Barnouli

In [26]:
bnb=BernoulliNB()
bnb.fit(X_train,y_train)
y_pred=bnb.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.8790731354091238


In [27]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]
}
grid=GridSearchCV(bnb,param_grid=param_grid,cv=10,verbose=3,scoring='accuracy')
grid.fit(X_train, y_train)

Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV 1/10] END ........................alpha=0.1;, score=0.916 total time=   0.0s
[CV 2/10] END ........................alpha=0.1;, score=0.866 total time=   0.0s
[CV 3/10] END ........................alpha=0.1;, score=0.866 total time=   0.0s
[CV 4/10] END ........................alpha=0.1;, score=0.888 total time=   0.0s
[CV 5/10] END ........................alpha=0.1;, score=0.885 total time=   0.0s
[CV 6/10] END ........................alpha=0.1;, score=0.876 total time=   0.0s
[CV 7/10] END ........................alpha=0.1;, score=0.907 total time=   0.0s
[CV 8/10] END ........................alpha=0.1;, score=0.882 total time=   0.0s
[CV 9/10] END ........................alpha=0.1;, score=0.876 total time=   0.0s
[CV 10/10] END .......................alpha=0.1;, score=0.919 total time=   0.0s
[CV 1/10] END ........................alpha=0.5;, score=0.916 total time=   0.0s
[CV 2/10] END ........................alpha=0.5;

In [28]:
grid.best_params_

{'alpha': 0.1}

In [29]:
y_pred=grid.predict(X_test)
print(accuracy_score(y_test,y_pred))
print (precision_score(y_test,y_pred))
print (recall_score(y_test,y_pred))
print (f1_score(y_test,y_pred))

0.8790731354091238
0.8882575757575758
0.8128249566724437
0.848868778280543


# multinomial

In [32]:
mnb=MultinomialNB()
mnb.fit(X_train,y_train)
y_pred=mnb.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.782041998551774


In [33]:
from sklearn.model_selection import GridSearchCV
param_grid = {
     'alpha': [0.1, 0.5, 1.0, 1.5, 2.0]
}
grid=GridSearchCV(mnb,param_grid=param_grid,cv=10,verbose=3,scoring='accuracy')
grid.fit(X_train, y_train)

Fitting 10 folds for each of 5 candidates, totalling 50 fits
[CV 1/10] END ........................alpha=0.1;, score=0.823 total time=   0.0s
[CV 2/10] END ........................alpha=0.1;, score=0.748 total time=   0.0s
[CV 3/10] END ........................alpha=0.1;, score=0.780 total time=   0.0s
[CV 4/10] END ........................alpha=0.1;, score=0.770 total time=   0.0s
[CV 5/10] END ........................alpha=0.1;, score=0.798 total time=   0.0s
[CV 6/10] END ........................alpha=0.1;, score=0.795 total time=   0.0s
[CV 7/10] END ........................alpha=0.1;, score=0.835 total time=   0.0s
[CV 8/10] END ........................alpha=0.1;, score=0.823 total time=   0.0s
[CV 9/10] END ........................alpha=0.1;, score=0.780 total time=   0.0s
[CV 10/10] END .......................alpha=0.1;, score=0.770 total time=   0.0s
[CV 1/10] END ........................alpha=0.5;, score=0.823 total time=   0.0s
[CV 2/10] END ........................alpha=0.5;

In [34]:
grid.best_params_

{'alpha': 0.1}

In [35]:
y_pred=grid.predict(X_test)
print(accuracy_score(y_test,y_pred))
print (precision_score(y_test,y_pred))
print (recall_score(y_test,y_pred))
print (f1_score(y_test,y_pred))

0.782766111513396
0.7628083491461101
0.6967071057192374
0.7282608695652174


Conclusion:

Bernoulli Naive Bayes seems to be the best-performing model overall in this scenario based on the provided metrics. It has the highest accuracy, precision, and F1-score.

Gaussian Naive Bayes performs well in terms of recall but slightly lower in precision compared to Bernoulli NB.

Multinomial Naive Bayes has the lowest scores across all metrics, indicating it might not be the best choice for this particular dataset or task.


When choosing the best model, consider the specific requirements of your problem. If precision is more important (e.g., in medical diagnoses), Bernoulli NB might be preferred. If recall is critical (e.g., identifying fraudulent transactions), Gaussian NB could be the choice. In general scenarios, where a balance of precision and recall is desired, Bernoulli NB with its higher F1-score might be the most suitable option.