# Na√Øve Bayes Classifier 
is probabilistic supervised machine learning algorithms. It is used to solve classification problem.

It is based on Bayes Theorem.


Here we are going to implement the Naive Bayes Classifier to predict Breast Cancer.

Will be using the dataset from sklearn. So lets get started with it.

# Import Libraries

In [46]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Model specific Library
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

# Load Dataset

In [47]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()

In [None]:
breast_cancer

So we see that we have loaded the breast cancer dataset.. but this does not looks like a dataset... yes, this is complete detail along with the dataset.

Actually this detail is in dictionary format, we can get the keys from this dictionary using `key`.

In [None]:
breast_cancer.keys()

So we see that we do have `data`,`feature_names`, `target`, `target_names` and others as well, but will be focusing on these. If interested please do and read the other details.

In [None]:
breast_cancer.data
# This is our actual data.

In [None]:
# These are the feature names for our dataset (data)
breast_cancer.feature_names

In [None]:
# These are our target data
breast_cancer.target

In [None]:
breast_cancer.target_names

This represent the string format of our target / class.

So in target if value is 0 it means it is `malignant` ie the patient is suffering with malignant tumor.

Where as `benign` is represented by 1, and the patient is not having any cancer.

# Create Dataframe 
Create dataframe out of the keys which are intreset to us.

In [54]:
df = pd.DataFrame(
    np.c_[breast_cancer.data, breast_cancer.target], 
    columns = [list(breast_cancer.feature_names)+ ['target']]
                 )

In [None]:
df.head()

In [56]:
# what is the shape of the data
# put your code here

In [57]:
# how to describe the data
# note this is a very informative function that is only available in pandas
# put your code here

In [None]:
# how do you exam the non-null values
# check whether there are any null values


print(df.isnull().any(axis=1).sum())

In [None]:
# another way to check for null values
df.info()

# Split the data into X and y

In [60]:
X = df.iloc[:, 0:-1]
y = df.iloc[:,-1]

#X = df.iloc[:, 0:-1]
#y = df.iloc[:,-1]

In [None]:
X.shape, y.shape

In [62]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 999)
# remember that the random state is important for reproducibility

In [None]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

# Train Naive Bayes Classifier Model

## Gaussian Naive Bayes

In [64]:
clf = GaussianNB()

In [None]:
clf.fit(X_train, y_train)

In [None]:
clf.score(X_val, y_val)

wow... congrats!! We have achieved 92% at the first go.

moving onto another type of Naive Bayes 

## Multinomial Naive Bayes

In [67]:
clf_mn = MultinomialNB()

In [None]:
clf_mn.fit(X_train, y_train)

In [None]:
clf_mn.score(X_val, y_val)

Score goes down when compared to Gaussian NB.

Lets see the 3rd type as well.

## Bernoulli Naive Bayes

In [None]:
clf_b = BernoulliNB()
clf_b.fit(X_train, y_train)
clf_b.score(X_val, y_val)

Ohh... this gets score much worst.

But dont worry... as we were having Numerical Data, so Guassian NB works best for this.. where as when we have any text dataset then using Multinomial or Bernoulli NB works best.

# Predict

To predict this, lets create our own data to test this.
But as we are not Medical person, so we will jsut do a copy paste any one or two records from our train dataset, and do some change in it.

The `display.max_columns` option controls the number of columns to be printed. It receives an int or None (to print all the columns):

In [71]:
pd.set_option('display.max_columns', None)

In [None]:
# print(df.iloc[99])
df[99:100]

In [None]:
patient1 = [14.42,19.77,94.48,642.5,0.09752,0.1141,0.09388,0.05839,0.1879,0.0639,0.2895,1.851,2.376,26.85,0.008005,0.02895,0.03321,0.01424,0.01462,0.004452,16.33,30.86,109.5,826.4,0.1431,0.3026,0.3194,0.1565,0.2718,0.09353]
patient1

Need to convert the patient1 dataset into 2-Dimension.

In [None]:
patient1 = np.array([patient1])
patient1

In [None]:
clf.predict(patient1)

In [None]:
pred = clf.predict(patient1)

if pred[0] == 0:
    print("Patient is suffering from Cancer (Malignant Tumor)")
else:
    print("Patient has no Cancer (Benign)")

# Probability
Lets see the probability of the prediction.

In [None]:
pred_prob = clf.predict_proba(patient1)
pred_prob

# Visualize it

In [78]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# get the top 10 features based on the model
feature_importance = clf.theta_[1]
print(feature_importance)


In [None]:
# feature_importance plot
from matplotlib import pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
# show the top 5 features
zip_data = zip(breast_cancer.feature_names, feature_importance)
zip_data = sorted(zip_data, key = lambda x: x[1], reverse = True)
feature_names, feature_importance = zip(*zip_data)
sns.barplot(x = list(feature_importance)[:5], y = list(feature_names)[:5])
plt.xlabel('Feature Importance')
plt.ylabel('Feature Names')
plt.title('Top 5 Features')


In [None]:

plt.bar(breast_cancer.target_names, pred_prob[0])
plt.title('Prediction Probability for Malignant Vs Benign')
plt.xlabel('Probability')
plt.xticks(pred_prob[0])
# plt.ylabel('y') 
plt.show()



In [None]:
sns.barplot(y = pred_prob[0], x = breast_cancer.target_names)


Annotate the values

In [None]:
line = plt.bar(breast_cancer.target_names,pred_prob[0])
plt.xlabel('Probability')
plt.ylabel("Value")

for i in range(2):
    plt.annotate(str(round(pred_prob[0][i],2)), xy=(breast_cancer.target_names[i],pred_prob[0][i]))

plt.show()

In [44]:
# permutate the worst area feature
# bin 10 worst area values from the max value to the min value
# predict the probability of the patient having cancer

worst_area = X_train['worst area'].values
# 10 bins
max_worst_area = max(worst_area)
min_worst_area = min(worst_area)
bins = np.linspace(min_worst_area, max_worst_area, 10)


In [None]:
# use the feature to permutate the patient 1 data
import numpy as np
permutated_patient1 = patient1.copy()
# worse area index

worst_area_index =list( X.columns.get_loc('worst area')).index(True)
# get the index number of worst area

# permutate patient1 
prob_list = []
class_list = []
for i in range(10):
    permutated_patient1[0][worst_area_index] = bins[i]
    prob_list.append(clf.predict_proba(permutated_patient1)[0][1])
    class_list.append(clf.predict(permutated_patient1)[0])
# plot the permutated patient1 based on the worst area
plt.plot(bins, prob_list)
plt.scatter(bins, prob_list, )
plt.xlabel('Worst Area')
plt.ylabel('Probability of having Cancer')
plt.title('Permutation of Worst Area')
plt.show()


# Summary and Q&A

In this workshop, we learned how to implement and evaluate different types of Naive Bayes classifiers using scikit-learn. 
We explored the **GaussianNB**, **MultinomialNB**, and **BernoulliNB** models and applied them to the breast cancer dataset.

Feel free to ask questions or experiment further with the code!
