# Naive Bayes Classifer 

I am making a naive bayes classifier with the `scikit-learn` library and I will be using the `pandas` and `plotly` libraries. Naive Bayes uses Byaes Theorem which is as follows:

$$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$$

Bayes Theorem calculates the probability of a class given a set of features, assuming that the features are independent. Despite this "naive" assumption of feature independence, it performs in many practical applications, especially high-dimensional data. The algorithm computes the posterior probability of each class and assigns the class label with the highest posterior probability. It is also important to mention that the Naive Bayes Algorithm that all dependent variables are independent which is again the "naive" assumption that it makes. It is commonly used in text classification, such as spam detection and sentiment analysis. Naive Bayes is efficent, easy to implement and works well with small datasets.

I will be looking at a dataset from `kaggle.com` having to do with prediciting if someone has diabetes or not.

# **:)**

In [1]:
# importing libraries

import pandas as pd
import plotly.express as px
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import GaussianNB

In [2]:
# Loading in the dataset

df = pd.read_csv('diabetic_data.csv')
df.head()

Unnamed: 0,glucose,bloodpressure,diabetes
0,40,85,0
1,40,92,0
2,45,63,1
3,45,80,0
4,40,73,1


In [3]:
# gonna do some EDA

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   glucose        995 non-null    int64
 1   bloodpressure  995 non-null    int64
 2   diabetes       995 non-null    int64
dtypes: int64(3)
memory usage: 23.4 KB


In [4]:
df.describe()

Unnamed: 0,glucose,bloodpressure,diabetes
count,995.0,995.0,995.0
mean,44.306533,79.184925,0.500503
std,6.707567,9.340204,0.500251
min,20.0,50.0,0.0
25%,40.0,72.0,0.0
50%,45.0,80.0,1.0
75%,50.0,87.0,1.0
max,70.0,100.0,1.0


In [5]:
# Visualizing Glucose Data

fig = px.histogram(df,
                   x = 'glucose',
                   y = 'diabetes',
                   color_discrete_sequence=px.colors.sequential.Cividis)
fig.update_layout(title = "Distribution of Glucose",
                  xaxis_title = 'Glucose Level',
                  yaxis_title = 'Diabetes')
fig.show()

In [6]:
# Visualizing bloodpressure data

fig = px.histogram(df,
                   x = 'bloodpressure',
                   y = 'diabetes',
                   color_discrete_sequence=px.colors.sequential.Inferno)
fig.update_layout(title = "Distribution of Blood Pressure",
                  xaxis_title = 'Blood Pressure',
                  yaxis_title = 'Diabetes')
fig.show()

# Making the Model

Because we have a normal distribution of data (Gaussian) we should use the GaussianNB algorithm.

In [7]:
# Feature/ Label Selection

X = df.loc[:, ['glucose', 'bloodpressure']]
y = df.loc[:, ['diabetes']]
y = np.ravel(y)

In [8]:
# Splitting the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=100)

bayes = GaussianNB()
bayes.fit(X_train, y_train)

In [9]:
# Making Predictions

y_pred = bayes.predict(X_test)
print(y_pred)

[1 1 1 1 1 0 0 0 1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0
 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 0
 0 0 1 0 1 1 0 0 0 1 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0 0 1 0 0 1 0 0 0
 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 1 1 1 0 0 0 1
 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 0
 0 0 1 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 0 1
 0 0 0 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 1 1 1
 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0
 1 0 1 1 0 0 0 0 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0]


In [10]:
# Confusion Matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

[[159  11]
 [ 10 149]]


In [11]:
cr = classification_report(y_test, y_pred)
print(cr)
print(f"Accuracy Score: {bayes.score(X_train, y_train):.2f}")

              precision    recall  f1-score   support

           0       0.94      0.94      0.94       170
           1       0.93      0.94      0.93       159

    accuracy                           0.94       329
   macro avg       0.94      0.94      0.94       329
weighted avg       0.94      0.94      0.94       329

Accuracy Score: 0.93


In [12]:
# Add in Parameters

param_grid = {
    'var_smoothing': [0.00000001, 0.000000001, 0.00000001]
}

In [13]:
grid_search = GridSearchCV(bayes, # Adding the Model that We made 
                           param_grid, # Throwing in the param grid
                           cv = 5, # Cross Validation
                           scoring = 'accuracy', # Scoring is based off of Accuracy
                           n_jobs = -1) 
grid_search.fit(X_train, y_train) # Fitting the gridsearch with training data

In [14]:
print(f"Best Params: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")

Best Params: {'var_smoothing': 1e-08}
Best Score: 0.9309168443496801
