# Naïve Bayes Classifier

Probability is a way to figure out how likely something is to happen. Probability is calculated by taking the number of chances something can happen and divide it by the total number of possible outcomes. For example, when flipping a coin there are 2 possible outcomes. The probability of getting heads is 50% (1 chance to get heads, with 2 possible outcomes). The formula would look like:

### \begin{align} probability = \frac{number of chances}{total outcomes} \end{align}

The Naïve Bayes classification model is an algorithm based on Bayes' Theorem, which is a way to find the probability of a variable when other values have been known to occur already. It is represented by the following formula:

### \begin{align} P(B|A) = \frac{P(B)\times P(A|B)}{P(A)} \end{align}

Where the probability of B given that A happened is equal to the probability of B times the probability of A given that B happened, divided by the probability of A. For example, in a bag of 2 blue marbles and 3 red marbles, if a blue marble is pulled from the bag then the probability of getting another blue marble is affected by the fact that a blue marble was already drawn (and thus, there are fewer blue marbles in the bag).

<center>![Marbles Probability](https://notebooks.azure.com/priesterkc/projects/testdb/raw/marbles.png "Probability using marbles")</center>

## Naïve Bayes Probability Calculation

In the following dataset, let's find the probability of a student passing a test (60% or higher) given that they studied 5 hours or less. Here are the things we'll need to know:

- the total number of students
- the number of students that passed the test
- the number of students that studied 5 hours or less
- the number of students that studied 5 hours or less, given that they already passed

Using those values, then we can calculate:

- the probability of passing the test
- the probability of studying 5 hours or less
- the probability of studying 5 hours or less, given already passing the test

In [1]:
import pandas as pd
import numpy as np

In [2]:
#load data
filename = "EduGradeData.csv"
df = pd.read_csv(filename)

df.head() #first 5 rows

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH


In [3]:
#descriptive statistics
df.describe()

Unnamed: 0,age,exercise,hours,grade
count,2000.0,2000.0,2000.0,2000.0
mean,16.5785,3.0005,10.9885,82.55605
std,1.696254,1.423205,4.063942,9.747593
min,14.0,0.0,0.0,32.0
25%,15.0,2.0,8.0,75.575
50%,17.0,3.0,11.0,82.7
75%,18.0,4.0,14.0,89.7
max,19.0,5.0,20.0,100.0


In [4]:
#total number of students
total = len(df)

In [5]:
#rows of students that passed the test
df_pass = df[df['grade'] >= 60]

#number of students that passed
numpass = len(df_pass)

In [6]:
#rows of students that studied 5 hours or less
df_less5hr = df[df['hours'] <= 5]

#number of students that studied 5 hours or less
num_less5hr = len(df_less5hr)

In [7]:
#rows of students that studied 5 hours or less and passed
df_5less_pass = df_pass.loc[df['hours'] <= 5]

#number of students that studied 5 hours or less and passed
num_5less_pass = len(df_5less_pass)

In [8]:
#probability of passing the test
#number of students that passed divided by total number of students
P_pass = numpass/total
P_pass

0.993

In [9]:
#probability of studying 5 hours or less
#number of students that studied 5 hours or less divided by total number of students
P_less5hr = num_less5hr/total
P_less5hr

0.1005

In [10]:
#probability of studying 5 hours or less given that you passed
#number of students that studied 5 hours or less given they passed, divided by total students that passed
P_5hr_pass = num_5less_pass/numpass
P_5hr_pass

0.094662638469285

In [11]:
#SOLUTION: probability of passing given that you studied 5 hours or less

#probability of passing times probability of studying 5 hours or less given that you passed
#divded by probability of studying 5 hours or less
P_pass_less5hr = (P_pass * P_5hr_pass)/(P_less5hr)
P_pass_less5hr

0.9353233830845771

#### The probability of a passing the test, given that a student studied 5 hours or less is about 93.5%. So a student only has a 6.5% chance of failing. That's not too bad; maybe the test is fairly easy.

***

## Naïve Bayes using Scikit-Learn

Let's use the same dataset above and build a Naïve Bayes classification model to predict student grades.

### Gaussian Naïve Bayes

There are different types of Naive Bayes functions and in the examples below, we will use Gaussian Bayes to build the predictive model. Gaussian Bayes uses conditional probability on data that is normally distributed.

In [12]:
from sklearn.naive_bayes import GaussianNB   #import Gaussian Bayes modeling function
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [13]:
#check to see if there are any missing values
df.count()

fname             2000
lname             2000
gender            2000
age               2000
exercise          2000
level_of_fit      2000
hours             2000
level_of_study    1997
grade             2000
home_state        2000
dtype: int64

In [14]:
df.dtypes

fname              object
lname              object
gender             object
age                 int64
exercise            int64
level_of_fit       object
hours               int64
level_of_study     object
grade             float64
home_state         object
dtype: object

In [15]:
#create a dataframe with columns to use in the model
modeldf = df[['gender', 'age', 'exercise', 'hours', 'grade']]
modeldf.head()

Unnamed: 0,gender,age,exercise,hours,grade
0,female,17,3,10,82.4
1,male,18,4,4,78.2
2,male,18,5,9,79.3
3,female,14,2,7,83.2
4,female,18,4,15,87.4


In [16]:
#transform gender column to binary values (0,1)
modeldf['gender'] = modeldf['gender'].map({'female': 0, 'male': 1})
modeldf.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  modeldf['gender'] = modeldf['gender'].map({'female': 0, 'male': 1})


Unnamed: 0,gender,age,exercise,hours,grade
0,0,17,3,10,82.4
1,1,18,4,4,78.2
2,1,18,5,9,79.3
3,0,14,2,7,83.2
4,0,18,4,15,87.4


In [17]:
#see which features are correlated to each other
modeldf.corr()

Unnamed: 0,gender,age,exercise,hours,grade
gender,1.0,0.006192,-0.032681,0.013906,-0.016547
age,0.006192,1.0,-0.003643,-0.017467,-0.00758
exercise,-0.032681,-0.003643,1.0,0.021105,0.161286
hours,0.013906,-0.017467,0.021105,1.0,0.801955
grade,-0.016547,-0.00758,0.161286,0.801955,1.0


In [18]:
#create a column to label if a student passed or failed a test
modeldf['passed'] = np.where(df['grade']>= 60, 1, 0)

#drop grade column
modeldf.drop('grade', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  modeldf['passed'] = np.where(df['grade']>= 60, 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  modeldf.drop('grade', axis=1, inplace=True)


In [19]:
#dataframe with predicting features
X = modeldf.drop('passed', axis=1)

#column of predictive target values
y = modeldf['passed']

In [20]:
#create training and test data
#will leave test size at default (25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=109)

In [21]:
#initialize Gaussian Bayes classifier
gnb = GaussianNB()

In [22]:
#train the model to learn trends
gnb.fit(X_train, y_train)

GaussianNB()

In [23]:
#predictive score of the model on the training data
gnb.score(X_train, y_train)

0.9933333333333333

In [24]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [25]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,0,4
True Passed,0,496


In [26]:
#frequency of passed students to failed students in the test dataset
y_test.value_counts()

1    496
0      4
Name: passed, dtype: int64

In [27]:
#predictive score of the model on the test data
gnb.score(X_test, y_test)

0.992

In [28]:
#predictive score of the model for each predictive category
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         4
           1       0.99      1.00      1.00       496

    accuracy                           0.99       500
   macro avg       0.50      0.50      0.50       500
weighted avg       0.98      0.99      0.99       500



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Bernoulli's Naïve Bayes

Bernoull's Naïve Bayes classifier is best on a target variable that is binary (Boolean; True/False (1,0) values). Let's try this method on the dataset from the previous example.

In [29]:
#import Bernoulli Naïve Bayes function from scikit-learn library
from sklearn.naive_bayes import BernoulliNB

In [30]:
#initialize Bernoulli Naïve Bayes function to a variable
bnb = BernoulliNB()

In [31]:
#build the model with training data
bnb.fit(X_train, y_train)

BernoulliNB()

In [32]:
#model's predictive score on the training data
bnb.score(X_train, y_train)

0.9933333333333333

In [33]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [34]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,0,4
True Passed,0,496


In [35]:
#predictive score of the model on the test data
gnb.score(X_test, y_test)

0.992

Overall, the model is really good at finding students that passed but in this test dataset, it didn't have enough data points to find the trend of predicting features for students that failed the test. One way to improve the results would be to decrease the size of the training data so that data points for failing students seem more significant. This dataset is also small, so new data with more students that failed could help the model see the trends for failing students. Lastly, it could just be that Naïve Bayes isn't the best model to use for the data and we should compare its results to other predictive classification models.