# Naïve Bayes Classifier

Probability is a way to figure out how likely something is to happen. Probability is calculated by taking the number of chances something can happen and divide it by the total number of possible outcomes. For example, when flipping a coin there are 2 possible outcomes. The probability of getting heads is 50% (1 chance to get heads, with 2 possible outcomes). The formula would look like:

### \begin{align} probability = \frac{number of chances}{total outcomes} \end{align}

The Naïve Bayes classification model is an algorithm based on Bayes' Theorem, which is a way to find the probability of a variable when other values have been known to occur already. It is represented by the following formula:

### \begin{align} P(B|A) = \frac{P(B)\times P(A|B)}{P(A)} \end{align}

Where the probability of B given that A happened is equal to the probability of B times the probability of A given that B happened, divided by the probability of A. For example, in a bag of 2 blue marbles and 3 red marbles, if a blue marble is pulled from the bag then the probability of getting another blue marble is affected by the fact that a blue marble was already drawn (and thus, there are fewer blue marbles in the bag).

<center>![Marbles Probability](https://notebooks.azure.com/priesterkc/projects/testdb/raw/marbles.png "Probability using marbles")</center>

## Naïve Bayes Probability Calculation

In the following dataset, let's find the probability of a passenger servived from the disaster a test (75% or higher) given that their sex, pclass, age. Here are the things we'll need to know:

- the total number of passenger
- the number of students that passed the test
- the number of students that studied 5 hours or less
- the number of students that studied 5 hours or less, given that they already passed

Using those values, then we can calculate:

- the probability of passing the test
- the probability of studying 5 hours or less
- the probability of studying 5 hours or less, given already passing the test

In [1]:
import pandas as pd
import numpy as np

In [2]:
#load data
filename = "../datasets/titanic.xls"
df = pd.read_excel(filename)

df.head() #first 5 rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
#descriptive statistics
df.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


38% of the training-set survived the Titanic disaster. 
The passengers ages range from 0.1 to 80 years. 

In [4]:
df.keys()

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

In [5]:
#total missing values
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [6]:
#fill missing values for age based on survival status, sex, and passenger class
df['age'].fillna(df.groupby(['survived', 'sex', 'pclass'])['age'].transform('mean'), inplace=True)

In [7]:
#only 2 missing values so we'll fill with most common embarkation point
df['embarked'].value_counts()

S    914
C    270
Q    123
Name: embarked, dtype: int64

In [8]:
#fill missing values
df['embarked'].fillna('S', inplace=True)

In [9]:
modeldf = df.drop(['name','ticket','fare', 'cabin', 'boat', 'body', 'home.dest'], axis=1)

In [10]:
#total missing values
modeldf.isnull().sum()

pclass      0
survived    0
sex         0
age         0
sibsp       0
parch       0
embarked    0
dtype: int64

## Naïve Bayes using Scikit-Learn

Let's use the same dataset above and build a Naïve Bayes classification model to predict student grades.

### Gaussian Naïve Bayes

There are different types of Naive Bayes functions and in the examples below, we will use Gaussian Bayes to build the predictive model. Gaussian Bayes uses conditional probability on data that is normally distributed.

In [11]:
from sklearn.naive_bayes import GaussianNB   #import Gaussian Bayes modeling function
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [12]:
#create a dataframe with columns to use in the model
modeldf = df[['sex', 'age', 'sibsp', 'pclass', 'survived', 'parch','embarked']]
modeldf.head()

Unnamed: 0,sex,age,sibsp,pclass,survived,parch,embarked
0,female,29.0,0,1,1,0,S
1,male,0.9167,1,1,1,2,S
2,female,2.0,1,1,0,2,S
3,male,30.0,1,1,0,2,S
4,female,25.0,1,1,0,2,S


In [13]:
#check to see if there are any missing values
modeldf.count()

sex         1309
age         1309
sibsp       1309
pclass      1309
survived    1309
parch       1309
embarked    1309
dtype: int64

In [14]:
modeldf.dtypes

sex          object
age         float64
sibsp         int64
pclass        int64
survived      int64
parch         int64
embarked     object
dtype: object

In [15]:
#change sex values to binary
#female=0, male=1
modeldf['sex'] = modeldf['sex'].map({'female':0, 'male':1})
modeldf.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,sex,age,sibsp,pclass,survived,parch,embarked
0,0,29.0,0,1,1,0,S
1,1,0.9167,1,1,1,2,S
2,0,2.0,1,1,0,2,S
3,1,30.0,1,1,0,2,S
4,0,25.0,1,1,0,2,S


In [16]:
#see which features are correlated to each other
modeldf.corr()

Unnamed: 0,sex,age,sibsp,pclass,survived,parch
sex,1.0,0.080752,-0.109609,0.124617,-0.528693,-0.213125
age,0.080752,1.0,-0.201513,-0.444002,-0.060032,-0.134548
sibsp,-0.109609,-0.201513,1.0,0.060832,-0.027825,0.373587
pclass,0.124617,-0.444002,0.060832,1.0,-0.312469,0.018322
survived,-0.528693,-0.060032,-0.027825,-0.312469,1.0,0.08266
parch,-0.213125,-0.134548,0.373587,0.018322,0.08266,1.0


In [17]:
#create new column based on number of family members
#drop sibsp and parch columns
modeldf['family_num'] = modeldf['sibsp'] + modeldf['parch']
modeldf.drop(['sibsp', 'parch'], axis=1, inplace=True)
modeldf.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,sex,age,pclass,survived,embarked,family_num
0,0,29.0,1,1,S,0
1,1,0.9167,1,1,S,3
2,0,2.0,1,0,S,3
3,1,30.0,1,0,S,3
4,0,25.0,1,0,S,3


In [18]:
#columns left in our dataframe
modeldf.columns

Index(['sex', 'age', 'pclass', 'survived', 'embarked', 'family_num'], dtype='object')

In [19]:
#dummy variables for passenger class embarkation port
#get_dummies will auto-drop columns that dummies were created from
modeldf = pd.get_dummies(data=modeldf, columns=['pclass','embarked'])
modeldf.head()

Unnamed: 0,sex,age,survived,family_num,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S
0,0,29.0,1,0,1,0,0,0,0,1
1,1,0.9167,1,3,1,0,0,0,0,1
2,0,2.0,0,3,1,0,0,0,0,1
3,1,30.0,0,3,1,0,0,0,0,1
4,0,25.0,0,3,1,0,0,0,0,1


In [20]:
modeldf['TravelAlone']=np.where((modeldf['family_num'] > 0), 0, 1)
modeldf.head()

Unnamed: 0,sex,age,survived,family_num,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S,TravelAlone
0,0,29.0,1,0,1,0,0,0,0,1,1
1,1,0.9167,1,3,1,0,0,0,0,1,0
2,0,2.0,0,3,1,0,0,0,0,1,0
3,1,30.0,0,3,1,0,0,0,0,1,0
4,0,25.0,0,3,1,0,0,0,0,1,0


In [21]:
#dataframe with predicting features
X = modeldf.drop('survived', axis=1)

#column of predictive target values
y = modeldf['survived']

In [22]:
#create training and test data
#will leave test size at default (25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=109)

In [23]:
#initialize Gaussian Bayes classifier
gnb = GaussianNB()

In [24]:
#train the model to learn trends
gnb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [25]:
#predictive score of the model on the training data
gnb.score(X_train, y_train)

0.7696228338430173

In [26]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [27]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,169,31
True Passed,44,84


In [28]:
#frequency of passed students to failed students in the test dataset
y_test.value_counts()

0    200
1    128
Name: survived, dtype: int64

In [29]:
#predictive score of the model on the test data
gnb.score(X_test, y_test)

0.7713414634146342

In [30]:
#predictive score of the model for each predictive category
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.84      0.82       200
           1       0.73      0.66      0.69       128

   micro avg       0.77      0.77      0.77       328
   macro avg       0.76      0.75      0.75       328
weighted avg       0.77      0.77      0.77       328



# Bernoulli's Naïve Bayes

In [31]:
#import Bernoulli Naïve Bayes function from scikit-learn library
from sklearn.naive_bayes import BernoulliNB

In [32]:
#initialize Bernoulli Naïve Bayes function to a variable
bnb = BernoulliNB()

In [33]:
#build the model with training data
bnb.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [34]:
#model's predictive score on the training data
bnb.score(X_train, y_train)

0.7522935779816514

In [35]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [36]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,169,31
True Passed,44,84


In [45]:
#predictive score of the model on the test data
gnb.score(X_test, y_test)

0.7713414634146342

In [37]:
#predictive score of the model for each predictive category
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.84      0.82       200
           1       0.73      0.66      0.69       128

   micro avg       0.77      0.77      0.77       328
   macro avg       0.76      0.75      0.75       328
weighted avg       0.77      0.77      0.77       328

