### Naïve Bayes Classification _ Titanic Dataset 

Using the Titanic dataset, clean up the data (handle missing values either by removal or filling, and transforming non-numerical data into number values) and then build Gaussian and Bernoulli Naive Bayes models to predict Titanic passengers' survival status (1=survived, 0=did not survive). Compare the two models against each other. Did one model perform better than the other? How does the performance of these two models compare to the other classification algorithms, logistic regression and decision trees?


For a bonus challenge, try different methods of preparing your data (cleaning, choosing rows/columns) to see if that affects your results.


*To see an example of predictive output of the logistic regression and decision trees, run the code in the notebooks for the Lv 1 Module 8: Logistic Regression and Module 9: Decision Trees notebooks (Links to an external site.)Links to an external site..





In [1]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB   #import Gaussian Bayes modeling function
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [2]:
#load data
filename = "titanic.xls"
df = pd.read_excel(filename)

df.head() #first 5 rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [43]:
df['embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [3]:
#descriptive statistics
df.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


In [4]:
#find columns that have missing values
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [5]:
#fill missing values for age based on survival status, sex, and passenger class
df['age'].fillna(df.groupby(['survived', 'sex', 'pclass'])['age'].transform('mean'), inplace=True)

In [6]:
#find columns that have missing values after filling missing value for age 
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [7]:
#only 2 missing values so we'll fill with most common embarkation point
df['embarked'].value_counts()

S    914
C    270
Q    123
Name: embarked, dtype: int64

In [8]:
#fill missing values
df['embarked'].fillna('S', inplace=True)

In [9]:
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        0
boat          823
body         1188
home.dest     564
dtype: int64

In [10]:
modeldf = df.drop(['name','ticket','fare', 'cabin', 'boat', 'body', 'home.dest'], axis=1)

In [11]:
modeldf.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,embarked
0,1,1,female,29.0,0,0,S
1,1,1,male,0.9167,1,2,S
2,1,0,female,2.0,1,2,S
3,1,0,male,30.0,1,2,S
4,1,0,female,25.0,1,2,S


In [12]:
#transform Sex column to binary values (0,1)
modeldf['sex'] = modeldf['sex'].map({'female': 0, 'male': 1})
modeldf.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,embarked
0,1,1,0,29.0,0,0,S
1,1,1,1,0.9167,1,2,S
2,1,0,0,2.0,1,2,S
3,1,0,1,30.0,1,2,S
4,1,0,0,25.0,1,2,S


In [16]:
#dummy variables for passenger class embarkation port
#get_dummies will auto-drop columns that dummies were created from
modeldf = pd.get_dummies(data=modeldf, columns=['pclass','embarked'])


In [17]:
modeldf.head()

Unnamed: 0,survived,sex,age,sibsp,parch,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S
0,1,0,29.0,0,0,1,0,0,0,0,1
1,1,1,0.9167,1,2,1,0,0,0,0,1
2,0,0,2.0,1,2,1,0,0,0,0,1
3,0,1,30.0,1,2,1,0,0,0,0,1
4,0,0,25.0,1,2,1,0,0,0,0,1


## Naïve Bayes using Scikit-Learn¶
Let's use the same dataset above and build a Naïve Bayes classification model to predict student grades.

###  1. Gaussian Naïve Bayes
There are different types of Naive Bayes functions and in this section, I will use Gaussian Bayes to build the predictive model. Gaussian Bayes uses conditional probability on data that is normally distributed

In [18]:
#check to see if there are any missing values
modeldf.count()

survived      1309
sex           1309
age           1309
sibsp         1309
parch         1309
pclass_1      1309
pclass_2      1309
pclass_3      1309
embarked_C    1309
embarked_Q    1309
embarked_S    1309
dtype: int64

In [21]:
#dataframe with predicting features
X = modeldf.drop('survived', axis=1)

#column of predictive target values
y = modeldf['survived']

In [22]:
#create training and test data
#will leave test size at default (25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=109)

In [23]:
#initialize Gaussian Bayes classifier
gnb = GaussianNB()

In [24]:
#train the model to learn trends
gnb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [25]:
#predictive score of the model on the training data
gnb.score(X_train, y_train)

0.7492354740061162

In [26]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [29]:
#Confusion matrix shows which values model predicted correctly vs incorrectly
#look at true and false predictions
pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Not Survival', 'Predicted Survival'],
    index=['True Not Survival', 'True Survival']
)

Unnamed: 0,Predicted Not Survival,Predicted Survival
True Not Survival,149,51
True Survival,38,90


In [30]:
#frequency of non survived to survived in the test dataset
y_test.value_counts()

0    200
1    128
Name: survived, dtype: int64

In [31]:
#predictive score of the model on the test data
gnb.score(X_test, y_test)

0.7286585365853658

In [32]:
#predictive score of the model for each predictive category
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.74      0.77       200
           1       0.64      0.70      0.67       128

   micro avg       0.73      0.73      0.73       328
   macro avg       0.72      0.72      0.72       328
weighted avg       0.73      0.73      0.73       328



=== Precision and recall from the above output is summarized below 



-- Precision = True Positives /( True Positives + False Positives)  = 90/90+51 = 0.638


   
Precision tells us, When positive value is predicted, how often is the prediction correct in other words it is 
 How precise is the classifier when predicting positive instances 

Recall/Sensitivity = True positives /( True positives + False Negatives ) = 90/90+38 = 0.703


 When actual value is positive , how often is the prediction is correct  
 True positive rate 


### 2. Bernoulli's Naïve Bayes

Bernoull's Naïve Bayes classifier is best on a target variable that is binary (Boolean; True/False (1,0) values). Let's try this method on the dataset from the previous example.

In [33]:
#import Bernoulli Naïve Bayes function from scikit-learn library
from sklearn.naive_bayes import BernoulliNB

In [34]:
#initialize Bernoulli Naïve Bayes function to a variable
bnb = BernoulliNB()

In [35]:
#build the model with training data
bnb.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [36]:
#model's predictive score on the training data
bnb.score(X_train, y_train)

0.7614678899082569

In [37]:
#test the model on unseen data
#score predictive values in variable
y_pred = bnb.predict(X_test)

In [38]:
#Confusion matrix shows which values model predicted correctly vs incorrectly
#look at true and false predictions
pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Not Survival', 'Predicted Survival'],
    index=['True Not Survival', 'True Survival']
)

Unnamed: 0,Predicted Not Survival,Predicted Survival
True Not Survival,169,31
True Survival,48,80


In [39]:
#predictive score of the model on the test data
bnb.score(X_test, y_test)

0.7591463414634146

In [40]:
#predictive score of the model for each predictive category
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.84      0.81       200
           1       0.72      0.62      0.67       128

   micro avg       0.76      0.76      0.76       328
   macro avg       0.75      0.73      0.74       328
weighted avg       0.76      0.76      0.76       328



Precision tells us, When positive value is predicted, how often is the prediction correct. In other words it is How precise is the classifier when predicting positive instances. Recall tells us when actual value is positive , how often is the prediction is correct  
 True positive rate 


# Comparing models 


    Model                        Over_all     Predictive_score       precion        recall 

    Bernoulli's Naïve Bayes   
    Gaussian Naïve Bayes       
    Decision Tree
    Logistic Regression 

  
