# Part 3: Titanic Classification  

The titanic dataset (included in the input_data directory) consists of instances representing passengers aboard the Titanic ship that sank in the North Atlantic Ocean on 15 April 1912. The dataset has three attributes describing a passenger (class, age, sex) and a class label (survived) denoting whether the passenger survived the shipwreck or not.

Write a Jupyter Notebook (pa5-titanic.ipynb) that uses your mysklearn package to build Naive Bayes and $k$-nearest neighbor classifiers to predict survival from the titanic dataset (titanic.txt). Your classifiers should use class, age, and sex attributes to determine the survival value. Note that since that class, age, and sex are categorical attributes, you will need to update your kNN implementation to properly compute the distance between categorical attributes. See the B Nearest Neighbors Classification notes on Github for how to go about doing this.


In [2]:
# some useful mysklearn package import statements and reloads
import importlib

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.myutils as myutils
importlib.reload(myutils)

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MySimpleLinearRegressor, MyNaiveBayesClassifier, MyZeroRClassifier, MyRandomClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

Evaluate the performance of your classifiers using stratified k-fold cross validation (with k = 10) and generate confusion matrices for the two classifiers. As in PA4, report both accuracy and error rate for the two approaches.

### Naive Bayes Classifier (Confusion Matrices)
* The first step is of course to grab and process the data. This involves grabbing the columns we want to work with. This data is then split up and preped to be trained. These are fit to the Classifier and places into folds using kfold cross validation. This data is used to make a prediction which is placed into a confusion matrix.


In [3]:
import os
importlib.reload(myutils)

# Get the file data
fname = os.path.join("input_data", "titanic.txt")
titanic_data = MyPyTable().load_from_file(fname)
titanic_data.remove_rows_with_missing_values() # prep the data by removing any missing values

# Grab the class, age, sex and store in a list
titatic_class = titanic_data.get_column('class')
titatic_age = titanic_data.get_column('age')
titatic_sex = titanic_data.get_column('sex')

# split the data
X_train = [[titatic_class[i],titatic_age[i],titatic_sex[i]] for i in range(len(titatic_class))]
y_train = titanic_data.get_column('survived')

# Fit to the Naive Bayes Classifier
mnbc = MyNaiveBayesClassifier()
mnbc.fit(X_train, y_train) # fit the data using Naive Bayes

# fold the column data
strattrain_folds, strattest_folds = myevaluation.stratified_kfold_cross_validation(X_train, y_train, 10)
X_train_strat, y_train_strat, X_test_strat, y_test_strat = myutils.get_from_folds(X_train, y_train, strattrain_folds, strattest_folds)

# make a prediction using Naive Bayes
predicted_bayes = mnbc.predict(X_test_strat)

print("===========================================")
print("Naive Bayes Confusion Matrix")
print("===========================================")

# create the confusion matrix
matrix = myevaluation.confusion_matrix(y_test_strat, predicted_bayes, ['yes','no'])

# print the data
table_header = ['Survived', 'yes', 'no', 'Total', 'Recognition (%)']
myutils.titanic_stats(matrix)
myutils.print_tabulate(matrix, table_header)

Naive Bayes Confusion Matrix
Survived      yes    no    Total    Recognition (%)
1            1364   126     1490              91.54
2             362   349      711              49.09
Total        1726   475     2201              77.83


### Knn Classifier (Confusion Matrix)

In [8]:
# Fit to the Naive Bayes Classifier
mknc = MyKNeighborsClassifier(10)
mknc.fit(X_train, y_train) # fit the data using Knn

# fold the column data
strattrain_folds, strattest_folds = myevaluation.stratified_kfold_cross_validation(X_train, y_train, 10)
X_train_strat, y_train_strat, X_test_strat, y_test_strat = myutils.get_from_folds(X_train, y_train, strattrain_folds, strattest_folds)

# make a prediction using Knn
predicted = mknc.predict(X_test_strat)
print(predicted)

print("===========================================")
print("Knn Confusion Matrix")
print("===========================================")

# create the confusion matrix
matrix = myevaluation.confusion_matrix(y_test_strat, predicted, ['yes','no'])

# print the data
table_header = ['Survived', 'yes', 'no', 'Total', 'Recognition (%)']
myutils.titanic_stats(matrix)
myutils.print_tabulate(matrix, table_header)



1


TypeError: unsupported operand type(s) for -: 'str' and 'str'

### Zero R Classifier (Confusion Matrix)
* Fit the data to Zero R Classifier and set the predictions. Put these predictions in a Confusion Matrix.

In [11]:
print("===========================================")
print("MyZeroRClassifier Confusion Matrix")
print("===========================================")

mzrc = MyZeroRClassifier()
mzrc.fit(X_train_strat, y_train_strat)

predicted_zero = mzrc.predict(X_test_strat)

matrix = myevaluation.confusion_matrix(y_test_strat, predicted_zero, ['yes','no'])
table_header = ['Survived', 'yes', 'no', 'Total', 'Recognition (%)']
myutils.titanic_stats(matrix)
myutils.print_tabulate(matrix, table_header)

MyZeroRClassifier Confusion Matrix
Survived      yes    no    Total    Recognition (%)
1            1490     0     1490              100
2             711     0      711                0
Total        2201     0     2201               67.7


### Random Classifier (Confusion Matrix)
* Fit the data to Random Classifier and set the predictions. Put these predictions in a Confusion Matrix.

In [12]:
print("===========================================")
print("MyRandomClassifier Confusion Matrix")
print("===========================================")

mrc = MyRandomClassifier()
mrc.fit(X_train_strat, y_train_strat)

predicted_zero = mrc.predict(X_test_strat)

matrix = myevaluation.confusion_matrix(y_test_strat, predicted_zero, ['yes','no'])
table_header = ['Survived', 'yes', 'no', 'Total', 'Recognition (%)']
myutils.titanic_stats(matrix)
myutils.print_tabulate(matrix, table_header)

MyRandomClassifier Confusion Matrix
Survived      yes    no    Total    Recognition (%)
1            1017   473     1490              68.26
2             481   230      711              32.35
Total        1498   703     2201              56.66
