# Titanic Classification

In this lab you will be building, and evaluating a classification model that will predict whether someone surivived the sinking of the Titanic. You are requried to explore the data so you can decide which features to build into your model. 

## Read in the Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
titanic = pd.read_csv('../data/titanic.csv',index_col='PassengerId')

titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
titanic.shape

(891, 11)

## EDA

### Which feature has the most null values? 

In [None]:
# A:

Click here for solution

<span style ='color:white'>

titanic.isnull().sum()

### Remove null values from the dataset and drop the `Cabin` feature

In [None]:
# A:

Click here for solution

<span style ='color:white'>

titanic.drop('Cabin',axis=1,inplace=True)
titanic.dropna(inplace=True)
titanic.shape

### Investigate the relationships between features using Seaborn's pairplot function

In [None]:
# A:

Click here for solution

<span style ='color:white'>

sns.pairplot(titanic);

### Which features are categorical?

In [None]:
# A:

### Use `pd.get_dummies` to dummy code these features

In [None]:
# A:

Click here for solution

<span style ='color:white'>

titanic_dummy=pd.get_dummies(titanic,columns=['Pclass','Sex','Embarked'],drop_first=True)

### Produce a heatmap to show correlation between features

In [None]:
# A:

Click here for solution

<span style ='color:white'>

sns.heatmap(titanic_dummy.corr());

### Extend: Investigate the survival rates for passengers based on their class and gender

In [None]:
# A:

Click here for solution

<span style ='color:white'>

titanic.pivot_table(index='Sex', columns='Pclass', aggfunc={'Survived':'mean'})
    
FacetGrid = sns.FacetGrid(titanic, row='Embarked', size=4.5, aspect=1.6)
FacetGrid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette=None,  order=None, hue_order=None )
FacetGrid.add_legend()

## Modelling

### Build a logisitic regression model that predicts `Survived` based on the passengers `Age.` 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
# A:

Click here for solution

<span style ='color:white'>

X=titanic[['Age']]
y=titanic.Survived


X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,shuffle=True)
lr=LogisticRegression().fit(X_train,y_train)

### Evaluate your model by quoting the accuracy and compare it to the baseline accuracy

In [None]:
# A:

Click here for solution

<span style ='color:white'>

print('Train Accuracy: ' + str(lr.score(X_train,y_train)))
print('Test Accuracy: ' + str(lr.score(X_test,y_test)))
print('Baseline: ' + str(titanic.Survived.value_counts(normalize=True).max()))     

### Build a confusion matrix using the test set

In [None]:
# A:

Click here for solution

<span style ='color:white'>

from sklearn.metrics import confusion_matrix

preds=lr.predict(X_test)

pd.DataFrame(confusion_matrix(y_test,preds),columns=['Predict Not Survived','Predict Survived'],
             index=['Actual Not Survived','Actual Survived'])

### Build a Better Model

### Improve your model by adding new features based off your EDA

Compare your new model by calculating performance metrics. If using multiple features, evaluate which have the biggest effect in determing chance of survival

In [None]:
# A:

Click here for sample solution

<span style ='color:white'>

X=titanic_dummy.drop(['Name','Ticket','Fare'],axis=1)
y=X.pop('Survived')

X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,shuffle=True)
lr=LogisticRegression().fit(X_train,y_train)

print('Train Accuracy: ' + str(lr.score(X_train,y_train)))
print('Test Accuracy: ' + str(lr.score(X_test,y_test)))
print('Baseline: ' + str(titanic.Survived.value_counts(normalize=True).max()))

preds=lr.predict(X_test)
pd.DataFrame(confusion_matrix(y_test,preds),columns=['Predict Not Survived','Predict Survived'],
             index=['Actual Not Survived','Actual Survived'])

pd.DataFrame({'Feature':X.columns,'Effect':lr.coef_[0]})

## Extend: How much better can you make this model? 

Try to engineer other features (e.g. titles from passenger name, or group ages into bands). Try other classification models, or standardising the features first. 