# Logistic regression exercise with Titanic data

## Introduction

- Data from Kaggle's Titanic competition: [data](../data/titanic.csv), [data dictionary](https://www.kaggle.com/c/titanic/data)
- **Goal**: Predict survival based on passenger characteristics
- `titanic.csv` is already in our repo, so there is no need to download the data from the Kaggle website

## Step 1: Read the data into a Pandas dataframe

In [2]:
# Read the data into a Panda's dataframe and display the head of the file.  Use PassengerID as the index_col
%matplotlib inline
import numpy as np
import pandas as pd

data = pd.read_csv("../data/titanic.csv")

In [40]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [46]:
data['sex_binary']=np.where(data.Sex=='male',1,0)

In [53]:
data.Age.mean()
data["Age"]=np.where(data.Age.isnull,data.Age.mean(),data.Age)

## Step 2: Create X and y

Define **Pclass** and **Parch** as the features, and **Survived** as the response.

In [58]:
feature_cols=["Pclass", "Parch", "sex_binary", "Age"]
X= data[feature_cols]
y = data.Survived

## Step 3: Split the data into training and testing sets

In [59]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

## Step 4: Fit a logistic regression model and examine the coefficients

Confirm that the coefficients make intuitive sense.

In [60]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
logreg.coef_

array([[-0.90438535, -0.08542545, -2.55091863,  0.10663506]])

## Step 5: Make predictions on the testing set and calculate the accuracy

In [61]:
# class predictions
train_predictions = logreg.predict(X_test)

In [62]:
# calculate classification accuracy
from sklearn.metrics import accuracy_score
accuracy_score(train_predictions,y_test)

0.7982062780269058

## Step 6: Compare your testing accuracy to the null accuracy

In [64]:
y_test.value_counts().head(1) / len(y_test)

0    0.623318
Name: Survived, dtype: float64