# Kaggle Titanic survival - logistic regression model

In this notebook we repeat our basic logistic regression model as previously described:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/02_logistic_regression.ipynb

We will extend the model to report a range of accuracy measures, as described:

https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/05_accuracy_standalone.ipynb

We will go through the following steps:

* Download and save pre-processed data
* Split data into features (X) and label (y)
* Split data into training and test sets (we will test on data that has not been used to fit the model)
* Standardise data
* Fit a logistic regression model (from sklearn learn)
* Predict survival of the test set

https://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination

## Load modules

A standard Anaconda install of Python (https://www.anaconda.com/distribution/) contains all the necessary modules.

In [1]:
import numpy as np
import pandas as pd
# Import machine learning methods
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Load data

The section below downloads pre-processed data, and saves it to a subfolder (from where this code is run).
If data has already been downloaded that cell may be skipped.

Code that was used to pre-process the data ready for machine learning may be found at:
https://github.com/MichaelAllen1966/1804_python_healthcare/blob/master/titanic/01_preprocessing.ipynb

In [2]:
download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

In [3]:
data = pd.read_csv('data/processed_data.csv')

The first column is a passenger index number. We will remove this, as this is not part of the original Titanic passenger data.

In [4]:
# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
# We drop passenger ID as it is not original data

data.drop('PassengerId', inplace=True, axis=1)

## Divide into X (features) and y (lables)

We will separate out our features (the data we use to make a prediction) from our label (what we are truing to predict).
By convention our features are called `X` (usually upper case to denote multiple features), and the label (survvive or not) `y`.

In [5]:
X = data.drop('Survived',axis=1) # X = all 'data' except the 'survived' column
y = data['Survived'] # y = 'survived' column from 'data'

In [8]:
from scipy.stats.stats import pearsonr
features = list(X)
correlation = []
significance = []
for feature in features:
    correl = pearsonr(X[feature].values, y.values)
    correlation.append(correl[0])
    significance.append(correl[1])
df = pd.DataFrame()
df['feature'] = features
df['correlation'] = correlation
df['abs_correlation'] = np.abs(correlation)
df['significance'] = significance
df['significant'] = df['significance'] < 0.05
df.sort_values(by='abs_correlation', ascending=False, inplace=True)

In [9]:
df

Unnamed: 0,feature,correlation,abs_correlation,significance,significant
10,male,-0.543351,0.543351,1.406066e-69,True
0,Pclass,-0.338481,0.338481,2.5370470000000002e-25,True
9,CabinNumberImputed,-0.321842,0.321842,6.404266000000001e-23,True
7,CabinLetterImputed,-0.316912,0.316912,3.090891e-22,True
23,CabinLetter_missing,-0.316912,0.316912,3.090891e-22,True
4,Fare,0.257307,0.257307,6.120189e-15,True
8,CabinNumber,0.235409,0.235409,1.100977e-12,True
16,CabinLetter_B,0.175095,0.175095,1.441584e-07,True
11,Embarked_C,0.16824,0.16824,4.397151e-07,True
13,Embarked_S,-0.15566,0.15566,3.036111e-06,True
