# Kaggle Titanic survival - logistic regression model

Can we predict which passengers would survive the sinking of the Titanic?

See:

https://www.kaggle.com/c/titanic/overview/evaluation

https://gitlab.com/michaelallen1966/1908_coding_club_kaggle_titanic

The data comes from:

https://www.kaggle.com/c/titanic/data

And the data includes:

Variable  | Definition
----------|-----------
survival  | Survival (0 = No, 1 = Yes)
pclass    | Ticket class
sex       | Sex
Age       | Age in years
sibsp     | # of siblings / spouses aboard the Titanic
parch     | # of parents / children aboard the Titanic
ticket    | Ticket number
fare      | Passenger fare
cabin     | Cabin number
embarked  | Port of Embarkation(C=Cherbourg, Q=Queenstown, S=Southampton)

## Load modules

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Load data

In [None]:
data = pd.read_csv('data/processed_data.csv')

In [None]:
# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)

data.drop('PassengerId', inplace=True, axis=1)

## Divide into X (features) and y (lables)

In [None]:
X = data.drop('Survived',axis=1)
y = data['Survived']

## Divide into training and tets sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.25)

## Standardise data

We want all of out features to be on roughly the same scale.

We will use standardisation. Where we use the mean and standard deviation of the training set of data to standardise the data. We substract the mean of the test set values, and dividie by the standard deviation of the training data. Note that the mean and standard deviation of the training data are used to standardise the test set data as well.

In [None]:
def standardise_data(X_train, X_test):
    
    # Initialise a new scaling object for normalising input data
    sc=StandardScaler() 

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_std=sc.transform(X_train)
    test_std=sc.transform(X_test)
    
    return train_std, test_std

In [None]:
X_train_std, X_test_std = standardise_data(X_train, X_test)

## Fit logistic regression model

In [None]:
model = LogisticRegression()
model.fit(X_train_std,y_train)

## Predict values

In [None]:
# Predict training and test set labels
y_pred_train = model.predict(X_train_std)
y_pred_test = model.predict(X_test_std)

## Calculate accuracy

In [None]:
accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)

print ('Accuracy of predicting training data =', accuracy_train)
print ('Accuracy of predicting test data =', accuracy_test)

## Examining the model coefficients (weights)

In [None]:
co_eff = model.coef_[0]
co_eff

Put in a DataFrame and sort by coefficient size.

This might be used to select top features (and refit)

In [None]:
co_eff_df = pd.DataFrame()

In [None]:
co_eff_df['feature'] = list(X)
co_eff_df['co_eff'] = co_eff
co_eff_df['abs_co_eff'] = np.abs(co_eff)
co_eff_df.sort_values(by='abs_co_eff', ascending=False, inplace=True)

In [None]:
co_eff_df

## Show predicted probabilities

The predicted probabilities are for the two alternative classes 0 (does not survive) or 1 (survive).

Later we will use these to adjust sensitivity of our model to detecting survivors or non-survivers.

In [None]:
probabilities = model.predict_proba(X_test_std)

In [None]:
# Show first five values
probabilities[0:5]