### Overview

The goal of this script is to build a model with maximum interpretability of features while achieving an cross validated prediction accuracy of no less than 25% lower than the best predictive model.

### Prediction Modeling Recap

- The model with the highest predictive ability was the gradient boosting classifier at 83.9 percent accuracy
- Twenty-five percent below 83.9 percent is 62.9 percent
- The logistic regression model with L1 penalty achieved an accuracy of 79.1 percent with median imputed values, well above 62.9 percent
- The logistic regression model also has the highest interpretability of all of the prediction models

### Import Libraries

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.preprocessing
import sklearn.linear_model
import sklearn.ensemble
import sklearn.model_selection
import warnings
warnings.filterwarnings('ignore')
import wrangle #Custom scripts I created. See documentation.

### Retrieve Training and Test Data

In [32]:
#Load data
filepath1 = "Data/train.csv"
filepath2 = "Data/test.csv"
train = pd.read_csv(filepath1)
test = pd.read_csv(filepath2)

#Wrangle data
train = wrangle.wrangle(train)
test = wrangle.wrangle(test)

### Logistic Regression Model Feature Diagnostics

In [33]:
#Retrieve full y column for model fitting
y_full = train.Survived

#Drop PassengerId
train = train.drop(["Survived","PassengerId"],1)

#Imputation of missing values for training data
imp_median = sklearn.preprocessing.Imputer(missing_values='NaN', strategy='median', axis=0)
X_median = imp_median.fit_transform(train.iloc[:,1:])

In [42]:
train.head(5)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Ticket_Num,Cabin_Yes,Cabin_A,Cabin_G,Cabin_T,...,Ticket_Prefix_SOP,Ticket_Prefix_SOPP,Ticket_Prefix_SOTONO2,Ticket_Prefix_SOTONOQ,Ticket_Prefix_SP,Ticket_Prefix_STONO 2,Ticket_Prefix_STONO2,Ticket_Prefix_SWPP,Ticket_Prefix_WC,Ticket_Prefix_WEP
0,3,22.0,1,0,7.25,21171.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,38.0,1,0,71.2833,17599.0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,26.0,0,0,7.925,3101282.0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,1,35.0,1,0,53.1,113803.0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3,35.0,0,0,8.05,373450.0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
log_cv = sklearn.linear_model.LogisticRegressionCV(cv = 10, solver ="liblinear", penalty = "l1", random_state = 100)
log_cv = log_cv.fit(X_median, y_full)

#C_ stores the regularization parameter
C = float(log_cv.C_)

In [47]:
# Subsampling and fitting a L1-penalized logsitic regression model using lambda (C) from previous CV model
# Assessing features selected from CV L1 model and percentage of time that feature is selected when repeatedly sampling with same C penalty
log_rand = sklearn.linear_model.RandomizedLogisticRegression(C = C, n_resampling=200, random_state = 100)
log_rand = log_rand.fit(X_median, y_full)

In [48]:
#Random forest classifier
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=100, max_features="sqrt", random_state = 100)
rf = rf.fit(X_median,y_full)

In [49]:
#Extremely randomized trees
ert = sklearn.ensemble.ExtraTreesClassifier(n_estimators=100, max_features="sqrt", random_state = 100)
ert = ert.fit(X_median,y_full)

In [50]:
#Gradient boosting classifier
gbc = sklearn.ensemble.GradientBoostingClassifier(learning_rate = 0.1, n_estimators=100, max_depth = 3, random_state = 100)
gbc = gbc.fit(X_median,y_full)

In [51]:
#Table of feature selection diagnostics
pd.DataFrame(list(zip(train.columns, 
                      np.transpose(log_cv.coef_),
                      np.transpose(log_rand.scores_),
                      np.transpose(rf.feature_importances_),
                      np.transpose(ert.feature_importances_),
                      np.transpose(gbc.feature_importances_))),columns=["Feature", 
                                                                       "L1Penalty", 
                                                                       "RandomScore",
                                                                       "RandForest",
                                                                        "ExtrRandTree",
                                                                       "GradBoosClass"])

Unnamed: 0,Feature,L1Penalty,RandomScore,RandForest,ExtrRandTree,GradBoosClass
0,Pclass,[-0.0343631133848],0.965,0.154221,0.153258,0.163858
1,Age,[-0.383505017789],0.97,0.041081,0.049421,0.059779
2,SibSp,[-0.174662703157],0.255,0.033352,0.040553,0.019473
3,Parch,[0.00716789891859],0.805,0.158171,0.129568,0.170888
4,Fare,[-2.05748510042e-07],0.13,0.201754,0.178487,0.258652
5,Ticket_Num,[1.64273318149],1.0,0.034068,0.046758,0.0202
6,Cabin_Yes,[0.0],0.13,0.003356,0.003477,0.004704
7,Cabin_A,[-2.89037247205],0.665,0.001606,0.003318,0.005491
8,Cabin_G,[0.0],0.13,0.000272,0.000291,0.0
9,Cabin_T,[-0.515523043254],0.04,0.007309,0.006141,0.0


### Stability Selection Results

- Ten features were selected over 70 percent of the time using subsampling a L1 logistic regression model, with a penalty value C of approximately 2.78.
    - Ticket class
    - Age
    - Number of parents and children aboard Titanic
    - Ticket number
    - Cabin F
    - Cabin D
    - female
    - Ticket prefix "SP"
    - Ticket prefix "STONO2"
    - Ticket prefix "SWPP"
- 