In [1]:
%autosave 0

Autosave disabled


In [2]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from adam_prepare import titanic_pipeline

Let's read in the data using the titanic pipeline function!

In [3]:
train, val, test = titanic_pipeline()
train.head()

Unnamed: 0,survived,age,sibsp,parch,fare,alone,sex_male,class_First,class_Second,class_Third,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
748,0,19.0,1,0,53.1,0,1,1,0,0,0,0,1
45,0,29.0,0,0,8.05,1,1,0,0,1,0,0,1
28,1,29.0,0,0,7.8792,1,0,0,0,1,0,1,0
633,0,29.0,0,0,0.0,1,1,1,0,0,0,0,1
403,0,28.0,1,0,15.85,0,1,0,0,1,0,0,1


Let's define a function to create X and y splits of our data.

In [4]:
def xy_split(df):
    
    return df.drop(columns = ['survived']), df.survived

In [5]:
X_train, y_train = xy_split(train)
X_val, y_val = xy_split(val)

X_train.head()

Unnamed: 0,age,sibsp,parch,fare,alone,sex_male,class_First,class_Second,class_Third,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
748,19.0,1,0,53.1,0,1,1,0,0,0,0,1
45,29.0,0,0,8.05,1,1,0,0,1,0,0,1
28,29.0,0,0,7.8792,1,0,0,0,1,0,1,0
633,29.0,0,0,0.0,1,1,1,0,0,0,0,1
403,28.0,1,0,15.85,0,1,0,0,1,0,0,1


Now it's time to build our [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model!

With the default value of 100 max iterations, the model fails to converge! We can increase the number of max iterations to help the model converge on coefficients for all features.

In [6]:
seed = 42

logreg = LogisticRegression(random_state = seed, max_iter = 400)

logreg.fit(X_train, y_train)

logreg.score(X_train, y_train), logreg.score(X_val, y_val)

(0.7993579454253612, 0.8507462686567164)

We can show the coefficients applied to each feature in our dataset.

In [7]:
logreg.coef_

array([[-0.04131956, -0.49520052, -0.17900688,  0.00252894, -0.62470577,
        -2.47895528,  0.98241479,  0.0508135 , -1.01420055, -0.14031909,
         0.22561708, -0.41096248]])

By creating a dataframe, we can visualize the coefficient for each feature!

A negative value shows the feature is correlated with the negative class, and a positive one indicates correlation with the positive class.

In [8]:
pd.DataFrame({'feature': X_train.columns,
              'coefficient': logreg.coef_[0]})

Unnamed: 0,feature,coefficient
0,age,-0.04132
1,sibsp,-0.495201
2,parch,-0.179007
3,fare,0.002529
4,alone,-0.624706
5,sex_male,-2.478955
6,class_First,0.982415
7,class_Second,0.050813
8,class_Third,-1.014201
9,embark_town_Cherbourg,-0.140319


Let's try again, this time with L1 (or Lasso) regularization!

We should expect to see some coefficients reduced to zero with this more aggressive method of penalizing our model.

In [9]:
seed = 42

logreg = LogisticRegression(random_state = seed, max_iter = 400,
                            solver = 'liblinear', penalty = 'l1')

logreg.fit(X_train, y_train)

logreg.score(X_train, y_train), logreg.score(X_val, y_val)

(0.8009630818619583, 0.8507462686567164)

The dataframe of features and coefficients shows that class_Third and embark_town_Cherbourg had their impact reduced to zero.

In [10]:
pd.DataFrame({'feature': X_train.columns,
              'coefficient': logreg.coef_[0]})

Unnamed: 0,feature,coefficient
0,age,-0.038327
1,sibsp,-0.418894
2,parch,-0.136355
3,fare,0.002931
4,alone,-0.434766
5,sex_male,-2.530612
6,class_First,1.966966
7,class_Second,1.009704
8,class_Third,0.0
9,embark_town_Cherbourg,0.0
