## Machine Learning: apply logistic regression modeling to predict the survival likelihood
 
After exploring the data, we found that variables including age, sex, ticket class, and maybe embarking port have correlations to the survival likelihood. The impact of vairables such as number of siblings/spouses on board are rather unclear. On the other hand, ticket number, fare, cabin number didn't seem to show any correlation with the survival rate.

A quick look at the variables in this dataset again:

|Variable |Definition                                  |Key                                             |
|:--------|:-------------------------------------------|:-----------------------------------------------|
|survival |Survival	                                   |0 = No, 1 = Yes                                 |
|pclass	  |Ticket class                                |1 = 1st, 2 = 2nd, 3 = 3rd                       |
|sex	  |Sex	                                       |                                                |
|Age	  |Age in years	                               |                                                |
|sibsp	  |# of siblings / spouses aboard the Titanic  |                                                |	
|parch	  |# of parents / children aboard the Titanic  |                                                |	
|ticket	  |Ticket number	                           |                                                |
|fare	  |Passenger fare	                           |                                                |
|cabin	  |Cabin number	                               |                                                |
|embarked |Port of Embarkation                         |C = Cherbourg, Q = Queenstown, S = Southampton  |

*(Variable Notes: pclass is a proxy for socio-economic status (SES). 1st = Upper, 2nd = Middle, 3rd = Lower)*

Therefore, we will use ticket class, sex, age, sibsp, parch, and embarked as the predictors for our machine learning, while survival will be the response variable to be predicted.

In [87]:
# First let's import the necessary libraries. 

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

In [88]:
# Read in both the train data and test data.

train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

In [89]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [90]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


### Data cleaning: missing values
Remember that we need to clean the data before fitting the model, because from the previous data explorations we found that there are missing values in some variables. 

In [91]:
# Take another look at the missing values.

train.isnull().mean()

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

In [92]:
test.isnull().mean()

PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.205742
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.002392
Cabin          0.782297
Embarked       0.000000
dtype: float64

Therefore, we have to either impute or drop the missing data in the 'age' variable. Since about 20% of the data is missing, it makes sense to impute the data with the median, whereas dropping these rows will not be worth it. Fare and Cabin also have missing data, but since I am not going to include them in the predictors, I will not have to do anyhing with the missing data in these variables. Finally, there is also 0.2% missing data in Embarked, which I will impute with the mode.

In [93]:
# Impute the missing data in 'Age' with its  median.

age_median = train['Age'].median()

train['Age'] = train['Age'].fillna(age_median)

test['Age'] = test['Age'].fillna(age_median)

In [94]:
# Impute the missing data in 'Embarked' with its mode.

embarked_mode = train['Embarked'].mode()

train['Embarked'] = train['Embarked'].fillna(embarked_mode)

### Data cleaning: categorical variables
Note that we also need to handel the categorical variables by creating dummy variables before fitting the model. But before creating dummy variables, we can create the predictor matrix X and response vector y first.

In [95]:
# Decide predictor matrix X and response vector y 

predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Embarked']

X_train = train[predictors]

X_test = test[predictors]

y_train = train['Survived'].values

In [96]:
# Define a function create_dummy_df to create dummy variables for the categorical columns in the datasets.

def create_dummy_df(df, cat_cols, dummy_na):
    '''
    INPUT:
    df - pandas dataframe with categorical variables you want to dummy
    cat_cols - a list of strings that contain the names of the categorical columns
    dummy_na - Bool holding whether you want to dummy NA vals of categorical columns or not
    
    OUTPUT:
    df - a new dataframe that has the following characteristics:
            1. contains all columns that were not specified as categorical
            2. removes all the original columns in cat_cols
            3. dummy columns for each of the categorical columns in cat_cols
            4. if dummy_na is True - it also contains dummy columns for the NaN values
            
    '''
    for col in  cat_cols:
        try:
            # for each cat add dummy var, drop original column. look up Try Except docs!
            df = pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col], prefix=col, drop_first=False, \
                                                                 dummy_na=dummy_na)], axis=1)
        except:
            continue

    return df

In [97]:
cat_cols = ['Sex', 'Embarked']

In [98]:
X_train = create_dummy_df(X_train, cat_cols, dummy_na=False).values

In [99]:
X_test = create_dummy_df(X_test, cat_cols, dummy_na=False).values

### Fitting the model
Now that the dataset is cleaned regarding the missing data and categorical data, we are ready to move on and fit the model.

In [100]:
model = LogisticRegression()

In [101]:
model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [102]:
# Make predictions for the test dataset using model.predict.

y_predict = model.predict(X_test)

In [103]:
y_predict

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [None]:
# If we had the true values fro the test dataset, we could also compute an accuracy percentage.

(y_true == y_predict).mean()