# Logistic Regression Models

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [28]:
df_not_imputed = pd.read_csv('./data/train_not_imputed.csv')
df_median_imputed = pd.read_csv('./data/train_median_imputed.csv')
df_pclass_imputed = pd.read_csv('./data/train_pclass_imputed.csv')

Set the index to `PassengerId`.

In [29]:
df_not_imputed.set_index('PassengerId', drop=True, inplace=True)
df_median_imputed.set_index('PassengerId', drop=True, inplace=True)
df_pclass_imputed.set_index('PassengerId', drop=True, inplace=True)

Separate each training dataset into `X` and `y`.

In [30]:
X_ni = df_not_imputed.drop('Survived', axis=1)
y_ni = df_not_imputed['Survived']
X_mi = df_median_imputed.drop('Survived', axis=1)
y_mi = df_median_imputed['Survived']
X_pi = df_pclass_imputed.drop('Survived', axis=1)
y_pi = df_pclass_imputed['Survived']

Perform a train-test split on each dataset.

In [31]:
X_ni_train, X_ni_valid, y_ni_train, y_ni_valid = train_test_split(X_ni, y_ni, test_size=.3, random_state=8801)
X_mi_train, X_mi_valid, y_mi_train, y_mi_valid = train_test_split(X_mi, y_mi, test_size=.3, random_state=8801)
X_pi_train, X_pi_valid, y_pi_train, y_pi_valid = train_test_split(X_pi, y_pi, test_size=.3, random_state=8801)

Fit logistic regression models on each dataset.

In [32]:
ni_logreg = LogisticRegression()
mi_logreg = LogisticRegression()
pi_logreg = LogisticRegression()

In [33]:
ni_logreg.fit(X_ni_train, y_ni_train)
mi_logreg.fit(X_mi_train, y_mi_train)
pi_logreg.fit(X_pi_train, y_pi_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

LogisticRegression()

Make predictions for every model.

In [34]:
ni_pred = ni_logreg.predict(X_ni_valid)
mi_pred = mi_logreg.predict(X_mi_valid)
pi_pred = pi_logreg.predict(X_pi_valid)

Print the performance report of every model for comparison.

In [36]:
print('Performance of the logistic regression trained on a not imputed dataset:')
print(metrics.classification_report(y_ni_valid, ni_pred))

Performance of the logistic regression trained on a not imputed dataset:
              precision    recall  f1-score   support

           0       0.82      0.87      0.84       127
           1       0.79      0.72      0.75        87

    accuracy                           0.81       214
   macro avg       0.80      0.80      0.80       214
weighted avg       0.81      0.81      0.81       214



In [37]:
print('Performance of the logistic regression trained on a median-imputed dataset:')
print(metrics.classification_report(y_mi_valid, mi_pred))

Performance of the logistic regression trained on a median-imputed dataset:
              precision    recall  f1-score   support

         0.0       0.81      0.93      0.87       150
         1.0       0.89      0.73      0.80       117

    accuracy                           0.84       267
   macro avg       0.85      0.83      0.83       267
weighted avg       0.84      0.84      0.84       267



In [38]:
print('Performance of the logistic regression trained on a Pclass-imputed dataset:')
print(metrics.classification_report(y_pi_valid, pi_pred))

Performance of the logistic regression trained on a Pclass-imputed dataset:
              precision    recall  f1-score   support

           0       0.81      0.93      0.87       150
           1       0.89      0.73      0.80       117

    accuracy                           0.84       267
   macro avg       0.85      0.83      0.83       267
weighted avg       0.84      0.84      0.84       267



Looks like the model trained on a smaller, not imputed dataset performed worse than models trained on imputed datasets. The imputation method does not make any significant impact on the result.

Let's try to fit a model on a Pclass-imputed dataset, but with `Title` columns dropped.

In [41]:
df_pclass_imputed_no_titles = df_pclass_imputed.drop(columns=df_pclass_imputed.columns[11:])
X_pi_no_title = df_pclass_imputed_no_titles.drop('Survived', axis=1)
y_pi_no_title = df_pclass_imputed_no_titles['Survived']

In [42]:
X_pint_train, X_pint_valid, y_pint_train, y_pint_valid = train_test_split(X_pi_no_title, y_pi_no_title, test_size=.3, random_state=8801)

In [44]:
pint_logreg = LogisticRegression()
pint_logreg.fit(X_pint_train, y_pint_train)
pint_pred = pint_logreg.predict(X_pint_valid)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [45]:
print('Performance of the logistic regression trained on a Pclass-imputed dataset (without "Title" columns):')
print(metrics.classification_report(y_pint_valid, pint_pred))

Performance of the logistic regression trained on a Pclass-imputed dataset (without "Title" columns):
              precision    recall  f1-score   support

           0       0.78      0.92      0.84       150
           1       0.87      0.67      0.75       117

    accuracy                           0.81       267
   macro avg       0.82      0.79      0.80       267
weighted avg       0.82      0.81      0.80       267



The above report proves that the passenger's `Title`, to some extent, helps to predict whether the person survived or not.