## Logistic Regression Exercises

In these exercises, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns
from pydataset import data

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, confusion_matrix 
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer

from acquire import get_iris_data
from acquire import get_titanic_data
from acquire import get_telco_data
from prepare import split_data
import os
import acquire
from env import get_db_url

In [2]:
titanic_df = get_titanic_data()
titanic_df

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.2500,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.9250,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1000,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.0500,S,Third,,Southampton,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.0,0,0,13.0000,S,Second,,Southampton,1
887,887,1,1,female,19.0,0,0,30.0000,S,First,B,Southampton,1
888,888,0,3,female,,1,2,23.4500,S,Third,,Southampton,0
889,889,1,1,male,26.0,0,0,30.0000,C,First,C,Cherbourg,1


In [3]:
def clean_titanic(df):

    df = df.drop(columns =['embark_town','class','deck'])

    df.embarked = df.embarked.fillna(value='S')

    dummy_df = pd.get_dummies(df[['sex','embarked']], drop_first=True)
    df = pd.concat([df, dummy_df], axis=1)
    return df

In [4]:
titanic_df = clean_titanic(titanic_df)


In [98]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  891 non-null    int64  
 1   survived      891 non-null    int64  
 2   pclass        891 non-null    int64  
 3   sex           891 non-null    object 
 4   age           714 non-null    float64
 5   sibsp         891 non-null    int64  
 6   parch         891 non-null    int64  
 7   fare          891 non-null    float64
 8   embarked      891 non-null    object 
 9   alone         891 non-null    int64  
 10  sex_male      891 non-null    uint8  
 11  embarked_Q    891 non-null    uint8  
 12  embarked_S    891 non-null    uint8  
dtypes: float64(2), int64(6), object(2), uint8(3)
memory usage: 79.2+ KB


In [99]:
titanic_df.head().T

Unnamed: 0,0,1,2,3,4
passenger_id,0,1,2,3,4
survived,0,1,1,1,0
pclass,3,1,3,1,3
sex,male,female,female,female,male
age,22.0,38.0,26.0,35.0,35.0
sibsp,1,1,0,1,0
parch,0,0,0,0,0
fare,7.25,71.2833,7.925,53.1,8.05
embarked,S,C,S,S,S
alone,0,0,1,0,1


In [5]:
train, validate, test = split_data(titanic_df, col_to_stratify='survived')
train.shape, validate.shape, test.shape

((534, 13), (178, 13), (179, 13))

In [101]:
X_train = train.drop(columns=['survived', 'passenger_id', 'sex', 'embarked', 'sibsp', 'parch', 'alone', 'sex_male', 'embarked_Q', 'embarked_S'])
y_train = train.survived

X_validate = validate.drop(columns=['survived', 'passenger_id', 'sex', 'embarked', 'sibsp', 'parch', 'alone', 'sex_male', 'embarked_Q', 'embarked_S'])
y_validate = validate.survived

X_test = test.drop(columns=['survived', 'passenger_id', 'sex', 'embarked', 'sibsp', 'parch', 'alone', 'sex_male', 'embarked_Q', 'embarked_S'])
y_test = test.survived

X_train.isnull().sum()

pclass      0
age       110
fare        0
dtype: int64

In [102]:
# find mode in age to impute for purposes of using in model
X_train.age.mode()

0    18.0
1    22.0
2    24.0
3    25.0
Name: age, dtype: float64

In [103]:
# replace all null values with the mode
X_train['age'] = X_train.age.fillna(value='24')
X_train.isnull().sum()

pclass    0
age       0
fare      0
dtype: int64

In [104]:
def establish_baseline(y_train):

    baseline_prediction = y_train.mode()

    y_train_pred = pd.Series((baseline_prediction[0]), range(len(y_train)))

    cm = confusion_matrix(y_train, y_train_pred)
    tn, fp, fn, tp = cm.ravel()

    accuracy = (tp+tn)/(tn+fp+fn+tp)
    return accuracy

In [105]:
establish_baseline(y_train)

0.6161048689138576

In [107]:
baseline_accuracy = (train.survived == 0).mean()
round(baseline_accuracy, 2)


0.62

1. Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?

In [108]:

logit = LogisticRegression(random_state=123)


features = ["age", "pclass", "fare"]


logit.fit(X_train[features], y_train)


y_pred = logit.predict(X_train[features])

print("Baseline is", round(baseline_accuracy, 2))
print("Logistic Regression using age, pclass, and fare features")
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train[features], y_train)))

Baseline is 0.62
Logistic Regression using age, pclass, and fare features
Accuracy of Logistic Regression classifier on training set: 0.70


2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [7]:
X_train = train.drop(columns=['survived', 'passenger_id', 'sex', 'embarked', 'sibsp', 'parch', 'alone', 'embarked_Q', 'embarked_S'])
y_train = train.survived

X_validate = validate.drop(columns=['survived', 'passenger_id', 'sex', 'embarked', 'sibsp', 'parch', 'alone', 'embarked_Q', 'embarked_S'])
y_validate = validate.survived

X_test = test.drop(columns=['survived', 'passenger_id', 'sex', 'embarked', 'sibsp', 'parch', 'alone', 'embarked_Q', 'embarked_S'])
y_test = test.survived

X_train.isnull().sum()

pclass        0
age         104
fare          0
sex_male      0
dtype: int64

In [8]:
X_train['age'] = X_train.age.fillna(value='24')
X_train.isnull().sum()

pclass      0
age         0
fare        0
sex_male    0
dtype: int64

In [9]:
# Create the logistic regression
logit1 = LogisticRegression(random_state=123)

# specify the features we're using
features = ["age", "pclass", "fare", "sex_male"]

# Fit a model using only these specified features
logit1.fit(X_train[features], y_train)

y_pred = logit1.predict(X_train[features])

print("Logistic Regression using age, pclass, fare, and gender features")
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit1.score(X_train[features], y_train)))

Logistic Regression using age, pclass, fare, and gender features
Accuracy of Logistic Regression classifier on training set: 0.78


3. Try out other combinations of features and models.

In [11]:
X_train = train.drop(columns=['survived', 'passenger_id', 'sex', 'embarked'])
y_train = train.survived

X_validate = validate.drop(columns=['survived', 'passenger_id', 'sex', 'embarked'])
y_validate = validate.survived

X_test = test.drop(columns=['survived', 'passenger_id', 'sex', 'embarked'])
y_test = test.survived

X_train.isnull().sum()

pclass          0
age           104
sibsp           0
parch           0
fare            0
alone           0
sex_male        0
embarked_Q      0
embarked_S      0
dtype: int64

In [12]:
X_train['age'] = X_train.age.fillna(value='24')
X_train.isnull().sum()

pclass        0
age           0
sibsp         0
parch         0
fare          0
alone         0
sex_male      0
embarked_Q    0
embarked_S    0
dtype: int64

In [13]:
logit2 = LogisticRegression(random_state=123)

logit2.fit(X_train, y_train)

y_pred = logit2.predict(X_train)

print("Model trained on all features")
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit2.score(X_train, y_train)))



Model trained on all features
Accuracy of Logistic Regression classifier on training set: 0.81


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [14]:
logit3 = LogisticRegression(random_state=123, class_weight='balanced')

logit3.fit(X_train, y_train)

y_pred = logit3.predict(X_train)

accuracy = logit3.score(X_train, y_train)

print("All Features and we're setting the class_weight hyperparameter")
print(f'Accuracy of Logistic Regression classifier on training set: {accuracy:.2}')

All Features and we're setting the class_weight hyperparameter
Accuracy of Logistic Regression classifier on training set: 0.79


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [15]:
features = ["pclass"]


logit5 = LogisticRegression(random_state=123)

logit5.fit(X_train[features], y_train)

y_pred = logit5.predict(X_train[features])
accuracy = logit5.score(X_train[features], y_train)

print("All Features and we're setting the class_weight hyperparameter")
print(f'Accuracy of Logistic Regression classifier on training set: {accuracy:.2}')


All Features and we're setting the class_weight hyperparameter
Accuracy of Logistic Regression classifier on training set: 0.66


#### minus age, plus embarkation, parch, sibs

4. Use you best 3 models to predict and evaluate on your validate sample.

In [17]:
X_validate['age'] = X_validate.age.fillna(value='24')
X_train.isnull().sum()

pclass        0
age           0
sibsp         0
parch         0
fare          0
alone         0
sex_male      0
embarked_Q    0
embarked_S    0
dtype: int64

In [18]:
features = ["age", "pclass", "fare", "sex_male"]

y_pred = logit1.predict(X_validate[features])

print('Logit1 model using age, pclass, fare, and is_female as the features')
print(classification_report(y_validate, y_pred))

Logit1 model using age, pclass, fare, and is_female as the features
              precision    recall  f1-score   support

           0       0.85      0.83      0.84       110
           1       0.73      0.76      0.75        68

    accuracy                           0.80       178
   macro avg       0.79      0.80      0.79       178
weighted avg       0.81      0.80      0.80       178



In [19]:
y_pred = logit2.predict(X_validate)

print("Logit2 model using all features and all model defaults")
print(classification_report(y_validate, y_pred))


Logit2 model using all features and all model defaults
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       110
           1       0.76      0.71      0.73        68

    accuracy                           0.80       178
   macro avg       0.79      0.78      0.79       178
weighted avg       0.80      0.80      0.80       178



In [20]:
y_pred = logit3.predict(X_validate)

print("Logit3 model using all features, class_weight='balanced', and all other hyperparameters as default")
print(classification_report(y_validate, y_pred))


Logit3 model using all features, class_weight='balanced', and all other hyperparameters as default
              precision    recall  f1-score   support

           0       0.86      0.81      0.84       110
           1       0.72      0.79      0.76        68

    accuracy                           0.80       178
   macro avg       0.79      0.80      0.80       178
weighted avg       0.81      0.80      0.80       178



5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train? 

In [21]:
X_test['age'] = X_test.age.fillna(value='24')
X_train.isnull().sum()

pclass        0
age           0
sibsp         0
parch         0
fare          0
alone         0
sex_male      0
embarked_Q    0
embarked_S    0
dtype: int64

In [22]:
y_pred = logit3.predict(X_test)

print("Logit3 model using all features, class_weight='balanced', and all other hyperparameters as default")
print(classification_report(y_test, y_pred))


Logit3 model using all features, class_weight='balanced', and all other hyperparameters as default
              precision    recall  f1-score   support

           0       0.85      0.81      0.83       110
           1       0.72      0.77      0.74        69

    accuracy                           0.79       179
   macro avg       0.78      0.79      0.78       179
weighted avg       0.80      0.79      0.79       179



The metrics are pretty close