<div class="alert alert-block alert-warning">

# Linear Regression Exercises

In these exercises, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Create a new notebook, logistic_regression, use it to answer the following questions:

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression

from acquire import new_titanic_data
from prepare import prep_titanic, split_data

import warnings
warnings.filterwarnings("ignore")

#### Acquire 

In [2]:
# Acquire data
titanic = prep_titanic(new_titanic_data())
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone
0,0,3,male,22.0,1,0,7.25,Southampton,0
1,1,1,female,38.0,1,0,71.2833,Cherbourg,0
2,1,3,female,26.0,0,0,7.925,Southampton,1
3,1,1,female,35.0,1,0,53.1,Southampton,0
4,0,3,male,35.0,0,0,8.05,Southampton,1


In [3]:
titanic['sex'] = titanic.sex.map({'male': 1, 'female': 0})
titanic['embark_town'] = titanic.embark_town.map({'Southampton': 0, 'Queenstown': 1, 'Cherbourg': 2})
titanic['age'] = titanic.age.astype(int)

In [4]:
# take a look
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone
0,0,3,1,22,1,0,7.25,0,0
1,1,1,0,38,1,0,71.2833,2,0
2,1,3,0,26,0,0,7.925,0,1
3,1,1,0,35,1,0,53.1,0,0
4,0,3,1,35,0,0,8.05,0,1


#### Prepare

In [5]:
# Train, validate, split data
train, validate, test = split_data(titanic, 'survived')

#### Isolate the target variable

In [6]:
# we know what our X and y are, let's be explicit about defining them
X_train = train.drop(columns='survived')
y_train = train.survived

X_val = validate.drop(columns='survived')
y_val = validate.survived

X_test = test.drop(columns='survived')
y_test = test.survived

#### Create the baseline

In [7]:
# write a function to compute the baseline for a classification model
def establish_baseline(y_train):
    #  establish the value we will predict for all observations
    baseline_prediction = y_train.mode()

    # create a series of predictions with that value, 
    # the same length as our training set
    y_train_pred = pd.Series([0]*len(y_train))

    # compute accuracy of baseline
    cm = confusion_matrix(y_train, y_train_pred)
    tn, fp, fn, tp = cm.ravel()

    accuracy = (tp+tn)/(tn+fp+fn+tp)
    return accuracy

In [8]:
baseline_accuracy = establish_baseline(y_train)

<div class="alert alert-block alert-success">

1. Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?

In [9]:
# create algorithm object
logit1 = LogisticRegression(C=1, random_state=42, intercept_scaling=1, solver='liblinear')

# fit model with age, pclass and fare as only features
logit1.fit(X_train[['age', 'pclass', 'fare']], y_train)

# compute accuracy
train_accuracy = logit1.score(X_train[['age', 'pclass', 'fare']], y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_accuracy}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.6867469879518072
Baseline Accuracy: 0.5943775100401606


<div class="alert alert-block alert-success">

2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [10]:
# Did not utilize dummy data - instead mapped values to 'sex' feature

# create algorithm object
logit2 = LogisticRegression(C=1, random_state=42, intercept_scaling=1, solver='liblinear')

# fit model with age, pclass, fare and sex_male as only features
logit2.fit(X_train[['age', 'pclass', 'fare', 'sex']], y_train)

# compute accuracy
train_accuracy = logit2.score(X_train[['age', 'pclass', 'fare', 'sex']], y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_accuracy}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.7871485943775101
Baseline Accuracy: 0.5943775100401606


<div class="alert alert-block alert-success">

3. Try out other combinations of features and models.

In [11]:
# Test model with all features

# create algorithm object
logit3 = LogisticRegression(C=1, random_state=42, intercept_scaling=1, solver='liblinear')

# fit model with all features
logit3.fit(X_train, y_train)

# compute accuracy
train_accuracy = logit3.score(X_train, y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_accuracy}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.8112449799196787
Baseline Accuracy: 0.5943775100401606


In [12]:
# Try changing 'solver' to 'lbfgs' feature

# create algorithm object
logit4 = LogisticRegression(C=1, random_state=42, intercept_scaling=1, solver='lbfgs')

# fit model with all features
logit4.fit(X_train, y_train)

# compute accuracy
train_acc4 = logit4.score(X_train, y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_acc4}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.7911646586345381
Baseline Accuracy: 0.5943775100401606


In [13]:
# Try changing 'class_weight' to 'balanced'

# create algorithm object
logit5 = LogisticRegression(C=1, class_weight='balanced', random_state=42, intercept_scaling=1, solver='lbfgs')

# fit model with all features
logit5.fit(X_train, y_train)

# compute accuracy
train_acc5 = logit5.score(X_train, y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_acc5}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.7991967871485943
Baseline Accuracy: 0.5943775100401606


In [14]:
# Try changing c-value (regularization strength) from 1 to 0.1

# create algorithm object
logit6 = LogisticRegression(C=0.1, random_state=123, intercept_scaling=1, solver='lbfgs')

# fit model with all features
logit6.fit(X_train, y_train)

# compute accuracy
train_acc6 = logit6.score(X_train, y_train)

# compare this model with baseline
print(f'Train Accuracy: {train_acc6}')
print(f'Baseline Accuracy: {baseline_accuracy}')

Train Accuracy: 0.8032128514056225
Baseline Accuracy: 0.5943775100401606


My 3 best models are currently: logit3, logit5, logit6

<div class="alert alert-block alert-success">

4. Use you best 3 models to predict and evaluate on your validate sample.

In [15]:
# use logit# to make predictions for the X_validate observations
y_val_pred3 = logit3.predict(X_val)
# compute accuracy
val_acc3 = logit3.score(X_val, y_val)
# create a list and add to a dataframe at the end comparing all the models. 
model3 = [3, train_acc4, val_acc3]

y_val_pred5 = logit5.predict(X_val)
val_acc5 = logit5.score(X_val, y_val) 
model5 = [5, train_acc5, val_acc5]

y_val_pred6 = logit6.predict(X_val)
val_acc6 = logit6.score(X_val, y_val) 
model6 = [6, train_acc6, val_acc6]

pd.DataFrame([model3, model5, model6], columns=['model', 'in-sample accuracy', 'out-of-sample accuracy'])

Unnamed: 0,model,in-sample accuracy,out-of-sample accuracy
0,3,0.791165,0.803738
1,5,0.799197,0.785047
2,6,0.803213,0.794393


Close between Model 3 and Model 6 - use Model 6 for next question

<div class="alert alert-block alert-success">

5. Choose your best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

In [16]:
print('Coefficient: \n', logit6.coef_)
print('Intercept: \n', logit6.intercept_)

Coefficient: 
 [[-0.62625505 -1.23437875 -0.02221037 -0.13097001 -0.11431645  0.00636409
   0.12663798 -0.39492829]]
Intercept: 
 [2.53392906]


In [17]:
# Validate model 6

y_pred6 = logit6.predict(X_val)
print("Model 6: solver = lbfgs, c = 1")
# accuracy of model 6
print('Accuracy: {:.2f}'.format(logit6.score(X_val, y_val)))
# confusion matrix of model 6
print(confusion_matrix(y_val, y_pred6))
# classification report of model 1
print(classification_report(y_val, y_pred6))

Model 6: solver = lbfgs, c = 1
Accuracy: 0.79
[[56  8]
 [14 29]]
              precision    recall  f1-score   support

           0       0.80      0.88      0.84        64
           1       0.78      0.67      0.72        43

    accuracy                           0.79       107
   macro avg       0.79      0.77      0.78       107
weighted avg       0.79      0.79      0.79       107



In [18]:
# Test Model 6

y_pred6 = logit6.predict(X_test)
y_pred_proba = logit6.predict_proba(X_test)
print("Model 6: solver = lbfgs, c = 1")
print('Accuracy: {:.2f}'.format(logit6.score(X_test, y_test)))
print(confusion_matrix(y_test, y_pred6))
print(classification_report(y_test, y_pred6))

Model 6: solver = lbfgs, c = 1
Accuracy: 0.78
[[184  28]
 [ 51  93]]
              precision    recall  f1-score   support

           0       0.78      0.87      0.82       212
           1       0.77      0.65      0.70       144

    accuracy                           0.78       356
   macro avg       0.78      0.76      0.76       356
weighted avg       0.78      0.78      0.77       356

