# Logistic Regression Exercises

In these exercises, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Create a new notebook, `logistic_regression`, use it to answer the following questions:

In [2]:
#data manipulation
import pandas as pd
import numpy as np

#visualization
import matplotlib.pyplot as plt
import seaborn as sns

#stats is great
from scipy import stats

#my own files with my own functions
import acquire
import prepare

# os is operating system stuff, few things I know
# env is my py file to access SQL databases
import os
import env

# If I decide to retrieve other datasets but they'll be raw
from pydataset import data

# ML stuff: (modeling imports)
from sklearn.model_selection import train_test_split

# The big 4 for classification
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression #logistic not linear!
from sklearn.neighbors import KNeighborsClassifier #pick the classifier one

# Evaluation metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [3]:
df = acquire.get_titanic_data()

this file exists, reading csv


In [4]:
df = prepare.clean_titanic(df)

In [5]:
train, validate, test = prepare.splitting_data(df, 'survived', seed=123)

In [6]:
train, validate, test = prepare.preprocess_titanic(train, validate, test)

In [7]:
### We want everything EXCEPT the target variable
X_train = train.drop(columns = 'survived')
X_validate = validate.drop(columns = 'survived')
X_test = test.drop(columns = 'survived')

In [10]:
### We want ONLY the target variable
y_train = train.survived
y_validate = validate.survived
y_test = test.survived

### Operations above ^ are from decision tree and will be used for the rest of the Machine Learning Models/Ensemble Methods

In [11]:
X_train.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
702,3,18.0,0,1,14.4542,0,0,0,0
199,2,24.0,0,0,13.0,1,0,0,1
108,3,38.0,0,0,7.8958,1,1,0,1
872,1,33.0,0,0,5.0,1,1,0,1
827,2,1.0,0,2,37.0042,0,1,0,0


In [16]:
df.survived.value_counts()

survived
0    424
1    288
Name: count, dtype: int64

In [22]:
424/ (424+288)

0.5955056179775281

In [23]:
### baseline prediction is not surviving (Perished)
### baseline accuracy is ~60%
### meaning if we choose the mode, we'd be right 60% of the time

print(df.survived.mode())
(df.survived == 0).mean()

0    0
Name: survived, dtype: int64


0.5955056179775281

## 1. Create a model that includes only age, fare, and pclass. Does this model perform better than your baseline?

In [12]:
features = ['age', 'fare', 'pclass']
X_train[features].head()

Unnamed: 0,age,fare,pclass
702,18.0,14.4542,3
199,24.0,13.0,2
108,38.0,7.8958,3
872,33.0,5.0,1
827,1.0,37.0042,2


In [13]:
#make it
lr = LogisticRegression()

#fit it
lr.fit(X_train[features], y_train)

In [24]:
lr.score(X_train[features], y_train)
### yes, performs better than the %60 baseline

0.7353629976580797

## 2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [25]:
features = ['age', 'fare', 'pclass', 'sex_male']
X_train[features].head()

Unnamed: 0,age,fare,pclass,sex_male
702,18.0,14.4542,3,0
199,24.0,13.0,2,0
108,38.0,7.8958,3,1
872,33.0,5.0,1,1
827,1.0,37.0042,2,1


In [26]:
# C argument is the most common one in LogisticRegression()
lr1 = LogisticRegression()
lr1.fit(X_train[features], y_train)

In [27]:
lr1.score(X_train[features], y_train)

0.8032786885245902

## 3. Try out other combinations of features and models.

In [31]:
features = ['age', 'fare', 'pclass', 'sex_male', 'parch', 'sibsp']
X_train[features].head()

Unnamed: 0,age,fare,pclass,sex_male,parch,sibsp
702,18.0,14.4542,3,0,1,0
199,24.0,13.0,2,0,0,0
108,38.0,7.8958,3,1,0,0
872,33.0,5.0,1,1,0,0
827,1.0,37.0042,2,1,2,0


In [32]:
lr2 = LogisticRegression()
lr2.fit(X_train[features], y_train)

In [33]:
lr2.score(X_train[features], y_train)
### not any higher by adding 'parch' or 'sibsp'

0.8032786885245902

## 4. Use you best 3 models to predict and evaluate on your validate sample.

## 5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

**Bonus1** How do different strategies for handling the missing values in the age column affect model performance?

**Bonus2**: How do different strategies for encoding sex affect model performance?

**Bonus3**: scikit-learn's LogisticRegression classifier is actually applying a regularization penalty to the coefficients by default. This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. This value can be modified with the C hyper parameter. Small values of C correspond to a larger penalty, and large values of C correspond to a smaller penalty.
Try out the following values for C and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected.
C
=
.01
,
.1
,
1
,
10
,
100
,
1000

**Bonus Bonus**: how does scaling the data interact with your choice of C?