# Logistic Regression Exercises
In this exercise, we'll continue working with the titanic dataset and building logistic regression models. Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. The test dataset should only be used for your final model.

For all of the models you create, choose a threshold that optimizes for accuracy.

Do your work for these exercises in either a notebook or a python script named model within your classification-exercises repository. Add, commit, and push your work.

In [10]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import acquire

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [7]:
titanic = acquire.get_titanic_data()
titanic = acquire.prep_titanic(titanic)

In [8]:
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,Q,S
0,0,0,3,male,22.0,1,0,7.25,S,Third,Southampton,0,0,1
1,1,1,1,female,38.0,1,0,71.2833,C,First,Cherbourg,0,0,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,Southampton,1,0,1
3,3,1,1,female,35.0,1,0,53.1,S,First,Southampton,0,0,1
4,4,0,3,male,35.0,0,0,8.05,S,Third,Southampton,1,0,1


In [12]:
# Handle missing values in the `age` column.
titanic.dropna(inplace=True)

In [21]:
# prep from curriculum
X = titanic[['pclass','fare']]
y = titanic[['survived']]

X_train_validate, X_test, y_train_validate, y_test = train_test_split(X, y, test_size = .20, random_state = 123)

X_train, X_validate, y_train, y_validate = train_test_split(X_train_validate, y_train_validate, test_size = .30, random_state = 123)

print("train: ", X_train.shape, ", validate: ", X_validate.shape, ", test: ", X_test.shape)
print("train: ", y_train.shape, ", validate: ", y_validate.shape, ", test: ", y_test.shape)

train:  (398, 2) , validate:  (171, 2) , test:  (143, 2)
train:  (398, 1) , validate:  (171, 1) , test:  (143, 1)


In [41]:
# More zeros, which means most did not survive
y_train.survived.value_counts()

0    249
1    149
Name: survived, dtype: int64

In [52]:
# Our baseline model will be that every passenger does not survive
models = pd.DataFrame(y_train)
models['baseline'] = 0
models.head()
models.columns = ['actual','baseline']
models.head()

Unnamed: 0,actual,baseline
717,1,0
471,0,0
161,1,0
678,0,0
543,1,0


In [53]:
# cross tab of our baseline versus actual
pd.crosstab(models['baseline'], models['actual'])

actual,0,1
baseline,Unnamed: 1_level_1,Unnamed: 2_level_1
0,249,149


In [74]:
# let's calculate the accuracy
# positive will be not survived
# (TP + TN) / (TP + TN + FP + FN)
true_p = 249
false_p = 149
true_n = 0
false_n = 0

base_acc = (true_p + true_n) / (true_p + true_n + false_p + false_n)
base_acc

0.6256281407035176

### 1. Create another model that includes age in addition to fare and pclass. Does this model perform better than your previous one (the baseline model)?

# Steps
- Create Basic Model

    1. Create model object

    2. Fit the model to the data

    3. Predict labels

    4. Estimate probability of a label estimate

- Evaluate Model

    1. Accuracy

    2. Classification report

    3. Confusion Matrix

In [56]:
X = titanic[['pclass','fare','age']]
y = titanic[['survived']]

X_train_validate, X_test, y_train_validate, y_test = train_test_split(X, y, test_size = .20, random_state = 123)

X_train, X_validate, y_train, y_validate = train_test_split(X_train_validate, y_train_validate, test_size = .30, random_state = 123)

print("train: ", X_train.shape, ", validate: ", X_validate.shape, ", test: ", X_test.shape)
print("train: ", y_train.shape, ", validate: ", y_validate.shape, ", test: ", y_test.shape)

train:  (398, 3) , validate:  (171, 3) , test:  (143, 3)
train:  (398, 1) , validate:  (171, 1) , test:  (143, 1)


## Create Basic Model

### Create Logistic Regression Object

In [57]:
# from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(C=1, class_weight={0:1, 1:99}, random_state=123, intercept_scaling=1, solver='lbfgs')

### Fit Model to the Data

In [58]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, class_weight={0: 1, 1: 99}, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Print the coefficients and intercept of the model

In [59]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-1.21384777e+00 -4.02559277e-05 -3.12383602e-02]]
Intercept: 
 [7.66577917]


### Estimate whether or not a passenger would survive, using the training data

In [63]:
y_pred = logit.predict(X_train)

### Estimate the probability of a passenger surviving, using the training data

In [64]:
y_pred_proba = logit.predict_proba(X_train)

## Evaluate Model on Train

### Compute the accuracy

In [65]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'.format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.37


### Create a confusion matrix

In [81]:
print("Confusion Matrix for Model 1\n",confusion_matrix(y_train, y_pred))

Confusion Matrix for Model 1
 [[  0 249]
 [  0 149]]


### Compute Precision, Recall, F1-score, and Support

In [67]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       249
           1       0.37      1.00      0.54       149

    accuracy                           0.37       398
   macro avg       0.19      0.50      0.27       398
weighted avg       0.14      0.37      0.20       398



In [76]:
from sklearn.metrics import accuracy_score

model1_acc = accuracy_score(y_train, y_pred)

### 2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

#### Adding the sex feature, and adjusting the weights of c

In [96]:
# create dummy variables for sex
df_dummies = pd.get_dummies(titanic['sex'],drop_first=1)
    
#add dummy variables to original df
titanic = pd.concat([titanic, df_dummies], axis=1)

#check to see it was added correctly
titanic.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,embark_town,alone,Q,S,male,male.1
0,0,0,3,male,22.0,1,0,7.25,S,Third,Southampton,0,0,1,1,1
1,1,1,1,female,38.0,1,0,71.2833,C,First,Cherbourg,0,0,0,0,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,Southampton,1,0,1,0,0
3,3,1,1,female,35.0,1,0,53.1,S,First,Southampton,0,0,1,0,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,Southampton,1,0,1,1,1


In [97]:
X = titanic[['pclass','fare','age','male']]
y = titanic[['survived']]

X_train_validate, X_test, y_train_validate, y_test = train_test_split(X, y, test_size = .20, random_state = 123)

X_train, X_validate, y_train, y_validate = train_test_split(X_train_validate, y_train_validate, test_size = .30, random_state = 123)

print("train: ", X_train.shape, ", validate: ", X_validate.shape, ", test: ", X_test.shape)
print("train: ", y_train.shape, ", validate: ", y_validate.shape, ", test: ", y_test.shape)

train:  (398, 5) , validate:  (171, 5) , test:  (143, 5)
train:  (398, 1) , validate:  (171, 1) , test:  (143, 1)


In [99]:
#Create Logistic Regression Object
# from sklearn.linear_model import LogisticRegression
# Where we adjust the value of C
logit = LogisticRegression(C=.01, class_weight={0:1, 1:99}, random_state=123, intercept_scaling=1, solver='lbfgs')

In [100]:
#Fit Model to the Data
logit.fit(X_train, y_train)

LogisticRegression(C=0.01, class_weight={0: 1, 1: 99}, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [101]:
# Print the coefficients and intercept of the model
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[-0.45118491  0.00602921 -0.01868668 -0.57462945 -0.57462945]]
Intercept: 
 [5.89088279]


In [102]:
# Estimate whether or not a passenger would survive, using the training data
y_pred = logit.predict(X_train)

In [103]:
# Estimate the probability of a passenger surviving, using the training data
y_pred_proba = logit.predict_proba(X_train)

In [104]:
# Evaluate Model on Train
# Compute the accuracy
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'.format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.37


### 3. Try out other combinations of features and models.

### 4. Use you best 3 models to predict and evaluate on your validate sample.

### 5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

- Bonus1 How do different strategies for handling the missing values in the age column affect model performance?

- Bonus2: How do different strategies for encoding sex affect model performance?

- Bonus3: scikit-learn's LogisticRegression classifier is actually applying a regularization penalty to the coefficients by default. This penalty causes the magnitude of the coefficients in the resulting model to be smaller than they otherwise would be. This value can be modified with the C hyper parameter. Small values of C correspond to a larger penalty, and large values of C correspond to a smaller penalty.
    - Try out the following values for C and note how the coefficients and the model's performance on both the dataset it was trained on and on the validate split are affected.
    - C = .01, .1, 1, 10, 100, 1000
- Bonus Bonus: how does scaling the data interact with your choice of C?