Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression
- do train/validate/test split
- begin with baselines for classification
- express and explain the intuition and interpretation of Logistic Regression
- use sklearn.linear_model.LogisticRegression to fit and interpret Logistic Regression models

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

### Setup

Run the code cell below. You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab.

Libraries:
- category_encoders
- numpy
- pandas
- scikit-learn

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Do train/validate/test split

## Overview

### Predict Titanic survival 🚢

Kaggle is a platform for machine learning competitions. [Kaggle has used the Titanic dataset](https://www.kaggle.com/c/titanic/data) for their most popular "getting started" competition. 

Kaggle splits the data into train and test sets for participants. Let's load both:

In [2]:
import pandas as pd
train = pd.read_csv(DATA_PATH+'titanic/train.csv')
test = pd.read_csv(DATA_PATH+'titanic/test.csv')

Notice that the train set has one more column than the test set:

In [4]:
train.shape, test.shape  # test data is unlabled

((891, 12), (418, 11))

Which column is in train but not test? The target!

In [5]:
set(train.columns) - set(test.columns)

{'Survived'}

### Why doesn't Kaggle give you the target for the test set?

#### Rachel Thomas, [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/)

> One great thing about Kaggle competitions is that they force you to think about validation sets more rigorously (in order to do well). For those who are new to Kaggle, it is a platform that hosts machine learning competitions. Kaggle typically breaks the data into two sets you can download:
>
> 1. a **training set**, which includes the _independent variables,_ as well as the _dependent variable_ (what you are trying to predict).
>
> 2. a **test set**, which just has the _independent variables._ You will make predictions for the test set, which you can submit to Kaggle and get back a score of how well you did.
>
> This is the basic idea needed to get started with machine learning, but to do well, there is a bit more complexity to understand. **You will want to create your own training and validation sets (by splitting the Kaggle “training” data). You will just use your smaller training set (a subset of Kaggle’s training data) for building your model, and you can evaluate it on your validation set (also a subset of Kaggle’s training data) before you submit to Kaggle.**
>
> The most important reason for this is that Kaggle has split the test data into two sets: for the public and private leaderboards. The score you see on the public leaderboard is just for a subset of your predictions (and you don’t know which subset!). How your predictions fare on the private leaderboard won’t be revealed until the end of the competition. The reason this is important is that you could end up overfitting to the public leaderboard and you wouldn’t realize it until the very end when you did poorly on the private leaderboard. Using a good validation set can prevent this. You can check if your validation set is any good by seeing if your model has similar scores on it to compared with on the Kaggle test set. ...
>
> Understanding these distinctions is not just useful for Kaggle. In any predictive machine learning project, you want your model to be able to perform well on new data.

### 2-way train/test split is not enough

#### Hastie, Tibshirani, and Friedman, [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/), Chapter 7: Model Assessment and Selection

> If we are in a data-rich situation, the best approach is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a "vault," and be brought out only at the end of the data analysis. Suppose instead that we use the test-set repeatedly, choosing the model with the smallest test-set error. Then the test set error of the final chosen model will underestimate the true test error, sometimes substantially.

#### Andreas Mueller and Sarah Guido, [Introduction to Machine Learning with Python](https://books.google.com/books?id=1-4lDQAAQBAJ&pg=PA270)

> The distinction between the training set, validation set, and test set is fundamentally important to applying machine learning methods in practice. Any choices made based on the test set accuracy "leak" information from the test set into the model. Therefore, it is important to keep a separate test set, which is only used for the final evaluation. It is good practice to do all exploratory analysis and model selection using the combination of a training and a validation set, and reserve the test set for a final evaluation - this is even true for exploratory visualization. Strictly speaking, evaluating more than one model on the test set and choosing the better of the two will result in an overly optimistic estimate of how accurate the model is.

#### Hadley Wickham, [R for Data Science](https://r4ds.had.co.nz/model-intro.html#hypothesis-generation-vs.hypothesis-confirmation)

> There is a pair of ideas that you must understand in order to do inference correctly:
>
> 1. Each observation can either be used for exploration or confirmation, not both.
>
> 2. You can use an observation as many times as you like for exploration, but you can only use it once for confirmation. As soon as you use an observation twice, you’ve switched from confirmation to exploration.
>
> This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis. Otherwise you will be over optimistic. There is absolutely nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading.
>
> If you are serious about doing an confirmatory analysis, one approach is to split your data into three pieces before you begin the analysis.


#### Sebastian Raschka, [Model Evaluation](https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html)

> Since “a picture is worth a thousand words,” I want to conclude with a figure (shown below) that summarizes my personal recommendations ...

<img src="https://sebastianraschka.com/images/blog/2018/model-evaluation-selection-part4/model-eval-conclusions.jpg" width="600">

Usually, we want to do **"Model selection (hyperparameter optimization) _and_ performance estimation."** (The green box in the diagram.)

Therefore, we usually do **"3-way holdout method (train/validation/test split)"** or **"cross-validation with independent test set."**

### What's the difference between Training, Validation, and Testing sets?

#### Brandon Rohrer, [Training, Validation, and Testing Data Sets](https://end-to-end-machine-learning.teachable.com/blog/146320/training-validation-testing-data-sets)

> The validation set is for adjusting a model's hyperparameters. The testing data set is the ultimate judge of model performance.
>
> Testing data is what you hold out until very last. You only run your model on it once. You don’t make any changes or adjustments to your model after that. ...

## Follow Along

> You will want to create your own training and validation sets (by splitting the Kaggle “training” data).

Do this, using the [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function:

In [6]:
from sklearn.model_selection import train_test_split


In [12]:
train, val = train_test_split(train, random_state=42)

In [13]:
train.shape, val.shape, test.shape

((375, 12), (126, 12), (418, 11))

## Challenge

For your assignment, you'll do a 3-way train/validate/test split.

Then next sprint, you'll begin to participate in a private Kaggle challenge, just for your cohort! 

You will be provided with data split into 2 sets: training and test. You will create your own training and validation sets, by splitting the Kaggle "training" data, so you'll end up with 3 sets total.

# Begin with baselines for classification

## Overview

We'll begin with the **majority class baseline.**

[Will Koehrsen](https://twitter.com/koehrsen_will/status/1088863527778111488)

> A baseline for classification can be the most common class in the training dataset.

[*Data Science for Business*](https://books.google.com/books?id=4ZctAAAAQBAJ&pg=PT276), Chapter 7.3: Evaluation, Baseline Performance, and Implications for Investments in Data

> For classification tasks, one good baseline is the _majority classifier,_ a naive classifier that always chooses the majority class of the training dataset (see Note: Base rate in Holdout Data and Fitting Graphs). This may seem like advice so obvious it can be passed over quickly, but it is worth spending an extra moment here. There are many cases where smart, analytical people have been tripped up in skipping over this basic comparison. For example, an analyst may see a classification accuracy of 94% from her classifier and conclude that it is doing fairly well—when in fact only 6% of the instances are positive. So, the simple majority prediction classifier also would have an accuracy of 94%. 

## Follow Along

Determine majority class

In [15]:
target = 'Survived'
y_train = train[target]
y_train.value_counts(normalize=True)   ## thus the majority class= 0 did not survive


0    0.621333
1    0.378667
Name: Survived, dtype: float64

In [16]:
y_train.mode()

0    0
dtype: int64

What if we guessed the majority class for every prediction?

In [17]:
majority_class = y_train.mode()[0]
y_pred= [majority_class] * len(y_train)
y_pred[:5]

[0, 0, 0, 0, 0]

#### Use a classification metric: accuracy

[Classification metrics are different from regression metrics!](https://scikit-learn.org/stable/modules/model_evaluation.html)
- Don't use _regression_ metrics to evaluate _classification_ tasks.
- Don't use _classification_ metrics to evaluate _regression_ tasks.

[Accuracy](https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score) is a common metric for classification. Accuracy is the ["proportion of correct classifications"](https://en.wikipedia.org/wiki/Confusion_matrix): the number of correct predictions divided by the total number of predictions.

What is the baseline accuracy if we guessed the majority class for every prediction?

In [18]:
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred)
#training accuracy of predicting maj class baseline

0.6213333333333333

In [20]:
y_val = val[target]
y_pred= [majority_class] * len(y_val)
accuracy_score(y_val, y_pred)

0.6349206349206349

## Challenge

In your assignment, your Sprint Challenge, and your upcoming Kaggle challenge, you'll begin with the majority class baseline. How quickly can you beat this baseline?

# Express and explain the intuition and interpretation of Logistic Regression


## Overview

To help us get an intuition for *Logistic* Regression, let's start by trying *Linear* Regression instead, and see what happens...

## Follow Along

### Linear Regression?

In [21]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,375.0,375.0,375.0,298.0,375.0,375.0,375.0
mean,460.205333,0.378667,2.309333,28.877248,0.557333,0.330667,34.607676
std,258.596123,0.485703,0.82753,14.184096,1.138201,0.664206,59.146306
min,4.0,0.0,1.0,0.67,0.0,0.0,0.0
25%,233.5,0.0,2.0,21.0,0.0,0.0,7.8958
50%,471.0,0.0,3.0,28.0,0.0,0.0,14.4
75%,693.0,1.0,3.0,36.0,1.0,0.0,31.33125
max,891.0,1.0,3.0,70.0,8.0,3.0,512.3292


In [22]:
# 1. Import estimator class
from sklearn.linear_model import LinearRegression

# 2. Instantiate this class
linear_reg = LinearRegression()

# 3. Arrange X feature matrices (already did y target vectors)
features = ['Pclass', 'Age', 'Fare']
X_train = train[features]
X_val = val[features]

# Impute missing values            #not in test
from sklearn.impute import SimpleImputer
imputer = SimpleImputer()              #fill with mean i this case
X_train_imputed = imputer.fit_transform(X_train)
X_val_imputed = imputer.transform(X_val)

# 4. Fit the model
linear_reg.fit(X_train_imputed, y_train)

# 5. Apply the model to new data.
# The predictions look like this ...
linear_reg.predict(X_val_imputed)

array([ 0.89901554,  0.24295209,  0.36839134,  0.26079796,  0.35549756,
        0.62324252,  0.66427592,  0.16999163,  0.92515875,  0.22992764,
        0.22326344,  0.16322963,  0.23311601,  0.38394742,  0.4991074 ,
        0.19143749,  0.26914245,  0.65825367,  0.39471238,  0.42316224,
        0.21762929,  0.53307606,  0.5898381 ,  0.28857595,  0.2886008 ,
        0.28233828,  0.21750924,  0.4426173 ,  0.4091377 ,  0.2174844 ,
        0.47949999,  0.37087581,  0.28206499,  0.21736435,  0.42258034,
        0.17470298,  0.43142375,  0.21659824,  0.2174678 ,  0.21696266,
        0.52726595,  0.62954695,  0.51673506,  0.42488795,  0.50558696,
        0.14152038,  0.39189773,  0.38953029,  0.29065466,  0.21044445,
        0.43528091,  0.84317908,  0.23308689,  0.20962875,  0.24875661,
        0.40345609,  0.38146294,  0.61045246,  0.26243274,  0.21696683,
        0.27641118,  0.26615062,  0.47453232,  0.62905005,  0.26926668,
        0.26847165,  0.38630504,  0.74920527,  0.47117135,  0.26

In [24]:
# Get coefficients
pd.Series(linear_reg.coef_, features)

Pclass   -0.194727
Age      -0.006536
Fare      0.000994
dtype: float64

In [36]:
test_case = [[1, 70, 100]]
linear_reg.predict(test_case)
linear_reg.intercept_ , linear_reg.coef_

(0.9826992972806641, array([-0.19472692, -0.0065358 ,  0.00099379]))

#math for linear reg  z = a + bx + cy 
linear_reg.intercept_ + (linear_reg.coef_[0] + linear_reg.coef_[1]*x1) + (linear_reg.coef_[2] * x2)


In [30]:
test_case = [[1, 5, 500]]  # 1st class, 5-year old, Rich
linear_reg.predict(test_case)   #extrapolation is not valid. surival probability over 1

array([1.2521881])

### Logistic Regression!

In [31]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(solver='lbfgs')
log_reg.fit(X_train_imputed, y_train)

print('Validation Accuracy', log_reg.score(X_val_imputed, y_val))  ##accuracy score

Validation Accuracy 0.7063492063492064


In [32]:
# The predictions look like this
log_reg.predict(X_val_imputed)

array([1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0])

In [33]:
log_reg.predict(test_case)

array([1])

In [34]:
log_reg.predict_proba(test_case)

array([[0.01030178, 0.98969822]])

In [37]:
# What's the math?
log_reg.coef_


array([[-0.79674509, -0.03045842,  0.00714001]])

In [38]:
log_reg.intercept_

array([1.94411596])

In [39]:
# The logistic sigmoid "squishing" function, implemented to accept numpy arrays
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.e**(-x))

In [42]:
sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(test_case)))
## this is replicating .predict() / probability 

array([[0.43273091]])

In [52]:
def my_predict(obs, thresh=0.6):   #custom threshold fun 
    prob = log_reg.predict_proba(obs)
    print(prob)
    return 1 if prob[0][1] > thresh else 0
my_predict(test_case)
    

[[0.56726909 0.43273091]]


0

So, clearly a more appropriate model in this situation! For more on the math, [see this Wikipedia example](https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study).

# Use sklearn.linear_model.LogisticRegression to fit and interpret Logistic Regression models

## Overview

Now that we have more intuition and interpretation of Logistic Regression, let's use it within a realistic, complete scikit-learn workflow, with more features and transformations.

## Follow Along

Select these features: `['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']`

(Why shouldn't we include the `Name` or `Ticket` features? What would happen here?) 

Fit this sequence of transformers & estimator:

- [category_encoders.one_hot.OneHotEncoder](http://contrib.scikit-learn.org/category_encoders/onehot.html)
- [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)
- [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
- [sklearn.linear_model.LogisticRegressionCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)

Get validation accuracy.

In [54]:
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
target = ['Survived']
x_train = train[features]
y_trainn = train[target]
x_val = val[features]
y_val = val[target]

x_train.shape

(375, 7)

Plot coefficients:

In [56]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler


  import pandas.util.testing as tm


Generate [Kaggle](https://www.kaggle.com/c/titanic) submission:

In [58]:
x_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
821,3,male,27.0,0,0,8.6625,S
868,3,male,,0,0,9.5,S
628,3,male,26.0,0,0,7.8958,S
749,3,male,31.0,0,0,7.75,Q
875,3,female,15.0,0,0,7.225,C


In [60]:
x_val.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
88,1,female,23.0,3,2,263.0,S
73,3,male,26.0,1,0,14.4542,C
265,2,male,36.0,0,0,10.5,S
858,3,female,24.0,0,3,19.2583,C
489,3,male,9.0,1,1,15.9,S


In [63]:
encoder = ce.OneHotEncoder(use_cat_names=True)
x_train_encoded = encoder.fit_transform(x_train)

In [66]:
x_val_encoded = encoder.transform(x_val)

In [69]:
x_train_encoded.head()
imputer = SimpleImputer(strategy='mean')
x_train_imputed = imputer.fit_transform(x_train_encoded)
x_val_imputed =  imputer.transform(x_val_encoded)      #returns array

In [74]:
x_val_imputed[:5]  

array([[  1.    ,   0.    ,   1.    ,  23.    ,   3.    ,   2.    ,
        263.    ,   1.    ,   0.    ,   0.    ,   0.    ],
       [  3.    ,   1.    ,   0.    ,  26.    ,   1.    ,   0.    ,
         14.4542,   0.    ,   0.    ,   1.    ,   0.    ],
       [  2.    ,   1.    ,   0.    ,  36.    ,   0.    ,   0.    ,
         10.5   ,   1.    ,   0.    ,   0.    ,   0.    ],
       [  3.    ,   0.    ,   1.    ,  24.    ,   0.    ,   3.    ,
         19.2583,   0.    ,   0.    ,   1.    ,   0.    ],
       [  3.    ,   1.    ,   0.    ,   9.    ,   1.    ,   1.    ,
         15.9   ,   1.    ,   0.    ,   0.    ,   0.    ]])

In [77]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train_imputed)
x_val_scaled = scaler.transform(x_val_imputed)
x_val_scaled[:5]

array([[-1.58433265, -1.34890655,  1.34890655, -0.46559658,  2.14894441,
         2.51663503,  3.86663974,  0.59891206, -0.30544142, -0.45790547,
        -0.0732252 ],
       [ 0.83572741,  0.7413412 , -0.7413412 , -0.22793609,  0.38943752,
        -0.49850278, -0.34119462, -1.66969422, -0.30544142,  2.18385686,
        -0.0732252 ],
       [-0.37430262,  0.7413412 , -0.7413412 ,  0.56426557, -0.49031592,
        -0.49850278, -0.40813849,  0.59891206, -0.30544142, -0.45790547,
        -0.0732252 ],
       [ 0.83572741, -1.34890655,  1.34890655, -0.38637642, -0.49031592,
         4.02420393, -0.2598621 , -1.66969422, -0.30544142,  2.18385686,
        -0.0732252 ],
       [ 0.83572741,  0.7413412 , -0.7413412 , -1.5746789 ,  0.38943752,
         1.00906612, -0.31671749,  0.59891206, -0.30544142, -0.45790547,
        -0.0732252 ]])

## Challenge

You'll use Logistic Regression for your assignment, your Sprint Challenge, and optionally for your first model in our Kaggle challenge!

In [None]:
#kaggle 
x_test= test[features]
x_test_encoded= encoder.transform(x_test)
x_test_imputed = imputer.transform(x_test_encoded)
x_test_scaled = scaler.transform(x_test_imputed)
y_pred = model.predict(x_test_scaled)

y_pred[:5]

# We have predictions, but can't calculate accuracy ourselves
# Because we don't have labels! But Kaggle does
# So we'll save and submit the predictions to them

submission = test[['PassengerId']].copy()
submission['Survived'] = y_pred
submission.head()

sub.to_csv('nameofdata.csv', index=False)

# Review

For your assignment, you'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- Begin with baselines for classification.
- Use scikit-learn for logistic regression.
- Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- Get your model's test accuracy. (One time, at the end.)
- Commit your notebook to your fork of the GitHub repo.
- Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.

# Sources
- Brandon Rohrer, [Training, Validation, and Testing Data Sets](https://end-to-end-machine-learning.teachable.com/blog/146320/training-validation-testing-data-sets)
- Hadley Wickham, [R for Data Science](https://r4ds.had.co.nz/model-intro.html#hypothesis-generation-vs.hypothesis-confirmation), Hypothesis generation vs. hypothesis confirmation
- Hastie, Tibshirani, and Friedman, [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/), Chapter 7: Model Assessment and Selection
- Mueller and Guido, [Introduction to Machine Learning with Python](https://books.google.com/books?id=1-4lDQAAQBAJ&pg=PA270), Chapter 5.2.2: The Danger of Overfitting the Parameters and the Validation Set
- Provost and Fawcett, [Data Science for Business](https://books.google.com/books?id=4ZctAAAAQBAJ&pg=PT276), Chapter 7.3: Evaluation, Baseline Performance, and Implications for Investments in Data
- Rachel Thomas, [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/)
- Sebastian Raschka, [Model Evaluation](https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html)
- Will Koehrsen, ["A baseline for classification can be the most common class in the training dataset."](https://twitter.com/koehrsen_will/status/1088863527778111488)