## Classification
 
* Response variable is Qualitative(i.e. Categorical)
* Predict qualitative response
* Some methods first predict probability that observation belongs to a category
* Set of observations

$(x_1,y_1),(x_2,y_2),...,(x_n,y_n)$

- Train on training data and predict on test data

### Default data set

* Predict if a person will default on his/her credit card based on income and credit card balance

In [1]:

library(ISLR)
head(Default)
str(Default)


default,student,balance,income
No,No,729.5265,44361.625
No,Yes,817.1804,12106.135
No,No,1073.5492,31767.139
No,No,529.2506,35704.494
No,No,785.6559,38463.496
No,Yes,919.5885,7491.559


'data.frame':	10000 obs. of  4 variables:
 $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
 $ balance: num  730 817 1074 529 786 ...
 $ income : num  44362 12106 31767 35704 38463 ...


  
    
![](Classification.png)
 

### Why Not Linear Regression?

* Code response as 1 = default, 0 = no default
* Line goes through 0 and 1, not a probability


![](LogisticCurve.png)

### Logistic Regression

* Models the probability that the response Y belongs to a particular category
* Sigmoid shape function, asymtotes at 0 and 1 
    - logistic function
  

$$p(X) = Pr(Y = 1|X) = \frac{e^{\beta_0 + \beta_1X}} {1+e^{\beta_0 + \beta_1X}}$$

 
#### Logit ( Log odds)
 
* Odds: Probability of winning divided by probability of not winning, $\frac{P(Win)}{P(Not Win)}$
    - if probability of winning is .8,, .8/(1 - .8) = 4
    - 4 to 1 odds, if odd are 4 t0 1, you will win 4 times out of 5
    
* Logit is the inverse of the logistic function: $logit(p) =  log(\frac{p}{1-p})$
    
* Substitute for P(X) from above:
 
$$\frac{p(X)}{1 - p(X)} = \frac{\frac{e^{\beta_0 + \beta_1}} {1+e^{\beta_0 + \beta_1}}} {1 - \frac{e^{\beta_0 + \beta_1}} {1+e^{\beta_0 + \beta_1}}}
= \frac{\frac{e^{\beta_0 + \beta_1X}} {1+e^{\beta_0 + \beta_1X}}} {\frac{1+e^{\beta_0 + \beta_1X}}{1+e^{\beta_0 + \beta_1X}} - \frac{e^{\beta_0 + \beta_1X}} {1+e^{\beta_0 + \beta_1X}}}={e^{\beta_0 + \beta_1X}}$$
  
* Log odds: Take log of both sides:
 
$$log(\frac{p(X)}{1 - p(X)}) = \beta_0 + \beta_1 X$$
 
 

### Generalized Linear Model (GLM)

* Response being 0 or 1, binomial distribution, not normal
* Logit is a transformation of response variable
* To generalize linear models:
    - Allow response distribution to be a member of the exponential family
    - Allow transformations of response variable (link function)
* glm function
    - Specify a family: 
    - Logistic regression: family = "binomial", link = logit
    - Poisson regression: family = "poisson", link = log
    - Linear regression: family = "normal", link = identity
    

### Estimating the coefficients

- Maximum Likelihood 
    - We want to estimate the coefficients such that when plugged into p(x), we get a number close to 1 for those individuals who defaulted and close to 0 for those who didn't
- Maximize the likelihood function L($\beta_0$,$\beta_1$)
    
$$L(\beta_0,\beta_1) = \Pi_{i:y_i=1}p(x_i)\Pi_{j:y_j=0}(1 - p(x_j))$$

### Logistic Regression Model

In [2]:

glm.fit1=glm(default~balance,data=Default,family=binomial)
coef(glm.fit1)
summary(glm.fit1)



Call:
glm(formula = default ~ balance, family = binomial, data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2697  -0.1465  -0.0589  -0.0221   3.7589  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1596.5  on 9998  degrees of freedom
AIC: 1600.5

Number of Fisher Scoring iterations: 8


#### Coefficients
 
* One unit increase in balance results in an increase in the log odds of default by 0.0055 units
* Std. Error measures the accuracy of the coefficient estimates
* z statistic plays same role as t statistic in regression $z = \frac{\hat{\beta_1}}{SE(\hat{\beta_1})}$

* Large z value is strong evidence against the null hypothesis that $\beta_1$ = 0

 
#### Deviance

* A measure of goodness (badness) of fit in a GLM
* Higher numbers imply worse fit
* Null deviance: model with just the intercept
* Residual deviance: full model
    * Much better fit including balance
  
#### Akaike Information Criteria (AIC)

* Assesses the quality of the model relative to other models
* Select model with lowest AIC
 
#### Fisher Scoring

* Method for solving maximum liklihood problems
* Converged in 8 iterations
 

### Prediction

$\hat{P}(X) = \frac{e^{-10.65 + 0.0055X}} {1+e^{-10.65 + 0.0055X}}$

* If X = 1000, $\hat{P}$(X) = 0.00576

In [3]:
predict(glm.fit1,newdata=data.frame(balance=c(1000.00,2000.00)),type="response")

### Logistic Regression Model in Python

In [3]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#%%

# Importing the dataset
dataset = pd.read_csv('Default.csv')
X = dataset.iloc[:, 2].values
y = dataset.iloc[:, 0].values
X = X.reshape(-1,1)  # Classifier want a 2d array

# Fitting Logistic Regression on default~balance
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 1234)
classifier.fit(X, y)
print([classifier.intercept_,classifier.coef_])

[array([-9.46506555]), array([[ 0.00478248]])]


In [4]:
# Prediction
X_pred = np.array((1000.0,2000.0)).reshape(2,1)
y_pred = classifier.predict(X_pred)
print("Predictions: ",y_pred)
y_prob = classifier.predict_proba(X_pred)
print("Prediction Probabilities: ",y_prob)


Predictions:  ['No' 'Yes']
Prediction Probabilities:  [[ 0.99082984  0.00917016]
 [ 0.4750484   0.5249516 ]]


### Using qualitative predictors


In [5]:

glm.fit = glm.fit=glm(default~student,data=Default,family=binomial)
summary(glm.fit)


Call:
glm(formula = default ~ student, family = binomial, data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.2970  -0.2970  -0.2434  -0.2434   2.6585  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.50413    0.07071  -49.55  < 2e-16 ***
studentYes   0.40489    0.11502    3.52 0.000431 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 2908.7  on 9998  degrees of freedom
AIC: 2912.7

Number of Fisher Scoring iterations: 6


$$\hat{Pr}(default=yes|student=yes) = \frac{e^{-3.5+0.4049*1}}{1+e^{-3.5+0.4049*1}} = 0.043$$
$$\hat{Pr}(default=yes|student=no) = \frac{e^{-3.5+0.4049*0}}{1+e^{-3.5+0.4049*0}} = 0.0292$$

### Multiple Logistic Regression

$$p(X) = Pr(Y = 1|X) = \frac{e^{\beta_0 + \beta_1X_1+...+\beta_pX_p}} {1+e^{\beta_0 + \beta_1X_1+...+\beta_pX_p}}$$

$$log(\frac{p(X)}{1 - p(X)}) = \beta_0 + \beta_1X_1+...+\beta_pX_p$$

In [6]:

glm.fit = glm.fit=glm(default~balance+income+student,data=Default,family=binomial)
summary(glm.fit)

 


Call:
glm(formula = default ~ balance + income + student, family = binomial, 
    data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4691  -0.1418  -0.0557  -0.0203   3.7383  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152    
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1571.5  on 9996  degrees of freedom
AIC: 1579.5

Number of Fisher Scoring iterations: 8


 ###  Confounding
  
* Coefficient for studentYes is negative but when ran just the single predictor student the coefficient was positive
* Negative coefficient indicates that a student is less likely to default than a non-student, positive indicates more likely 
* For a fixed balance and income students are less likely to default,
    however, averaged over balance and income students are more likely to default (see left figure below)
* Student and balance are correlated.
    - Students tend to have higher balances and higher balances tend to default more
    - Therefore, overall, students tend to default at a higher rate
* If don't know balance, students are riskier, if know balance, then students are less riskier (than a non-student with the same balance)
    - Confounding: results obtained using 1 predictor may be different than results obtained with multiple predictors, especially if there are correlations among predictors
 
![](confounding.png)
 
 

In [4]:
library(caTools)
set.seed(1234)
df <- read.csv("lr_example.csv")

split = sample.split(df$Species, SplitRatio = 0.8)
train = subset(df, split == TRUE)
test = subset(df, split == FALSE)

glm.fit = glm(Species~Petal.Length+Petal.Width,data=train,family=binomial)
glm.sum = summary(glm.fit)
glm.sum
contrasts(train$Species)
predict.preds = predict(glm.fit,test,type="response")
predict.preds = ifelse (predict.preds > .5, "virginica", "versicolor")
test.preds = test$Species
test_error = mean(predict.preds != test.preds)
print("Test Error:")
test_error
cm = table(test.preds,predict.preds)
print("Confusion Matrix:")
cm
test_error = 1 - (sum(diag(cm))/sum(cm))
print("Test Error:")
test_error


Call:
glm(formula = Species ~ Petal.Length + Petal.Width, family = binomial, 
    data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.82880  -0.04096  -0.00014   0.01585   1.77240  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)   -45.483     15.808  -2.877  0.00401 **
Petal.Length    5.810      2.575   2.256  0.02406 * 
Petal.Width    10.590      4.446   2.382  0.01722 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 110.904  on 79  degrees of freedom
Residual deviance:  16.499  on 77  degrees of freedom
AIC: 22.499

Number of Fisher Scoring iterations: 9


Unnamed: 0,virginica
versicolor,0
virginica,1


[1] "Test Error:"


[1] "Confusion Matrix:"


            predict.preds
test.preds   versicolor virginica
  versicolor          9         1
  virginica           0        10

[1] "Test Error:"


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('lr_example.csv')
X = dataset.iloc[:, [3, 4]].values
y = dataset.iloc[:, 5].values

# Split the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

# Fit Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
lr_classifier = LogisticRegression(random_state = 1234)
lr_classifier.fit(X_train, y_train)
# Predict the Test set results
y_pred = lr_classifier.predict(X_test)

# Make the Confusion Matrix and calculate the test error rate
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
test_error = 1 - (np.trace(cm)/np.sum(cm))
test_error

[[10  0]
 [ 1  9]]




0.050000000000000044

#### Splitting data by indexing
 
* In R the split function splits data into groups by a factor
* When you don't have a factor, split using indexing 

In [2]:
n = .8 * nrow(airquality)
indxs = sample(1:n,n) #permute 1-n
train = airquality[indxs,]
test = airquality[-indxs,]
nrow(train)
nrow(test)

"Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R"  (Springer, 2013) with permission from the authors: G. James, D. Witten,  T. Hastie and R. Tibshirani " 