Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
331 lines (228 sloc) 9.31 KB
title author date output
STAT/MATH 495: Problem Set 07
Andrew Kim & Albert Y. Kim
2017-10-24
html_document
toc toc_float toc_depth collapsed smooth_scroll df_print
true
true
2
false
false
kable
knitr::opts_chunk$set(
  echo = TRUE, fig.width=8, fig.height=4.5, message=FALSE, warning = FALSE
  )
set.seed(76)

# Load packages
library(tidyverse)
library(broom)
library(ROCR)
library(scales)
library(knitr)
library(gridExtra)

train <- read_csv("data/cs-training.csv") %>% 
  rename(Id = X1)
test <- read_csv("data/cs-test.csv") %>% 
  rename(Id = X1)
submission <- read_csv("data/sampleEntry.csv")

Information on the competition can be found here.

Collaboration

Please indicate who you collaborated with on this assignment:

Build binary classifier

Build the binary classifier based on a single predictor variable: DebtRatio, age, or MonthlyIncome. Justify this choice.

Note: The score used by this Kaggle competition is Area under the ROC curve.

Using debt ratio

Model fit

model_formula <- as.formula(SeriousDlqin2yrs~DebtRatio)
model_logistic <- glm(model_formula, data=train, family="binomial")

Get fitted values on training set and predicted values on test set:

fitted_values <- model_logistic %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  mutate(p_hat = 1/(1 + exp(-.fitted)))
predictions <- model_logistic %>% 
  broom::augment(newdata=test) %>% 
  as_tibble() %>% 
  mutate(p_hat = 1/(1 + exp(-.fitted)))

Plot

  • The training observations with black points. Recall each individual either was delinquent or not, thus we only observe 0 & 1's along the y-axis.
  • The fitted logistic regression curve with a blue line
  • The predicted values with red points
plot_debt_ratio <- ggplot(NULL) +
  geom_point(data=fitted_values, aes(x=DebtRatio, y=SeriousDlqin2yrs)) +
  geom_line(data=fitted_values, aes(x=DebtRatio, y=p_hat), col="blue") +
  geom_point(data=predictions, aes(x=DebtRatio, y=p_hat), col="red") +
  labs(x="Debt Ratio (percentage)", y="p_hat", title="Fitted probability of delinquency vs debt ratio")
plot_debt_ratio

This is a little hard to see because of:

  1. the overplotting for both $y=0$ and $y=1$
  2. the right-skew in the debt ratio predictor.

To rectify this let's

  • Put a little vertical jitter in the y-values to break the overplotting
  • narrow the range of the x-axis. We observe that overall:
plot_debt_ratio <- ggplot(NULL) +
  geom_jitter(data=fitted_values, aes(x=DebtRatio, y=SeriousDlqin2yrs), height=0.05, alpha=0.05) +
  geom_line(data=fitted_values, aes(x=DebtRatio, y=p_hat), col="blue") +
  geom_point(data=predictions, aes(x=DebtRatio, y=p_hat), col="red") +
  labs(x="Debt Ratio (percentage)", y="p_hat", title="Fitted probability of delinquency vs debt ratio") +
  coord_cartesian(xlim=c(0, 5*10^4))
plot_debt_ratio

We observe:

  • There is generally less than 10% chance anyone will be delinquent no matter the debt ratio
  • As the debt ratio increases, the probability of delinquency decreases. This makes sense since the debt ratio variable is:

$$ \frac{\mbox{Monthly debt payments, alimony, living costs}}{\mbox{monthy gross income}} $$

Score

pred_debt_ratio <- prediction(predictions = fitted_values$p_hat, labels = fitted_values$SeriousDlqin2yrs)
perf_debt_ratio <- performance(pred_debt_ratio, "tpr","fpr")
auc_debt_ratio <- as.numeric(performance(pred_debt_ratio,"auc")@y.values)
auc_debt_ratio

Using age

Model fit

model_formula <- as.formula(SeriousDlqin2yrs~age)
model_logistic <- glm(model_formula, data=train, family="binomial")

Get fitted values on training set and predicted values on test set:

fitted_values <- model_logistic %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  mutate(p_hat = 1/(1 + exp(-.fitted)))
predictions <- model_logistic %>% 
  broom::augment(newdata=test) %>% 
  as_tibble() %>% 
  mutate(p_hat = 1/(1 + exp(-.fitted)))

Plot

  • The training observations with black points (with a little jitter along y-axis to break overplotting)
  • The fitted logistic regression curve with a blue line
  • The predicted values with red points
plot_age <- ggplot(NULL) +
  geom_jitter(data=fitted_values, aes(x=age, y=SeriousDlqin2yrs), height=0.05, alpha=0.05) +
  geom_line(data=fitted_values, aes(x=age, y=p_hat), col="blue") +
  geom_point(data=predictions, aes(x=age, y=p_hat), col="red") +
  labs(x="Age (in years)", y="p_hat", title="Fitted probability of delinquency vs age")
plot_age
  • There is generally less than 10% chance anyone will be delinquent no matter the age again
  • There is a negative relationship between fitted probability of delinquency and age
  • However there seems to be a bigger spread in the predicted values (red points) in this case than with debt ratio

Score

pred_age <- prediction(predictions = fitted_values$p_hat, labels = fitted_values$SeriousDlqin2yrs)
perf_age <- performance(pred_age, "tpr","fpr")
auc_age <- as.numeric(performance(pred_age,"auc")@y.values)
auc_age

Using monthly income

Model fit

model_formula <- as.formula(SeriousDlqin2yrs~MonthlyIncome)
model_logistic <- glm(model_formula, data=train, family="binomial")

Get fitted values on training set and predicted values on test set:

fitted_values <- model_logistic %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  mutate(p_hat = 1/(1 + exp(-.fitted)))
predictions <- model_logistic %>% 
  broom::augment(newdata=test) %>% 
  as_tibble() %>% 
  mutate(p_hat = 1/(1 + exp(-.fitted)))

Plot

  • The training observations with black points (with a little jitter along y-axis to break overplotting)
  • The fitted logistic regression curve with a blue line
  • The predicted values with red points
plot_monthly_income <- ggplot(NULL) +
  geom_jitter(data=fitted_values, aes(x=MonthlyIncome, y=SeriousDlqin2yrs), height=0.05, alpha=0.05) +
  geom_line(data=fitted_values, aes(x=MonthlyIncome, y=p_hat), col="blue") +
  geom_point(data=predictions, aes(x=MonthlyIncome, y=p_hat), col="red") +
  labs(x="Monthly income (in USD)", y="p_hat", title="Fitted probability of delinquency vs monthly income")
plot_monthly_income

This is a little hard to see because of the right-skew in the monthly income predictor. Let's narrow the range of the x-axis. We observe that overall:

  • There is generally less than 10% chance anyone will be delinquent no matter the monthly income
  • As the monthly income increases, the probability of delinquency decreases.
plot_monthly_income <- plot_monthly_income +
  coord_cartesian(xlim=c(0, 0.75*10^6))
plot_monthly_income

Score

pred_monthly_income <- prediction(predictions = fitted_values$p_hat, labels = fitted_values$SeriousDlqin2yrs)
perf_monthly_income <- performance(pred_monthly_income, "tpr","fpr")
auc_monthly_income <- as.numeric(performance(pred_monthly_income,"auc")@y.values)
auc_monthly_income

Comparison

Let's compare AUC's. It seems from best to worse its: age, monthly income, and debt ratio.

data_frame(
  predictor = c("Debt ratio", "Age", "Monthly income"),
  AUC = c(auc_debt_ratio, auc_age, auc_monthly_income)
  ) %>% 
  kable(digits = 3)

Let's compare plots. Let's compare the spread of the fitted probabilities along both the $x$ and $y$ axes, or the amount of discrimination each predictor variable provides. Compare these to the AUC values above.

grid.arrange(plot_debt_ratio, plot_monthly_income, plot_age, ncol=2)

ROC curve

Based on the ultimate classifier you choose, plot a corresponding ROC curve.

Since age was best, let's go with that:

plot(perf_age, main=paste("Area Under the Curve =", round(auc_age, 3)))
abline(c(0, 1), lty=2)

ROC curve for random guessing

Instead of using any predictor information as you did above, switch your predictions to random guesses and plot the resulting ROC curve.

There are many ways of doing random guess, as long as your guesses are not informed by any of the predictor variables:

Method 1: Shuffle the predictions from an earlier model fit using sample().

pred_random <- prediction(predictions = sample(fitted_values$p_hat), labels = fitted_values$SeriousDlqin2yrs)
perf_random <- performance(pred_random, "tpr","fpr")
auc_random <- as.numeric(performance(pred_random,"auc")@y.values)

plot(perf_random, main=paste("Area Under the Curve =", round(auc_random, 3)))
abline(c(0, 1), lty=2)

Method 2: Replace the fitted values with uniform probabilities between 0 and 1:

pred_random <- prediction(predictions = runif(n=nrow(fitted_values), min=0, max=1), labels = fitted_values$SeriousDlqin2yrs)
perf_random <- performance(pred_random, "tpr","fpr")
auc_random <- as.numeric(performance(pred_random,"auc")@y.values)

plot(perf_random, main=paste("Area Under the Curve =", round(auc_random, 3)))
abline(c(0, 1), lty=2)

Method 3: Replace the fitted values with just any scalar between 0 and 1:

pred_random <- prediction(predictions = rep(1, times=nrow(fitted_values)), labels = fitted_values$SeriousDlqin2yrs)
perf_random <- performance(pred_random, "tpr","fpr")
auc_random <- as.numeric(performance(pred_random,"auc")@y.values)

plot(perf_random, main=paste("Area Under the Curve =", round(auc_random, 3)))
abline(c(0, 1), lty=2)