gapplyExample/gapplyExample.Rmd

---
title: "Example gapply task"
author: "Nina Zumel"
date: "December 5, 2016"
output:
  md_document:
    variant: markdown_github
---

# Simple Example of using `gapply`

```{r setup, message=FALSE, warning=FALSE}
library(tidyr)
library(dplyr)
library(replyr)
library(sigr)
```

The task: evaluate how several different models perform on the same problem, in terms of deviance explained, accuracy, precision, and recall.

The classification task: Identify credit accounts that will default in the next month (data from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)). 22% target prevalence.

Read in the data.
```{r readdata}
ccdata = read.table("creditCardDefault.txt", header=TRUE, sep="\t")
```

Categoricals are coded as integers; convert them to strings (for clarity).

```{r convert}
# map coded factor variables to strings
ccdata$SEX = with(ccdata, ifelse(SEX==1, 'Male', 'Female'))

# I'm not sure of the exact ed. levels, except grad/uni/high school
# the documentation of the data set was incomplete. I'm extrapolating
# that higher numerical level means less education
# Original coding - 0:6
edlevels = c("unknown", "grad school", "university", "high school",
             "level4", "level5", "level6")
# Again, documentation is incomplete. Original coding - 0:3
mlevels = c("unknown", "married", "single", "other")

ccdata$EDUCATION = edlevels[ccdata$EDUCATION +1]
ccdata$MARRIAGE = mlevels[ccdata$MARRIAGE +1]
```

Define the variables and the outcome column, then split data into train and test.
```{r split}
ccdata$defaults = ccdata$default.payment.next.month==1
outcome = "defaults"
varlist = setdiff(colnames(ccdata), c("defaults", "default.payment.next.month", "ID"))

#
# split into train and test
#
set.seed(252627)
N = nrow(ccdata)
isTrain = runif(N) < 0.8
train = ccdata[isTrain,]
test = ccdata[!isTrain,]
```


Make a list of the algorithms to try
```{r message=FALSE, warning=FALSE}
# load the file of model fitting and prediction functions
source("modelfitting.R")
algolist = list(glm=glm_predictor, gam=gam_predictor, rangerRF=ranger_predictor)
```

Fit models for each algorithm and gather together the predictions each model makes on a test set.
```{r message=FALSE, warning=FALSE}
predictors = fit_models(algolist, outcome, varlist, train)
predframe = make_predictions(predictors, test, outcome)
replyr_summary(predframe)[, c("column", "class", "nunique")]
replyr_uniqueValues(predframe, "model")
```


## Model Evaluation

Functions to evaluate predictions
```{r}

cmat = function(tval, pred) {
  if(max(pred) > 0.5 && min(pred) <= 0.5)
    threshold=0.5
  else {
    prevalence = mean(tval)
    threshold=prevalence # not ideal, but it will do for now.
  }
  table(truth=tval, pred=pred > 0.5)
}

accuracy = function(cmat) {
  sum(diag(cmat))/sum(cmat)
}

recall = function(cmat) {
  cmat["TRUE", "TRUE"]/sum(cmat["TRUE",])
}

precision = function(cmat) {
  cmat["TRUE", "TRUE"]/sum(cmat[,"TRUE"])
}
```

Write a function `metric_row` that takes a frame of predictions (and truth) for a single model, and returns a frame of all the performance metrics.
```{r}
metric_row = function(subframe,
                      yvar,
                      pred,
                      label) {
  confmat = cmat(subframe[[yvar]], subframe[[pred]])
  devExplained = formatChiSqTest(subframe, pred, yvar)$pseudoR2
  tframe = data.frame(devExplained=devExplained,
                      accuracy=accuracy(confmat),
                      precision=precision(confmat),
                      recall=recall(confmat))
  tframe$model = subframe[[label]][1] # assuming there is only one label 
  tframe
}

# example outcome of metric_row. In this case it's a 1-row data frame
# but you could return a multirow frame, for example one row each for test
# and training performance
metric_row(subset(predframe, model=="glm"), outcome, "pred", "glm")
```

Use `gapply` to apply `metric_row` to the predictions for all of the models, and return a frame with all the performance metrics for easy comparison of the different models
```{r}
#
# compute performance metrics for all the model types
#
replyr::gapply(predframe, 'model',
               function(fi) metric_row(fi,outcome,
                                       'pred',
                                       'model'),
               partitionMethod = 'split') %>%
  dplyr::arrange(desc(devExplained))


```

Using `split` explicitly, for comparison:

```{r}
split(predframe, predframe$model) %>% 
  lapply(function(fi) {metric_row(fi, outcome, 
                                  'pred', 
                                  'model')}) %>% 
  dplyr::bind_rows() %>%
  dplyr::arrange(desc(devExplained))
              
```