-
Notifications
You must be signed in to change notification settings - Fork 92
/
Copy pathgapplyExample.Rmd
155 lines (126 loc) · 4.63 KB
/
gapplyExample.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: "Example gapply task"
author: "Nina Zumel"
date: "December 5, 2016"
output:
md_document:
variant: markdown_github
---
# Simple Example of using `gapply`
```{r setup, message=FALSE, warning=FALSE}
library(tidyr)
library(dplyr)
library(replyr)
library(sigr)
```
The task: evaluate how several different models perform on the same problem, in terms of deviance explained, accuracy, precision, and recall.
The classification task: Identify credit accounts that will default in the next month (data from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)). 22% target prevalence.
Read in the data.
```{r readdata}
ccdata = read.table("creditCardDefault.txt", header=TRUE, sep="\t")
```
Categoricals are coded as integers; convert them to strings (for clarity).
```{r convert}
# map coded factor variables to strings
ccdata$SEX = with(ccdata, ifelse(SEX==1, 'Male', 'Female'))
# I'm not sure of the exact ed. levels, except grad/uni/high school
# the documentation of the data set was incomplete. I'm extrapolating
# that higher numerical level means less education
# Original coding - 0:6
edlevels = c("unknown", "grad school", "university", "high school",
"level4", "level5", "level6")
# Again, documentation is incomplete. Original coding - 0:3
mlevels = c("unknown", "married", "single", "other")
ccdata$EDUCATION = edlevels[ccdata$EDUCATION +1]
ccdata$MARRIAGE = mlevels[ccdata$MARRIAGE +1]
```
Define the variables and the outcome column, then split data into train and test.
```{r split}
ccdata$defaults = ccdata$default.payment.next.month==1
outcome = "defaults"
varlist = setdiff(colnames(ccdata), c("defaults", "default.payment.next.month", "ID"))
#
# split into train and test
#
set.seed(252627)
N = nrow(ccdata)
isTrain = runif(N) < 0.8
train = ccdata[isTrain,]
test = ccdata[!isTrain,]
```
Make a list of the algorithms to try
```{r message=FALSE, warning=FALSE}
# load the file of model fitting and prediction functions
source("modelfitting.R")
algolist = list(glm=glm_predictor, gam=gam_predictor, rangerRF=ranger_predictor)
```
Fit models for each algorithm and gather together the predictions each model makes on a test set.
```{r message=FALSE, warning=FALSE}
predictors = fit_models(algolist, outcome, varlist, train)
predframe = make_predictions(predictors, test, outcome)
replyr_summary(predframe)[, c("column", "class", "nunique")]
replyr_uniqueValues(predframe, "model")
```
## Model Evaluation
Functions to evaluate predictions
```{r}
cmat = function(tval, pred) {
if(max(pred) > 0.5 && min(pred) <= 0.5)
threshold=0.5
else {
prevalence = mean(tval)
threshold=prevalence # not ideal, but it will do for now.
}
table(truth=tval, pred=pred > 0.5)
}
accuracy = function(cmat) {
sum(diag(cmat))/sum(cmat)
}
recall = function(cmat) {
cmat["TRUE", "TRUE"]/sum(cmat["TRUE",])
}
precision = function(cmat) {
cmat["TRUE", "TRUE"]/sum(cmat[,"TRUE"])
}
```
Write a function `metric_row` that takes a frame of predictions (and truth) for a single model, and returns a frame of all the performance metrics.
```{r}
metric_row = function(subframe,
yvar,
pred,
label) {
confmat = cmat(subframe[[yvar]], subframe[[pred]])
devExplained = formatChiSqTest(subframe, pred, yvar)$pseudoR2
tframe = data.frame(devExplained=devExplained,
accuracy=accuracy(confmat),
precision=precision(confmat),
recall=recall(confmat))
tframe$model = subframe[[label]][1] # assuming there is only one label
tframe
}
# example outcome of metric_row. In this case it's a 1-row data frame
# but you could return a multirow frame, for example one row each for test
# and training performance
metric_row(subset(predframe, model=="glm"), outcome, "pred", "glm")
```
Use `gapply` to apply `metric_row` to the predictions for all of the models, and return a frame with all the performance metrics for easy comparison of the different models
```{r}
#
# compute performance metrics for all the model types
#
replyr::gapply(predframe, 'model',
function(fi) metric_row(fi,outcome,
'pred',
'model'),
partitionMethod = 'split') %>%
dplyr::arrange(desc(devExplained))
```
Using `split` explicitly, for comparison:
```{r}
split(predframe, predframe$model) %>%
lapply(function(fi) {metric_row(fi, outcome,
'pred',
'model')}) %>%
dplyr::bind_rows() %>%
dplyr::arrange(desc(devExplained))
```