-
Notifications
You must be signed in to change notification settings - Fork 92
/
Copy pathlogistic_model_transfer.Rmd
205 lines (145 loc) · 8.83 KB
/
logistic_model_transfer.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: "Logistic Model Transfer"
author: "Nina Zumel"
date: "2024-07-08"
output: github_document
---
In this notebook, we'll show an example of fitting a regularized logistic model in R using the `glmnet`package, then transferring the model to another generalized linear framework, in this example `glm`. A similar procedure can be used to transfer an R `glmnet` model to a framework in another language, for example `scikit-learn` or pytorch in Python, or `linfa` in Rust.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
First, we'll simulate some data. This example is taken from the [EPIB-621 course notes](http://www.medicine.mcgill.ca/epidemiology/Joseph/courses/EPIB-621/logconfound.pdf) by Lawrence Joseph at McGill University.
The data is designed to demonstrate the effects of collinear variables when fitting logistic regression models.
```{r}
library(ggplot2)
set.seed(20240707)
# Create any first independent variable (round to one decimal place)
x1 <- round(rnorm(400, mean=0, sd=1), 1)
# Create any second independent variable (round to one decimal place)
x2 <- round(rnorm(400, mean = 4, sd=2), 1)
# Now create a third independent variable that is a direct function
# of the first two variables
x3 <- 3*x1 + 2 *x2
# Create a binary outcome variable that depends on all three variables
# Note that the probability of the binomial is an inv.logit function
y <- rbinom(400, 1, exp(x1 + 2*x2 -3 * x3)/(1+ exp(x1 + 2*x2 -3 * x3)))
# Put all variables into a data frame
df <- data.frame(x1=x1, x2=x2, x3=x3, y=y)
# mark some of it as holdout data, for didactic reasons
df$gp = ifelse(runif(400) < 0.1, 'holdout', 'train')
dftrain = subset(df, gp=='train')
```
First, we'll fit a logistic regression model to see the pathology.
```{r}
# try to fit a logistic regression
mod1 = glm(y ~ x1 + x2 + x3, data=dftrain, family = binomial)
summary(mod1)
```
Note that `glm` is unable to assign a coefficient to `x3`, because it is a linear combination of `x1` and `x2`.
This singularity is clearly an issue when the data scientist is interested in correctly inferring the parameter values of the model (in other words, in estimating the effect of a variable on the outcome). In situations where predicting outcomes is the primary goal, such a model *may* still give correct answers, but it can also be quite sensitive to any deviations from the training data in the parameter distributions. Logistic models fit to nearly collinear variables can also be unstable.
If you are more interested in stable prediction of outcome, rather than correct inference
of parameters, regularizing the regression is a recommended procedure.
So we will try fitting a logistic model to this data with L2 regularization (AKA ridge regression), to manage the collinearity. We'll use `cv.glmnet` to automatically pick a good value for the regularization parameter, and return the appropriate model.
```{r}
library(glmnet)
# glmnet logistic regression needs a matrix for input, and factors for output
dmat = as.matrix(dftrain[, c('x1', 'x2', 'x3' )])
outcome = as.factor(dftrain$y)
# alpha = 0 : ridge regression
modglmn = cv.glmnet(dmat, outcome, alpha = 0, family=binomial)
modglmn
coef(modglmn)
```
Let's compare the predictions of the two models on all the data (including holdout).
```{r}
dmat_all = as.matrix(df[, c('x1', 'x2', 'x3' )])
# get the predicted probabilities from both model
cframe = data.frame(
mod1 = predict(mod1, newdata=df, type='response'),
glmn = as.numeric(predict(modglmn, dmat_all, type='response')),
group = df$gp
)
palette = c(train='darkgrey', holdout='#d95f02')
# unregularized model goes to 0/1 much faster
ggplot(cframe, aes(x=glmn, y=mod1, color=group)) +
geom_point() + geom_abline(color='darkblue') +
scale_color_manual(values=palette) +
xlab("regularized model") + ylab("unregularized model") +
ggtitle('Unregularized vs regularized model')
```
Note that the unregularized model returns much larger link values, leading the probabilities to saturate at nearly 0 or nearly 1 much sooner than the regularized model. Regularization produces what you might call a "more conservative" model.
## Transfer the `glmnet` model to the `glm` framework
If the resulting model is good enough for your needs, and you are willing to work with your data as matrices and factors, then you're done. But you might want to keep the data as a data frame for other reasons (for example, compatibility with the rest of the workflow). In this case, you might find it useful to port the `glmnet` model to `glm` for ease of use.
As discussed in our main article, this only requires evaluating the `glmnet` model on a full rank set of rows. Since all the variables here are numeric, we only need 4 independent rows: one for each variable, plus the intercept.
Note that in the representation below, the intercept column is implicit, so you can think of `evalframe` having an invisible intercept column `c(1, 1, 1, 1)`, making `evalframe` a 4 by 4 full rank matrix.
```{r}
# you can build this frame any old way, this just happens to be readable
evalframe = wrapr::build_frame(
'x1', 'x2', 'x3' |
0, 0, 0 | # for intercept
1, 0, 0 |
0, 1, 0 |
0, 0, 1
)
# attach predictions from glmnet model
emat = as.matrix(evalframe)
evalframe$y = as.numeric(predict(modglmn, emat, type='response'))
evalframe
```
In the code above, I used `wrapr::build_frame` to specify the dataframe in a row-wise fashion. This is of course completely optional; I just didn't want to transpose the frame in my head to write it down in the default column-wise manner.
Now, let's fit a `glm` model to the evaluation frame. In R, we can use the predicted probabilities as outputs directly; the `glm` function will complain, but the answer will be correct.
```{r}
# fit the model using glm, family binomial.
# It will complain, but the answer will be correct
modport = glm(y ~ x1 + x2 + x3, data=evalframe, family=binomial)
summary(modport)
# compare coefficients
data.frame(
ported_model = coef(modport),
original_model = as.numeric(coef(modglmn))
)
```
The coefficients are the same. We can check that the predictions are the same, even on new data.
```{r}
# compare the predictions on the original data
cframe = data.frame(
modport = predict(modport, newdata=df, type='response'),
glmn = as.numeric(predict(modglmn, dmat_all, type='response')),
group = df$gp
)
ggplot(cframe, aes(x=glmn, y=modport, color=group)) +
geom_point() + geom_abline(color='gray') +
scale_color_manual(values=palette) +
xlab("Original regularized model") + ylab("Ported model") +
ggtitle('Ported vs original regularized model')
```
### Porting to `scikit-learn`
The above procedure should work fine when porting the model to a framework like pytorch or tensorflow. However, there are caveats when porting to scikit-learn.
First of all: scikit-learn `LogisticRegression` regularizes by default, using ridge regression and a fairly large regularization (`C` = 1.0; compare this to `cv.glmnet`, which, for our problem, chose regularization `lambda` = 0.02621). So in our example scenario, you'd probably prefer to fit the model directly in scikit-learn anyway. But there are other situations where you might want to fit a logistic model outside of scikit-learn and then port it. For example, you might prefer to fit using `statsmodels` in python, or you have priors on the coefficients that you want to enforce using Bayesian regression (either in python or in R).
In any case, to port your model to scikit-learn's `LogisticRegression`, first, make sure to set `penalty=None` to turn off the regularization. Second, augment your evaluation frame to a "weighted logistic regression" representation. Here we show such a representation, in R.
```{r}
# weights for positive examples
posframe = evalframe
posframe$wt = posframe$y # probability of success
posframe$y = 1
# weights for negative examples
negframe = posframe
negframe$wt = 1 - negframe$wt # probability of failure
negframe$y = 0
wtframe = data.table::rbindlist(list(posframe, negframe)) |> as.data.frame()
wtframe
```
The new evaluation frame has two rows for every combination of features. One row has outcome 1, and weight $p$, where $p$ is the predicted probability of success from the original model. The other row has outcome 0, and weight $1-p$, the predicted probability of failure. The input data, outcome vector, and weight vector can all be passed to the `fit()` call of `LogisticRegression`.
We can do the same thing in R, with `glm`.
```{r}
# it still complains, but the answer is still correct
modport2 = glm(y ~ x1 + x2 + x3, data=wtframe, weights = wtframe$wt, family=binomial)
summary(modport2)
# compare coefficients
data.frame(
port_using_weights = coef(modport2),
port_using_probs = coef(modport),
original_model = as.numeric(coef(modglmn))
)
```
Both ported models have the same coefficients as the original model that was fit with `cv.glmnet`.