Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multinomial and Cox redux; also fixing showarray error #50

Merged
merged 66 commits into from
Dec 5, 2020
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
a71d478
add support for survival data
linxihui Jul 19, 2015
73a94b9
add CV for coxnet
linxihui Jul 20, 2015
468f801
add coxDeviance
linxihui Jul 21, 2015
bfcbd1d
support for multi-class classification
linxihui Jul 21, 2015
01841db
replace tab by spaces
linxihui Jul 21, 2015
dff9e40
define GLMNetPathMultinomial as output
linxihui Jul 22, 2015
d3b1615
add another method for coxnet
linxihui Jul 22, 2015
d79809a
code clean up
linxihui Jul 25, 2015
43279e8
add loss/deviance function
linxihui Jul 25, 2015
fcc1155
code clean up
linxihui Jul 25, 2015
aa7af09
Add export CoxNet; change coxDeviance to CoxDeviance
linxihui Jul 25, 2015
2694ed3
add export; change GLMNetPathMultinomial to LogNetPath
linxihui Jul 25, 2015
fcdbf14
change GLMNetCrossValidation.path to Any; include Multinomial, CoxNet
linxihui Jul 25, 2015
2195fd1
big bug: x->X; mean/std does not reduce dim; add offsets to glmnetcv;…
linxihui Jul 25, 2015
aac5379
fixed bug: x->X; define StringVector; fixed bug in CoxDeviance
linxihui Jul 25, 2015
5a62436
add offsets to glmnetcv
linxihui Jul 25, 2015
dcfe72c
add predict function on GLMNetCrossValidation object; code clean up
linxihui Jul 25, 2015
3adf4a5
add glmnet method on AbstractMatrix
linxihui Jul 25, 2015
383d05e
fixed bug in glmnetcv: pass weights/offsets to glmnet
linxihui Jul 25, 2015
3356d7a
fixed bug in glmnetcv: pass weights/offsets to glmnet
linxihui Jul 25, 2015
14d3d98
fix bug in glmnetcv: no offsets for Normal family
linxihui Jul 25, 2015
449e663
add convenient function coef, lambdamin
linxihui Jul 25, 2015
14bbfe8
add to predict: output type
linxihui Jul 25, 2015
88a142b
update for new functionality; added multiclass example
linxihui Jul 25, 2015
4352ada
in glmnetcv, change default weights from ones(legnth(y)) to ones(size…
linxihui Jul 25, 2015
d3cd8e7
add method glmentcv(Matrix, y::StringVector)
linxihui Jul 25, 2015
52bab2f
change outtype to be keyword parameter
linxihui Jul 25, 2015
d339b5d
fix typo
linxihui Jul 25, 2015
5648f49
clean example
linxihui Jul 26, 2015
ae761b9
glmnetcv: fixed bug on offsets when binomial; test passed
linxihui Jul 27, 2015
aee5496
glmnetcv: offsets issue for multivariate
linxihui Jul 27, 2015
3eb0d91
add plot functions to visualize shrinkage path
linxihui Jul 30, 2015
521b454
add plotting example
linxihui Aug 1, 2015
f96a8ed
add Gadfly to dependency
linxihui Aug 1, 2015
f1d0ec8
add test for multinomial and survival outcome
linxihui Aug 1, 2015
1e23a5a
fixed bug in predict, ccall; add offsets to predict, loss and cv
linxihui Aug 1, 2015
c6a5902
add keyword argument offset to predict, loss, glmnetcv; include plot.jl
linxihui Aug 1, 2015
1c00f36
add integerity checking on x, y; re-indent
linxihui Aug 1, 2015
fa312cd
fixed bug in predict, ccall; add offsets to predict, loss and cv
linxihui Aug 1, 2015
f73ae64
re-indent
linxihui Aug 1, 2015
a48e07e
import Gadfly function
linxihui Aug 1, 2015
1e01a8d
replace DataFrames.rep by repmat
linxihui Aug 1, 2015
64c3361
load DataFrames
linxihui Aug 1, 2015
2a180fd
explicitly import predict from DataFrame, which solves that problem t…
linxihui Aug 1, 2015
5624014
add coef method for Multinomial
linxihui Aug 1, 2015
f902852
more tests
linxihui Aug 1, 2015
80d638b
fix bug, clean up, better test coverage
linxihui Aug 1, 2015
b2b41e6
update examples
linxihui Aug 1, 2015
e0a9a50
update examples
linxihui Aug 1, 2015
f9e5ade
update examples
linxihui Aug 1, 2015
ee0f482
replaced PNG by SVG images
linxihui Aug 1, 2015
344de80
change links to svg images
linxihui Aug 1, 2015
7e4c990
Merge remote-tracking branch 'MultinomialCox/master' into HEAD
Dec 1, 2020
d80bd0f
modernizing and dropping gadfly dependencies.
Dec 2, 2020
0c90a11
showarray isn't defined
Dec 2, 2020
1bce066
@testsets are nice
Dec 2, 2020
c51eb94
readme update
Dec 4, 2020
cc66fb2
tweaking glmnetcv for cox to allow determined rng
Dec 4, 2020
3e79249
bye bye plots
Dec 4, 2020
db88aad
bug when using CategoricalArrays for Multinomial
Dec 4, 2020
b0aaf12
parameteric types are nice (just GLMNetPath now)
Dec 4, 2020
8a251bf
this conversion breaks Multinomial regression
Dec 4, 2020
1b0727c
fixing CategoricalArray dependency, revert #18
Dec 4, 2020
db10587
Merge commit '456b9aeabe9e07ea72452461232a162df1f41837' into multinom…
Dec 5, 2020
726b25c
same error, but in the cox version
Dec 5, 2020
0506b46
also in Multinomial, add CategoricalArrays in tests
Dec 5, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ uuid = "8d5ece8b-de18-5317-b113-243142960cc6"
version = "0.5.2"

[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Distributed = "8ba89e20-285c-5b6f-9357-94700520ee1b"
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
Expand Down
124 changes: 99 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

## Quick start

To fit a basic model:
To fit a basic regression model:

```julia
julia> using GLMNet
Expand All @@ -17,63 +17,137 @@ julia> y = collect(1:100) + randn(100)*10;
julia> X = [1:100 (1:100)+randn(100)*5 (1:100)+randn(100)*10 (1:100)+randn(100)*20];

julia> path = glmnet(X, y)
Least Squares GLMNet Solution Path (55 solutions for 4 predictors in 163 passes):
55x3 DataFrame:
df pct_dev λ
[1,] 0 0.0 27.1988
[2,] 1 0.154843 24.7825
[3,] 1 0.283396 22.5809
:
[53,] 2 0.911956 0.215546
[54,] 2 0.911966 0.196397
[55,] 2 0.911974 0.17895
Least Squares GLMNet Solution Path (74 solutions for 4 predictors in 832 passes):
dsweber2 marked this conversation as resolved.
Show resolved Hide resolved
74x3 DataFrame
| Row | df | pct_dev | λ |
|-----|----|----------|-----------|
| 1 | 0 | 0.0 | 29.6202 |
| 2 | 1 | 0.148535 | 26.9888 |
| 3 | 1 | 0.271851 | 24.5912 |
| 4 | 1 | 0.37423 | 22.4066 |
| 70 | 4 | 0.882033 | 0.0482735 |
| 71 | 4 | 0.882046 | 0.043985 |
| 72 | 4 | 0.882058 | 0.0400775 |
| 73 | 4 | 0.882067 | 0.0365171 |
| 74 | 4 | 0.882075 | 0.033273 |
```

`path` represents the Lasso or ElasticNet fits for varying values of λ. The value of the intercept for each λ value are in `path.a0`. The coefficients for each fit are stored in compressed form in `path.betas`.

```julia
julia> path.betas
4x55 CompressedPredictorMatrix:
0.0 0.083706 0.159976 0.22947 … 0.929157 0.929315
0.0 0.0 0.0 0.0 0.00655753 0.00700862
0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0
4x74 CompressedPredictorMatrix:
0.0 0.091158 0.174218 … 0.913497 0.915593 0.917647
0.0 0.0 0.0 0.128054 0.127805 0.127568
0.0 0.0 0.0 -0.126211 -0.128015 -0.129776
0.0 0.0 0.0 0.108217 0.108254 0.108272
```

This CompressedPredictorMatrix can be indexed as any other AbstractMatrix, or converted to a Matrix using `convert(Matrix, path.betas)`.

One can visualize the path by

```julia
julia> using Gadfly

julia> plot(path, Guide.xlabel("||β||₁"), Guide.ylabel("βᵢ"), x=:norm1)
```
![regression-lasso-path](https://rawgit.com/linxihui/Misc/master/Images/GLMNet.jl/regression_lasso_path.svg)

One can see that the LASSO path is piecewise linear.

To predict the output for each model along the path for a given set of predictors, use `predict`:

```julia
julia> predict(path, [22 22+randn()*5 22+randn()*10 22+randn()*20])
1x55 Array{Float64,2}:
51.7098 49.3242 47.1505 45.169925.1036 25.0878 25.0736
1x74 Array{Float64,2}:
50.8669 48.2689 45.901721.9344 21.9377 21.9407
```

To find the best value of λ by cross-validation, use `glmnetcv`:

```julia
julia> cv = glmnetcv(X, y)
Least Squares GLMNet Cross Validation
55 models for 4 predictors in 10 folds
Best λ 0.343 (mean loss 76.946, std 12.546)
74 models for 4 predictors in 10 folds
Best λ 0.450 (mean loss 129.720, std 14.871)

julia> argmin(cv.meanloss)
48

julia> cv.path.betas[:, 48]
julia> cv.path.betas[:, 46]
4-element Array{Float64,1}:
0.926911
0.00366805
0.781119
0.128094
0.0
0.103008

julia> coef(cv)
4-element Array{Float64,1}:
0.781119
0.128094
0.0
0.103008
```

### A classification Example

```julia
julia> using RDatasets

julia> iris = dataset("datasets", "iris");

julia> X = convert(Matrix, iris[:, 1:4]);

julia> y = convert(Vector, iris[:Species]);

julia> iTrain = sample(1:size(X,1), 100, replace = false);

julia> iTest = setdiff(1:size(X,1), iTrain);

julia> iris_cv = glmnetcv(X[iTrain, :], y[iTrain])
Multinomial GLMNet Cross Validation
100 models for 4 predictors in 10 folds
Best λ 0.001 (mean loss 0.124, std 0.046)

julia> yht = round(predict(iris_cv, X[iTest, :], outtype = :prob), 3);

julia> DataFrame(target=y[iTest], set=yht[:,1], ver=yht[:,2], vir=yht[:,3])[5:5:50,:]
10x4 DataFrame
| Row | target | set | ver | vir |
|-----|--------------|-------|-------|-------|
| 1 | "setosa" | 0.999 | 0.001 | 0.0 |
| 2 | "setosa" | 1.0 | 0.0 | 0.0 |
| 3 | "setosa" | 1.0 | 0.0 | 0.0 |
| 4 | "versicolor" | 0.0 | 0.983 | 0.017 |
| 5 | "versicolor" | 0.002 | 0.961 | 0.037 |
| 6 | "versicolor" | 0.0 | 0.067 | 0.933 |
| 7 | "versicolor" | 0.0 | 0.993 | 0.007 |
| 8 | "virginica" | 0.0 | 0.0 | 1.0 |
| 9 | "virginica" | 0.0 | 0.397 | 0.603 |
| 10 | "virginica" | 0.0 | 0.025 | 0.975 |

julia> plot(iris_cv.path, Scale.x_log10, Guide.xlabel("λ"), Guide.ylabel("βᵢ"))
```
![iris-lasso-path](https://rawgit.com/linxihui/Misc/master/Images/GLMNet.jl/iris_lasso_path.svg)

```julia
julia> plot(iris_cv)
```
![iris-cv](https://rawgit.com/linxihui/Misc/master/Images/GLMNet.jl/iris_cv.svg)

## Fitting models

`glmnet` has two required parameters: the m x n predictor matrix `X` and the dependent variable `y`. It additionally accepts an optional third argument, `family`, which can be used to specify a generalized linear model. Currently, only `Normal()` (least squares, default), `Binomial()` (logistic), and `Poisson()` are supported, although the glmnet Fortran code also implements a Cox model. For logistic models, `y` is a m x 2 matrix, where the first column is the count of negative responses for each row in `X` and the second column is the count of positive responses. For all other models, `y` is a vector.
`glmnet` has two required parameters: the m x n predictor matrix `X` and the dependent variable `y`. It additionally accepts an optional third argument, `family`, which can be used to specify a generalized linear model. Currently, `Normal()` (least squares, default), `Binomial()` (logistic), `Poisson()` , `Multinomial()`, `CoxPH()` (Cox model) are supported.

- For linear and Poisson models, `y` is a numerical vector.
- For logistic models, `y` is either a string vector or a m x 2 matrix, where the first column is the count of negative responses for each row in `X` and the second column is the count of positive responses.
- For multinomial models, `y` is etiher a string vector (with at least 3 unique values) or a m x k matrix, where k is number of unique values (classes).
- For Cox models, `y` is a 2-column matrix, where the first column is survival time and second column is (right) censoring status. Indeed, For survival data, `glmnet` has another method `glmnet(X::Matrix, time::Vector, status::Vector)`. Same for `glmnetcv`.


`glmnet` also accepts many optional parameters, described below:
`glmnet` also accepts many optional keyword parameters, described below:

- `weights`: A vector of weights for each sample of the same size as `y`.
- `alpha`: The tradeoff between lasso and ridge regression. This defaults to `1.0`, which specifies a lasso model.
Expand Down