# Prostate Data  

This demonstration solves a regular, unconstrained lasso problem using
the constrained lasso solution path (`lsq_classopath.jl`).

The `prostate` data come from a study that examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. ([Stamey et al. (1989)](../references.md#4))

Let's load and organize the `prostate` data. Since we are interested in the following variables as predictors, we extract them and create a design matrix `Xz`:

* `lcavol` : log(cancer volume)
* `lweight`: log(prostate weight)
* `age`    : age
* `lbph`   : log(benign prostatic hyperplasia amount)
* `svi`    : seminal vesicle invasion
* `lcp`    : log(capsular penetration)
* `gleason`: Gleason score
* `pgg45`  : percentage Gleason scores 4 or 5

The response variable is `lpsa`, which is log(prostate specific antigen). 

In [None]:
using ConstrainedLasso 

In [2]:
prostate = readcsv(joinpath(Pkg.dir("ConstrainedLasso"), "docs/src/demo/misc/prostate.csv"), header=true)
tmp = Int[]
labels = ["lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason" "pgg45"]
for i in labels
    push!(tmp, find(x -> x == i, prostate[2])[1])
end
Xz = Array{Float64}(prostate[1][:, tmp])

97×8 Array{Float64,2}:
 -0.579818  2.76946  50.0  -1.38629   0.0  -1.38629   6.0   0.0
 -0.994252  3.31963  58.0  -1.38629   0.0  -1.38629   6.0   0.0
 -0.510826  2.69124  74.0  -1.38629   0.0  -1.38629   7.0  20.0
 -1.20397   3.28279  58.0  -1.38629   0.0  -1.38629   6.0   0.0
  0.751416  3.43237  62.0  -1.38629   0.0  -1.38629   6.0   0.0
 -1.04982   3.22883  50.0  -1.38629   0.0  -1.38629   6.0   0.0
  0.737164  3.47352  64.0   0.615186  0.0  -1.38629   6.0   0.0
  0.693147  3.53951  58.0   1.53687   0.0  -1.38629   6.0   0.0
 -0.776529  3.53951  47.0  -1.38629   0.0  -1.38629   6.0   0.0
  0.223144  3.24454  63.0  -1.38629   0.0  -1.38629   6.0   0.0
  0.254642  3.60414  65.0  -1.38629   0.0  -1.38629   6.0   0.0
 -1.34707   3.59868  63.0   1.26695   0.0  -1.38629   6.0   0.0
  1.61343   3.02286  63.0  -1.38629   0.0  -0.597837  7.0  30.0
  ⋮                                         ⋮                  
  3.30285   3.51898  64.0  -1.38629   1.0   2.32728   7.0  60.0
  2.02419   3.731

In [3]:
y = Array{Float64}(prostate[1][:, end-1])

97-element Array{Float64,1}:
 -0.430783
 -0.162519
 -0.162519
 -0.162519
  0.371564
  0.765468
  0.765468
  0.854415
  1.04732 
  1.04732 
  1.26695 
  1.26695 
  1.26695 
  ⋮       
  3.63099 
  3.68009 
  3.71235 
  3.98434 
  3.9936  
  4.02981 
  4.12955 
  4.38515 
  4.68444 
  5.14312 
  5.47751 
  5.58293 

First we standardize the data by subtracting its mean and dividing by its standard deviation. 

In [4]:
n, p = size(Xz)
for i in 1:size(Xz,2)
    Xz[:, i] -= mean(Xz[:, i])
    Xz[:, i] /= std(Xz[:, i])
end
Xz

97×8 Array{Float64,2}:
 -1.63736   -2.00621    -1.86243    …  -0.863171  -1.04216   -0.864467
 -1.98898   -0.722009   -0.787896      -0.863171  -1.04216   -0.864467
 -1.57882   -2.18878     1.36116       -0.863171   0.342627  -0.155348
 -2.16692   -0.807994   -0.787896      -0.863171  -1.04216   -0.864467
 -0.507874  -0.458834   -0.250631      -0.863171  -1.04216   -0.864467
 -2.03613   -0.933955   -1.86243    …  -0.863171  -1.04216   -0.864467
 -0.519967  -0.362793    0.0180011     -0.863171  -1.04216   -0.864467
 -0.557313  -0.208757   -0.787896      -0.863171  -1.04216   -0.864467
 -1.80425   -0.208757   -2.26537       -0.863171  -1.04216   -0.864467
 -0.956085  -0.897266   -0.116315      -0.863171  -1.04216   -0.864467
 -0.92936   -0.0578992   0.152317   …  -0.863171  -1.04216   -0.864467
 -2.28833   -0.0706369  -0.116315      -0.863171  -1.04216   -0.864467
  0.223498  -1.41472    -0.116315      -0.299282   0.342627   0.199211
  ⋮                                 ⋱   ⋮             

Now we solve the problem using solution path algorithm. 

In [None]:
βpath, ρpath, = lsq_classopath(Xz, y);

In [6]:
βpath

8×9 Array{Float64,2}:
 0.000197119  0.421559  0.461915   …   0.597645    0.603245    0.665561 
 0.0          0.0       0.0            0.232715    0.246191    0.266408 
 0.0          0.0       0.0           -0.0601318  -0.0936838  -0.158234 
 0.0          0.0       0.0            0.0882392   0.108105    0.14034  
 0.0          0.0       0.0403562      0.243534    0.252539    0.315269 
 0.0          0.0       0.0        …   0.0         0.0        -0.148508 
 0.0          0.0       0.0            0.0         0.0121929   0.0354652
 0.0          0.0       0.0            0.0646193   0.0699873   0.125787 

We plot the solution path below. 

In [12]:
using Plots; pyplot(); 
colors = [:green :orange :black :purple :red :grey :brown :blue] 
plot(ρpath, βpath', xaxis = ("ρ", (minimum(ρpath),
      maximum(ρpath))), yaxis = ("β̂(ρ)"), label=labels, color=colors)
title!("Prostate Data: Solution Path via Constrained Lasso")

Below, we solve the same problem using `GLMNet.jl` package. 

In [None]:
using GLMNet;  
path = glmnet(Xz, y, intercept=false);
path.betas

In [11]:
plot(path.lambda, path.betas', color=colors, label=labels, 
		xaxis=("λ"), yaxis= ("β̂(λ)"))
title!("Prostate Data: Solution Path via GLMNet.jl")

*Follow the [link](https://github.com/Hua-Zhou/ConstrainedLasso.jl/blob/master/docs/src/demo/prostate.ipynb) to access the .ipynb file of this page.*