# Binary classification with the crabs data

In this example we show how the GP Monte Carlo function can be used for supervised learning classification. We use the Crab dataset from the R package MASS. In this dataset we are interested in predicting whether a crab is of colour form blue or orange.

In [1]:
using GaussianProcesses

train=readdlm("data/crabs_train.txt",',');

train[train[:,1].==-1.0,1]=0.0;
y = convert(Vector{Bool},train[:,1]);       # response
X = train[:,2:end];                         # predictors

We assume a zero mean GP with a Matern 3/2 kernel. We use the automatic relevance determination (ARD) kernel to allow each dimension of the predictor variables to have a different length scale. As this is binary classifcation, we use the Bernoulli likelihood.

In [2]:
#Select mean, kernel and likelihood function
mZero = MeanZero()   #Zero mean function
kern = Matern(3/2,zeros(5),0.0)   #Matern 3/2 ARD kernel (note that hyperparameters are on the log scale)
lik = BernLik()

Type: GaussianProcesses.BernLik, Params: Any[]


We fit the GP using the general `GP` function. This function is a shorthand for the `GPMC` function which is used to generate Monte Carlo approximations of the latent function when the likelihood is non-Gaussian. 

In [3]:
gp = GP(X',vec(y),mZero,kern,lik)

GP Monte Carlo object:
  Dim = 5
  Number of observations = 100
  Mean function:
    Type: GaussianProcesses.MeanZero, Params: Float64[]
  Kernel:
    Type: GaussianProcesses.Mat32Ard, Params: [-0.0,-0.0,-0.0,-0.0,-0.0,0.0]
  Likelihood:
    Type: GaussianProcesses.BernLik, Params: Any[]
  Input observations = 
[8.1 8.8 … 17.5 19.2; 6.7 7.7 … 16.7 16.5; … ; 19.0 20.8 … 44.5 47.9; 7.0 7.4 … 17.0 18.1]
  Output observations = Bool[false,false,false,false,false,false,false,false,false,false  …  true,true,true,true,true,true,true,true,true,true]
  Log-posterior = -161.209

In [4]:
We assign `Normal` priors from the `Distributions` package to each of the Matern 3/2 kernel parameters.

LoadError: syntax: extra token "assign" after end of expression

In [5]:
set_priors!(gp.k,[Distributions.Normal(0.0,2.0) for i in 1:6])

6-element Array{Distributions.Normal{Float64},1}:
 Distributions.Normal{Float64}(μ=0.0, σ=2.0)
 Distributions.Normal{Float64}(μ=0.0, σ=2.0)
 Distributions.Normal{Float64}(μ=0.0, σ=2.0)
 Distributions.Normal{Float64}(μ=0.0, σ=2.0)
 Distributions.Normal{Float64}(μ=0.0, σ=2.0)
 Distributions.Normal{Float64}(μ=0.0, σ=2.0)

Samples of the latent function $f|X,y,\theta$ are drawn using MCMC sampling. The MCMC routine uses the `Klara` package. By default, the `mcmc` function uses the MALA algorithm, where the stepsize is automatically tuned to a desired acceptance rate. Alternative samplers, such as HMC and NUTS, can also be used. 

In [6]:
samples = mcmc(gp)

BasicMCJob:
  Variable [1]: p (Klara.BasicContMuvParameter)
  GenericModel: 1 variables, 0 dependencies (directed graph)
  MALA sampler: drift step = 0.1
  VanillaMCTuner: period = 100, verbose = false
  BasicMCRange: number of steps = 5000, burnin = 1000, thinning = 1
  plain = true (job flow not controlled by tasks)

106×4000 Array{Float64,2}:
  0.0218528    0.256832     0.256832    …  -0.302273   -0.302273   -0.302273 
 -0.684932    -0.429229    -0.429229       -0.149953   -0.149953   -0.149953 
  0.437747     0.44674      0.44674         0.287418    0.287418    0.287418 
  0.177475    -0.00499257  -0.00499257     -0.38544    -0.38544    -0.38544  
  0.00438975   0.187532     0.187532        0.876611    0.876611    0.876611 
 -1.0175      -1.13479     -1.13479     …   0.0531735   0.0531735   0.0531735
  0.333472     0.382272     0.382272       -0.582206   -0.582206   -0.582206 
 -1.3892      -1.45687     -1.45687        -1.12528    -1.12528    -1.12528  
  0.523746     0.125239     0.125239       -0.551809   -0.551809   -0.551809 
  0.269302     0.913518     0.913518        0.0344306   0.0344306   0.0344306
 -0.442526    -1.04142     -1.04142     …  -0.442074   -0.442074   -0.442074 
  0.16869      0.1078       0.1078          1.51187     1.51187     1.51187  
 -0.328974    -0.463307    -0.463307 

We test the predictive accuracy of the fitted model against a hold-out dataset

In [7]:
test = readdlm("data/crabs_test.txt",',');
test[test[:,1].==-1.0,1]=0.0;

xTest = test[:,2:end];
yTest = test[:,1];

In [8]:
ymean = Array(Float64,size(samples,2),size(xTest,1));

for i in 1:size(samples,2)
    set_params!(gp,samples[:,i])
    update_target!(gp)
    ymean[i,:] = predict_y(gp,xTest')[1]
end

In [9]:
using Plots
gr()

plot(ymean',leg=false)
scatter!(yTest)