# Supervised Learning (Regression)

Originally Written for [Edward](http://edwardlib.org/tutorials/supervised-regression).

In supervised learning, the task is to infer hidden structure from
labeled data, comprised of training examples $\{(x_n, y_n)\}$.
Regression typically means the output $y$ takes continuous values.

Let us import the necessary packages

In [41]:
using Turing
using Distributions
using Plots
using StatsPlots
using Random
using LinearAlgebra
# use Distances for Kullback-Leibler divergence
using Distances

## Data

Simulate training and test sets of $40$ data points. They comprise of
pairs of inputs $\mathbf{x}_n\in\mathbb{R}^{10}$ and outputs
$y_n\in\mathbb{R}$. They have a linear dependence with normally
distributed noise.

In [42]:
function build_toy_dataset(N, w)
    D = length(w)
    x = rand(Normal(0.0, 2.0), N, D)
    y = x * w + rand(Normal(0.0, 0.01), N)
    return x, y
end

Random.seed!(42)

N = 40 # number of data point
D = 10 # number of features

w_true = randn(D) * 0.5
X_train, y_train = build_toy_dataset(N, w_true)
X_test, y_test = build_toy_dataset(N, w_true)

([2.48075 -1.03266 … -2.63031 1.20771; 2.0663 -6.04472 … -1.7286 -0.107753; … ; -3.3115 4.05319 … 0.933275 -2.30527; -3.94518 -0.0104893 … -3.32086 -0.467877], [5.36258, 2.71703, 0.0280316, -2.08799, -3.31021, 6.71276, 2.73633, 0.922324, 3.03641, -4.17485  …  -1.50087, -3.37694, 1.06012, 1.38397, 4.31679, -0.100416, -2.02223, 1.88883, -0.456922, 2.43781])

## Model

Posit the model as Bayesian linear regression (Murphy, 2012).
It assumes a linear relationship between the inputs
$\mathbf{x}\in\mathbb{R}^D$ and the outputs $y\in\mathbb{R}$.

For a set of $N$ data points $(\mathbf{X},\mathbf{y})=\{(\mathbf{x}_n, y_n)\}$,
the model posits the following distributions:

\begin{align*}
  p(\mathbf{w})
  &=
  \text{Normal}(\mathbf{w} \mid \mathbf{0}, \sigma_w^2\mathbf{I}),
  \\[1.5ex]
  p(b)
  &=
  \text{Normal}(b \mid 0, \sigma_b^2),
  \\
  p(\mathbf{y} \mid \mathbf{w}, b, \mathbf{X})
  &=
  \prod_{n=1}^N
  \text{Normal}(y_n \mid \mathbf{x}_n^\top\mathbf{w} + b, \sigma_y^2).
\end{align*}

The latent variables are the linear model's weights $\mathbf{w}$ and
intercept $b$, also known as the bias.
Assume $\sigma_w^2,\sigma_b^2$ are known prior variances and $\sigma_y^2$ is a
known likelihood variance. The mean of the likelihood is given by a
linear transformation of the inputs $\mathbf{x}_n$.

Let's build the model in Turing, fixing $\sigma_w,\sigma_b,\sigma_y=1$.

In [43]:
@model regression(x, y) = begin
    x ~ rand(N, D)
    w ~ rand(Normal(0, 1), D)
    b ~ rand(Normal(0, 1))
    y ~ # to be implemented
end

regression (generic function with 3 methods)

## Inference

We now turn to inferring the posterior using variational inference.
Define the variational model to be a fully factorized normal across
the weights.