# Chapter 14: Least squares classification 

## 14.1 Classification 

**Note on boolean values** Julia uses Boolean values `true` and `false`, converted to numbers 1 and 0. In VMLS, we use +1 and -1, so we must encode by `2*b-1` or via ternary operation `b ? 1 : -1`

In [38]:
bool_2_signed(b) = 2*b-1
bool_2_signed(true), bool_2_signed(false)

(1, -1)

In [3]:
b = [true, false, true]
bool_2_signed.(b)

3-element Vector{Int64}:
  1
 -1
  1

**Confusion matrix.** Calculating prediction errors and the confusion matrix, given a randomly generated set of data `y`and predictions `yhat` of length `N`

In [13]:
using VMLS
N = 100;
y = rand(Bool, N); yhat = rand(Bool, N);

Ntp(y,yhat) = sum((y .== true) .& (yhat .== true));
Nfn(y,yhat) = sum((y .== true) .& (yhat .== false));
Nfp(y,yhat) = sum((y .== false) .& (yhat .== true));
Ntn(y,yhat) = sum((y .== false) .& (yhat .== false));

error_rate(y,yhat) = avg(y .!= yhat) #  (Nfn(y,yhat) + Nfp(y,yhat)) / length(y);
recall(y,yhat) = Ntp(y,yhat) / (Ntp(y,yhat) + Nfn(y,yhat))
false_positive_rate(y,yhat) = Nfp(y,yhat) / (Nfp(y,yhat) + Ntn(y,yhat))

confusion_matrix(y,yhat) = [Ntp(y,yhat) Nfn(y,yhat);
                            Nfp(y,yhat) Ntn(y,yhat)];

confusion_matrix(y,yhat)

2×2 Matrix{Int64}:
 24  27
 23  26

In [14]:
error_rate(y,yhat), recall(y,yhat), false_positive_rate(y,yhat)

(0.5, 0.47058823529411764, 0.46938775510204084)

## 14.2 Least squares classifier 

We can calculate $\hat f(x) = sign(\tilde f(x))$ using `ftilde(x) > 0` which returns a Boolean.

**Iris flower classification.** The Iris data set contains 150 examples of 3 types of iris flowers (50 examples of each class). For each example, 4 features are provided. We write the following code to read in a dictionary containing 3, $50 \times 4$ matrices `setosa, versicolor, virginica` and then computes a Boolean classifier that distinguishes *Iris Virginica* from the other 2 classes.

In [21]:
using VMLS
D = iris_data();
iris = vcat(D["setosa"], D["versicolor"], D["virginica"])
y = [zeros(Bool, 100); ones(Bool, 50)]
println(y)
ysigned = bool_2_signed.(y)
A = [ones(150) iris]
theta = A \ ysigned

Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


5-element Vector{Float64}:
 -2.3905637266512034
 -0.09175216910134672
  0.4055367711191063
  0.007975822012794512
  1.103558649867573

In [22]:
yhat = A*theta .> 0
C = confusion_matrix(y, yhat)

2×2 Matrix{Int64}:
 46   4
  7  93

In [23]:
error_rate(y, yhat)

0.07333333333333333

## 14.3 Multi-class classifiers 

**Multi-class error rate and confusion matrix.**

In [24]:
function k_confusion_matrix(y,yhat,K)
    C = zeros(K,K)
    for i=1:K
        for j=1:K
            C[i,j] = sum( (y .== i) .& (yhat .== j) )
        end
    end
    return C
end;

In [33]:
K = 4;
y = rand(1:K, 100); yhat = rand(1:K, 100)

C = k_confusion_matrix(y,yhat,K)

4×4 Matrix{Float64}:
  8.0   5.0  8.0  6.0
  7.0   5.0  6.0  7.0
 11.0  10.0  7.0  2.0
  3.0   5.0  8.0  2.0

In [34]:
using LinearAlgebra
error_rate(y,yhat), 1-sum(diag(C))/sum(C)

(0.78, 0.78)

**Least squares multi-class classifier** A $K$-class classifier (with regression model) can be expressed as 

\begin{align}
\hat f(x) = \argmax_{k=1,\dots,K} \tilde f_k(x),
\end{align}

where $\tilde f_k(x) = x^T \theta_k.$. The $n$-vectors $\theta_1,\dots,\theta_K$ are the coefficients or parameters in the model. We can express this in matrix-vector notation as 

\begin{align}
\hat f(x) = \argmax(x^T \Theta),
\end{align}

where $\Theta = [\theta_1,\dots,\theta_K]$ is the $n \times K$ matrix of model coefficients. 

In [70]:
# size of x'Theta is 1xK
# for N examples X, X'Theta gives NxK matrix 

# define function that will return N-vector 
# calculating fhat for all N examples 
# i.e. taking the argmax of each row in the NxK matrix

row_argmax(m) = [argmax(m[i, :]) for i=1:size(m,1)];


**Matrix least squares.** Least squares to find coefficient matrix $\Theta$ for a multi-class classifier with $n$ features and $K$ classes, from a data set of $N$ examples. We will assume data $X$ is given as $n \times N$ matrix and the classes of the examples will be given as $N$-vector $y^{cl}$ with entries $\in \{1,\dots,K\}$

The least squares objective cam be expressed as a matrix norm squared, 

\begin{align}
||X^T \Theta - Y||^2
\end{align}

where $Y$ is the $N \times K$ matrix with 

\begin{align}
Y_{ij} = \begin{cases}
            1 & y_i^{cl} = j\\
            -1 & y_i^{cl} \not =  j
         \end{cases}
\end{align}

The least squares solution is given by $\hat \Theta = (X^T)^{\dagger} Y $.

In [71]:
# function returns a N x K matrix Y 
# with one hot encoding mentioned above
# (0 for false, 1 for true)
function one_hot(y, K)
    N = length(y)
    Y = zeros(N,K)
    for j=1:K
        Y[findall(y .== j), j] .= 1
    end
    return Y
end;

# example 
K = 4
ycl = rand(1:K,6)
Y = one_hot(ycl, K)

6×4 Matrix{Float64}:
 0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0
 0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0

In [72]:
# use our function to convert to -1/1 scale
bool_2_signed.(Y)

6×4 Matrix{Float64}:
 -1.0  -1.0  -1.0   1.0
  1.0  -1.0  -1.0  -1.0
 -1.0  -1.0  -1.0   1.0
 -1.0  -1.0  -1.0   1.0
  1.0  -1.0  -1.0  -1.0
  1.0  -1.0  -1.0  -1.0

In [77]:
# define function for matrix least squares 
# multi-class classfier 
function ls_multiclass(X, y, K)
    n, N = size(X)
    Y = bool_2_signed.(one_hot(y, K))
    Theta = X' \ Y 
    yhat = row_argmax(X'*Theta)
    return Theta, yhat
end;

**Iris flower multi-class classification**

In [102]:
using Random
D = iris_data();
setosa = D["setosa"];
versicolor = D["versicolor"];
virginica = D["virginica"];

# pick 3 random perms of 1-50 (one for each class)
idx1 = Random.randperm(50);
idx2 = Random.randperm(50);
idx3 = Random.randperm(50);

# training set is 40 randomly picked examples per class 
Xtrain =  [ setosa[     idx1[1:40],:];
            versicolor[ idx2[1:40],:];
            virginica[  idx3[1:40],:]   
          ]';
# add a new constant feature 
Xtrain = [ones(1,120); Xtrain];
#ylabels 
ytrain = [ones(40); 2*ones(40); 3*ones(40)];

# test set is all other examples not picked for training
Xtest =   [ setosa[     idx1[41:end],:];
            versicolor[ idx2[41:end],:];
            virginica[  idx3[41:end],:]   
          ]';
Xtest = [ones(1,30); Xtest];
ytest = [ones(10); 2*ones(10); 3*ones(10)];

Theta, yhat = ls_multiclass(Xtrain, ytrain, 3);

In [103]:
Theta

5×3 Matrix{Float64}:
 -0.810683   2.43811    -2.62743
  0.125552  -0.0473458  -0.0782058
  0.499109  -0.941416    0.442307
 -0.41321    0.346798    0.0664124
 -0.187698  -0.759335    0.947033

In [104]:
Ctrain = k_confusion_matrix(ytrain, yhat, 3)

3×3 Matrix{Float64}:
 40.0   0.0   0.0
  0.0  29.0  11.0
  0.0   4.0  36.0

In [105]:
yhat_test = row_argmax(Xtest'*Theta)
Ctest = k_confusion_matrix(ytest, yhat_test, 3)

3×3 Matrix{Float64}:
 10.0  0.0  0.0
  0.0  7.0  3.0
  0.0  2.0  8.0

In [106]:
error_train = error_rate(ytrain, yhat)

0.125

In [107]:
error_test = error_rate(ytest, yhat_test)

0.16666666666666666