# Least squares classification


In [45]:
tf2int(b) = 2*b-1

tf2int (generic function with 1 method)

In [46]:
tf2int(true)

1

In [47]:
tf2int(false)

-1

In [48]:
b = [true, false, false]

3-element Vector{Bool}:
 1
 0
 0

In [49]:
tf2int.(b)

3-element Vector{Int64}:
  1
 -1
 -1

# Confusion matrix

Evaluate the prediction errors

Given a set of data $ y $ and predictions $\hat{y}$

- Both stored as arrays(vectors) of Boolean values, of length N
- Count errors and correct predictions
  

In [50]:
Ntp(y, yhat) = sum((y .== true) .& (yhat .== true))
Nfn(y, yhat) = sum((y .== true) .& (yhat .== false))
Nfp(y, yhat) = sum((y .== false) .& (yhat .== true))
Ntn(y, yhat) = sum((y .== false) .& (yhat .== false))
confusion_matrix(y, yhat) = [Ntp(y, yhat) Nfn(y, yhat); Nfp(y, yhat) Ntn(y, yhat)]
error_rate(y, yhat) = (Nfp(y, yhat)+Nfn(y, yhat))/length(y)

error_rate (generic function with 1 method)

In [51]:
y = [true true false]
yhat = [false true false]
Ntp(y, yhat)
confusion_matrix(y, yhat)

2×2 Matrix{Int64}:
 1  1
 0  1

In [52]:
y = rand(Bool, 100); yhat = rand(Bool, 100);
confusion_matrix(y, yhat)

2×2 Matrix{Int64}:
 22  20
 31  27

In [53]:
error_rate(y, yhat)

0.51

In [54]:
ftilde(x, theta, v) = x'*theta + v 
fhat(x) = ftilde(x, theta, v) > 0

fhat (generic function with 1 method)

# Iris flower classification

The iris data set has **150 examples** of three types of iris flowers

There are 50 examples of each class

For each data point, four features are provided

The following code reads in a dicitonary containing three 50 x 4 matrices 

- setosa
- versicolor
- virginica

with the examples for each class, then computes a classifier that distinguishes **Iris Virginica** from the other two classes


In [55]:
using VMLS
using LinearAlgebra

D = iris_data()
X = [D["setosa"]; D["versicolor"]; D["virginica"]]

150×4 Matrix{Float64}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 ⋮              
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8

In [56]:
# y[k] is true if virginica, false otherwise 
y = [zeros(100) .== ones(100); zeros(50) .== zeros(50)]

150-element BitVector:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 1
 1
 1
 1
 1
 1
 1
 1
 1

In [57]:
A = [ones(150) X]
theta = A \ (2*y .- 1 )

5-element Vector{Float64}:
 -2.3905637266512034
 -0.09175216910134672
  0.4055367711191063
  0.007975822012794512
  1.103558649867573

In [58]:
ytilde = A * theta 
yhat = ytilde .> 0
C = confusion_matrix(y, yhat)

2×2 Matrix{Int64}:
 46   4
  7  93

In [59]:
errorRate = (C[1, 2] + C[2, 1]) / length(y)


0.07333333333333333

In [60]:
avg(y .!= yhat)

0.07333333333333333

# Least squares multi-class classifier

A K class classifier (with regression model) can be expressed as

\begin{align}
    \hat{f}(x) = \argmax_{k=1,2,...K}\tilde{f}_{k}(x)  \\
    where \space \tilde{f}_{k}(x) = x^T\theta^{(k)} 
\end{align}

The n-vectors $\theta^{(1)}, \theta^{(2)},..., \theta^{(K)},$ are the coefficients or parameters in the model

**Express equations (1), (2) in matrix-vector notation**

\begin{align}
    \hat{f}{x^{(i)}} = \argmax(x^{(i)})^T\Theta
\end{align}

where $ \Theta = \begin{bmatrix} \theta^{(1)} & \theta^{(2)} & \dotsm & \theta^{(k)} \end{bmatrix} $ is the n x K matrix of model coefficients

In [65]:
# row_argmax(A) computes the index of the largest entry for each row
row_argmax(A) = [argmax(A[i, :]) for i =1:size(A, 1)]

row_argmax (generic function with 1 method)

In [62]:
A = randn(3,3)

3×3 Matrix{Float64}:
 -0.856962  -1.23766    1.30344
 -1.19168    0.169778  -1.08716
 -0.149733  -0.481873   0.321372

In [63]:
row_argmax(A)

3-element Vector{Int64}:
 3
 2
 3

In [66]:
# find the N-vector of predictions
fhat(X, theta) = row_argmax(X*theta)

fhat (generic function with 2 methods)

# Matrix least squares 

Use least squares to find the coefficient matrix \Theta for a **multi-class classifier** with 

- n featuresand K classes
- from a data set of N examples

Data is given as an $N \times n$ matrix **X** and an N-vector **y** 

- $ y^{(i)} is in {1, . . ., K} taht give the classes of examples

The least squares objective can be expressed a matrix norm squared

$
|| X\Theta - Y ||^2
$

where Y is the NxK matrix with

$
Y^{(i)}_j = 
\begin{cases}
1 & y^{(i)} = j\\
-1 & y^{(i)} \ne j\\
\end{cases}
$

In other words, the rows of **Y** describe the classes using one-hot encoding, converted from 0/1 to -1/+1 values. 

The least squares solution is given by $\hat{\Theta} = X^{\dagger}Y$

In [10]:
function one_hot(y, K=3)
    N = length(y)
    Y = zeros(N, K)
    for j in 1:K # for each column j
        Y[findall(y .== j), j] .= 1
    end
    return Y 
end
K = 3
y = rand(1:K, 6)

6-element Vector{Int64}:
 3
 1
 3
 1
 1
 2

In [11]:
one_hot(y)

6×3 Matrix{Float64}:
 0.0  0.0  1.0
 1.0  0.0  0.0
 0.0  0.0  1.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 0.0  1.0  0.0

In [13]:
function multi_classifier(X, y, K) 
    N, n = size(X)
    Y = 2*one_hot(y, K) .- 1
    Theta = X \ Y
    yhat = row_argmax(X*Theta)
    return Theta, yhat
end

multi_classifier (generic function with 1 method)

# Iris flower classification 

We compute a 3-class classifier for the Iris flower data set. 

We split the data set of 150 examples into 

- a training set of 120 examples (40 per category)
- and a test set of 30 (10 per category)

The code calls the function we defined above

In [15]:
using LinearAlgebra
using VMLS
D = iris_data()



Dict{String, Matrix{Float64}} with 3 entries:
  "virginica"  => [6.3 3.3 6.0 2.5; 5.8 2.7 5.1 1.9; … ; 6.2 3.4 5.4 2.3; 5.9 3…
  "setosa"     => [5.1 3.5 1.4 0.2; 4.9 3.0 1.4 0.2; … ; 5.3 3.7 1.5 0.2; 5.0 3…
  "versicolor" => [7.0 3.2 4.7 1.4; 6.4 3.2 4.5 1.5; … ; 5.1 2.5 3.0 1.1; 5.7 2…

In [29]:
function confusion_matrix(y, yhat, K=3)
    C = zeros(K, K)
    for i = 1:K 
        for j = 1:K 
            C[i, j] = sum((y .== i) .& (yhat .== j))
        end
    end
    return C
end

confusion_matrix (generic function with 2 methods)

In [49]:
setosa = D["setosa"]
versicolor = D["versicolor"]
virginica = D["virginica"]

using Random
I1 = Random.randperm(50)
I2 = Random.randperm(50)
I3 = Random.randperm(50)

Train1 = setosa[I1[1:40], :]
Train2 = versicolor[I2[1:40], :]
Train3 = virginica[I3[1:40], :]

XTrain = [Train1; Train2; Train3]
XTrain = [ones(120) XTrain]

yTrain = [ones(40); 2*ones(40); 3*ones(40)]
Theta, yhat = multi_classifier(XTrain, yTrain, 3)

train_confusion = confusion_matrix(yTrain, yhat)

function error_rate(C)
    return 1 - sum(diag(C))/sum(C)
end

function error_rate(y, yhat)
    return 1 - sum(y .== yhat) / length(y)
end

error_rate(train_confusion), error_rate(yTrain, yhat)

(0.14166666666666672, 0.14166666666666672)

In [59]:
Test1 = setosa[I1[41:50], :]
Test2 = versicolor[I2[41:50], :]
Test3 = virginica[I3[41:50], :]

XTest = [Test1; Test2; Test3]
XTest = [ones(30) XTest]

yTest = [ones(10); 2*ones(10); 3*ones(10)]
ytestHat = row_argmax(XTest*Theta )


30-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1
 2
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 2
 3

In [64]:
Ctest = confusion_matrix(yTest, ytestHat)
error_rate(Ctest)

0.16666666666666663

In [66]:
Ctest = confusion_matrix(yTest, ytestHat)

3×3 Matrix{Float64}:
 9.0  1.0  0.0
 0.0  7.0  3.0
 0.0  1.0  9.0