## Classification
Put simply, classification is the task of predicting a label for a given observation. For example: you are given certain physical descriptions of an animal, and your taks is to classify them as either a dog or a cat. Here, we will classify iris flowers.

As we will see later, we will use different classifiers and at the end of this notebook, we will compare them. We will define our accuracy function right now to get it out of the way. We will use a simple accuracy function that returns the ratio of the number of correctly classified observations to the total number of predictions.

In [2]:
findaccuracy(predictedvals,groundtruthvals) = sum(predictedvals.==groundtruthvals)/length(groundtruthvals)

findaccuracy (generic function with 1 method)

In [3]:
#installing the data sets that we still need
using Pkg
# Pkg.add("GLMNet")
# Pkg.add("DecisionTree")
# Pkg.add("NearestNeighbors")
# Pkg.add("Random")
# Pkg.add("DataStructures")
# Pkg.add("LIBSVM")

In [4]:
using GLMNet
using RDatasets
using MLBase
using Plots
using DecisionTree
using Distances
using NearestNeighbors
using Random
using LinearAlgebra
using DataStructures
using LIBSVM

Get the data first

In [6]:
iris = dataset("datasets", "iris") #measurements on flowers

Row,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


In [73]:
X = Matrix(iris[:,1:4]) #get the first four columns of the dataframe and use it as matrix (to pass for classification methods)
irislabels = iris[:,5] #the labels
unique(irislabels) #here are three different flowers

3-element Vector{String}:
 "setosa"
 "versicolor"
 "virginica"

In [8]:
X #our matrix

150×4 Matrix{Float64}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 5.4  3.7  1.5  0.2
 4.8  3.4  1.6  0.2
 4.8  3.0  1.4  0.1
 ⋮              
 6.0  3.0  4.8  1.8
 6.9  3.1  5.4  2.1
 6.7  3.1  5.6  2.4
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8

In [9]:
#converting the string flower type into numbers from 1 - 3
irislabelsmap = labelmap(irislabels) 
y = labelencode(irislabelsmap, irislabels)

150-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3

In classification, we often want to use some of the data to fit a model, and the rest of the data to validate (commonly known as `training` and `testing` data). We will get this data ready now so that we can easily use it in the rest of this notebook.

In [11]:
# for each type of flower, some data will be put in training and some in testing
function perclass_splits(y,at) 
    uids = unique(y)
    keepids = []
    for ui in uids
        curids = findall(y.==ui)
        rowids = randsubseq(curids, at)  #we use random sampling 
        push!(keepids,rowids...)
    end
    return keepids
end

perclass_splits (generic function with 1 method)

In [12]:
?randsubseq

search: [0m[1mr[22m[0m[1ma[22m[0m[1mn[22m[0m[1md[22m[0m[1ms[22m[0m[1mu[22m[0m[1mb[22m[0m[1ms[22m[0m[1me[22m[0m[1mq[22m [0m[1mr[22m[0m[1ma[22m[0m[1mn[22m[0m[1md[22m[0m[1ms[22m[0m[1mu[22m[0m[1mb[22m[0m[1ms[22m[0m[1me[22m[0m[1mq[22m! [0m[1mr[22m[0m[1ma[22m[0m[1mn[22m[0m[1md[22mcycle T[0m[1mr[22m[0m[1ma[22m[0m[1mn[22mspose subset t[0m[1mr[22m[0m[1ma[22m[0m[1mn[22mspose! [0m[1mr[22m[0m[1ma[22m[0m[1mn[22m[0m[1md[22m[0m[1ms[22mtring



```
randsubseq([rng=default_rng(),] A, p) -> Vector
```

Return a vector consisting of a random subsequence of the given array `A`, where each element of `A` is included (in order) with independent probability `p`. (Complexity is linear in `p*length(A)`, so this function is efficient even if `p` is small and `A` is large.) Technically, this process is known as "Bernoulli sampling" of `A`.

# Examples

```jldoctest
julia> randsubseq(Xoshiro(123), 1:8, 0.3)
2-element Vector{Int64}:
 4
 7
```


In [13]:
trainids = perclass_splits(y,0.7) #around 70% of data will be in training, about 30% in testing
testids = setdiff(1:length(y),trainids) #so here is our testing ids and we can see we have about 50 elemts (in this case 49)

49-element Vector{Any}:
   4
   6
   8
  13
  15
  17
  22
  23
  25
  26
  27
  30
  31
   ⋮
 112
 113
 115
 117
 118
 120
 122
 124
 128
 130
 140
 150

We will need one more function, and that is the function that will assign classes based on the predicted values when the predicted values are continuous.

In [15]:
assign_class(predictedvalue) = argmin(abs.(predictedvalue .- [1,2,3]))

assign_class (generic function with 1 method)

### 🟣 Method 1: Lasso

In [17]:
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids]) #cross validation function which picks the best lambda

Least Squares GLMNet Cross Validation
57 models for 4 predictors in 10 folds
Best λ 0.006 (mean loss 0.048, std 0.008)

In [18]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids]) 
cv = glmnetcv(X[trainids,:], y[trainids])
mylambda = path.lambda[argmin(cv.meanloss)] #we are picking the one with the minimum mean loss

            #training data    #labels   #specifying the lambda
path = glmnet(X[trainids,:], y[trainids],lambda=[mylambda]);

In [19]:
q = X[testids,:]; #testing data
predictions_lasso = GLMNet.predict(path,q)

49×1 Matrix{Float64}:
 0.9843015599733884
 1.0847986979494137
 0.9621092120213237
 0.9118606430333109
 0.8729895502098545
 1.0251520767396272
 1.0697702859792304
 0.8727560368743805
 1.0217558332311105
 1.00661066459319
 1.1068742892337415
 0.9918157659584803
 0.9992132152758351
 ⋮
 2.6842883449705375
 2.8217727295333375
 2.9717011398232174
 2.626992596668381
 3.0064597095788446
 2.4168337818300296
 2.6821709853983817
 2.559715012805765
 2.5375226648537
 2.5418741406657492
 2.799463624913536
 2.5673459754585934

In [20]:
predictions_lasso = assign_class.(predictions_lasso) 
findaccuracy(predictions_lasso,y[testids]) #find the accuracy -- 95.92% of numbers that we tried to predict were correct 

0.9591836734693877

### 🟣 Method 2: Ridge
We will use the same function but set alpha to zero.

In [22]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0); #we set alpha =0 gives us the ridge method, default is 1 and gives us Lasso
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0,lambda=[mylambda]);
q = X[testids,:];
predictions_ridge = GLMNet.predict(path,q)
predictions_ridge = assign_class.(predictions_ridge)
findaccuracy(predictions_ridge,y[testids]) #accuracy decreased!

0.9387755102040817

### 🟣 Method 3: Elastic Net
We will use the same function but set alpha to 0.5 (it's the combination of lasso and ridge).

In [24]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0.5);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0.5)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0.5,lambda=[mylambda]);
q = X[testids,:];
predictions_EN = GLMNet.predict(path,q)
predictions_EN = assign_class.(predictions_EN)
findaccuracy(predictions_EN,y[testids]) 
#this gives us the same as lasso

0.9591836734693877

### 🟣 Method 4: Decision Trees
We will use the package `DecisionTree`

In [26]:
#different package now!

#building our tree now
model = DecisionTreeClassifier(max_depth=2) 

#run the fit function and use the training data and training label
DecisionTree.fit!(model, X[trainids,:], y[trainids])

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  [1, 2, 3]
root:                     Decision Tree
Leaves: 3
Depth:  2

In [27]:
q = X[testids,:]; #put testing data here
predictions_DT = DecisionTree.predict(model, q) #run the prediction
findaccuracy(predictions_DT,y[testids]) #accuracy is around 93.9%

0.9387755102040817

### 🟣 Method 5: Random Forests
The `RandomForestClassifier` is available through the `DecisionTree` package as well.

In [29]:
#we are using decision tree package again
model = RandomForestClassifier(n_trees=20)
DecisionTree.fit!(model, X[trainids,:], y[trainids])

RandomForestClassifier
n_trees:             20
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           -1
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             [1, 2, 3]
ensemble:            Ensemble of Decision Trees
Trees:      20
Avg Leaves: 6.15
Avg Depth:  4.4

In [30]:
q = X[testids,:];
predictions_RF = DecisionTree.predict(model, q)
findaccuracy(predictions_RF,y[testids])

0.9183673469387755

### 🟣 Method 6: Using a Nearest Neighbor method
We will use the `NearestNeighbors` package here.

In [32]:
Xtrain = X[trainids,:] #use this to build the kdtree
ytrain = y[trainids]
kdtree = KDTree(Xtrain') #build kdtree

KDTree{StaticArraysCore.SVector{4, Float64}, Euclidean, Float64, StaticArraysCore.SVector{4, Float64}}
  Number of points: 101
  Dimensions: 4
  Metric: Euclidean(0.0)
  Reordered: true

In [33]:
queries = X[testids,:]

49×4 Matrix{Float64}:
 4.6  3.1  1.5  0.2
 5.4  3.9  1.7  0.4
 5.0  3.4  1.5  0.2
 4.8  3.0  1.4  0.1
 5.8  4.0  1.2  0.2
 5.4  3.9  1.3  0.4
 5.1  3.7  1.5  0.4
 4.6  3.6  1.0  0.2
 4.8  3.4  1.9  0.2
 5.0  3.0  1.6  0.2
 5.0  3.4  1.6  0.4
 4.7  3.2  1.6  0.2
 4.8  3.1  1.6  0.2
 ⋮              
 6.4  2.7  5.3  1.9
 6.8  3.0  5.5  2.1
 5.8  2.8  5.1  2.4
 6.5  3.0  5.5  1.8
 7.7  3.8  6.7  2.2
 6.0  2.2  5.0  1.5
 5.6  2.8  4.9  2.0
 6.3  2.7  4.9  1.8
 6.1  3.0  4.9  1.8
 7.2  3.0  5.8  1.6
 6.9  3.1  5.4  2.1
 5.9  3.0  5.1  1.8

In [34]:
#for each query, it takes the five closest elements
idxs, dists = knn(kdtree, queries', 5, true)

([[3, 30, 24, 6, 21], [13, 8, 29, 31, 14], [25, 32, 1, 12, 18], [2, 7, 30, 21, 3], [11, 13, 8, 23, 20], [8, 14, 23, 20, 31], [14, 31, 12, 4, 1], [5, 3, 26, 4, 22], [9, 16, 25, 21, 7], [21, 7, 2, 30, 32]  …  [95, 75, 92, 76, 99], [91, 100, 83, 86, 74], [85, 71, 73, 89, 79], [47, 54, 99, 75, 87], [95, 75, 92, 45, 54], [82, 99, 47, 87, 54], [92, 82, 45, 54, 87], [81, 84, 72, 80, 78], [98, 94, 78, 93, 80], [92, 95, 45, 54, 82]], [[0.24494897427831802, 0.26457513110645925, 0.29999999999999954, 0.2999999999999997, 0.3000000000000007], [0.33166247903553986, 0.3464101615137753, 0.3741657386773947, 0.38729833462074226, 0.38729833462074226], [0.09999999999999964, 0.14142135623730964, 0.17320508075688762, 0.1999999999999999, 0.22360679774997916], [0.1414213562373099, 0.17320508075688812, 0.19999999999999998, 0.20000000000000034, 0.264575131106459], [0.5477225575051664, 0.5567764362830022, 0.5830951894845297, 0.5916079783099616, 0.6855654600401041], [0.3464101615137753, 0.38729833462074226, 0.4582

In [35]:
c = ytrain[hcat(idxs...)] 
possible_labels = map(i->counter(c[:,i]),1:size(c,2)) 
predictions_NN = map(i->parse(Int,string(string(argmax(possible_labels[i])))),1:size(c,2))
findaccuracy(predictions_NN,y[testids])

0.9795918367346939

### 🟣 Method 7: Support Vector Machines
We will use the `LIBSVM` package here.

In [37]:
Xtrain = X[trainids,:]
ytrain = y[trainids]

101-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3

In [38]:
model = svmtrain(Xtrain', ytrain)

LIBSVM.SVM{Int64, LIBSVM.Kernel.KERNEL}(SVC, LIBSVM.Kernel.RadialBasis, nothing, 4, 101, 3, [1, 2, 3], Int32[1, 2, 3], Float64[], Int32[], LIBSVM.SupportVectors{Vector{Int64}, Matrix{Float64}}(34, Int32[5, 15, 14], [1, 1, 1, 1, 1, 2, 2, 2, 2, 2  …  3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [4.3 5.7 … 6.3 6.5; 3.0 4.4 … 2.5 3.0; 1.1 1.5 … 5.0 5.2; 0.1 0.4 … 1.9 2.0], Int32[10, 11, 16, 27, 29, 33, 35, 36, 39, 43  …  82, 85, 87, 88, 91, 92, 94, 95, 99, 100], LIBSVM.SVMNode[LIBSVM.SVMNode(1, 4.3), LIBSVM.SVMNode(1, 5.7), LIBSVM.SVMNode(1, 5.1), LIBSVM.SVMNode(1, 4.5), LIBSVM.SVMNode(1, 5.1), LIBSVM.SVMNode(1, 7.0), LIBSVM.SVMNode(1, 6.5), LIBSVM.SVMNode(1, 4.9), LIBSVM.SVMNode(1, 5.0), LIBSVM.SVMNode(1, 5.6)  …  LIBSVM.SVMNode(1, 6.2), LIBSVM.SVMNode(1, 7.9), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 6.1), LIBSVM.SVMNode(1, 6.4), LIBSVM.SVMNode(1, 6.0), LIBSVM.SVMNode(1, 6.9), LIBSVM.SVMNode(1, 5.8), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 6.5)]), 0.0, [0.0 0.041885424956808255; 0.3804450072984313 

In [39]:
predictions_SVM, decision_values = svmpredict(model, X[testids,:]')
findaccuracy(predictions_SVM,y[testids])

0.9795918367346939

Putting all the results together:

In [41]:
overall_accuracies = zeros(7)
methods = ["lasso","ridge","EN", "DT", "RF","kNN", "SVM"]
ytest = y[testids]
overall_accuracies[1] = findaccuracy(predictions_lasso,ytest)
overall_accuracies[2] = findaccuracy(predictions_ridge,ytest)
overall_accuracies[3] = findaccuracy(predictions_EN,ytest)
overall_accuracies[4] = findaccuracy(predictions_DT,ytest)
overall_accuracies[5] = findaccuracy(predictions_RF,ytest)
overall_accuracies[6] = findaccuracy(predictions_NN,ytest)
overall_accuracies[7] = findaccuracy(predictions_SVM,ytest)
hcat(methods, overall_accuracies)

7×2 Matrix{Any}:
 "lasso"  0.959184
 "ridge"  0.938776
 "EN"     0.959184
 "DT"     0.938776
 "RF"     0.918367
 "kNN"    0.979592
 "SVM"    0.979592

# Finally...
After finishing this notebook, you should be able to:
- [ ] split your data into training and testing data to test the effectiveness of a certain method
- [ ] apply a simple accuracy function to test the effectiveness of a certain method
- [ ] run multiple classification algorithms:
    - [ ] LASSO
    - [ ] Ridge
    - [ ] ElasticNet
    - [ ] Decision Tree
    - [ ] Random Forest
    - [ ] Nearest Neighbors
    - [ ] Support Vector Machines

# 🥳 One cool finding

We used multiple methods to run classification on the `iris` dataset which is a dataset of flowers and there are three types of iris flowers in it. We split the data into training and testing and ran our methods. Here is the scoreboard:

| method | accuracy score |
|---|---|
| lasso  |1.0|
| ridge  |1.0|
| EN     |1.0|
| DT     |0.960784|
| RF     |0.980392|
| kNN    |1.0|
| SVM    |1.0|