## Classification
Put simply, classification is the task of predicting a label for a given observation. For example: you are given certain physical descriptions of an animal, and your taks is to classify them as either a dog or a cat. Here, we will classify iris flowers.

As we will see later, we will use different classifiers and at the end of this notebook, we will compare them. We will define our accuracy function right now to get it out of the way. We will use a simple accuracy function that returns the ratio of the number of correctly classified observations to the total number of predictions.

In [31]:
findaccuracy(predictedvals,groundtruthvals) = sum(predictedvals.==groundtruthvals)/length(groundtruthvals)

findaccuracy (generic function with 1 method)

In [41]:
using GLMNet
using RDatasets
using MLBase
using Plots
using DecisionTree
using Distances
using NearestNeighbors
using Random
using LinearAlgebra
using DataStructures
using LIBSVM

Get the data first

In [42]:
iris = dataset("datasets", "iris")

Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Categorical…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


In [28]:
X = Matrix(iris[:,1:4])
irislabels = iris[:,5]

150-element CategoricalArray{String,1,UInt8}:
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 "setosa"
 ⋮
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

In [43]:
X

150×4 Array{Float64,2}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 5.4  3.7  1.5  0.2
 4.8  3.4  1.6  0.2
 4.8  3.0  1.4  0.1
 ⋮              
 6.0  3.0  4.8  1.8
 6.9  3.1  5.4  2.1
 6.7  3.1  5.6  2.4
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8

In [5]:
irislabelsmap = labelmap(irislabels)
y = labelencode(irislabelsmap, irislabels)

150-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3

In classification, we often want to use some of the data to fit a model, and the rest of the data to validate (commonly known as `training` and `testing` data). We will get this data ready now so that we can easily use it in the rest of this notebook.

In [44]:
function perclass_splits(y,at)
    uids = unique(y)
    keepids = []
    for ui in uids
        curids = findall(y.==ui)
        rowids = randsubseq(curids, at) 
        push!(keepids,rowids...)
    end
    return keepids
end

perclass_splits (generic function with 1 method)

In [55]:
?randsubseq

search: [0m[1mr[22m[0m[1ma[22m[0m[1mn[22m[0m[1md[22m[0m[1ms[22m[0m[1mu[22m[0m[1mb[22m[0m[1ms[22m[0m[1me[22m[0m[1mq[22m [0m[1mr[22m[0m[1ma[22m[0m[1mn[22m[0m[1md[22m[0m[1ms[22m[0m[1mu[22m[0m[1mb[22m[0m[1ms[22m[0m[1me[22m[0m[1mq[22m! [0m[1mR[22m[0m[1ma[22m[0m[1mn[22m[0m[1md[22mom[0m[1mS[22m[0m[1mu[22m[0m[1mb[22m St[0m[1mr[22m[0m[1ma[22mtifiedRa[0m[1mn[22m[0m[1md[22mom[0m[1mS[22m[0m[1mu[22m[0m[1mb[22m



```
randsubseq([rng=GLOBAL_RNG,] A, p) -> Vector
```

Return a vector consisting of a random subsequence of the given array `A`, where each element of `A` is included (in order) with independent probability `p`. (Complexity is linear in `p*length(A)`, so this function is efficient even if `p` is small and `A` is large.) Technically, this process is known as "Bernoulli sampling" of `A`.

# Examples

```jldoctest
julia> rng = MersenneTwister(1234);

julia> randsubseq(rng, collect(1:8), 0.3)
2-element Array{Int64,1}:
 7
 8
```


In [54]:
trainids = perclass_splits(y,0.7)
testids = setdiff(1:length(y),trainids)

52-element Array{Int64,1}:
   1
   3
   5
   6
   8
  10
  14
  17
  19
  21
  22
  24
  28
   ⋮
 121
 124
 126
 127
 132
 136
 139
 143
 144
 146
 148
 150

We will need one more function, and that is the function that will assign classes based on the predicted values when the predicted values are continuous.

In [56]:
assign_class(predictedvalue) = argmin(abs.(predictedvalue .- [1,2,3]))

assign_class (generic function with 1 method)

### 🟣 Method 1: Lasso

In [58]:
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids])

Least Squares GLMNet Cross Validation
72 models for 4 predictors in 10 folds
Best λ 0.002 (mean loss 0.051, std 0.007)

In [59]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids])
cv = glmnetcv(X[trainids,:], y[trainids])
mylambda = path.lambda[argmin(cv.meanloss)]

path = glmnet(X[trainids,:], y[trainids],lambda=[mylambda]);

In [60]:
q = X[testids,:];
predictions_lasso = GLMNet.predict(path,q)

52×1 Array{Float64,2}:
 0.9118087620695705
 0.9562094597484808
 0.92498141235741
 1.0441948673224948
 0.9583283426087003
 0.9286090749989088
 0.9160189670188558
 0.933080059009638
 0.9450803140421167
 0.9500586892676182
 1.0420759844622753
 1.1627982192086392
 0.9236306997734071
 ⋮
 2.962019394073047
 2.579309062907373
 2.72734333596564
 2.564703011116998
 2.8181128576261925
 2.951048315563958
 2.5910483116926772
 2.768678663179472
 3.033533562603853
 2.8606076406038072
 2.7304360466212128
 2.6903411823016974

In [61]:
predictions_lasso = assign_class.(predictions_lasso)
findaccuracy(predictions_lasso,y[testids])

0.9615384615384616

### 🟣 Method 2: Ridge
We will use the same function but set alpha to zero.

In [62]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0,lambda=[mylambda]);
q = X[testids,:];
predictions_ridge = GLMNet.predict(path,q)
predictions_ridge = assign_class.(predictions_ridge)
findaccuracy(predictions_ridge,y[testids])

0.9807692307692307

### 🟣 Method 3: Elastic Net
We will use the same function but set alpha to 0.5 (it's the combination of lasso and ridge).

In [88]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0.5);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0.5)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0.5,lambda=[mylambda]);
q = X[testids,:];
predictions_EN = GLMNet.predict(path,q)
predictions_EN = assign_class.(predictions_EN)
findaccuracy(predictions_EN,y[testids])

0.9615384615384616

### 🟣 Method 4: Decision Trees
We will use the package `DecisionTree`

In [65]:
model = DecisionTreeClassifier(max_depth=2)
DecisionTree.fit!(model, X[trainids,:], y[trainids])

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  [1, 2, 3]
root:                     Decision Tree
Leaves: 3
Depth:  2

In [66]:
q = X[testids,:];
predictions_DT = DecisionTree.predict(model, q)
findaccuracy(predictions_DT,y[testids])

0.9230769230769231

### 🟣 Method 5: Random Forests
The `RandomForestClassifier` is available through the `DecisionTree` package as well.

In [67]:
model = RandomForestClassifier(n_trees=20)
DecisionTree.fit!(model, X[trainids,:], y[trainids])

RandomForestClassifier
n_trees:             20
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           -1
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             [1, 2, 3]
ensemble:            Ensemble of Decision Trees
Trees:      20
Avg Leaves: 5.35
Avg Depth:  3.75

In [68]:
q = X[testids,:];
predictions_RF = DecisionTree.predict(model, q)
findaccuracy(predictions_RF,y[testids])

0.9615384615384616

### 🟣 Method 6: Using a Nearest Neighbor method
We will use the `NearestNeighbors` package here.

In [17]:
Xtrain = X[trainids,:]
ytrain = y[trainids]
kdtree = KDTree(Xtrain')

KDTree{StaticArrays.SArray{Tuple{4},Float64,1,4},Euclidean,Float64}
  Number of points: 109
  Dimensions: 4
  Metric: Euclidean(0.0)
  Reordered: true

In [70]:
queries = X[testids,:]

52×4 Array{Float64,2}:
 5.1  3.5  1.4  0.2
 4.7  3.2  1.3  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 5.0  3.4  1.5  0.2
 4.9  3.1  1.5  0.1
 4.3  3.0  1.1  0.1
 5.4  3.9  1.3  0.4
 5.7  3.8  1.7  0.3
 5.4  3.4  1.7  0.2
 5.1  3.7  1.5  0.4
 5.1  3.3  1.7  0.5
 5.2  3.5  1.5  0.2
 ⋮              
 6.9  3.2  5.7  2.3
 6.3  2.7  4.9  1.8
 7.2  3.2  6.0  1.8
 6.2  2.8  4.8  1.8
 7.9  3.8  6.4  2.0
 7.7  3.0  6.1  2.3
 6.0  3.0  4.8  1.8
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.0  5.2  2.3
 6.5  3.0  5.2  2.0
 5.9  3.0  5.1  1.8

In [72]:
idxs, dists = knn(kdtree, queries', 5, true)

([[1, 17, 5, 32, 23], [3, 37, 4, 12, 7], [5, 1, 17, 33, 8], [6, 18, 38, 36, 19], [8, 32, 39, 1, 17], [10, 2, 25, 12, 39], [13, 31, 35, 9, 37], [16, 38, 28, 19, 6], [18, 6, 38, 20, 16], [20, 26, 23, 32, 38]  …  [91, 73, 94, 105, 86], [92, 89, 100, 108, 97], [95, 83, 98, 76, 91], [98, 94, 73, 88, 91], [100, 52, 92, 109, 89], [72, 104, 80, 87, 109], [105, 86, 90, 106, 102], [107, 103, 101, 79, 102], [77, 78, 82, 107, 56], [109, 100, 72, 104, 52]], [[0.0, 0.09999999999999998, 0.1414213562373093, 0.14142135623730964, 0.14142135623730995], [0.0, 0.14142135623730978, 0.24494897427831802, 0.26457513110645897, 0.264575131106459], [0.0, 0.1414213562373093, 0.17320508075688756, 0.17320508075688767, 0.22360679774997916], [0.0, 0.33166247903553986, 0.3605551275463989, 0.3872983346207422, 0.38729833462074226], [0.0, 0.09999999999999964, 0.14142135623730964, 0.17320508075688762, 0.1999999999999999], [0.0, 0.17320508075688784, 0.1732050807568881, 0.17320508075688815, 0.26457513110645875], [0.0, 0.2449

In [74]:
c = ytrain[hcat(idxs...)]
possible_labels = map(i->counter(c[:,i]),1:size(c,2))
predictions_NN = map(i->parse(Int,string(argmax(DataFrame(possible_labels[i])[1,:]))),1:size(c,2))
findaccuracy(predictions_NN,y[testids])

0.9615384615384616

### 🟣 Method 7: Support Vector Machines
We will use the `LIBSVM` package here.

In [79]:
Xtrain = X[trainids,:]
ytrain = y[trainids]

98-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3
 3

In [80]:
model = svmtrain(Xtrain', ytrain)

LIBSVM.SVM{Int64}(SVC, LIBSVM.Kernel.RadialBasis, nothing, 4, 3, [1, 2, 3], Int32[1, 2, 3], Float64[], Int32[], LIBSVM.SupportVectors{Int64,Float64}(35, Int32[4, 16, 15], [1, 1, 1, 1, 2, 2, 2, 2, 2, 2  …  3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [5.7 4.8 … 6.9 6.3; 4.4 3.4 … 3.1 2.5; 1.5 1.9 … 5.1 5.0; 0.4 0.2 … 2.3 1.9], Int32[9, 13, 24, 26, 31, 33, 35, 36, 39, 41  …  78, 79, 80, 81, 84, 86, 89, 90, 95, 97], LIBSVM.SVMNode[LIBSVM.SVMNode(1, 5.7), LIBSVM.SVMNode(1, 4.8), LIBSVM.SVMNode(1, 4.5), LIBSVM.SVMNode(1, 5.1), LIBSVM.SVMNode(1, 7.0), LIBSVM.SVMNode(1, 6.9), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 4.9), LIBSVM.SVMNode(1, 5.0), LIBSVM.SVMNode(1, 6.1)  …  LIBSVM.SVMNode(1, 7.7), LIBSVM.SVMNode(1, 7.7), LIBSVM.SVMNode(1, 6.0), LIBSVM.SVMNode(1, 5.6), LIBSVM.SVMNode(1, 6.1), LIBSVM.SVMNode(1, 7.2), LIBSVM.SVMNode(1, 6.3), LIBSVM.SVMNode(1, 6.1), LIBSVM.SVMNode(1, 6.9), LIBSVM.SVMNode(1, 6.3)]), 0.0, [0.4703629508897133 0.9529273421609011; 0.36561894494384256 0.01507807475671298; … ; -0.2421

In [81]:
predictions_SVM, decision_values = svmpredict(model, X[testids,:]')
findaccuracy(predictions_SVM,y[testids])

0.9615384615384616

Putting all the results together:

In [83]:
overall_accuracies = zeros(7)
methods = ["lasso","ridge","EN", "DT", "RF","kNN", "SVM"]
ytest = y[testids]
overall_accuracies[1] = findaccuracy(predictions_lasso,ytest)
overall_accuracies[2] = findaccuracy(predictions_ridge,ytest)
overall_accuracies[3] = findaccuracy(predictions_EN,ytest)
overall_accuracies[4] = findaccuracy(predictions_DT,ytest)
overall_accuracies[5] = findaccuracy(predictions_RF,ytest)
overall_accuracies[6] = findaccuracy(predictions_NN,ytest)
overall_accuracies[7] = findaccuracy(predictions_SVM,ytest)
hcat(methods, overall_accuracies)

7×2 Array{Any,2}:
 "lasso"  0.961538
 "ridge"  0.980769
 "EN"     0.961538
 "DT"     0.923077
 "RF"     0.961538
 "kNN"    0.961538
 "SVM"    0.961538

# Finally...
After finishing this notebook, you should be able to:
- [ ] split your data into training and testing data to test the effectiveness of a certain method
- [ ] apply a simple accuracy function to test the effectiveness of a certain method
- [ ] run multiple classification algorithms:
    - [ ] LASSO
    - [ ] Ridge
    - [ ] ElasticNet
    - [ ] Decision Tree
    - [ ] Random Forest
    - [ ] Nearest Neighbors
    - [ ] Support Vector Machines

# 🥳 One cool finding

We used multiple methods to run classification on the `iris` dataset which is a dataset of flowers and there are three types of iris flowers in it. We split the data into training and testing and ran our methods. Here is the scoreboard:

| method | accuracy score |
|---|---|
| lasso  |0.961538|
| ridge  |0.980769|
| EN     |0.961538|
| DT     |0.923077|
| RF     |0.961538|
| kNN    |0.961538|
| SVM    |0.961538|