# Exercise 3: Classification 

Use the Stock Market data from Chap 4 of the ISL Book to do classification, using the following algorithms:

* LASSO
* Ridge
* Elastic Net
* Decision Tree
* Random Forest
* Nearest Neighbors
* Support Vector Machines (SVM)

You will need to do the following

* Get the data
* Clean it
* Use Go or Julia to write/use the classification algos. 
* Evaluate the accuracy of each prediction, using a confusion matrix and a ROC curve.
* Submit your project following the submission guidelines


In [1]:
import Pkg; 
# Pkg.add("GLMNet");
# Pkg.add("Random");
# Pkg.add("LinearAlgebra");
# Pkg.add("DataStructures");
# Pkg.add("LIBSVM");
# Pkg.add("MLBase");
# Pkg.add("Plots");
# Pkg.add("DecisionTree");
# Pkg.add("Distances");
# Pkg.add("NearestNeighbors");
# Pkg.add("EvalMetrics");
# Pkg.add("PrettyTables");

using GLMNet
using DataFrames
using CSV
using LinearAlgebra
using RDatasets
using MLBase
using Plots
using DecisionTree
using Distances
using NearestNeighbors
using Random
using DataStructures
using LIBSVM
using PrettyTables

### Funciones útiles

In [4]:
assign_class(predictedvalue) = argmin(abs.(predictedvalue .- 1))

findaccuracy(predictedvals,groundtruthvals) = sum(predictedvals.==groundtruthvals)/length(groundtruthvals)

function peryear_splits(years,at)
    uids = unique(years)
    keepids = []
    for ui in uids
        curids = findall(years.==ui)
        rowids = randsubseq(curids, at) 
        push!(keepids,rowids...)
    end
    return keepids
end


peryear_splits (generic function with 1 method)

In [None]:
marketdf = CSV.read("./dat/Smarket.csv", DataFrame);
@show(marketdf)
@show(size(marketdf))


### Tomando únicamente las columnas de lag 1 al 5 y Volume para predecir Direction

In [7]:
X = Matrix(marketdf[:,3:8])
directions = marketdf[:,10]

directions_map=labelmap(directions)
y = labelencode(directions_map, directions)

1250-element Vector{Int64}:
 1
 1
 2
 1
 1
 1
 2
 1
 1
 1
 ⋮
 2
 2
 1
 1
 1
 2
 1
 2
 2

### Dividiendo el 70% de la información para entrenamiento y 30% para test

In [8]:
trainids = peryear_splits(X[:,2],0.7)
@show size(trainids)
testids = setdiff(1:length(X[:,2]),trainids)
@show size(testids)

size(trainids) = (866,)
size(testids) = (384,)


(384,)

## 🟣 Method 1: Lasso

In [9]:
path = glmnet(X[trainids,:],y[trainids])
cv = glmnetcv(X[trainids,:],y[trainids])
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:],y[trainids],lambda=[mylambda]);
q = X[testids,:];
predictions_lasso = GLMNet.predict(path,q)
predictions_lasso = assign_class.(predictions_lasso)
findaccuracy(predictions_lasso,y[testids])

0.5260416666666666

## 🟣 Method 2: Ridge

In [10]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0,lambda=[mylambda]);
q = X[testids,:];
predictions_ridge = GLMNet.predict(path,q)
predictions_ridge = assign_class.(predictions_ridge)
findaccuracy(predictions_ridge,y[testids])

0.5260416666666666

## 🟣 Method 3: Elastic Net

In [11]:
# choose the best lambda to predict with.
path = glmnet(X[trainids,:], y[trainids],alpha=0.5);
cv = glmnetcv(X[trainids,:], y[trainids],alpha=0.5)
mylambda = path.lambda[argmin(cv.meanloss)]
path = glmnet(X[trainids,:], y[trainids],alpha=0.5,lambda=[mylambda]);
q = X[testids,:];
predictions_EN = GLMNet.predict(path,q)
predictions_EN = assign_class.(predictions_EN)
findaccuracy(predictions_EN,y[testids])

0.5260416666666666

## 🟣 Method 4: Decision Trees

In [12]:
model = DecisionTreeClassifier(max_depth=2)
DecisionTree.fit!(model, X[trainids,:], y[trainids])

DecisionTreeClassifier
max_depth:                2
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  [1, 2]
root:                     Decision Tree
Leaves: 4
Depth:  2

In [13]:
q = X[testids,:];
predictions_DT = DecisionTree.predict(model, q)
findaccuracy(predictions_DT,y[testids])

0.5052083333333334

## 🟣 Method 5: Random Forests

In [14]:
model = RandomForestClassifier(n_trees=20)
DecisionTree.fit!(model, X[trainids,:], y[trainids])

RandomForestClassifier
n_trees:             20
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           -1
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             [1, 2]
ensemble:            Ensemble of Decision Trees
Trees:      20
Avg Leaves: 143.35
Avg Depth:  21.8

In [15]:
q = X[testids,:];
predictions_RF = DecisionTree.predict(model, q)
findaccuracy(predictions_RF,y[testids])

0.4973958333333333

## 🟣 Method 6: Using a Nearest Neighbor method

In [16]:
Xtrain = X[trainids,:]
ytrain = y[trainids]
kdtree = KDTree(Xtrain')

KDTree{StaticArraysCore.SVector{6, Float64}, Euclidean, Float64, StaticArraysCore.SVector{6, Float64}}
  Number of points: 866
  Dimensions: 6
  Metric: Euclidean(0.0)
  Reordered: true

In [17]:
queries = X[testids,:]
idxs, dists = knn(kdtree, queries', 5, true)
c = ytrain[hcat(idxs...)]
possible_labels = map(i->counter(c[:,i]),1:size(c,2))
predictions_NN = map(i->parse(Int,string(string(argmax(possible_labels[i])))),1:size(c,2))
findaccuracy(predictions_NN,y[testids])

0.5390625

## 🟣 Method 7: Support Vector Machines

In [18]:
Xtrain = X[trainids,:]
ytrain = y[trainids]
model = svmtrain(Xtrain', ytrain)
predictions_SVM, decision_values = svmpredict(model, X[testids,:]')
findaccuracy(predictions_SVM,y[testids])

0.5338541666666666

## Resultados:

In [19]:
overall_accuracies = zeros(7)
methods = ["lasso","ridge","EN", "DT", "RF","kNN", "SVM"]
ytest = y[testids]
overall_accuracies[1] = findaccuracy(predictions_lasso,ytest)
overall_accuracies[2] = findaccuracy(predictions_ridge,ytest)
overall_accuracies[3] = findaccuracy(predictions_EN,ytest)
overall_accuracies[4] = findaccuracy(predictions_DT,ytest)
overall_accuracies[5] = findaccuracy(predictions_RF,ytest)
overall_accuracies[6] = findaccuracy(predictions_NN,ytest)
overall_accuracies[7] = findaccuracy(predictions_SVM,ytest)
hcat(methods, overall_accuracies)

7×2 Matrix{Any}:
 "lasso"  0.526042
 "ridge"  0.526042
 "EN"     0.526042
 "DT"     0.505208
 "RF"     0.497396
 "kNN"    0.539062
 "SVM"    0.533854

## Matrices de confusión

In [20]:
println("Matriz Confusión predictions_lasso")
pretty_table(confusmat(2,y[testids], predictions_lasso[:]))
println("Matriz Confusión predictions_ridge")
pretty_table(confusmat(2,y[testids], predictions_ridge[:]))
println("Matriz Confusión predictions_EN")
pretty_table(confusmat(2,y[testids], predictions_EN[:]))
println("Matriz Confusión predictions_DT")
pretty_table(confusmat(2,y[testids], predictions_DT[:]))



Matriz Confusión predictions_lasso
┌────────┬────────┐
│[1m Col. 1 [0m│[1m Col. 2 [0m│
├────────┼────────┤
│    202 │      0 │
│    182 │      0 │
└────────┴────────┘
Matriz Confusión predictions_ridge
┌────────┬────────┐
│[1m Col. 1 [0m│[1m Col. 2 [0m│
├────────┼────────┤
│    202 │      0 │
│    182 │      0 │
└────────┴────────┘
Matriz Confusión predictions_EN
┌────────┬────────┐
│[1m Col. 1 [0m│[1m Col. 2 [0m│
├────────┼────────┤
│    202 │      0 │
│    182 │      0 │
└────────┴────────┘
Matriz Confusión predictions_DT
┌────────┬────────┐
│[1m Col. 1 [0m│[1m Col. 2 [0m│
├────────┼────────┤
│    171 │     31 │
│    159 │     23 │
└────────┴────────┘


In [21]:

println("Matriz Confusión predictions_RF")
pretty_table(confusmat(2,y[testids], predictions_RF[:]))
println("Matriz Confusión predictions_NN")
pretty_table(confusmat(2,y[testids], predictions_NN[:]))
println("Matriz Confusión predictions_SVM")
pretty_table(confusmat(2,y[testids], predictions_SVM[:]))

Matriz Confusión predictions_RF
┌────────┬────────┐
│[1m Col. 1 [0m│[1m Col. 2 [0m│
├────────┼────────┤
│     99 │    103 │
│     90 │     92 │
└────────┴────────┘
Matriz Confusión predictions_NN
┌────────┬────────┐
│[1m Col. 1 [0m│[1m Col. 2 [0m│
├────────┼────────┤
│    115 │     87 │
│     90 │     92 │
└────────┴────────┘
Matriz Confusión predictions_SVM
┌────────┬────────┐
│[1m Col. 1 [0m│[1m Col. 2 [0m│
├────────┼────────┤
│    138 │     64 │
│    115 │     67 │
└────────┴────────┘
