# Decision trees in Julia

*Alexandre Slonina*  
Subject link: https://github.com/bensadeghi/DecisionTree.jl

### First Prerequiste: install julia

You can run this notebook on google colab with the solution explained here [Julia for Pythonistas](https://github.com/ageron/julia_notebooks/blob/master/Julia_for_Pythonistas.ipynb) but personaly I would not recommend it.

To install julia with anaconda:  
`$ conda install -c conda-forge julia`

If it doesn't work download julia manualy:  
`$ sudo apt install julia` (Linux) or https://julialang.org/downloads

Launch Julia:  
`$ julia`

Follow these lines:
- `using Pkg`
- `Pkg.add("IJulia")`

Julia should now be available on your conda environment ! 
You must have "Julia 1.X.X" instead of "Python 3" on the top right of your screen, if not, try to re launch jupyter.   


## First, why using Julia instead of Python ?

Sorry Dennis spoiled you during the last OBD course ...  
Julia is a new language created in 2012 especially for Data Science and linear algebra. Quickly the advantages of Julia are : 
- The syntax is optimized for math and machine learning
- Speed (it tries to imitate C) 
- Native machine learning libraries

Many Data Scientists keep using python beacause it's more popular yet and there are more third-party packages (PyTorch, TensorFlow are the main ones).

If you are interested in the subject :  
[Link1: datascientest.com](https://datascientest.com/python-vs-julia-quel-est-le-meilleur-langage-pour-la-data-science)  
[Link2: geeksforgeeks.org](https://www.geeksforgeeks.org/julia-vs-python/)   
[Link3: analyticsvidhya.com](https://www.analyticsvidhya.com/blog/2020/08/what-is-better-for-data-science-learning-and-work-julia-or-python/) 


### Some important differences
  * Arrays in Julia are indexed starting from 1.
  * In Julia classes (i.e. types) don't own methods. Methods are implementations of generic functions and are invoked in a "static style", i.e. instead of Python's str1.rstrip(), we will have rstrip( str1 ), instead of file1.close(), close( file1 ).

### Some important similarities.


| *Python*                 | *Julia* | *Comments* |
| -------------------------|---------|----------|
| `True` | `true` | |
| `False` | `false` | | 
| `None` | `nothing` |  |
| `type( obj )` | `typeof( obj )` |  |
| `{}` | `Dict{KeyType,ValueType}()` |  |
| `elif` | `elseif` |  |
| `lambda x, y : y + x * 2` | `(x,y) -> y + x * 2` |  |
| `"string %s interpolation %d" % ( str1, i1)` | `"string $str1 interpolation $i1 "` | You can interpolate arbitrary expressions by enclosing them in braces, as in `"${x+y}"` | 
| `xrange(10,4,-2`)` | `10:-2:4` | |
| `range(10,4,-2)` | `[10:-2:4]` | Do this only if you really have to, as it will consume memory proportional to the length of the range | 
| `id( obj )` | `object_id( obj )` | | 
| `raise excetion` | `throw( exception )` |


| *Python*                 | *Julia* |
| -------------------------|-----------|
| `str1 + str2 + str2` |  `string( str1, str2, str3 )`  |
| `len( str1 )` |  `length( str1 )` | 
| `str1.rstrip()` | `rstrip( str1 )` | 
 
[Source](https://gist.github.com/svaksha/bf2b287e85967dcaad03a26d8b1e523d)  
To go further: [Julia official doc](https://docs.julialang.org/en/v1/manual/noteworthy-differences/#Noteworthy-differences-from-Python)

For those who are interested in Julia there are many tutorials to getting started.  
[JuliaAcademy](https://juliaacademy.com/courses/)
[GithubRepository](https://github.com/JuliaAcademy/JuliaTutorials)

Here is the notebook for pythonistas with all you need: [Julia for Pythonistas](https://github.com/ageron/julia_notebooks/blob/master/Julia_for_Pythonistas.ipynb) 

## A quick reminder of Decision Trees

**Hierarchical description of data based on logical (binary) questions**.  
Basic Idea: Test the attributes (features) sequentially
= Ask questions about the target/status sequentially

Ask about the attribute which maximizes the expected
reduction of the entropy.

Ingredients:
- Nodes<br>
Each node contains a **test** on the features which **partitions** the data.
- Edges<br>
The outcome of a node's test leads to one of its child edges.
- Leaves<br>
A terminal node, or leaf, holds a **decision value** for the output variable.

![Tree example](img/dt1.png)

If you want to get back in the math behind you can quickly run the MLclass notebook : [8-Decision Trees](https://github.com/erachelson/MLclass/tree/master/8%20-%20Decision%20Trees)

### Ensemble methods used with Decision Trees


**Boosting:** A set of weak learners creates a single strong
classifier. Apply learner to weighted samples and then Increase weights of misclassified examples.  
[9-Boosting](https://github.com/erachelson/MLclass/tree/master/9%20-%20Boosting)

**Bagging:** (Bootstrap Aggregating) Use bootstrap replicates of training set by sampling with replacement. On each replicate learn one model. This lead to High variance and low bias classifiers. Combined altogether it reduces the variance of the classifier. It's a powerful algorithm for controling overfitting.  
[10-Bagging](https://github.com/erachelson/MLclass/tree/master/10%20-%20Bagging)  

**Forest:** Bagging + Random feature selection at each node   
[11-Random Forest](https://github.com/erachelson/MLclass/tree/master/11%20-%20Random%20Forests)

Visual examples for boosting and bagging are [here](ressources/10-ensemble-6.pdf)

### Second prerequiste: install packages

The keyword **import** in python is **using** in julia. To install the "example" package in julia you have to run Pkg.add("example") it's like a pip install "example" for python. Once it's done you'll never have to do it again on your machine. So please uncomment the lines below.

In [1]:
using Pkg
#Pkg.add("DecisionTree")
#Pkg.add("ScikitLearn")

## A simple classification example using Iris dataset
The API is quite the same as scikitlearn in python.

In [2]:
#Here we import the package
using DecisionTree

This package support the ScikitLearn interface. The Available models are: `DecisionTreeClassifier, DecisionTreeRegressor, RandomForestClassifier, RandomForestRegressor, AdaBoostStumpClassifier`.

In [3]:
?DecisionTreeClassifier

search: [0m[1mD[22m[0m[1me[22m[0m[1mc[22m[0m[1mi[22m[0m[1ms[22m[0m[1mi[22m[0m[1mo[22m[0m[1mn[22m[0m[1mT[22m[0m[1mr[22m[0m[1me[22m[0m[1me[22m[0m[1mC[22m[0m[1ml[22m[0m[1ma[22m[0m[1ms[22m[0m[1ms[22m[0m[1mi[22m[0m[1mf[22m[0m[1mi[22m[0m[1me[22m[0m[1mr[22m



```
DecisionTreeClassifier(; pruning_purity_threshold=0.0,
                       max_depth::Int=-1,
                       min_samples_leaf::Int=1,
                       min_samples_split::Int=2,
                       min_purity_increase::Float=0.0,
                       n_subfeatures::Int=0,
                       rng=Random.GLOBAL_RNG)
```

Decision tree classifier. See [DecisionTree.jl's documentation](https://github.com/bensadeghi/DecisionTree.jl)

Hyperparameters:

  * `pruning_purity_threshold`: (post-pruning) merge leaves having `>=thresh` combined purity (default: no pruning)
  * `max_depth`: maximum depth of the decision tree (default: no maximum)
  * `min_samples_leaf`: the minimum number of samples each leaf needs to have (default: 1)
  * `min_samples_split`: the minimum number of samples in needed for a split (default: 2)
  * `min_purity_increase`: minimum purity needed for a split (default: 0.0)
  * `n_subfeatures`: number of features to select at random (default: keep all)
  * `rng`: the random number generator to use. Can be an `Int`, which will be used to seed and create a new random number generator.

Implements `fit!`, `predict`, `predict_proba`, `get_classes`


In [4]:
features, labels = load_data("iris")

# the data loaded are of type Array{Any}
# cast them to concrete types for better performance
features = float.(features)
labels   = string.(labels)

150-element Array{String,1}:
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 "Iris-setosa"
 ⋮
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"
 "Iris-virginica"

In [5]:
# train depth-truncated classifier
model = DecisionTreeClassifier(max_depth=2)
fit!(model, features, labels)
# pretty print of the tree, to a depth of 3 nodes (usefull if you change the max_depth above)
print_tree(model,3)

Feature 3, Threshold 2.45
L-> Iris-setosa : 50/50
R-> Feature 4, Threshold 1.75
    L-> Iris-versicolor : 49/54
    R-> Iris-virginica : 45/46


We obtain the same result as in python, here was the result for the same tree running the 9nth notebook
![dt2](img/iris_dt2.png)

In [6]:
# run n-fold cross validation over 3 CV folds
# See ScikitLearn.jl for installation instructions
using ScikitLearn.CrossValidation: cross_val_score
accuracy = cross_val_score(model, features, labels, cv=3)

3-element Array{Float64,1}:
 0.9607843137254902
 0.9019607843137255
 0.9791666666666666

In [7]:
# let's say we have some features corresponding to an iris, which type is it ?
myIris = [5.9,3.0,5.1,1.9]
# apply learned model
print("myIris type is ")
println(predict(model, myIris))
# get the probability of each label
println(predict_proba(model, myIris))
println(get_classes(model)) # returns the ordering of the columns in predict_proba's output

myIris type is Iris-virginica
[0.0, 0.021739130434782608, 0.9782608695652174]
["Iris-setosa", "Iris-versicolor", "Iris-virginica"]


In [9]:
preds = predict(model, features)
confusion_matrix(labels, preds)

3×3 Array{Int64,2}:
 50   0   0
  0  49   1
  0   5  45

Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.96
Kappa:    0.94

### Random forest
It also works with random forest, the syntax is the same.

In [10]:
?RandomForestClassifier

search: [0m[1mR[22m[0m[1ma[22m[0m[1mn[22m[0m[1md[22m[0m[1mo[22m[0m[1mm[22m[0m[1mF[22m[0m[1mo[22m[0m[1mr[22m[0m[1me[22m[0m[1ms[22m[0m[1mt[22m[0m[1mC[22m[0m[1ml[22m[0m[1ma[22m[0m[1ms[22m[0m[1ms[22m[0m[1mi[22m[0m[1mf[22m[0m[1mi[22m[0m[1me[22m[0m[1mr[22m



```
RandomForestClassifier(; n_subfeatures::Int=-1,
                       n_trees::Int=10,
                       partial_sampling::Float=0.7,
                       max_depth::Int=-1,
                       rng=Random.GLOBAL_RNG)
```

Random forest classification. See [DecisionTree.jl's documentation](https://github.com/bensadeghi/DecisionTree.jl)

Hyperparameters:

  * `n_subfeatures`: number of features to consider at random per split (default: -1, sqrt(# features))
  * `n_trees`: number of trees to train (default: 10)
  * `partial_sampling`: fraction of samples to train each tree on (default: 0.7)
  * `max_depth`: maximum depth of the decision trees (default: no maximum)
  * `min_samples_leaf`: the minimum number of samples each leaf needs to have
  * `min_samples_split`: the minimum number of samples in needed for a split
  * `min_purity_increase`: minimum purity needed for a split
  * `rng`: the random number generator to use. Can be an `Int`, which will be used to seed and create a new random number generator. Multi-threaded forests must be seeded with an `Int`

Implements `fit!`, `predict`, `predict_proba`, `get_classes`


In [11]:
model = RandomForestClassifier(n_trees=12,partial_sampling=0.6,max_depth=5)
fit!(model,features,labels)
accuracy = cross_val_score(model, features, labels, cv=3)

3-element Array{Float64,1}:
 0.9607843137254902
 0.9215686274509803
 0.9583333333333334

## Let's pratice on an harder example (Digits NIST)

And see if julia works better than python !
In this example we'll fit a Randomn Forest classifier with the Digits NIST dataset and try to optimize its parameters with the GridSearchCV from sklearn. 
GridSearchCV is explained and the following code is adapted from [here](https://github.com/cstjean/ScikitLearn.jl/blob/master/examples/Randomized_Search.ipynb) 

In [12]:
using ScikitLearn.GridSearch: GridSearchCV
using Printf, Statistics

features, labels = load_data("digits")

# the data loaded are of type Array{Any}
# cast them to concrete types for better performance
features = float.(features)
labels   = string.(labels)

1797-element Array{String,1}:
 "1"
 "2"
 "3"
 "4"
 "5"
 "6"
 "7"
 "8"
 "9"
 "10"
 "1"
 "2"
 "3"
 ⋮
 "8"
 "10"
 "6"
 "5"
 "9"
 "9"
 "5"
 "10"
 "1"
 "9"
 "10"
 "9"

In [13]:
model = RandomForestClassifier(n_trees=100,partial_sampling=0.7)
fit!(model,features,labels)
accuracy = cross_val_score(model, features, labels, cv=3)

3-element Array{Float64,1}:
 0.9335548172757475
 0.9499165275459098
 0.924496644295302

This random forest classifies quite well our data, let's try to optimize the hyperparameters. First run with these param_grid, at the end if you have time you can come back here and try with more parameters.

In [15]:
#?RandomForestClassifier

In [19]:
param_grid = Dict("max_depth"=> [3,10,100],
                  "n_subfeatures"=> [1,3,10],
                  "min_samples_split"=> [2,3,10],
                  "min_samples_leaf"=> [1,3,10],
                  "n_trees"=> [10, 100])

# run grid search
model = RandomForestClassifier(rng=2)
grid_search = GridSearchCV(model, param_grid)

start = time()
fit!(grid_search, features, labels)

@printf("GridSearchCV took %.2f seconds for %d candidate parameter settings.\n",
time() - start, length(grid_search.grid_scores_))

GridSearchCV took 33.36 seconds for 162 candidate parameter settings.


In [20]:
# Utility function to report best scores
function report(grid_scores, n_top=3)
    top_scores = sort(grid_scores, by=x->x.mean_validation_score, rev=true)[1:n_top]
    for (i, score) in enumerate(top_scores)
        println("Model with rank:$i")
        @printf("Mean validation score: %.3f (std: %.3f)\n",
                score.mean_validation_score,
                std(score.cv_validation_scores))
        println("Parameters: $(score.parameters)")
        println("")
    end
end

report(grid_search.grid_scores_)

Model with rank:1
Mean validation score: 0.945 (std: 0.008)
Parameters: Dict{Symbol,Any}(:min_samples_split => 2,:n_trees => 100,:n_subfeatures => 3,:min_samples_leaf => 1,:max_depth => 100)

Model with rank:2
Mean validation score: 0.943 (std: 0.009)
Parameters: Dict{Symbol,Any}(:min_samples_split => 3,:n_trees => 100,:n_subfeatures => 3,:min_samples_leaf => 1,:max_depth => 100)

Model with rank:3
Mean validation score: 0.942 (std: 0.004)
Parameters: Dict{Symbol,Any}(:min_samples_split => 3,:n_trees => 100,:n_subfeatures => 3,:min_samples_leaf => 1,:max_depth => 10)



## Let's compare with the Random forest from Python
This may take a few minutes to run 

In [3]:
using PyCall
py"""
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
digits = datasets.load_digits()
X = digits.data
y = digits.target
param_grid = {"max_depth": [3, 10, 100],
                  "max_features": [1, 3, 10],
                  "min_samples_split": [2, 3, 10],  
                  "min_samples_leaf": [1, 3, 10],   
                  "n_estimators":[10, 100]}
rf = RandomForestClassifier()
start_time = time.time()
clf = GridSearchCV(rf, param_grid)
clf.fit(X,y)
end_time = time.time()
"""
print("GridSearchCV time is ")
py"end_time-start_time"

GridSearchCV time is 

114.10692858695984

In [7]:
print("Best parameters are ")
py"clf.best_params_"

Best parameters are 

Dict{Any,Any} with 5 entries:
  "min_samples_split" => 2
  "max_depth"         => 100
  "min_samples_leaf"  => 1
  "n_estimators"      => 100
  "max_features"      => 3

In [8]:
print("Best score is ")
py"clf.best_score_"

Best score is 

0.9421448467966573

### Using python module with julia sklearn

**Please restart the kernel here** otherwise there will be a conflict with the RandomClassier beacause they have the same name. This is a mix between julia and python, in fact julia import a Python object (the classifier) and use it. We'll discuss later on the perfomance of this method.

In [1]:
using ScikitLearn, Printf, Statistics
using PyCall
using ScikitLearn.GridSearch: GridSearchCV
@sk_import datasets: load_digits
@sk_import ensemble: RandomForestClassifier

PyObject <class 'sklearn.ensemble._forest.RandomForestClassifier'>

In [2]:
digits = load_digits()
X, y = digits["data"], digits["target"]
param_grid = Dict("max_depth"=> [3, 10, 100],
                  "max_features"=> [1, 3, 10],
                  "min_samples_split"=> [2, 3, 10],  
                  "min_samples_leaf"=> [1, 3, 10],   
                  "n_estimators"=>[10, 100])

clf = RandomForestClassifier(random_state=2)
# run grid search
grid_search = GridSearchCV(clf, param_grid)

start = time()
fit!(grid_search, X, y)

@printf("GridSearchCV took %.2f seconds for %d candidate parameter settings.\n",
time() - start, length(grid_search.grid_scores_))

GridSearchCV took 75.74 seconds for 162 candidate parameter settings.


In [3]:
function report(grid_scores, n_top=3)
    top_scores = sort(grid_scores, by=x->x.mean_validation_score, rev=true)[1:n_top]
    for (i, score) in enumerate(top_scores)
        println("Model with rank:$i")
        @printf("Mean validation score: %.3f (std: %.3f)\n",
                score.mean_validation_score,
                std(score.cv_validation_scores))
        println("Parameters: $(score.parameters)")
        println("")
    end
end
report(grid_search.grid_scores_)

Model with rank:1
Mean validation score: 0.945 (std: 0.000)
Parameters: Dict{Symbol,Any}(:max_features => 3,:min_samples_split => 2,:min_samples_leaf => 1,:n_estimators => 100,:max_depth => 10)

Model with rank:2
Mean validation score: 0.945 (std: 0.004)
Parameters: Dict{Symbol,Any}(:max_features => 3,:min_samples_split => 3,:min_samples_leaf => 1,:n_estimators => 100,:max_depth => 10)

Model with rank:3
Mean validation score: 0.945 (std: 0.000)
Parameters: Dict{Symbol,Any}(:max_features => 3,:min_samples_split => 2,:min_samples_leaf => 1,:n_estimators => 100,:max_depth => 100)



## Comparison of execution time

Here are the results from my computer, they could change but the order of magnitude should be the same. The best score varies slighty between methods (arround 0.94) it could be an effect of the randomness involved in the algorithm. 

|Method| GridSearchCV time (seconds) |
|-----|------|
|Julia Random Forest| 33  |
|Python Random Forest|115 |
| Mix Julia using Python Random Forest|75|

We clearly see that Julia is very efficient when the method is purely in Julia. Even with a python module it works better than the pure python one. So far there are still few algorithms which are written in Julia, Decision trees and Randomn Forest are part of them. 






## Other classifiers 

Here is a comparison of the Julia Decison Trees and Random forest against various Python classifier with the accuracy score associated : ![classifiers comparison](img/classifierspic.png)
You could see the source code for this plot [here](https://github.com/cstjean/ScikitLearn.jl/blob/master/examples/Classifier_Comparison_Julia.ipynb)