# Usage

In [1]:
using Pkg
Pkg.activate(".")

using Revise
using DecisionTrees
using Statistics

[32m[1m  Activating[22m[39m project at `d:\projects_julia\jul-project\DecisionTrees\examples`




### Data initialization

Construct input matrix `X` of size `n x m` with `n` samples each with `m` features.
The features are allowed to be of type `Real`, `String` or `Bool`.
Also create vector of labels `Y` corresponding to sampels.

This can be achieved manually, like below, or by using functions provided
in file [data.jl](), like in [examples/titanic.ipynb]().

In [2]:
X = [
    -1.5  "a"     1 true; 
    -1.14 "b"     5 false; 
    -0.45 "bb"   -8 false; 
     2.5  "aaa"  -1 true; 
    27.4  "aaaa"  5 true]
Y = [0, 1, 0, 1, 1];

### Decision tree

Firstly, empty struct `DecisionTree` must be initialized.

In [3]:
dt = DecisionTree()

Decision tree
    Maximal depth: nothing
    Attribute count: nothing

    Nodes:
        Leaf node:  
        Decision: nothing  
        Confidence: nothing

By calling function `learn!` the decision tree is build to maximize information
gain in each split node for input data pair `X`, `Y`.

In [4]:
learn!(dt, X, Y)
dt

Decision tree
    Maximal depth: 1000
    Attribute count: 4

    Nodes:
        Decision node
        Type: real
        Parameter index: 1
        θ: 1.025
    
        Decision node
            Type: stringequality
            Parameter index: 2
            θ: b
    
            Leaf node:  
                Decision: 1  
                Confidence: 1.0
            Leaf node:  
                Decision: 0  
                Confidence: 1.0
        Leaf node:  
            Decision: 1  
            Confidence: 1.0

Labels are then predicted by calling function `evaluate`.
Because the maximal depth of decision tree is very high the train error should
be 0.0 if working properly.

In [5]:
Y_ = evaluate(dt, X)
println("Train error: $(mean(Y .!= Y_))")

Train error: 0.0


#### Learning attributes
- __depth__: Sets the maximal possible depth of build decision tree.
When the depth is limited enought the train error becomes larger than 0.0.

In [19]:
dt_d = DecisionTree()
learn!(dt_d, X, Y; depth=1)
dt_d

Decision tree
    Maximal depth: 1
    Attribute count: 4

    Nodes:
        Decision node
        Type: real
        Parameter index: 3
        θ: 3.0
    
        Leaf node:  
            Decision: 0  
            Confidence: 0.6666666666666666
        Leaf node:  
            Decision: 1  
            Confidence: 1.0

In [20]:
Y_ = evaluate(dt_d, X)
println("Train error: $(mean(Y .!= Y_))")

Train error: 0.2


- __attribute_count__: Sets the number of randomly selected features which are
considered for optimal split.
When the attribute count is set lower than the total number of features
the decision tree becomes undeterministic.

In [26]:
dt_a = DecisionTree()
learn!(dt_a, X, Y; attribute_count=1)
dt_a

Decision tree
    Maximal depth: 1000
    Attribute count: 1

    Nodes:
        Decision node
        Type: real
        Parameter index: 1
        θ: 1.025
    
        Decision node
            Type: real
            Parameter index: 1
            θ: -1.3199999999999998
    
            Leaf node:  
                Decision: 0  
                Confidence: 1.0
            Leaf node:  
                Decision: 0  
                Confidence: 0.5
        Leaf node:  
            Decision: 1  
            Confidence: 1.0

In [27]:
Y_ = evaluate(dt_a, X)
println("Train error: $(mean(Y .!= Y_))")

Train error: 0.2


### Random forest

It is model which combines several undeterministically constructed
decision trees and produces result by averaging their decision.

Fistly, the empty struct `RandomForest` must be initialized with single argument
__size__ which describes the amount of used `DecisionTree`s.

In [29]:
rf = RandomForest(10)

Random Forest
    Tree count: 10
    Bagging: false
    
    Each tree:
        Maximal depth: nothing
        Attribute count: nothing

Function `learn!` constructs all trees in random forest.
Decision trees are build the same as in model `DecisionTree` and thus the same 
learning arguments are aviable (`depth`, `attribute_count`).
The learning might take a while, thus the progress bar is displayed.

In [30]:
learn!(rf, X, Y)
rf

[32mProgress:  20%|█████████                                |  ETA: 0:00:02[39m[K

[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:00[39m[K


Random Forest
    Tree count: 10
    Bagging: false
    
    Each tree:
        Maximal depth: 1000
        Attribute count: 4

Labels are predicted by calling function `evaluate`.
Once again, when all trees of the forest are deterministic the train error is
0.0.

In [31]:
Y_ = evaluate(rf, X)
println("Train error: $(mean(Y .!= Y_))")

Train error: 0.0


#### Learning attributes
- __bagging__: When set to `true` each tree is trained with different dataset
with the same size as the input dataset which was generated by randomly sampling
with replacemnt from it.
This also produces nondeterministic behaviour.

In [54]:
rf_b = RandomForest(10)
learn!(rf_b, X, Y; bagging=true)
rf_b

Random Forest
    Tree count: 10
    Bagging: true
    
    Each tree:
        Maximal depth: 1000
        Attribute count: 4

In [55]:
Y_ = evaluate(rf_b, X)
println("Train error: $(mean(Y .!= Y_))")

Train error: 0.0
