# iris CART using DecisionTree.jl
Loading packages that we will use

In [1]:
using DecisionTree
using Random, Statistics
using DataFrames

Loading Data (Iris dataset)

In [2]:
X, y = load_data("iris")
DF1 = DataFrame(X, :auto)
DF2 = DataFrame("Type" => y )
DF = DataFrame(x1 = float.(DF1[!,1]),
x2 = float.(DF1[!,2]),
x3 = float.(DF1[!,3]),
x4 = float.(DF1[!,4]),
y = string.(DF2[!,1]));
Matrix(DF[!,1:4]);


Now using pycall to split data into training and testing sets

In [3]:
using PyCall
ModelSelection = pyimport("sklearn.model_selection")
X_train, X_test, y_train, y_test =
 ModelSelection.train_test_split(Matrix(DF[!,1:4]),DF[!,5]);

Decision Tree

In [4]:
model = DecisionTreeClassifier(max_depth = 2);
fit!(model, X_train, y_train);
#Printting the tree
print_tree(model)

Feature 3 < 2.6 ?
├─ Iris-setosa : 38/38
└─ Feature 3 < 4.75 ?
    ├─ Iris-versicolor : 36/36
    └─ Iris-virginica : 35/38


In [5]:
train = [X_train y_train]
# view decision node data subset

train_R = train[train[:, 4] .> 0.8, :];


Ready to make some predictions

In [6]:
y_hat = predict(model, X_test)
#and checking the accuracy
accuracy = mean(y_hat .== y_test)

0.8947368421052632

Let's see where the model get confussed

In [7]:
DecisionTree.confusion_matrix(y_test,y_hat)

3×3 Matrix{Int64}:
 12  0   0
  0  8   3
  0  1  14

Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.8947368421052632
Kappa:    0.8393234672304438

Display results

In [8]:
check = [y_hat[i] == y_test[i] for i in 1:length(y_hat)]
check_display = [y_hat y_test check]

38×3 Matrix{Any}:
 "Iris-setosa"      "Iris-setosa"       true
 "Iris-versicolor"  "Iris-virginica"   false
 "Iris-versicolor"  "Iris-versicolor"   true
 "Iris-setosa"      "Iris-setosa"       true
 "Iris-versicolor"  "Iris-versicolor"   true
 "Iris-versicolor"  "Iris-versicolor"   true
 "Iris-virginica"   "Iris-virginica"    true
 "Iris-versicolor"  "Iris-versicolor"   true
 "Iris-setosa"      "Iris-setosa"       true
 "Iris-virginica"   "Iris-virginica"    true
 "Iris-virginica"   "Iris-virginica"    true
 "Iris-virginica"   "Iris-virginica"    true
 "Iris-virginica"   "Iris-versicolor"  false
 ⋮                                     
 "Iris-virginica"   "Iris-virginica"    true
 "Iris-virginica"   "Iris-versicolor"  false
 "Iris-setosa"      "Iris-setosa"       true
 "Iris-setosa"      "Iris-setosa"       true
 "Iris-setosa"      "Iris-setosa"       true
 "Iris-versicolor"  "Iris-versicolor"   true
 "Iris-setosa"      "Iris-setosa"       true
 "Iris-versicolor"  "Iris-versicolor"   tr

Display probability of each prediction

In [9]:
prob = predict_proba(model, sort(X_test,dims = 1))
display(prob)

38×3 Matrix{Float64}:
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 1.0  0.0        0.0
 0.0  1.0        0.0
 ⋮               
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053
 0.0  0.0789474  0.921053

We can improve our predicition but it will come with the Bias-variance tradeoff.\
Ensemble learning for improve decision tree,they are two different techniques, one is Bagging, the other is Boosting.  

## Random Forest
Example of the Bagging technique, it reduces the variance of the model (Increase independecy of the features or Increase number of models) but you get slightly higher bias.

In [10]:
#Creating a random forest with 20 trees
model1 = RandomForestClassifier(n_trees = 20);
fit!(model1, X_train, y_train)
#making predicitions
ŷ = predict(model1, X_test)
#checking accuracy 
accuracy1 = mean(y_test .== ŷ)
#confusion matrix
@show DecisionTree.confusion_matrix(y_test, ŷ)
#checking the probability
prob1 = predict_proba(model1, sort(X_test, dims = 1))

3×3 Matrix{Int64}:
 12   0   0
  0  10   1
  0   2  13

DecisionTree.confusion_matrix(y_test, ŷ) = Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.9210526315789473
Kappa:    0.8810020876826722


38×3 Matrix{Float64}:
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 1.0  0.0  0.0
 0.0  1.0  0.0
 ⋮         
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0
 0.0  0.0  1.0

## AdaBoost
Technique of Boosting in ensemble learning where usually lower the bias

In [11]:
model2 = AdaBoostStumpClassifier(n_iterations = 20);
fit!(model2, X_train, y_train)
#making predicitions
ŷ = predict(model2, X_test)
#checking accuracy 
accuracy1 = mean(y_test .== ŷ)
#confusion matrix
@show DecisionTree.confusion_matrix(y_test, ŷ)
#checking the probability
prob1 = predict_proba(model2, sort(X_test, dims = 1))


3×3 Matrix{Int64}:
 12  0   0
  0  9   2
  0  1  14

DecisionTree.confusion_matrix(y_test, ŷ) = Classes:  ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
Matrix:   
Accuracy: 0.9210526315789473
Kappa:    0.88


38×3 Matrix{Float64}:
 0.613395  0.386605  0.0
 0.613395  0.386605  0.0
 0.613395  0.386605  0.0
 0.613395  0.386605  0.0
 0.613395  0.386605  0.0
 0.613395  0.386605  0.0
 0.613395  0.339775  0.0468293
 0.613395  0.339775  0.0468293
 0.613395  0.339775  0.0468293
 0.613395  0.339775  0.0468293
 0.613395  0.339775  0.0468293
 0.613395  0.339775  0.0468293
 0.0       0.614547  0.385453
 ⋮                   
 0.0       0.36444   0.63556
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804
 0.0       0.320196  0.679804

## Using SKlearn with PyCall

In [12]:
using MLBase

In [13]:
LabelMap = labelmap(DF[:,5])
y_train1 = labelencode(LabelMap, y_train)
y_test1 = labelencode(LabelMap, y_test);


In [14]:
np = pyimport("numpy")
skl = pyimport("sklearn")
ModelEnsemble = pyimport("sklearn.ensemble")
SklMtr = pyimport("sklearn.metrics")
jol = pyimport("joblib")

PyObject <module 'joblib' from 'C:\\Users\\PC\\.julia\\conda\\3\\lib\\site-packages\\joblib\\__init__.py'>

In [15]:
model3 = ModelEnsemble.RandomForestClassifier()
model3.fit(X_train,y_train1)
y_pred = model3.predict(X_test)
model3.score(X_test,y_test1)
SklMtr.confusion_matrix(y_test1, y_pred)

3×3 Matrix{Int64}:
 12  0   0
  0  9   2
  0  1  14

Saving model3

In [16]:
joblib_file = "IrisDetection.joblib"
jol.dump(model3, joblib_file)

1-element Vector{String}:
 "IrisDetection.joblib"

In [21]:
using JLD2, Pkg
Pkg.add("Genie")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `C:\Users\PC\.julia\environments\v1.7\Project.toml`
[32m[1m  No Changes[22m[39m to `C:\Users\PC\.julia\environments\v1.7\Manifest.toml`


In [20]:
@save "mdl.jld2" model2

In [23]:
using Genie
using Pkg
Pkg.update("Genie")

[32m[1m    Updating[22m[39m registry at `C:\Users\PC\.julia\registries\General.toml`
[32m[1m  No Changes[22m[39m to `C:\Users\PC\.julia\environments\v1.7\Project.toml`
[32m[1m  No Changes[22m[39m to `C:\Users\PC\.julia\environments\v1.7\Manifest.toml`
┌ Info: We haven't cleaned this depot up for a bit, running Pkg.gc()...
└ @ Pkg C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.7\Pkg\src\Pkg.jl:639
[32m[1m      Active[22m[39m manifest files: 2 found
[32m[1m      Active[22m[39m artifact files: 105 found
[32m[1m      Active[22m[39m scratchspaces: 9 found
[32m[1m     Deleted[22m[39m no artifacts, repos, packages or scratchspaces


In [25]:
newapp_webservice("IrisDetector")

LoadError: UndefVarError: newapp_webservice not defined