# Julia 機器學習：DecisionTree 決策樹

## 作業 030：乳癌預測資料集

請使用隨機森林模型建立一個分類模型，來預測乳癌資料集中，為良性或是惡性的腫瘤。

In [1]:
using DecisionTree, RDatasets, DataFrames, MLDataUtils, Statistics

## 讀取資料

In [2]:
biopsy = dataset("MASS", "biopsy")

Unnamed: 0_level_0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,Class
Unnamed: 0_level_1,String,Int32,Int32,Int32,Int32,Int32,Int32⍰,Int32,Int32,Int32,Categorical…
1,1000025,5,1,1,1,2,1,3,1,1,benign
2,1002945,5,4,4,5,7,10,3,2,1,benign
3,1015425,3,1,1,1,2,2,3,1,1,benign
4,1016277,6,8,8,1,3,4,3,7,1,benign
5,1017023,4,1,1,3,2,1,3,1,1,benign
6,1017122,8,10,10,8,7,10,9,7,1,malignant
7,1018099,1,1,1,1,2,10,3,1,1,benign
8,1018561,2,1,2,1,2,1,3,1,1,benign
9,1033078,2,1,1,1,2,1,1,1,5,benign
10,1033078,4,2,1,1,2,1,2,1,1,benign


## 清除遺失值

In [3]:
biopsy = dropmissing(biopsy)

Unnamed: 0_level_0,ID,V1,V2,V3,V4,V5,V6,V7,V8,V9,Class
Unnamed: 0_level_1,String,Int32,Int32,Int32,Int32,Int32,Int32,Int32,Int32,Int32,Categorical…
1,1000025,5,1,1,1,2,1,3,1,1,benign
2,1002945,5,4,4,5,7,10,3,2,1,benign
3,1015425,3,1,1,1,2,2,3,1,1,benign
4,1016277,6,8,8,1,3,4,3,7,1,benign
5,1017023,4,1,1,3,2,1,3,1,1,benign
6,1017122,8,10,10,8,7,10,9,7,1,malignant
7,1018099,1,1,1,1,2,10,3,1,1,benign
8,1018561,2,1,2,1,2,1,3,1,1,benign
9,1033078,2,1,1,1,2,1,1,1,5,benign
10,1033078,4,2,1,1,2,1,2,1,1,benign


## 訓練與測試資料集

In [4]:
indecies = MLDataUtils.shuffleobs(collect(1:nrow(biopsy)))

683-element view(::Array{Int64,1}, [64, 636, 242, 472, 316, 468, 637, 348, 444, 294  …  593, 128, 588, 619, 517, 533, 557, 352, 276, 523]) with eltype Int64:
  64
 636
 242
 472
 316
 468
 637
 348
 444
 294
 177
 166
 605
   ⋮
 365
 633
 593
 128
 588
 619
 517
 533
 557
 352
 276
 523

In [5]:
collect(1:nrow(biopsy))

683-element Array{Int64,1}:
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
   ⋮
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683

In [6]:
train_ind, test_ind = MLDataUtils.splitobs(indecies, at = 0.8);

In [7]:
features = Matrix{Float64}(biopsy[!, 2:10])

683×9 Array{Float64,2}:
 5.0   1.0   1.0  1.0  2.0   1.0   3.0   1.0  1.0
 5.0   4.0   4.0  5.0  7.0  10.0   3.0   2.0  1.0
 3.0   1.0   1.0  1.0  2.0   2.0   3.0   1.0  1.0
 6.0   8.0   8.0  1.0  3.0   4.0   3.0   7.0  1.0
 4.0   1.0   1.0  3.0  2.0   1.0   3.0   1.0  1.0
 8.0  10.0  10.0  8.0  7.0  10.0   9.0   7.0  1.0
 1.0   1.0   1.0  1.0  2.0  10.0   3.0   1.0  1.0
 2.0   1.0   2.0  1.0  2.0   1.0   3.0   1.0  1.0
 2.0   1.0   1.0  1.0  2.0   1.0   1.0   1.0  5.0
 4.0   2.0   1.0  1.0  2.0   1.0   2.0   1.0  1.0
 1.0   1.0   1.0  1.0  1.0   1.0   3.0   1.0  1.0
 2.0   1.0   1.0  1.0  2.0   1.0   2.0   1.0  1.0
 5.0   3.0   3.0  3.0  2.0   3.0   4.0   4.0  1.0
 ⋮                           ⋮                
 3.0   1.0   1.0  1.0  2.0   1.0   2.0   3.0  1.0
 4.0   1.0   1.0  1.0  2.0   1.0   1.0   1.0  1.0
 1.0   1.0   1.0  1.0  2.0   1.0   1.0   1.0  8.0
 1.0   1.0   1.0  3.0  2.0   1.0   1.0   1.0  1.0
 5.0  10.0  10.0  5.0  4.0   5.0   4.0   4.0  1.0
 3.0   1.0   1.0  1.0  2.0   

In [8]:
labels = Vector{String}(biopsy[!, :Class])

683-element Array{String,1}:
 "benign"
 "benign"
 "benign"
 "benign"
 "benign"
 "malignant"
 "benign"
 "benign"
 "benign"
 "benign"
 "benign"
 "benign"
 "malignant"
 ⋮
 "benign"
 "benign"
 "benign"
 "benign"
 "malignant"
 "benign"
 "benign"
 "benign"
 "benign"
 "malignant"
 "malignant"
 "malignant"

## 隨機森林模型

In [9]:
# 需要知道如何評估參數
model = DecisionTree.RandomForestClassifier(n_trees=500, max_depth=9)

RandomForestClassifier
n_trees:             500
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           9
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             nothing
ensemble:            nothing

## 訓練

In [10]:
DecisionTree.fit!(model, features[train_ind, :], labels[train_ind])

RandomForestClassifier
n_trees:             500
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           9
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             ["benign", "malignant"]
ensemble:            Ensemble of Decision Trees
Trees:      500
Avg Leaves: 17.824
Avg Depth:  7.616

## 預測

In [11]:
ŷ = DecisionTree.predict(model, features[test_ind, :])

137-element Array{String,1}:
 "malignant"
 "benign"
 "benign"
 "malignant"
 "benign"
 "malignant"
 "malignant"
 "benign"
 "benign"
 "malignant"
 "malignant"
 "benign"
 "benign"
 ⋮
 "benign"
 "malignant"
 "benign"
 "benign"
 "benign"
 "benign"
 "benign"
 "benign"
 "malignant"
 "benign"
 "malignant"
 "benign"

## 評估模型

In [12]:
accuracy(xs, ys) = mean(xs .== ys)

accuracy (generic function with 1 method)

In [13]:
accuracy(ŷ, labels[test_ind])

0.9708029197080292

↑ 這還真的有誤差...