# Decision Trees (basic)

1. Introduction

    - Decision trees are supervised machine learning algorithms used for classification and regression analysis.  
    
    - Decision trees are used for issues where we have continuous but also categorical input and target features. 
    
    - A decision tree mainly contains of a root node, interior nodes, and leaf nodes which are then connected by branches.
    
<img src="data/images/Decisiontree.png" width="45%">

2. General method
    - Find descriptive features (can be criterion set) which contain the most "information" regarding the target feature and then split the dataset along the values of these features such that the target feature values for the resulting sub_datasets are as pure as possible. 
    - Process of finding the "most informative" feature is done until we accomplish a stopping criteria where we then finally end up in so called leaf nodes.
    - The leaf nodes contain the predictions we will make for new query instances presented to our trained model. 
    
    
3. Advantages of Decision-Tree 
    * Generate easy rules to understand
    * Perform classification without needing much computation
    * Indicate which fields are most important for classification


4. Disadvantages of Decision-Tree
    * Not too suitable for numerical prediction via regression 
    * Cannot handle cases having many classes and relatively small number of training examples
    * Generally computationally expensive to train for very large training examples. The process of growing a decision tree is computationally expensive. 
    * At each node, each candidate splitting field must be sorted before its best split can be found. In some algorithms, combinations of fields are used and a search must be made for optimal combining weights. Pruning algorithms can also be expensive since many candidate sub-trees must be formed and compared, and then prune appropriately.

In [1]:
# import relevant libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree

model = tree.DecisionTreeClassifier(criterion = 'gini')
# DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of size [n_samples, n_features] 
# holding the training samples, and an array Y of integer values, size [n_samples], 
# holding the class labels for the training samples
# criterion can either be 'entropy' for information gain or 'gini' for Gini impurity

4. Reading in data

   * Using zoo data from data folder

In [2]:
# loading in zoo data
# dataset = pd.read_csv('data/zoo.data.txt', names=['animal_name','hair','feathers','eggs','milk',
#                                                    'airbone','aquatic','predator','toothed','backbone',
#                                                   'breathes','venomous','fins','legs','tail','domestic','catsize','class',])
dataset = pd.read_csv('zoo_decision_tree.txt', names=['animal_name','hair','feathers','eggs','milk',
                                                   'airbone','aquatic','predator','toothed','backbone',
                                                  'breathes','venomous','fins','legs','tail','domestic','catsize','class',])
dataset.drop(columns = ['animal_name'],axis = 1,inplace=True)

x_categories = ['hair','feathers','eggs','milk',
                'airbone','aquatic','predator','toothed','backbone',
                'breathes','venomous','fins','legs','tail','domestic','catsize']

test_classes = {}
extract_class_column = dataset['class']
print(extract_class_column)

for i in range(len(extract_class_column)):
    if extract_class_column[i] in test_classes.keys():
        test_classes[extract_class_column[i]] += 1
    else:
        test_classes[extract_class_column[i]] = 1

test_classes_names = [item for item in test_classes.keys()]
class_name_convert = ['a','d','b','g','f','e','c'] # a - 1, b - 2, c - 3 etc.

0      1
1      1
2      4
3      1
4      1
5      1
6      1
7      4
8      4
9      1
10     1
11     2
12     4
13     7
14     7
15     7
16     2
17     1
18     4
19     1
20     2
21     2
22     1
23     2
24     6
25     5
26     5
27     1
28     1
29     1
      ..
71     2
72     7
73     4
74     1
75     1
76     3
77     7
78     2
79     2
80     3
81     7
82     4
83     2
84     1
85     7
86     4
87     2
88     6
89     5
90     3
91     3
92     4
93     1
94     1
95     2
96     1
97     6
98     1
99     7
100    2
Name: class, Length: 101, dtype: int64


5. Data management to split dataset for training and validation

    * Available train_test_split function in sklearn.model_selection can do the splitting
    
      see more information sklearn.model_selection https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    

In [3]:
# data management for setting up training and validation dataset
x_train, x_test = train_test_split(dataset.iloc[:,:-1], test_size = 0.20)
y_train, y_test = train_test_split(dataset.iloc[:,-1], test_size = 0.20)

6. Model-training to build classification model based on decision tree

In [4]:
# model training
trained_model = model.fit(x_train,y_train)
pred_result_train = trained_model.predict(x_train)
y_train = list(y_train)

counts = 0
for i in range(len(pred_result_train)):
    if str(pred_result_train[i]) == str(y_train[i]):
        counts += 1
accuracy = (counts/len(pred_result_train)) * 100
accuracy = round(accuracy,1)

print('The accuracy of the trained model from the training step is ' + str(accuracy) + '% .')

The accuracy of the trained model from the training step is 72.5% .


7. Model validation to test predictive capability of classification model

In [5]:
# model validation
test_results = trained_model.predict(x_test)
y_test = list(y_test)
counts = 0
for i in range(len(test_results)):
    if str(test_results[i]) == str(y_test[i]):
        counts += 1
accuracy = (counts/len(test_results)) * 100
accuracy = round(accuracy,1)

print('The accuracy of the trained model from the validation step is ' + str(accuracy) + '% .')

The accuracy of the trained model from the validation step is 19.0% .


In [6]:
import graphviz
import pydot
from graphviz import Graph
dot_data = tree.export_graphviz(trained_model, out_file=None,
                                feature_names=x_categories,
                               class_names=class_name_convert,
                               filled=True, rounded=True,
                               special_characters=False)
graph = graphviz.Source(dot_data)
graph.render()

'Source.gv.pdf'