# Machine Learning / Aprendizagem Automática

## Sara C. Madeira, 2019/20

# Practical 05 - Decision Trees in Scikit-Learn

## 0. Requirements

This practical studies decision trees with [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org), [Scikit-learn](http://scikit-learn.org/stable/), as well as other Python technical libraries, such as [Pandas](http://pandas.pydata.org) and [NumPy](http://www.numpy.org).

**Decision trees can handle both categorical and numerical features in the same dataset. This is actualy one of their strenghts together with interpretability.** Some decision tree learning algorithms, such C4.5 can deal with categorical features natively, thus being transparent to the user how the different attribute types are managed by the algorithm, while others, such as CART, require data transformations to deal with categorical features. Despite the algorithm, also its implementation may force the user to data transformations before the algorithm can run and/or be used correctely. **In Scikit-learn, for example, all learning algorithms are implemented to receive as input numeric values, preventing the use of certain types of features, such as categorical, without data transformations.**

Last week we used `Orange3` and its implementation of **C4.5 algorithm** (Quinlan, 1993) to learn decision trees. C4.5 uses the **Information Gain** as attribute selection measure (impurity measure), nativelly suports any type of feature (numeric and categorical atributes), and it can learn **decision trees that might not be binary**. C4.5 allows binary and multiclass classification but does not work with numeric class, that is, it cannot be used to learn a regression tree. **Orange3 implementations of C4.5 handles correctely and natively any feature type.**

**Scikit-learn uses an optimised version of the CART algorithm** (Breiman, Friedman, Olshen, Stone, 1986). CART (Classification and Regression Trees) uses the **Gini Index** as attribute selection measure; allows binary and multiclass classification together with regression; and learns **binary trees**. 

**In order to learn decision trees or any other classifier from non-numerical data, Scikit-learn requires feature transformations, aka encoding.**

This tutorial has two main parts and a third that summarizes the topic of **learning decision trees in Scikit-Learn**:

1. Learning Decision Tree from **Numerical Features**

2. Learning Decision Tree from **Categorical Features**

3. Learning Decision Tree from **Multiple Types of Features**


### Decision Trees and Encoding

Summing up, when learning decision tree models from categorical features, we might come across three types of algorithms/implementations:

1. **Algorithms handling categorical features CORRECTLY**. We input the categorical features to the algorithm in the appropriate format, as we do with the numeric features (since we can have features of any type), and the machine learning algorithm processes categorical features correctly as categorical. This is the BEST CASE since it fits our needs and we do not have to worry about feature transformations. This is the case of Orange3, and it is also the case of Weka and other machine learning tools.

2. **Models handling categorical features INCORRECTLY**. We input the categorical features to the algorithm in the appropriate format, BUT the machine learning algorithm processes categorical features incorrectly by doing wizardry processing to transform them into something usable. This is the WORST CASE EVER since it probabbly does not do what we expect to be done, and thus features are wrongly transformation, and consequentely the model performance will be compromised.

3. **Models NOT handling categorical features at all**. In this case we have to preprocess manually the categorical features in order to have them in an appropriate format for the machine learning algorithm, usually numeric features. **This is the case in Scikit-Learn, where we have to transform (aka ENCODE) the categorical features before learning the decision tree.** But how do we transform (aka ENCODE) them? There are many methods to encode categorical features. We are going to explore the use of three of them: **Binary Encoding**, **One-Hot encoding (Dummy Variables)**, and **Numeric Encoding**.

## 1. Learning Decision Trees from Numeric Features

### 1.1. Getting Started: Learning a Decision Tree using All Data

We introduce decidion trees in Scikit-learn using datasets where the features are numeric. We first use the well-known [iris dataset](https://archive.ics.uci.edu/ml/datasets/iris). 

As you might remember the goal is to distinguish the species of iris flowers given that they are characterised by the length and width of the petals, and the length and width of the sepal, all measured in centimeters. 

In this context, we have at hands a **multi-class classification problem, where the class has 3 possible values (Setosa, Versicolor, or Virginica) and the iris examples are characterized by 4 numeric features**.

In [None]:
#load dataset

from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [None]:
# general view of the dataset

iris_dataset

In [None]:
iris_dataset['feature_names']

In [None]:
# Values in key target_names

iris_dataset['target_names']

In [None]:
# Shape of the dataset

# The shape of the data array is the number of examples times the number of features.
# This is a convention in scikit-learn and your data will always be assumed to be in this shape.

iris_dataset['data'].shape

In [None]:
# feature values of the first 5 examples

iris_dataset['data'][:5]

In [None]:
# Target of the first 5 learning examples

# (Setosa, Versicolor, Virginica) are coded as (0, 1, 2)

iris_dataset['target'][:5]

In [None]:
# The target is always a one-dimensional array, with one entry per example

iris_dataset['target'].shape

In [None]:
# The species are encoded as integers from 0 to 2. 
# The meaning of the numbers are given by the iris['target_names'] array: 
# 0 means Setosa
# 1 means Versicolor
# 2 means Virginica

iris_dataset['target_names']

We can now learn the **decision tree classifier** (http://scikit-learn.org/stable/modules/tree.html) using the class `DecisionTreeClassifier` (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier). 

**The parameter `criterion` defines the function to measure the quality of a split. Supported criteria are “gini” for the Gini Index and “entropy” for the Information Gain. Gini is the `default`.**

In [None]:
from sklearn import tree

#Learning a decision tree using CART and the Gini index as impurity criteria (default)

dtc_Gini = tree.DecisionTreeClassifier() # criterion = "Gini"

dtc_Gini = dtc_Gini.fit(iris_dataset.data, iris_dataset.target)

dtc_Gini

**Scikit-learn allows to export the decision tree as a .dot file after training**, which we can visualize using **`GraphViz`**. `GraphViz` is freely available at http://www.graphviz.org and it is supported by Linux, Windows, and Mac OS X. 

**If `GraphViz` is not installed in your computer you should install it. If you are not able to install it now don't worry, there is an alternative below.**

In this context, we first **create the .dot file via scikit-learn using the `export_graphviz` function from the `tree` submodule**, as follows:

In [None]:
dot_data = tree.export_graphviz(dtc_Gini, out_file="iris_Gini.dot",
                                feature_names=iris_dataset.feature_names,
                                class_names=iris_dataset.target_names,
                                filled=True, rounded=True,
                                special_characters=True)

After installing GraphViz, we can **convert the tree.dot file into a PNG file** by executing the following code:

In [None]:
from subprocess import call
call(['dot', '-T', 'png', 'iris_Gini.dot', '-o', 'iris_Gini.png'])

Now go to your working folder and **open the file `'iris_Gini.png'`**. 

As you can see **the decision tree learned using the Gini Index is the following:**

<img src="iris_Gini.png" alt="iris_Gini.png" style="width: 900px;"/>

**In case you are not allowed or are not able to install Graphviz**, you can **vizualise the `.dot` file online** by using for instance [`GraphvizOnline`](https://dreampuf.github.io/GraphvizOnline/).

You should first copy/paste the `iris_Gini.dot` file contents to the editor area (left).

<img src="graphviz_online.png" alt="graphviz_online.png" style="width: 900px;"/>

**Let's now check if we obtain the same decision tree if we use Information Gain instead of Gini Index as impurity measure:**

In [None]:
from sklearn import tree

#Learning a decision tree using CART and the Information Gain as impurity criteria 

dtc_IG = tree.DecisionTreeClassifier(criterion = "entropy") # criterion = "entropy"
dtc_IG = dtc_IG.fit(iris_dataset.data, iris_dataset.target)
dtc_IG

In [None]:
# create a .dot file with the tree
dot_data = tree.export_graphviz(dtc_IG, out_file="iris_IG.dot",
                                feature_names=iris_dataset.feature_names,
                                class_names=iris_dataset.target_names,
                                filled=True, rounded=True,
                                special_characters=True)

# create a .png file from the .dot file 
from subprocess import call
call(['dot', '-T', 'png', 'iris_IG.dot', '-o', 'iris_IG.png']) 

Go to your working folder an **open the file `'iris_IG.png'`**. 

As you can see **the decision tree learned using the Information Gain is the following:**

<img src="iris_IG.png" alt="iris_IG.png" style="width: 900px;"/>

### 1.2 Learning a Decision Tree using Train and Test Data

Recall what we did at the end of Practical 02 + 03 with Iris Dataset. **Repeat the train and test scheme followed then by using now the decision tree as classifier instead of the K-NN used before.**

In [None]:
# Now it's up to you to code
# ...

### 1.3 Additional Exercise: Learn a Decision Tree using Train and Test Data for the Wine Dataset

The file `wine.csv` constains the [Wine Dataset](https://archive.ics.uci.edu/ml/datasets/wine). These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. 

**You should first load the dataset from a .csv file and can use the code below to do it.**

In [None]:
import numpy as np
import pandas as pd

def load_data(fname):
    """Load CSV file with any number of consecutive features, starting in column 0, where last column is the class"""
    df = pd.read_csv(fname)
    nc = df.shape[1] # number of columns
    matrix = df.as_matrix() # Convert dataframe to darray
    table_X = matrix [:, 0:nc-1] # get features 
    table_y = matrix [:, nc-1] # get class (last columns)           
    features = df.columns.values[0:nc-1] #get features names
    target = df.columns.values[nc-1] #get target name
    return table_X, table_y, features, target

In [None]:
# load dataset `wine.csv`
table_X, table_y, features, target = load_data('wine.csv')

In [None]:
# Now it's up to you to code
# train and test the decision tree
# ...


## 2. Learning Decision Trees from Categorical Attributes

In the previous examples, the features were numeric (real-valued). In many settings, this is not the case, and some or even all features are categorical.  

**Scikit-learn only handles numeric features, but provides mechanisms to convert categorical features into numeric ones**. 

### 2.1 Handling Categorical Features with Two Values

**For this example, all features are categorical, each with two possible values.** 

**We will convert each feature into an integer value, with two possible values, either 0 or 1.**

We consider the dataset `hike.csv`, where all features have the following values {"yes", "no"}

In [None]:
import numpy as np
import pandas as pd

def load_data(fname):
    """Load CSV file with any number of consecutive features, starting in column 0, where last column is tha class"""
    df = pd.read_csv(fname)
    nc = df.shape[1] # number of columns
#    matrix = df.as_matrix() # Convert dataframe to darray # deprecating...
    matrix = df.values # Convert dataframe to darray
    table_X = matrix [:, 0:nc-1] # get features 
    table_y = matrix [:, nc-1] # get class (last columns)           
    features = df.columns.values[0:nc-1] #get features names
    target = df.columns.values[nc-1] #get target name
    return table_X, table_y, features, target

In [None]:
table_X, table_y, features, target = load_data('hike.csv')

In [None]:
# feature names
features

In [None]:
# Data from which we should learn (features)
table_X

In [None]:
# the first column 'Sample' should not be used to learn the decision tree since it is the identifier 
# let's remove it from table_X

nc = table_X.shape[1] # number of columns
table_X = table_X[:, 1:nc] # remove column 0
table_X

In [None]:
# let's also remove 'Sample' from the features names
features = features [1:features.size]
features

In [None]:
# Target name
target

In [None]:
# Vector with what we should learn (Class)
table_y

**Now we need to define utility functions to transform the features and the classe into binary values.**

In [None]:
from sklearn.preprocessing import LabelEncoder

def int_encode_class(vect):
    enc = LabelEncoder()
    label_encoder = enc.fit(vect)
    integer_classes = label_encoder.transform(label_encoder.classes_)
    t = label_encoder.transform(vect)
    return t
    
def int_encode_feature(vect):
    return int_encode_class(vect)

Given the utility functions, we can now set up the scikit-learn classifier as follows:

In [None]:
# ENCODE table_X (FEATURES) with integers 

table_X[:, 0] = int_encode_feature(table_X[:, 0])
table_X[:, 1] = int_encode_feature(table_X[:, 1])
table_X[:, 2] = int_encode_feature(table_X[:, 2])
table_X[:, 3] = int_encode_feature(table_X[:, 3])

table_X

In [None]:
# ENCODE table_Y (CLASS) with integers

table_y = int_encode_class(table_y)

table_y

**We still have to convert the binary values into real values.**

In [None]:
# Convert FEATURES into REAL VALUES

table_X = table_X.astype(float)

table_X

In [None]:
# ConvertCLASS to REAL VALUES

table_y = table_y.astype(float)

table_y

We can now finally learn the **decision tree classifier** (http://scikit-learn.org/stable/modules/tree.html) using the class `DecisionTreeClassifier` (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier). Remember that the parameter `criterion` defines the function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Gini is the default.

In [None]:
from sklearn import tree

#Learning a decision tree using CART and the Gini index as impurity criteria (default)

dtc_Gini = tree.DecisionTreeClassifier() #criterion='gini'
dtc_Gini = dtc_Gini.fit(table_X, table_y)

dtc_Gini

First, we **create the .dot file via scikit-learn using the export_graphviz function from the tree submodule**, as follows:

In [None]:
# create a .dot file with the tree
tree.export_graphviz(dtc_Gini, out_file='hike_Gini.dot',
                     feature_names=['Lecture', 'Concert', 'ArtExpo', 'SeasonSales'],
                     class_names=['No', 'Yes'],
                     filled=True, rounded=True,
                     special_characters=True)

#after executing this code you should have the file "hike_Gini.dot" in your working directory

After we have installed GraphViz on our computer, we can **convert the tree.dot file into a PNG file** by executing the following code:

In [None]:
# create a .png file from the .dot file
from subprocess import call
call(['dot', '-T', 'png', 'hike_Gini.dot', '-o', 'hike_Gini.png'])

#after executing this code you should have the file "hike_Gini.png" in your working directory

**The decision tree learned using the Gini Index is the following:**

<img src="hike_Gini.png" alt="hike_Gini.png" style="width: 300px;"/>

**Let's now check if we obtain the same decision tree if we used Information Gain instead of Gini Index:**

In [None]:
from sklearn import tree

#Learning a decision tree using CART and the Information Gain as impurity criteria (default)

dtc_IG = tree.DecisionTreeClassifier(criterion = "entropy")
dtc_IG = dtc_IG.fit(table_X, table_y)

dtc_IG

In [None]:
# create a .dot file with the tree
tree.export_graphviz(dtc_IG, out_file='hike_IG.dot',
                     feature_names=['Lecture', 'Concert', 'ArtExpo', 'SeasonSales'],
                     class_names=['No', 'Yes'],
                     filled=True, rounded=True,
                     special_characters=True)

# create a .png file from the .dot file
from subprocess import call
call(['dot', '-T', 'png', 'hike_IG.dot', '-o', 'hike_IG.png'])

**The decision tree learned using the Information Gain is the following:**

<img src="hike_IG.png" alt="hike_IG.png" style="width: 300px;"/>

#### Additional Exercise: Learn Decision Trees for the Zoo Dataset

Do the necessary encodings to learn a decision tree for the Zoo dataset in file `zoo.csv`. 

You can also use the files  `zoo_train.csv` and `zoo_test.csv` as train and test sets as we did in Orange3. The first loading is already done for you.

In [None]:
import numpy as np
import pandas as pd

def load_data(fname):
    """Load CSV file with any number of consecutive features, starting in column 0, where last column is tha class"""
    df = pd.read_csv(fname)
    nc = df.shape[1] # number of columns
    matrix = df.as_matrix() # Convert dataframe to darray
    table_X = matrix [:, 0:nc-1] # get features 
    table_y = matrix [:, nc-1] # get class (last columns)           
    features = df.columns.values[0:nc-1] #get features names
    target = df.columns.values[nc-1] #get target name
    return table_X, table_y, features, target

table_X, table_y, features, target = load_data('zoo.csv')

In [None]:
# Now is up to you to code
#  ...



### 2.2 Handling Categorical Features with Multiple Values

**For categorical features with more than two possible values, a different approach is used. The idea is to encode each possible value as a distinct feature, using the so-called one-hot-encoding.**

Let's see how this is done using the small dataset in `votingIssue.csv`.

In [None]:
import numpy as np
import pandas as pd

def load_data(fname):
    """Load CSV file with any number of consecutive features, starting in column 0, where last column is tha class"""
    df = pd.read_csv(fname)
    nc = df.shape[1] # number of columns
    matrix = df.as_matrix() # Convert dataframe to darray
    table_X = matrix [:, 0:nc-1] # get features 
    table_y = matrix [:, nc-1] # get class (last columns)           
    features = df.columns.values[0:nc-1] #get features names
    target = df.columns.values[nc-1] #get target name
    return table_X, table_y, features, target

table_X, table_y, features, target = load_data('votingIssue.csv')

In [None]:
features

In [None]:
table_X

In [None]:
# the first column 'Person' should not be used to learn the decision tree since it is the identifier 
# let's remove it from table_X
nc = table_X.shape[1] # number of columns
table_X = table_X[:, 1:nc] # remove column 0
table_X

In [None]:
# let's also remove 'Person' from the features names
features = features [1:features.size]
features

In [None]:
target

**We start by defining the function `ohenc_encode_feature` that given a target column (`col`), the number of rows (`nrow`) and the number of possible values (`ndim`), replaces the original column by `ndim` new binary colums.**

In [None]:
from sklearn.preprocessing import OneHotEncoder

def ohenc_encode_feature(table_X, col, nrow, ndim):
    enc = LabelEncoder()
    label_encoder = enc.fit(table_X[:, col])
    integer_classes = label_encoder.transform(label_encoder.classes_).reshape(ndim, 1)
    enc = OneHotEncoder()
    one_hot_encoder = enc.fit(integer_classes)
    # First, convert feature values to 0-(N-1) integers using label_encoder
    num_of_rows = nrow
    t = label_encoder.transform(table_X[:, col]).reshape(num_of_rows, 1)
    # Second, create a sparse matrix with col columns, each one indicating
    # whether the instance belongs to the class
    new_features = one_hot_encoder.transform(t)
    # Add the new features to table_X
    table_X = np.concatenate([table_X, new_features.toarray()], axis = 1)
    # Eliminate converted columns
    table_X = np.delete(table_X, [col], 1)
    return table_X

**We still need the functions `int_encode_feature` and `int_encode_class`, since we have two features with two values ('Sex' and 'HasChildren'), and need to encode the target.**

In [None]:
from sklearn.preprocessing import LabelEncoder

def int_encode_class(vect):
    enc = LabelEncoder()
    label_encoder = enc.fit(vect)
    integer_classes = label_encoder.transform(label_encoder.classes_)
    t = label_encoder.transform(vect)
    return t
    
def int_encode_feature(vect):
    return int_encode_class(vect)

**Putting all together, we can perform the encoding for the VotingIssue dataset as follows:**

In [None]:
# Encode  feature 'Sex'

table_X[:, 2] = int_encode_feature(table_X[:, 2])
table_X

In [None]:
# Encode  feature 'HasChildren'

table_X[:, 3] = int_encode_feature(table_X[:, 3])
table_X

In [None]:
# Encode feature 'Education'

# 1st - int_encode_feature
table_X[:,0] = int_encode_feature(table_X[:, 0])

# 2nd - ohenc_encode_feature
num_of_rows = table_X.shape[0]
table_X = ohenc_encode_feature(table_X, 0, num_of_rows, 3)

# 3rd Update feature names
features = ['MaritalStatus', 'Sex', 'HasChildren', 'Education-Primary', 'Education-Secondary', 'Education-University']
table_X

In [None]:
# Encode feature 'MaritalStatus'

# 1st - int_encode_feature
table_X[:,0] = int_encode_feature(table_X[:, 0])

# 2nd - ohenc_encode_feature
num_of_rows = table_X.shape[0]
table_X = ohenc_encode_feature(table_X, 0, num_of_rows, 3)

# 3rd Update feature names
features = ['Sex', 'HasChildren', 'Education-Primary', 'Education-Secondary', 
            'Education-University', 'MaritalStatus-Single', 'MaritalStatus-Married', 'MaritalStatus-Divorced']
table_X

In [None]:
# Convert table_X to numerical values
table_X = table_X.astype(float)
table_X

In [None]:
# ENCODE table_y (CLASS) with integers
table_y = int_encode_class(table_y)

# Convert table_y to numerical values
table_y = table_y.astype(float)
table_y

**Now that the data is encoded we can learn the decision tree.**

In [None]:
# Learn the decision tree

from sklearn import tree

cdt = tree.DecisionTreeClassifier(criterion='gini')
cdt = cdt.fit(table_X, table_y)
cdt

In [None]:
# Visualize the decision tree

tree.export_graphviz(cdt, out_file='votingIssue.dot',
                     feature_names=features,
                     filled=True, rounded=True,
                     special_characters=True)

from subprocess import call
call(['dot', '-T', 'png', 'votingIssue.dot', '-o', 'votingIssue.png'])

**The decision tree learned using the Gini Index is the following:**

<img src="votingIssue.png" alt="votingIssue.png" style="width: 750px;"/>

## 3. Learning Decision Trees from Multiple Types of Features

## 3.1. A Small Dataset: The Restaurant Dataset

Do the necessary encodings to learn a decision tree for the Restaurant dataset in file `restaurant.csv`. The loading is already done for you.

In [None]:
import numpy as np
import pandas as pd

def load_data(fname):
    """Load CSV file with any number of consecutive features, starting in column 0, where last column is tha class"""
    df = pd.read_csv(fname)
    nc = df.shape[1] # number of columns
    matrix = df.as_matrix() # Convert dataframe to darray
    table_X = matrix [:, 0:nc-1] # get features 
    table_y = matrix [:, nc-1] # get class (last columns)           
    features = df.columns.values[0:nc-1] #get features names
    target = df.columns.values[nc-1] #get target name
    return table_X, table_y, features, target

table_X, table_y, features, target = load_data('restaurante.csv')

In [None]:
table_X

In [None]:
table_y

In [None]:
features

In [None]:
target

In [None]:
# Now is up to you to code
#  ...



## 3.1. A Not That Small Dataset: The Titanic Dataset

Do the necessary data preprocessing to learn a decision tree for the Titanic dataset in file `titanic.csv`. The loading is already done for you.

In [None]:
import numpy as np
import pandas as pd

def load_data(fname):
    """Load CSV file with any number of consecutive features, starting in column 0, where last column is tha class"""
    df = pd.read_csv(fname)
    nc = df.shape[1] # number of columns
    matrix = df.as_matrix() # Convert dataframe to darray
    table_X = matrix [:, 0:nc-1] # get features 
    table_y = matrix [:, nc-1] # get class (last columns)           
    features = df.columns.values[0:nc-1] #get features names
    target = df.columns.values[nc-1] #get target name
    return table_X, table_y, features, target

table_X, table_y, features, target = load_data('titanic.csv')

In [None]:
table_X

In [None]:
table_X.shape

In [None]:
features

In [None]:
target

In [None]:
# Now is up to you to code
#  ...
