## Building your first Decision Tree

In this tutorial, we will walk you through all the steps involved in building a decision tree: formulating your data, one-hot encoding data, training a decision tree model and validating the accuracy of the decision tree built, on your test data set. We will be using sklearn's DecisionTreeClassifier API to build the Decision Tree.

### Formulating the data

In this section, we will discuss the types of data formulation needed to use the API. Consider the mushroom edibility dataset which contains various features for each mushroom entry such as odour, habitat, etc. There are a total of 22 such features There are a total of 8124 data points in this dataset.

The following lines of code are loading the relevant data and organising the data into the features and labels. 

In [9]:
import numpy as np
import pandas as pd

# Loading the data from CSV file
data = pd.read_csv("mushrooms.csv")

# The first column of the csv file contains the labels
y = data.iloc[:, 0] # All rows, 0th column

# The rest of the columns of the csv file contains the features
X = data.iloc[:,1:] # All rows, column 1 to end

In [10]:
# Let us look at what a row from the data contains

print "Features of Row 1: ", X.iloc[1].values
print "Class label of Row 1: ",y.iloc[1]

# Let us see the shape of X
print "Dimensions of X: ", X.shape

Features of Row 1:  ['x' 's' 'y' 't' 'a' 'f' 'c' 'b' 'k' 'e' 'c' 's' 's' 'w' 'w' 'p' 'w' 'o'
 'p' 'n' 'n' 'g']
Class label of Row 1:  e
Dimensions of X:  (8124, 22)


The data has been loaded and as you can see, each feature is represented by a character. All the features are categorical. Categorical data is one where the feature can take one value out of a discrete set of values. For example, in this case Cap-surface feature can be one of the following: 'fibrous', 'grooves', 'scaly', 'smooth' and are represented as characters in the dataset. In contrast, numerical data such as length can take any set of numerical values in a given range. Also, as expected, X has 22 columns which are the number of features that are there in the dataset.

Moving forward, there are some requirements that your data needs to satisfy before being ready to be used for creating the decision tree. Firstly, sklearn's Decision tree classifier requires numeric data to be passed to it. Therefore, it is necessary to model the categorical data, represented by characters, into numeric class labels. This is done by mapping the features into corresponding numerical values. 


In [11]:
# Enter code to map the categorical features into numerical values

for col in X.columns:
    X.loc[:, col] = X.loc[:, col].map(dict([(l,idx) for (idx,l) in enumerate(X[col].unique())]))
    
# Let us see how the features looks like now.
print "Features of Row 1 after numerical encoding", X.iloc[1].values

Features of Row 1 after numerical encoding [0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1]


As you can see, the categorical data is now represented by numbers instead of characters. Now, the API will be able to process our data. 

However, it is not a good idea to have categorical data being represented by numeric values. This is because, the machine learning algorithm may perceive one class to be more important than the other, just because one class has a numerically larger class label. To overcome this problem, it is a common strategy to one-hot encode categorical data for machine learning problems.

#### One-hot Encoding

One-hot encoding is used for binarization of categorical data. In other words, one hot encoding will modify the data representation such that, if a feature has 'n' possible values, that feature will be represented by 'n' columns, each of which can contain either 1 or 0, indicating the presence or absence of the feature respectively.

Let us consider an example. Look at the following data:

<table>
  <tr>
    <th>Colour</th>
    <th>Shape</th> 
    <th>Size</th>
  </tr>
  <tr>
    <td>Red</td>
    <td>Circle</td> 
    <td>Small</td>
  </tr>
  <tr>
    <td>White</td>
    <td>Box</td> 
    <td>Small</td>
  </tr>
  <tr>
    <td>Red</td>
    <td>Box</td> 
    <td>Medium</td>
  </tr>
  <tr>
    <td>Black</td>
    <td>Circle</td> 
    <td>Large</td>
  </tr>
</table>

This data has 3 features: Colour, Shape and Size. If we were to encode each of the categorical features, this is one possible way of doing it :<br>
<b>Colour</b> : Red = 0, White = 1, Black = 2 <br>
<b>Shape</b> : Circle = 0, Box = 1 <br>
<b>Size</b> : Small = 0, Medium = 1, Large = 2<br>

The first row in the data(Colour = Red, Shape = Circle, Size = Small) can be numerically encoded as (0, 0, 0).  
With one-hot encoding, the data will be represented as [[1, 0, 0],[1,0],[1,0,0]]. The inner list is representative of each feature, and 1 represents the presence of the feature value while 0 denotes the absence of it. Thus [1, 0, 0] means that Red is present while white and black are absent. 

On concatenation, the whole data can be represented as [1, 0, 0, 1, 0, 1, 0, 0]. This is the final one hot encoding of the data.

Let us look at sklearn.preprocessing.OneHotEncoder, which we will be using to one-hot encode the categorical features in our mushroom dataset. To use this, the data being passed to the API needs to be in the form of numerical data. First an object of the OneHotEncoder class is created and then, to the fit_transform() function of the object, the data is passed, which will output the data with all the columns being encoded in one-hot encoding format.

For more details, have a look at the documentation http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [4]:
# Enter code to create the one-hot encoding of the data.

from sklearn.preprocessing import OneHotEncoder

# Saving the encoder to handle any data acquired later.
encoder = OneHotEncoder().fit(X)

# Transforming the data to one hot encoding
X = encoder.transform(X)

# Let us see the shape of X
print X.shape

(8124, 117)


As you can see, the number of columns has increased from 22 to 117.

The data can now be passed to the sklearn Decision tree classifer API as it is one-hot encoded. Before doing so, it is essential to split the data into train and test datasets.

#### Splitting the data into train and test

It is essential to to split the data into train and test. This is done so because, if we validate the accuracy of the model on the training data itself, there is a chance that the model has learnt the noise in the training data and has associated it to be a part of the entire data. By validating the accuracy on the test dataset, we can quantify in a valid manner whether the model has learnt the generalisations of the data. 

This is done by using the train_test_split method present in sklearn.model_selection. The details for this follow below. 

For more details, have a look at the documentation http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [5]:
# Here, X is the input features, y is the input labels. X_train and X_test are the train and test splits 
# respectively of the features. Likewise for the labels in y. The test_size gives the fraction of the total 
# data that is in the test data. The random_state is used to create the same split each time the function is called.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=4)

Now, your data is ready to be used to build a Decision Tree and evaluate it's efficiency.

### Training your decision tree

The next step is to fit a decision tree for your training data. For this we will use the DecisionTreeClassifier API. This can be imported from sklearn.tree . The API takes a number of parameters which will provide you with finer control on what the Decision Tree learns. For now, we will use the API with all of it's parameters set to their default values. The .fit() function is used to make the decision tree learn from the training data, which is passed as parameters to it. 

For more detials, have a look at the documentation http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


In [6]:
from sklearn.tree import DecisionTreeClassifier

# Create an object of the DecisionTreeClassifier class.
model_tree = DecisionTreeClassifier()

# To the created object, call the fit() function with the training data parameters. 
model_tree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### Test and evaluate your model

Now that your decision tree model has been trained. We will now compute the accuracy on the test dataset. This will give us a measure of how well that decision tree has been trained. A higher accuracy on the test dataset indicates that the Decision Tree has adequately learnt the right patterns in the data. The score() function used on the DecisionTreeClassifier object will output the accuracy. The features and labels of the test data is passed to the function as parameters.

In [7]:
# The score() function is called with the test data as parameters. 
accuracy = model_tree.score(X_test, y_test)

# The accuracy is output as a fraction. We multiply by 100 to obtain the accuracy in percentage.
print accuracy*100

100.0


You should get an accuracy of 100% on the dataset. 

### Using RandomForestClassifier to classify data 

Using the sklearn.ensemble.RandomForestClassifer is similar to using the DecisionTreeClassifier discussed above. The first stage would be to fit the training data on an object of RandomForestClassifer. This is done using the fit() function, passing the training data as parameters. Post that, the accuracy of the RandomForestClassifier can be obtained by using the score() function and passing the test data as parameters. This is done in the code given below. 

In [8]:
from sklearn.ensemble import RandomForestClassifier

# Create an object of RandomForestClassifier
clf = RandomForestClassifier()

# Fit your training data
clf.fit(X_train, y_train)

# Check the accuracy 
y_acc = clf.score(X_test, y_test)
print y_acc*100

100.0
