# Decision Trees

## Daniel Wilcox: 19147414

This example problem can be found within chapter 7 of the "Hands-on Machine Learning with Scikit-Learn and TensorFlow" by Aurélien Géron. 

This project will be investigating the theory behind Decision Trees and how to implament them.

In [1]:
import numpy as np

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.model_selection import ShuffleSplit

from sklearn.base import clone

from scipy.stats import mode



# Exercises
### 7. Train and fine-tune a Decision Tree for the moons dataset.


a. Generate a moons dataset using __make_moons(n_samples=10000, noise=0.4)__

In [2]:
#Generate Data:
X_moon, y_moon = make_moons(n_samples=10000, noise=0.4, random_state=42)

b. Split it into a training set and a test set using __train_test_split()__

In [3]:
#Split into training and testing data:
X_train, X_test, y_train, y_test = train_test_split(X_moon, y_moon, test_size=0.2, random_state=42)

c. Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyperparameter values for a DecisionTreeClassifier. Hint: try various values for max_leaf_nodes.


In [4]:
#Tune hyper-parameters:

#Model
desTree = DecisionTreeClassifier(random_state=42)

#Parameters
tree_param = {
    'criterion' : ['gini'],
    'max_leaf_nodes': [2, 5, 10, 50, 100],
    'max_depth' : [1, 2, 5],
    'min_samples_split' : [2, 5, 10],
    'min_samples_leaf' : [1, 5, 10],
}

#Grid-search
gs_desTree = GridSearchCV(desTree, tree_param, n_jobs=-1, verbose=1, cv=3)

#Fit
gs_desTree.fit(X_train, y_train)

Fitting 3 folds for each of 135 candidates, totalling 405 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 405 out of 405 | elapsed:    2.5s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'criterion': ['gini'], 'max_leaf_nodes': [2, 5, 10, 50, 100], 'max_depth': [1, 2, 5], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 5, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

d. Train it on the full training set using these hyperparameters, and measure your model's performance on the test set. You should get roughly 85% to 87% accuracy.

In [5]:
#Predict from test set:
y_pred = gs_desTree.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Decision Tree Accuracy Score: {:.2f}%".format(100*acc))

Decision Tree Accuracy Score: 85.45%


### 8. Grow a forest
a. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use Scikit-Learn's ShuffleSplit class for this.

In [6]:
trees = 1000
instances = 100

len_train = len(X_train)
len_test = len_train - instances

#Generate 100 random index's 1000 times 
shuffle = ShuffleSplit(n_splits=trees, test_size=len_test, random_state=42)

forest_sets = []

#Train a tree for each subset:
for train_idx, test_idx in shuffle.split(X_train):
    X_train_f = X_train[train_idx]
    y_train_f = y_train[train_idx]
    forest_sets.append((X_train_f, y_train_f))
    

In [7]:
#Get best tree model
best_tree = gs_desTree.best_estimator_

#Clone best tree 1000 times:
forest = [clone(best_tree) for _ in range(trees)]

acc_score = []

#Get score for each tree in forest
for tree, (X_train_f, y_train_f) in zip(forest, forest_sets):
    tree.fit(X_train_f, y_train_f)
    
    y_pred = tree.predict(X_test)
    acc_score.append(accuracy_score(y_test, y_pred))

#Get mean accuracy score:
f_acc = np.mean(acc_score)
print("Forest Mean Accuracy Score: {:.2f}%".format(100*f_acc))

Forest Mean Accuracy Score: 79.66%



c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction (you can use SciPy's mode() function for this). This gives you majority-vote predictions over the test set.

In [8]:
#Empty array to be filled with each tree's prediction of the test data
vote_pred = np.empty([trees, len(X_test)], dtype=np.uint8)

#predict test data for each tree
for tree_idx, tree in enumerate(forest):
    vote_pred[tree_idx] = tree.predict(X_test)

#Get number of votes per instance and make prediciton most frequent prediciton
y_pred_hard, n_votes = mode(vote_pred, axis=0)
y_pred_hard.reshape([-1])

array([1, 1, 0, ..., 0, 0, 0], dtype=uint8)


d. Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). Congratulations, you have trained a Random Forest classifier!

In [9]:
#get accuracy of voting:
acc_vote = accuracy_score(y_test, y_pred_hard.reshape([-1]))
print("Forest (hard) Vote Accuracy Score: {:.2f}%".format(100*acc_vote))

Forest (hard) Vote Accuracy Score: 83.30%
