# Chapter 6: Decision Tree

This chapter briefly introduces the principles of Decision Tree and how to use it with sklearn. This notebook contains my solution to Ex.7 and Ex.8. Due to the limitation of computing power of my laptop, grid search is not performed.

## Exercise 7: Decision Tree on Satellite dataset

Requirement: Create and fine-tune a Decision Tree on Satellite dataset.

Generally, the first step is to prepare the dataset. Check the available APIs to assist you.

In [32]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
data, target = make_moons(n_samples=10000, noise=0.4)
X_train, X_test, y_train, y_test = train_test_split(data, target)
y_test

array([1, 0, 0, ..., 0, 1, 1])

When we have the proper dataset, use it to train a Decision Tree. Grid search is not performed due to computing power limitations.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# grid_search = GridSearchCV(
#     DecisionTreeClassifier(),
#     param_grid={"max_leaf_nodes": [], "max_depth": []}, 
#     cv=5,
#     n_jobs=-1,
#     scoring="precision"
# )
# grid_search.fit(X_train, y_train)
# model = grid_search.best_estimator_

model = DecisionTreeClassifier(max_leaf_nodes=17, max_depth=None)

model.fit(X_train, y_train)

Finally check the general performance of the model.

In [None]:
from sklearn.metrics import precision_score, roc_auc_score
predictions = model.predict(X_train)
print("---------- Training ----------")
print("Precision Score:", precision_score(y_train, predictions))
print("AUC:", roc_auc_score(y_train, predictions))
predictions = model.predict(X_test)
print("---------- Testing ----------")
print("Precision Score:", precision_score(y_test, predictions))
print("AUC:", roc_auc_score(y_test, predictions))

## Exercise 8: Grow a "forest"

Requirement: Grow a Random Forest with Decision Tree based on the dataset and model above.

Based on the dataset above, we seperate the dataset into 1000 different subsets. Then, using the hyperparameters above, create 1000 Decision Trees on each training subset. Predict each test instance with these 1000 Decision Trees, and keep the most frequently predicted result. In this way, you create a Random Forest on your own. 

In [None]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score
n_trees = 1000
n_samples = 100
rs = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_samples, random_state=42)
mini_sets = []
for train_index, val_index in rs.split(X_train):
    X_train_fold, y_train_fold = X_train[train_index], y_train[train_index]
    mini_sets.append((X_train_fold, y_train_fold))
    
model_collection = [DecisionTreeClassifier(max_leaf_nodes=17) for _ in range(n_trees)]
for model, (X_train_fold, y_train_fold) in zip(model_collection, mini_sets):
    model.fit(X_train_fold, y_train_fold)
    y_pred = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))

Check whether the performance of Random Forest is better than single Decision Tree.

In [None]:
# A helper function to find the element that occurs most frequently in a list.
def max_occurence(input_array):
    categories = []
    count_dict = {}
    for item in input_array:
        if str(item) not in categories:
            categories.append(str(item))
            count_dict[str(item)] = 1
        else:
            count_dict[str(item)] += 1

    max_value = 0
    max_key = ""
    for key, value in count_dict.items():
        if value > max_value:
            max_value = value
            max_key = key

    return int(max_key)


# Use voting to create a random forest.
import numpy as np
prediction_result = []
for test_instance in X_test:
    prediction_collection = []
    for model in model_collection:
        prediction_collection.append(model.predict(test_instance))
        
    prediction_result.append(max_occurence(prediction_collection))

prediction_result = np.array(prediction_result)
print("Final Precision Score:", precision_score(y_test, prediction_result))
print("Final AUC:", roc_auc_score(y_test, prediction_result))