# Exercises

1. What is the approximate depth of a decision tree trained (without restrictions) on a training set with one million instances?
2. Is a node's gini impurity generally lower or greater than its parent's? Is it *generally* lower/greater or *always* lower/greater?
3. If a decision tree is overfitting the training set, is it a good idea to try decreasing `max_depth`?
4. If a decision tree is underfitting the training set, is it a good idea to try scaling the input features?
5. If it takes one hour to train a decision tree on a training set containing 1 million instances, roughly how much time will it take to train another decision tree on a training set containing 10 million instances?
6. If your training set contains 100,000 instances, will setting `presort = True` speed up training?
7. Train & fine-tune a decision tree for the moons dataset by following these steps:
   * Use `make_moons(n_samples = 10000, noise = 0.4)` to generate a moons dataset.
   * Use `train_test_split()` to split the dataset into a training set & a test set.
   * Use grid search with cross-validation (with the help of `GridSearchCV()` class) to find good hyperparameter values for a `DecisionTreeClassifier()`. Hint: try various values for `max_leaf_nodes`.
   * Train it on the full training set using these hyperparameters, & measure your model's performance on the test set. You should get roughly 85% to 87% accuracy.
8. Grow a forest by following these steps:
   * Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: use scikit-learn's `ShuffleSplit()` class for this.
   * Train one decision tree on each subset, using the best hyperparameter values found in the previous exercise. Evaluate these 1,000 decision trees on the test set. Since they were trained on smaller sets, these decision trees will likely perform worse than the first decision tree, achieving only about 80% accuracy. 
   * Now here comes the magic. For each test instance, generate the predictions of the 1,000 decision trees, & keep only the most frequent predictions (you can use scipy's `mode()` function for this). This approach gives you *majority-vote predictions* over the test set.
   * Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). Congratulations, you have trained a random forest classifier!

---

1. The default (unrestricted) depth of a decision tree on any training set is unlimited. You can change a maximum depth by setting the `max_depth` hyperparameter to some positive integer value.
2. I think a node's gini impurity is generally lower than it's parents. The CART algorithm doesn't try to minimise an individual node's gini impurity. It tries to find the optimal split by minimising the weighted sum of the gini impurities of the left & right subset of the split. Because of this, I think it's possible that: for example, the gini impurity of a right subset could be higher than the gini impurity of the parent, as long as the gini impurity of the left subset can be low enough to compensate for the right subset.
3. Yes. If a decision tree is overfitting the training set, you should decrease max_depth.
4. The decision tree won't be affected by scaling the training set. Instead, to fix an underfitting decision tree, you should decrease the `min_*` & increase the `max_*` hyperparameters.
5. If a decision tree containing 1 million instances takes 1 hour to train, with a training complexity of $O(n\ *\ m\ log_2(m))$; then a decision tree containing 10 million instances should have a training complexity of $O(n\ *\ 10m\ log_2(10m))$. Then the amount of time to train a decision tree with 10 million instances could be written like this: $\frac{n\ *\ 10m\ log_2(10m)}{n\ *\ m\ log_2(m)}$, which can be reduced to $\frac{10\ log_2(10m)}{log_2(m)}$. If we substitute m with 100,000, then we would get $\frac{232.53}{19.93} \approx$ 11.67 times longer than it takes for a decision tree to train 1 million instances, so approximately 11 hours & 40 minutes.
6. Setting `presort = True` only speeds up training for small training sets (less than a few thousand instances), so no.

# 7.

In [2]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples = 1000, noise = 0.4)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 32)
X_train

array([[ 0.65815584, -0.97254898],
       [-0.51110222, -0.03089614],
       [ 0.20462693,  1.31207084],
       ...,
       [-0.23060328,  1.0514219 ],
       [ 0.67051093,  0.83189349],
       [ 1.8084936 , -0.16454155]])

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

tree_classifier = DecisionTreeClassifier()
param_search_space = {"max_depth": [x for x in range(1, 6)], 
                      "min_samples_leaf": [x for x in range(20, 31)], 
                      "max_leaf_nodes": [x for x in range(2, 7)]}
grid_search = GridSearchCV(tree_classifier, param_search_space, cv = 5, n_jobs = 7,
                           scoring = "accuracy", return_train_score = True)
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'max_depth': 4, 'max_leaf_nodes': 5, 'min_samples_leaf': 20}

In [5]:
from sklearn.model_selection import cross_val_score
import numpy as np

scores = cross_val_score(grid_search.best_estimator_, X_train, y_train, 
                         scoring = "accuracy", cv = 10, n_jobs = 7)
print("Accuracy Scores: ", scores)
print("Mean: ", scores.mean())
print("StdDev: ", scores.std())

Accuracy Scores:  [0.8125 0.85   0.9375 0.8875 0.85   0.75   0.85   0.8875 0.7875 0.8125]
Mean:  0.8424999999999999
StdDev:  0.05159941860137572


In [6]:
from sklearn.metrics import accuracy_score

y_pred = grid_search.best_estimator_.predict(X_test)
accuracy_score(y_test, y_pred)

0.83

---

# 8.

In [7]:
from sklearn.model_selection import ShuffleSplit

all_subsets = []

randomsplit = ShuffleSplit(n_splits = 1000, train_size = 100, random_state = 32)
for train_index, test_index in randomsplit.split(X_train):
    X_subset_train, y_subset_train = X_train[train_index], y_train[train_index]
    all_subsets.append((X_subset_train, y_subset_train))

In [9]:
from sklearn.base import clone

forest = [clone(grid_search.best_estimator_) for _ in range(1000)]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, all_subsets):
    tree.fit(X_mini_train, y_mini_train)
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

np.mean(accuracy_scores)

0.769685

In [10]:
y_pred = np.empty([1000, len(X_test)], dtype = np.uint8)

for tree_index, tree in enumerate(forest):
    y_pred[tree_index] = tree.predict(X_test)

In [11]:
from scipy.stats import mode

y_pred_majority, n_votes = mode(y_pred, axis = 0)
accuracy_score(y_test, y_pred_majority.reshape([-1]))

0.785

Where is the magic?