In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:]
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [2]:
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file=image_path('iris_tree.dot'),
        feature_names=iris.feature_names[2:],
        class_names=iris.target_names,
        rounded=True,
        filled=True
)

NameError: name 'image_path' is not defined

In [3]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X,y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=2,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

## Exercises

1. What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with one million instances?

If you assume a split on every depth, it would be log one million, approximately.

2. Is a node's Gini impurity generally lower or greater than its parent's? is it generally lower/greater, or always lower/greater?

The Impurity value of a node is generally lower than it's parents. Typically it is always lower due to how the cost function attempts to make its decision divisions. It is possible to have a node that is higher than its parents, but other child nodes are then much lower.

3. If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?

Yes, by decreasing max_depth parameter you're essentially regularizing the model, a known technique for preventing overfitting.

4. If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?

No, scaling the input features of a Decision Tree is not necessary, as Decision Trees can function on any kind of data

5. If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?

If you are increasing your training set by a factor of 10, you'll roughly increase your training time by another ten hours, due to the computational complexity decision trees have.

6. If your training set contains 100,000 instances, will setting presort=True speed up training?

setting presort to true will only help if you have a few thousand instances. However, with 100,000 instances, it will take considerably longer and thus slow down your training speed.

7. Train and fine-tune a Decision Tree for the moons dataset by following these steps:

    a. Use make_moons(n_samples=10000, noise=0.4) to generate a moons dataset.
    
    b. Use train_test_split() to split the dataset into a training set and a test set.
    
    c. Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyperparameter values for a     DecisionTreeClassifier. Hint: try various values for max_leaf_nodes.
    
    d. Train it on the full training set uing these hyperparamters, and measure your model's performance on the test set. You       should get roughly 85% to 87% accuracy.

In [4]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
from sklearn.model_selection import GridSearchCV

params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2,3,4]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 294 candidates, totalling 882 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 882 out of 882 | elapsed:    7.6s finished


GridSearchCV(cv=3, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,


In [7]:
grid_search_cv.best_estimator_

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=17,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

In [8]:
from sklearn.metrics import accuracy_score

y_pred = grid_search_cv.predict(X_test)
accuracy_score(y_test, y_pred)

0.8695

8. Grow a forst by following these steps:

    a. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected         randomly. Hint: you can use Scikit-Learn's ShuffleSplit class for this.
    
    b. Train one Decision Tree on each subset, using the best hyperparameter values found in the previous exercise. Evaluate       these 1,000 Decision Trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely           perform worse than the final Decision Tree, achieving only about 80% accuracy.
    
    c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the     most frequent prediction (you can use SciPy's mode() function for this). This approach gives you majority-vote predictions     over the test set.
    
    d. Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about       0.5 to 1.5% higher). Congratulations, you have trained a Random Forest Classifier!

In [10]:
from sklearn.model_selection import ShuffleSplit

n_trees = 1000
n_instances = 100

mini_sets = []

rs = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - 
                  n_instances, random_state=42)
for mini_train_index, mini_test_index in rs.split(X_train):
    X_mini_train = X_train[mini_train_index]
    y_mini_train = y_train[mini_train_index]
    mini_sets.append((X_mini_train, y_mini_train))

In [12]:
from sklearn.base import clone
import numpy as np

forest = [clone(grid_search_cv.best_estimator_) for _ in range(n_trees)]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train, y_mini_train)
    
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    
np.mean(accuracy_scores)

0.8054499999999999

In [14]:
Y_pred = np.empty([n_trees, len(X_test)], dtype=np.uint8)

for tree_index, tree in enumerate(forest):
    Y_pred[tree_index] = tree.predict(X_test)

In [15]:
from scipy.stats import mode

y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

In [16]:
accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.872