## Decision Tree Questions and Answers

1. **Approximate Depth of an Unrestricted Decision Tree**
   The depth of an unrestricted Decision Tree trained on a dataset with one million instances could be very deep, potentially as deep as the number of instances. However, due to duplicate data and splitting strategies that don't always produce single-instance leaves, the practical depth may be less.

2. **Node's Gini Impurity Compared to Its Parent**
   A node's Gini impurity is generally lower than its parent's because the Decision Tree algorithm aims to reduce impurity with each split. There are rare cases where a split might not reduce impurity due to the data distribution or constraints set on the tree growth.

3. **Decision Tree Overfitting and `max_depth`**
   Decreasing `max_depth` is a good strategy to combat overfitting in a Decision Tree as it limits the complexity of the model, thereby making it generalize better to unseen data.

4. **Underfitting and Scaling Input Features**
   Scaling input features will not affect the performance of a Decision Tree as it is invariant to the scale of the input features. If a Decision Tree is underfitting, increasing model complexity by relaxing constraints like `max_depth` can help.

5. **Training Time Estimation with Increased Dataset Size**
   Training time for Decision Trees does not increase linearly with the number of instances due to the complexity of the tree-building process. Training on 10 million instances will take more than just 10 times the time it takes to train on 1 million instances due to factors like increased splitting computations and memory management.

6. **Using `presort=True` on Large Datasets**
   Setting `presort=True` on large datasets is not recommended as it can significantly slow down the training process. Presorting is computationally expensive and is more suited for small datasets where the overhead of sorting is relatively low.


In [2]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# a. Generate a moons dataset
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

# b. Split the dataset into a training and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# c. Grid search with cross-validation to find good hyperparameter values
params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1, cv=3, verbose=1)
grid_search_cv.fit(X_train, y_train)

# d. Train it on the full training set using these hyperparameters
best_tree_clf = grid_search_cv.best_estimator_
best_tree_clf.fit(X_train, y_train)

# Measure your model's performance on the test set
accuracy = best_tree_clf.score(X_test, y_test)
print(f"Decision Tree accuracy: {accuracy:.2%}")

Fitting 3 folds for each of 294 candidates, totalling 882 fits
Decision Tree accuracy: 86.95%


In [3]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.base import clone
from scipy.stats import mode


# a. Generate 1,000 subsets of the training set
rs = ShuffleSplit(n_splits=1000, test_size=len(X_train) - 100, random_state=42)
mini_sets = []
for mini_train_index, mini_test_index in rs.split(X_train):
    X_mini_train = X_train[mini_train_index]
    y_mini_train = y_train[mini_train_index]
    mini_sets.append((X_mini_train, y_mini_train))

# b. Train one Decision Tree on each subset
forest = [clone(best_tree_clf) for _ in range(1000)]
accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train, y_mini_train)
    
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

print(f"Average accuracy of individual trees: {np.mean(accuracy_scores):.2%}")

# c. Now comes the magic: majority-vote predictions over the test set
Y_pred = np.array([tree.predict(X_test) for tree in forest])
Y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

# d. Evaluate these predictions on the test set
accuracy_majority_vote = accuracy_score(y_test, Y_pred_majority_votes.reshape(-1))
print(f"Majority-vote accuracy: {accuracy_majority_vote:.2%}")


Average accuracy of individual trees: 80.55%
Majority-vote accuracy: 87.20%


  Y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)


In [9]:
y_train

array([0, 0, 1, ..., 1, 1, 0], dtype=int64)