## 1.
_Exercise: train and fine-tune a Decision Tree for the moons dataset._

a. Generate a moons dataset using `make_moons(n_samples=10000, noise=0.4)`.

Adding `random_state=42` to make this notebook's output constant:

In [1]:
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=10000, noise=0.4, random_state=42)

b. Split it into a training set and a test set using `train_test_split()`.

In [2]:
from sklearn.model_selection import train_test_split

# Split the moons dataset into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X_moons, y_moons, test_size=0.2, random_state=42)

c. Use grid search with cross-validation (with the help of the `GridSearchCV` class) to find good hyperparameter values for a `DecisionTreeClassifier`. Hint: try various values for `max_leaf_nodes`.

In [3]:
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import GridSearchCV


# Set up the parameter grid for max_leaf_nodes
param_grid = {"max_leaf_nodes": list(range(2, 100))}

dt_clf = DecisionTreeClassifier(random_state=42)

grid_search = GridSearchCV(dt_clf, param_grid, cv=3, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best hyperparameters
best_leaf_nodes = grid_search.best_params_["max_leaf_nodes"]
best_leaf_nodes

17

In [4]:
# Train the best Decision Tree on the full training set
best_tree = DecisionTreeClassifier(max_leaf_nodes=best_leaf_nodes, random_state=42)
best_tree.fit(X_train, y_train)

d. Train it on the full training set using these hyperparameters, and measure your model's performance on the test set. You should get roughly 85% to 87% accuracy.

By default, `GridSearchCV` trains the best model found on the whole training set (you can change this by setting `refit=False`), so we don't need to do it again. We can simply evaluate the model's accuracy:

In [5]:
from sklearn.metrics import accuracy_score

# Evaluate the best model on the test set

y_pred = best_tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.8695

## 2.

_Exercise: Grow a forest._

a. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use Scikit-Learn's `ShuffleSplit` class for this.

In [6]:
from sklearn.model_selection import ShuffleSplit


# Generate 1,000 subsets of the training set, each with 100 instances
n_trees = 1000
n_instances = 100
subsets = []
ss = ShuffleSplit(n_splits=n_trees, train_size=n_instances, random_state=42)
for train_idx, _ in ss.split(X_train):
    X_subset = X_train[train_idx]
    y_subset = y_train[train_idx]
    subsets.append((X_subset, y_subset))

b. Train one Decision Tree on each subset, using the best hyperparameter values found above. Evaluate these 1,000 Decision Trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely perform worse than the first Decision Tree, achieving only about 80% accuracy.

In [7]:
# Train one Decision Tree on each subset and evaluate on the test set
from sklearn.base import clone

forest = []
for X_subset, y_subset in subsets:
    tree = DecisionTreeClassifier(max_leaf_nodes=best_leaf_nodes, random_state=42)
    tree.fit(X_subset, y_subset)
    forest.append(tree)

# Evaluate accuracy of each tree
import numpy as np
all_preds = np.array([tree.predict(X_test) for tree in forest])
individual_accuracies = np.mean(all_preds == y_test, axis=1)
np.mean(individual_accuracies)

0.805471

c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction (you can use SciPy's `mode()` function for this). This gives you _majority-vote predictions_ over the test set.

In [8]:
from scipy.stats import mode

# Majority-vote predictions for each test instance
majority_votes, _ = mode(all_preds, axis=0, keepdims=False)
ensemble_predictions = majority_votes
ensemble_predictions

array([1, 1, 0, ..., 0, 0, 0], dtype=int64)

In [9]:
# Evaluate ensemble accuracy
ensemble_accuracy = accuracy_score(y_test, ensemble_predictions)
ensemble_accuracy

0.872

d. Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). 

In [10]:
# The ensemble accuracy should be slightly higher than the best single tree
ensemble_accuracy

0.872

In [11]:
# Compare with the best single tree accuracy
accuracy

0.8695