#### 7. Train and fine-tune a decision tree for the moons dataset by following these steps:
- Use `make_moons(n_samples=10000, noise=0.4)` to generate a moons dataset.
- Use `train_test_split()` to split the dataset into a training set and a test set.
- Use grid search with cross-validation (with the help of the `GridSearchCV` class) to find good hyperparameter values for a `DecisionTreeClassifier`. Hint: try various values for `max_leaf_nodes`.
- Train it on the full training set using these hyperparameters, and measure your model's performance on the test set. You should get roughly 85% to 87% accuracy. 

In [106]:
# Step 1: Import the moons dataset
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=10000, noise=0.4, random_state=43)

In [107]:
# Step 2: Split the dataset into training and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_moons, y_moons, test_size=0.2, random_state=43)

In [108]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(8000, 2)
(2000, 2)
(8000,)
(2000,)


In [109]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=43)

In [110]:
from sklearn.model_selection import GridSearchCV

tree_param_grid = {
    "max_depth" : [1, 2, 3, 4, 5],
    "min_samples_split" : [2, 3, 4, 5],
    "min_samples_leaf" : [1, 2, 3, 4, 5],
    "max_leaf_nodes" : [2, 3, 4, 5]
}

grid_search_tree = GridSearchCV(
    tree_clf, tree_param_grid, cv=5
)

In [111]:
grid_search_tree.fit(X_train, y_train)

In [112]:
print(f"Best parameters for Decision Tree: {grid_search_tree.best_params_}")

Best parameters for Decision Tree: {'max_depth': 2, 'max_leaf_nodes': 4, 'min_samples_leaf': 1, 'min_samples_split': 2}


In [113]:
best_tree = grid_search_tree.best_estimator_

In [114]:
from sklearn.model_selection import cross_val_score

cross_val_score(best_tree, X_train, y_train, cv=3, scoring="accuracy")

array([0.85076865, 0.86689164, 0.84096024])

In [115]:
cross_val_score(best_tree, X_train, y_train, cv=3, scoring="accuracy").mean()

0.8528735108411523

In [116]:
cross_val_score(best_tree, X_test, y_test, cv=3, scoring="accuracy").mean()

0.8625034329682005

#### 8. Grow a forest by following these steps:
- Continuing the previous exercise, generate 1000 subsets of the training set, each containing 100 instances selected randomly. *Hint: you can use Scikit-Learn's `ShuffleSplit` class for this*. 
- Train one decision tree on each subset, using the best hyperparameter values found in the previous exercise. Evaluate these 1000 decision trees on the test set. Since they were trained on smaller sets, these decision trees will likely perform worse than the first decision tree, achieving only about 80% accuracy.
- Now comes the magic. For each test set instance, generate the predictions of the 1000 decision trees, and keep only the most frequent prediction (you can use SciPy's `mode()` function for this). This approach gives you *majority-vote predictions* over the test set.
- Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5% to 1.5% higher). Congratulations, you have trained a random forest classifier!

In [117]:
# Splitting the dataset into subsets
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=1000, test_size=0.2, random_state=43)
rs.get_n_splits(X_moons)

1000

In [118]:
# It should not have 8000 instances in each set, but only 1000 or less

trained_best_trees = []
for i, (train_index, test_index) in enumerate(rs.split(X_moons)):
    best_tree_curr = grid_search_tree.best_estimator_
    trained_best_trees.append(best_tree_curr.fit(X_moons[train_index], y_moons[train_index]))
    print(len(X_moons[train_index]))
    

8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000
8000


In [119]:
# Evaluating the trained trees on the test set
for tree in trained_best_trees:
    curr_tree_score = cross_val_score(tree, X_test, y_test, cv=3, scoring="accuracy").mean()
    print(f"Current Tree Accuracy on the Test Set: {curr_tree_score}")

Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree Accuracy on the Test Set: 0.8625034329682005
Current Tree A