#### 7. Train and fine-tune a decision tree for the moons dataset by following these steps:
- Use `make_moons(n_samples=10000, noise=0.4)` to generate a moons dataset.
- Use `train_test_split()` to split the dataset into a training set and a test set.
- Use grid search with cross-validation (with the help of the `GridSearchCV` class) to find good hyperparameter values for a `DecisionTreeClassifier`. Hint: try various values for `max_leaf_nodes`.
- Train it on the full training set using these hyperparameters, and measure your model's performance on the test set. You should get roughly 85% to 87% accuracy. 

In [294]:
# Step 1: Import the moons dataset
from sklearn.datasets import make_moons

X_moons, y_moons = make_moons(n_samples=10000, noise=0.4, random_state=43)

In [295]:
# Step 2: Split the dataset into training and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_moons, y_moons, test_size=0.2, random_state=43)

In [296]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(8000, 2)
(2000, 2)
(8000,)
(2000,)


In [297]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=43)

In [298]:
from sklearn.model_selection import GridSearchCV

tree_param_grid = {
    "max_depth" : [1, 2, 3, 4, 5],
    "min_samples_split" : [2, 3, 4, 5],
    "min_samples_leaf" : [1, 2, 3, 4, 5],
    "max_leaf_nodes" : [2, 3, 4, 5]
}

grid_search_tree = GridSearchCV(
    tree_clf, tree_param_grid, cv=5
)

In [299]:
grid_search_tree.fit(X_train, y_train)

In [300]:
print(f"Best parameters for Decision Tree: {grid_search_tree.best_params_}")

Best parameters for Decision Tree: {'max_depth': 2, 'max_leaf_nodes': 4, 'min_samples_leaf': 1, 'min_samples_split': 2}


In [301]:
best_tree = grid_search_tree.best_estimator_

In [302]:
from sklearn.model_selection import cross_val_score

cross_val_score(best_tree, X_train, y_train, cv=3, scoring="accuracy")

array([0.85076865, 0.86689164, 0.84096024])

In [303]:
cross_val_score(best_tree, X_train, y_train, cv=3, scoring="accuracy").mean()

0.8528735108411523

In [304]:
cross_val_score(best_tree, X_test, y_test, cv=3, scoring="accuracy").mean()

0.8625034329682005

#### 8. Grow a forest by following these steps:
- Continuing the previous exercise, generate 1000 subsets of the training set, each containing 100 instances selected randomly. *Hint: you can use Scikit-Learn's `ShuffleSplit` class for this*. 
- Train one decision tree on each subset, using the best hyperparameter values found in the previous exercise. Evaluate these 1000 decision trees on the test set. Since they were trained on smaller sets, these decision trees will likely perform worse than the first decision tree, achieving only about 80% accuracy.
- Now comes the magic. For each test set instance, generate the predictions of the 1000 decision trees, and keep only the most frequent prediction (you can use SciPy's `mode()` function for this). This approach gives you *majority-vote predictions* over the test set.
- Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5% to 1.5% higher). Congratulations, you have trained a random forest classifier!

In [305]:
# Splitting the dataset into subsets
from sklearn.model_selection import ShuffleSplit
import numpy as np

# Assume we have 8000 instances (e.g., X.shape[0] = 8000)
n_samples = len(X_moons)
subset_size = 100
n_splits = 1000  # Number of subsets

# Create ShuffleSplit object
shuffle_split = ShuffleSplit(n_splits=n_splits, test_size=subset_size, random_state=42)

# Generate 1000 subsets
subset_idxs = []
for train_index, subset_index in shuffle_split.split(np.arange(n_samples)):  
    subset_idxs.append(subset_index)  # Store indices of each subset

# Convert to numpy array (optional)
subset_idxs = np.array(subsets) # This stores indices of all subsets

# Print first subset
print(subset_idxs[0])

[6252 4684 1731 4742 4521 6340  576 5202 6363  439 2750 7487 5272 5653
 3999 6033  582 9930 7051 8158 9896 2249 4640 9485 4947 9920 1963 8243
 6590 8847  321 2678 4625 4949 8328 3337 5589  251 3973 6630 5547   35
 8362 1513 9317   39 4819 3465 1760 2304 3723 8284 4993 8127 3032 7938
 3039 9655 2545 2592 1188 7966 6077  107 1315 8187 2753 9753 6231 2876
 5323  799 3570 2894 2927 8178  971 6687 8575 2020 9054  952 5359 3857
 5861 3145 3305 3006 9001 7770 7438 7942 9238 1056 3154 3787 9189 7825
 7539 7231]


In [306]:
trained_best_trees = []
for subset_idx in subset_idxs:
    best_tree_curr = DecisionTreeClassifier(**grid_search_tree.best_params_)
    trained_best_trees.append(
        best_tree_curr.fit(X_moons[subset_idx], y_moons[subset_idx])
    )
    print(cross_val_score(
        best_tree_curr, X_moons[subset_idx], y_moons[subset_idx], 
        cv=3, scoring="accuracy").mean())

0.8398692810457516
0.7195484254307783
0.8398692810457516
0.749554367201426
0.7593582887700535
0.6794414735591205
0.8199643493761141
0.8000594177064766
0.7106357694592988
0.8300653594771242
0.7795603089720737
0.8995840760546643
0.7700534759358288
0.7697563874034463
0.8992869875222818
0.8600713012477718
0.8793820558526441
0.8401663695781343
0.8294711824123588
0.7703505644682115
0.8505644682115271
0.7709447415329768
0.8193701723113488
0.9197860962566845
0.7706476530005942
0.8493761140819965
0.7507427213309565
0.7697563874034463
0.8300653594771242
0.809863339275104
0.8199643493761141
0.7391562685680332
0.8701723113487819
0.8012477718360071
0.7596553773024363
0.8202614379084968
0.7801544860368389
0.8606654783125371
0.8205585264408793
0.859180035650624
0.8208556149732621
0.7798573975044564
0.8704693998811646
0.880273321449792
0.8395721925133689
0.8291740938799762
0.9093879976232917
0.7201426024955437
0.8303624480095069
0.8900772430184195
0.7807486631016043
0.8306595365418895
0.85918003565062

In [307]:
# Evaluating the trained trees on the test set
from sklearn.metrics import accuracy_score

accuracy_scores = []
for tree in trained_best_trees:
    y_pred = tree.predict(X_test)
    curr_tree_accuracy_score = accuracy_score(y_test, y_pred)
    print(type(y_pred))
    accuracy_scores.append(curr_tree_accuracy_score)

    # print(f"Current Tree Accuracy on the Test Set: {curr_tree_accuracy_score}")

print(f"Mean Accuracy on Test Set: {np.array(accuracy_scores).mean()}")


<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'numpy.nd

In [308]:
from scipy.stats import mode

forest_preds = []
for test_instance in X_test:
    preds_on_this_instance = []
    for tree in trained_best_trees:
        preds_on_this_instance.append(tree.predict([test_instance]))
    forest_preds.append(mode(np.array(preds_on_this_instance)).mode)

print(forest_preds)

[array([1]), array([1]), array([1]), array([1]), array([0]), array([0]), array([1]), array([1]), array([1]), array([1]), array([1]), array([1]), array([1]), array([1]), array([1]), array([1]), array([0]), array([0]), array([1]), array([0]), array([0]), array([0]), array([1]), array([0]), array([1]), array([1]), array([0]), array([1]), array([0]), array([1]), array([1]), array([0]), array([0]), array([0]), array([0]), array([1]), array([1]), array([0]), array([1]), array([1]), array([1]), array([1]), array([1]), array([0]), array([0]), array([0]), array([1]), array([0]), array([1]), array([0]), array([0]), array([0]), array([0]), array([0]), array([0]), array([0]), array([0]), array([0]), array([1]), array([0]), array([0]), array([0]), array([0]), array([1]), array([0]), array([0]), array([1]), array([1]), array([0]), array([1]), array([1]), array([0]), array([0]), array([0]), array([1]), array([0]), array([0]), array([1]), array([1]), array([0]), array([1]), array([0]), array([1]), arr

In [309]:
print(type(np.array(forest_preds)))

<class 'numpy.ndarray'>


In [311]:
print(f"Final Accuracy Score: {accuracy_score(y_test, np.array(forest_preds))}")

Final Accuracy Score: 0.8635


**Great! So by implementing the forest, I managed to improve from 82.6% accuracy to 86.35% accuracy**