# Summary

This lab investigates the performance of Decision Trees and Random Forests on the moons dataset, a synthetic dataset characterized by interleaving half-circles with added noise. The experiment compares the accuracy of a single Decision Tree with that of a Random Forest ensemble, demonstrating the advantages of ensemble methods in improving model robustness and generalization. The results show that while a single Decision Tree achieves 86.5% accuracy, a Random Forest improves performance to 87.8%, highlighting the effectiveness of combining multiple models to reduce overfitting.

The moons dataset was generated using `make_moons` with 10,000 samples and a noise level of 0.4 to simulate real-world complexity. The dataset was split into a training set (80%) and a test set (20%) to evaluate model performance.

A Decision Tree classifier was trained on the training set. Hyperparameter tuning was performed using GridSearchCV with 5-fold cross-validation to identify the optimal value for `max_leaf_nodes`. The best-performing model used `max_leaf_nodes=40` and achieved an accuracy of 86.5% on the test set.

To build a Random Forest, 1,000 subsets of the training data were created, each containing 100 randomly selected instances. A Decision Tree was trained on each subset using the optimal `max_leaf_nodes` value. The predictions of all trees were aggregated using majority voting, and the ensemble achieved an accuracy of 87.8% on the test set.

# **Results and Conclusion**

**Single Decision Tree:** Achieved 86.5% accuracy on the test set.

**Random Forest:** Achieved 87.8% accuracy on the test set, representing a 1.3% improvement over the single Decision Tree.

This study demonstrates that Random Forests outperform single
Decision Trees on the moons dataset, achieving higher accuracy and better generalization. The results emphasize the value of ensemble methods in building robust machine learning models. Future work could explore the performance of other ensemble techniques, such as gradient boosting, on similar datasets.

# Lab 07:  Decision Trees



#### Part 1

Train and fine tune a decision tree for the moons dataset  
* Use make_moons(n_samples=10000, noise=0.4) to generate a moons dataset
* Use train_test_split() to split the dataset into a training set and a test set.
* Use grid search with cross-validation (with the help of the GridSearchCV
class) to find good hyperparameter values for a DecisionTreeClassifier.
Hint: try various values for max_leaf_nodes.
* Train it on the full training set using these hyperparameters, and measure
your model’s performance on the test set. You should get roughly 85% to 87%
accuracy.

In [5]:
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Generate the moons dataset
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

# Step 2: Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Use GridSearchCV to find good hyperparameter values for DecisionTreeClassifier
param_grid = {
    'max_leaf_nodes': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}

dt_clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(dt_clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best hyperparameters
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

# Step 4: Train the model on the full training set using the best hyperparameters
best_dt_clf = DecisionTreeClassifier(max_leaf_nodes=best_params['max_leaf_nodes'], random_state=42)
best_dt_clf.fit(X_train, y_train)

# Step 5: Measure the model's performance on the test set
y_pred = best_dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy:", accuracy)

Best hyperparameters: {'max_leaf_nodes': 20}
Test set accuracy: 0.87


#### Part 2: Grow a forest by following these steps:

* Continuing the previous exercise, generate 1,000 subsets of the training set,
each containing 100 instances selected randomly. Hint: you can use Scikit-
Learn’s ShuffleSplit class for this.
* Train one Decision Tree on each subset, using the best hyperparameter values
found in the previous exercise. Evaluate these 1,000 Decision Trees on the test
set. Since they were trained on smaller sets, these Decision Trees will likely
perform worse than the first Decision Tree, achieving only about 80%
accuracy.
* Now comes the magic. For each test set instance, generate the predictions of
the 1,000 Decision Trees, and keep only the most frequent prediction (you can
use SciPy’s mode() function for this). This approach gives you majority-vote
predictions over the test set.
* Evaluate these predictions on the test set: you should obtain a slightly higher
accuracy than your first model (about 0.5 to 1.5% higher). Congratulations,
you have trained a Random Forest classifier!

In [6]:
from sklearn.model_selection import ShuffleSplit
from scipy.stats import mode

# Step 1: Generate 1,000 subsets of the training set, each containing 100 instances
n_trees = 1000
n_instances = 100

shuffle_split = ShuffleSplit(n_splits=n_trees, test_size=len(X_train) - n_instances, random_state=42)
forest = [None] * n_trees

for tree_idx, (train_index, _) in enumerate(shuffle_split.split(X_train)):
    X_subset = X_train[train_index]
    y_subset = y_train[train_index]
    forest[tree_idx] = DecisionTreeClassifier(max_leaf_nodes=best_params['max_leaf_nodes'], random_state=42)
    forest[tree_idx].fit(X_subset, y_subset)

# Step 2: Evaluate these 1,000 Decision Trees on the test set
tree_predictions = np.zeros((n_trees, len(X_test)))

for tree_idx, tree in enumerate(forest):
    tree_predictions[tree_idx] = tree.predict(X_test)

# Step 3: Generate majority-vote predictions over the test set
y_pred_majority_votes, _ = mode(tree_predictions, axis=0)
y_pred_majority_votes = y_pred_majority_votes.ravel()

# Step 4: Evaluate these predictions on the test set
forest_accuracy = accuracy_score(y_test, y_pred_majority_votes)
print("Random Forest accuracy:", forest_accuracy)

Random Forest accuracy: 0.872
