# ML Challenge (Optional)

Train, test, optimize, and analyze the performance of a classification model using a methodology of your choice for the randomly generated moons dataset.

You are not being evaluated for the performance of your model. Instead, we are interested in whether you can implement a simple but rigorous ML workflow.

Show all of your work in this notebook.

In [6]:
# you are free to use any package you deem fit
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
import seaborn as sns

pd.options.mode.chained_assignment = None

## Dataset

In [2]:
# DO NOT MODIFY
from sklearn.datasets import make_moons

X, Y = make_moons(random_state=42, n_samples=(50, 450), noise=0.25)

In [19]:
print(X.shape)
print(Y.shape)

(500, 2)
(500,)


## Training

In [73]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=50)
base_model = RandomForestClassifier(n_estimators=100,
                                    max_depth=None,
                                    min_samples_split=2,
                                    min_samples_leaf=1,
                                    random_state = 2)
base_model.fit(X_train,Y_train)

Y_pred_base = base_model.predict(X_test)
accuracy_base = accuracy_score(Y_test, Y_pred_base)
print(f'Base Model Testing Accuracy: {accuracy_base}')

Base Model Testing Accuracy: 0.94


## Testing / Optimization

In [80]:
#Grid search approach
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=base_model, param_grid=param_grid, cv=3)
grid_search.fit(X_train, Y_train)
best_params = grid_search.best_params_
grid_model= RandomForestClassifier(n_estimators = best_params['n_estimators'], 
                                    max_depth = best_params['max_depth'],
                                    min_samples_split = best_params['min_samples_split'],
                                    min_samples_leaf = best_params['min_samples_split'],
                                    random_state = 2 )
grid_model.fit(X_train, Y_train)
Y_pred_grid = grid_model.predict(X_test)
accuracy_grid = accuracy_score(Y_test, Y_pred_grid)
print(f'Grid Model Testing Accuracy: {accuracy_grid}')


Grid Model Testing Accuracy: 0.94


## Performance Analysis

In [81]:
accuracy_base_train = accuracy_score(Y_train, base_model.predict(X_train))
print(f'Base Model Training Accuracy: {accuracy_base_train}')

accuracy_grid_train = accuracy_score(Y_train, grid_model.predict(X_train))
print(f'Grid Model Training Accuracy: {accuracy_grid_train}')

Base Model Training Accuracy: 1.0
Grid Model Training Accuracy: 0.9675


The observation that the accuracy score increased on the test set while decreasing on the training set suggests that the initial model was likely overfitting the training data. The grid search appears to have identified hyperparameters that make the model more general, enabling it to perform better on new, unseen data. This trade-off often indicates a well-tuned model that is likely to generalize better to future data, even if it performs slightly worse on the training set.