In this example we are going to fit Regression Trees to a data set that measures prices for a collection of Toyota Corollas.

Additionally we will learn how to do a *grid search* over many different parameters, basically learning how to optimize multiple parameters at once.

We will start by importing a new Python library called `dmba` (this is the Python library for data sets in the Shmueli book)
and then reading "ToyotaCorolla.csv" from that package:

In [None]:
!pip install dmba

In [None]:
from dmba import load_data

# Load the ToyotaCorolla dataset
toyota_df = load_data('ToyotaCorolla.csv')


# Rename some of the annoyingly named features
toyota_df = toyota_df.rename(columns={'Age_08_04': 'Age', 'Quarterly_Tax': 'Tax'})


# Display the first few rows
print(toyota_df.head())
print(toyota_df.info())

In [None]:
toyota_df.describe()

# Regression Trees
Fitting Regression trees to Toyota Corolla data
Shmueli Chapter 9.6
(data introduced in chapter 6).

Lets install some other important packages



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split, GridSearchCV
from dmba import regressionSummary, classificationSummary


In [None]:

predictors = ['Age', 'KM', 'Fuel_Type', 'HP', 'Met_Color', 'Automatic', 'CC',
              'Doors', 'Tax', 'Weight']
outcome = 'Price'

X = pd.get_dummies(toyota_df[predictors], drop_first=True)
y = toyota_df[outcome]

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.4, random_state=1)

In [None]:
# fit a regression tree of depth 3
regTree = DecisionTreeRegressor(max_depth=3, random_state=13, criterion="absolute_error")
regTree.fit(train_X, train_y)
# plot it
plt.figure(figsize=(20,15))
plot_tree(regTree, feature_names=X.columns, filled=True, rounded=True)
plt.show()

###Grid Search

Previously we have learned how to optimize a single parameter (like `max_depth`) while fitting a tree model.  However, there are several paramters you could choose from...many models have several parameters you can optimize for complexity control. How can we optimize across multiple parameters?  


**Grid Search** is one way to find optimized parameters across many different options.  In tree fitting we have many things we can vary...we can use `GridSearchCV()` as a way to optimize across all of them using cross validation.

First you identify all of your parameters and the grid using `param_grid`...

In [None]:
# user grid search to find optimized tree
param_grid = {
    'max_depth': [1, 5, 10, 15, 20, 25],
    'min_samples_split': [2, 5, 10, 15, 20, 25, 30],
    'max_features': [1,3,5,7,9]
}

gridSearch = GridSearchCV(DecisionTreeRegressor(random_state=13), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(train_X, train_y)
print('First level optimal parameters: ', gridSearch.best_params_)


In [None]:

## now that we are narrowing in on the best options, lets refine some more:

param_grid = {
    'max_depth': [6,7,8,9,10,11,12,13,14],
    'min_samples_split': [6,7,8,9,10,11,12,13,14,15],
    'max_features': [4,5,6]
}
gridSearch = GridSearchCV(DecisionTreeRegressor(random_state=13,criterion="absolute_error"), param_grid, cv=5, n_jobs=-1)
gridSearch.fit(train_X, train_y)
print('Improved parameters: ', gridSearch.best_params_)

regTree = gridSearch.best_estimator_


In [None]:
print("summary on holdout data")
regressionSummary(test_y, regTree.predict(test_X))

In [None]:
# plot the best tree
plt.figure(figsize=(40,35))
plot_tree(regTree, feature_names=X.columns, filled=True, rounded=True)
plt.show()

In [None]:
## and you can still plot feature importances


scores = regTree.feature_importances_
scores
sorted_pairs = sorted(zip(scores, X.columns), reverse=True)
sorted_scores, sorted_names = zip(*sorted_pairs)

# plot it

plt.bar(sorted_names, sorted_scores, color='skyblue')  # You can change the color

# Add title and labels
plt.title('Scores by Name')
plt.xlabel('Name')
plt.ylabel('Score')

# Display the plot
plt.xticks(rotation=45)  # Rotate names to prevent overlap
plt.show()
