# Ensembles: From Decision Trees to Extra Trees

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, ExtraTreesRegressor

cars = pd.read_csv('data/cars.csv')

In [None]:
cars.shape

In [None]:
cars.tail()

In [None]:
# Let's check the datatypes of these columns.

cars.dtypes

In [None]:
# Let's look for nulls.

cars.isnull().sum()

In [None]:
# Great! Let's map the cubicinches values to floats.



In [None]:
# What happened?



In [None]:
# Fix the problem!



In [None]:
# What about the weight column?



In [None]:
# Fixing



In [None]:
# Let's check the correlations of the other features with MPG.



## Fitting a Decision Tree

In [None]:
X = cars.drop(['mpg', ' brand'], axis=1)
y = cars['mpg']

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state=1)

rt = DecisionTreeRegressor(random_state=1)
rt.fit(X_train, y_train)

In [None]:
rt.score(X_test, y_test)

In [None]:
rt.feature_importances_

A single decision tree will often overfit your training data. There are steps one can take to help with this, like limiting the "depth" of the nodes. But it's often better to do something else: Plant another tree!

Of course, if a second tree is going to be of any value, it has to be *different from* the first. Here's a good algorithm for achieving that:

## Fitting a Set of Bagged Decision Trees

### Bagging Algorithm

Take a sample of your X_train and fit a decision tree to it. <br/>
Replace the first batch of data and repeat. <br/>
When you've got as many trees as you like, make use of all your individual trees' predictions to come up with some holistic prediction. (Most obviously, we could take the average of our predictions, but there are other methods we might try.)

<br/>

Because we're resampling our data with replacement, we're *bootstrapping*. <br/>
Because we're making use of our many samples' predictions, we're *aggregating*. <br/>
Because we're bootstrapping and aggregating all in the same algorithm, we're *bagging*.

In [None]:
# Instantiate a bagging regressor



In [None]:
# Fit it.



In [None]:
# Score it.



That's a significant improvement in $R^2$! Let's see if we can do even better.

## Fitting a Random Forest

### Random Forest Algorithm

Let's add an extra layer of randomization: Instead of using *all* the features of my model to optimize a branch at each node, I'll just choose a subset of my features.

In [None]:
# Let's try a forest with 100 trees.



In [None]:
# Score it.



## Fitting a Stand of Extremely Randomized Trees

### Extra Trees Algorithm

Sometimes we might want even one more bit of randomization. Instead of always choosing the *optimal* branching path, we might just choose a branching path at random. If we're doing that, then we've got extremely randomized trees.

In [None]:
# Again let's try 100 trees. We'll also set `bootstrap=True`.



In [None]:
# Scoring



In [None]:
# Checking the feature importances (average of feature importances over all trees)



## Gridsearching

One method of hyperparameter tuning is **gridsearching**. The idea is to build mulitple models with different hyperparameter values and then see which one performs the best. The hyperparameters and the values to try form a sort of *grid* along which we are looking for the best performance.

Scikit-Learn has a `GridSearchCV` class whose `fit()` method runs this procedure. Note that this can be quite computationally expensive since:

- A model is constructed for each combination of hyperparameter values that we input; and
- Each model is cross-validated.

In [None]:
# Define param grid.

param_grid = {
    'max_features': ['sqrt', 'log2', 0.1],
    'criterion': ['mse', 'mae']
}

**Question: How many models will we be constructing with this grid?**

In [None]:
# Initialize the gridsearch object with five-fold cross-validation.

gs = GridSearchCV(estimator=et, param_grid=param_grid, cv=5)

In [None]:
# Fit it.

gs.fit(X_train, y_train)

In [None]:
# Score it.

gs.score(X_test, y_test)

In [None]:
# Get the best parameter values!

gs.best_params_

In [None]:
# And the best score

gs.best_score_

In [None]:
# And the best estimator

gs.best_estimator_

## Building a Model that Takes Raw Data as Input

Suppose we go with the best estimator according to our gridsearch results. If we want our model to be able to make predictions of MPG for an uncleaned row of input, we'll need to be able to clean the row before modeling.

The main cleaning moves that we made above were to fix the problems in the cubicinches and the weightlbs columns. We'll also need to drop the columns that don't belong in the model.

Let's write functions that will take care of those problems.

Suppose we've got our best-performing model already available:

In [None]:
et_best = ExtraTreesRegressor(n_estimators=100, bootstrap=True, random_state=1,
                             max_features='sqrt', criterion='mae').fit(X_train, y_train)

In [None]:
def clean(row):
    import pandas as pd
    series = pd.Series(row)
    for col in [' cubicinches', ' weightlbs']:
        series[col] = float(series[col])
    return series

In [None]:
def drop(row):
    return row.drop([' brand', 'mpg'])

In [None]:
def model_predict(row):
    row_clean = clean(row)
    row_preds = drop(row_clean)
    return et_best.predict(row_preds.values.reshape(1, -1))

In [None]:
new_row = {'mpg': 25.0, ' cylinders': 6, ' cubicinches': '202', ' hp': 120,
          ' weightlbs': '2300', ' time-to-60': 13, ' year': 2019, ' brand': 'China.'}

In [None]:
model_predict(new_row)

## Exercise

Use a Random Forest Classifier to predict the category of price range for the phones in this dataset. Try tuning some hyperparameters using GridSearch, and then write up a short paragraph about your findings.

In [None]:
phones_train = pd.read_csv('data/train.csv')

phones_test = pd.read_csv('data/test.csv')