# Challenge: If a tree falls in the forest...
Pick a dataset. It could be old or new. Then build the best decision tree you can. Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime.

> Initially I was working with the auto-mpg data found on the UCI dataset websit, but there are many issues with nonlinearity and it produced very poor scores no matter what feature engineering was applied to it. So instead I'm working with EPA fuel efficiency data from 2016 found here https://www.epa.gov/compliance-and-fuel-economy-data/data-cars-used-testing-fuel-economy

In [1]:
%matplotlib inline

import pandas  as pd 
import numpy  as np
import scipy
from sklearn import tree
from sklearn import ensemble
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split
import time

  from numpy.core.umath_tests import inner1d


In [2]:
data = pd.read_csv("16tstcar.csv") 
data.head(2)

Unnamed: 0,Model Year,Vehicle Manufacturer Name,Veh Mfr Code,Represented Test Veh Make,Represented Test Veh Model,Test Vehicle ID,Test Veh Configuration #,Test Veh Displacement (L),Actual Tested Testgroup,Vehicle Type,...,Set Coef A (lbf),Set Coef B (lbf/mph),Set Coef C (lbf/mph**2),Aftertreatment Device Cd,Aftertreatment Device Desc,Police - Emergency Vehicle?,Averaging Group ID,Averaging Weighting Factor,Averaging Method Cd,Averging Method Desc
0,2016,aston martin,ASX,Aston Martin,DB9,143TT1042,0,5.9,DASXV05.9VH1,Car,...,8.35,0.299,0.0192,TWC,Three-way catalyst,N,,,N,No averaging
1,2016,aston martin,ASX,Aston Martin,DB9,143TT1042,0,5.9,DASXV05.9VH1,Car,...,8.35,0.299,0.0192,TWC,Three-way catalyst,N,,,N,No averaging


In [3]:
# Create a new dataframe and take only numerical values
epa = pd.DataFrame()
epa = data.select_dtypes([np.number])

In [4]:
# Add a couple categorical values for reference
epa['Brand'] = data['Vehicle Manufacturer Name']
epa['Vehicle Type'] = data['Vehicle Type']

epa.columns

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Index(['Model Year', 'Test Veh Configuration #', 'Test Veh Displacement (L)',
       'Rated Horsepower', '# of Cylinders and Rotors', '# of Gears',
       'Transmission Overdrive Code', 'Equivalent Test Weight (lbs.)',
       'Axle Ratio', 'N/V Ratio', 'Shift Indicator Light Use Cd',
       'ADFE Total Road Load HP', 'ADFE Equiv. Test Weight (lbs.)',
       'ADFE N/V Ratio', 'Test Procedure Cd', 'Test Fuel Type Cd',
       'THC (g/mi)', 'CO (g/mi)', 'CO2 (g/mi)', 'NOx (g/mi)', 'PM (g/mi)',
       'CH4 (g/mi)', 'N2O (g/mi)', 'RND_ADJ_FE', 'FE Bag 1', 'FE Bag 2',
       'FE Bag 3', 'FE Bag 4', 'DT-Inertia Work Ratio Rating',
       'DT-Absolute Speed Change Ratg', 'DT-Energy Economy Rating',
       'Target Coef A (lbf)', 'Target Coef B (lbf/mph)',
       'Target Coef C (lbf/mph**2)', 'Set Coef A (lbf)',
       'Set Coef B (lbf/mph)', 'Set Coef C (lbf/mph**2)',
       'Averaging Weighting Factor', 'Brand', 'Vehicle Type'],
      dtype='object')

In [5]:
epa.isnull().sum()

Model Year                           0
Test Veh Configuration #             0
Test Veh Displacement (L)            0
Rated Horsepower                     0
# of Cylinders and Rotors          140
# of Gears                           0
Transmission Overdrive Code          0
Equivalent Test Weight (lbs.)        0
Axle Ratio                           0
N/V Ratio                            0
Shift Indicator Light Use Cd         0
ADFE Total Road Load HP           4184
ADFE Equiv. Test Weight (lbs.)    4182
ADFE N/V Ratio                    4184
Test Procedure Cd                    0
Test Fuel Type Cd                    0
THC (g/mi)                         437
CO (g/mi)                          436
CO2 (g/mi)                         142
NOx (g/mi)                         475
PM (g/mi)                         3997
CH4 (g/mi)                         794
N2O (g/mi)                        2923
RND_ADJ_FE                           4
FE Bag 1                          2538
FE Bag 2                 

In [6]:
# Some columns only have a small amount of missing values. 
# Replace them with the average for that column
epa['# of Cylinders and Rotors'].fillna(epa['# of Cylinders and Rotors'].mean(), inplace=True)
epa['THC (g/mi)'].fillna(epa['THC (g/mi)'].mean(), inplace=True)
epa['CO (g/mi)'].fillna(epa['CO (g/mi)'].mean(), inplace=True)
epa['CO2 (g/mi)'].fillna(epa['CO2 (g/mi)'].mean(), inplace=True)
epa['NOx (g/mi)'].fillna(epa['NOx (g/mi)'].mean(), inplace=True)
epa['CH4 (g/mi)'].fillna(epa['CH4 (g/mi)'].mean(), inplace=True)
epa['RND_ADJ_FE'].fillna(epa['RND_ADJ_FE'].mean(), inplace=True)

# drop columns with majority null values
epa.dropna(axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


In [7]:
# Rename column for easier reference
epa.rename(columns={'RND_ADJ_FE':'MPG'}, 
          inplace=True)

# Separate features from the target
X = epa.drop('MPG', 1)
Y = epa['MPG']
X = pd.get_dummies(X)
X = X.dropna(axis=1) 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


## Initial run

In [8]:
# Initialize our tree.
dtc = tree.DecisionTreeRegressor(
    max_features=1,
    max_depth=4
)
dtc.fit(X, Y)

# Time to render and Cross Validation scores
start_time = time.time()
print("Runtime --- %s seconds ---" % (time.time() - start_time))
print(dtc.score(X,Y))

Runtime --- 6.008148193359375e-05 seconds ---
0.005354146823850935


In [9]:
# Initialize our forest
rfc = ensemble.RandomForestRegressor()
rfc.fit(X,Y)

start_time = time.time()
print("Runtime --- %s seconds ---" % (time.time() - start_time))
print(rfc.score(X, Y))

Runtime --- 4.8160552978515625e-05 seconds ---
0.9071518555000663


***The random forest ran in about half the time of the decision tree, and scored 170x's better. I'm going to do some feature reduction to try and get the decision tree to run better then test and train the data.***

## Second attempt

In [10]:
# Get the most important features
features_imp = pd.Series(rfc.feature_importances_,index=X.columns).sort_values(ascending=False)[:10]
features_imp

Test Fuel Type Cd                0.556167
N/V Ratio                        0.085058
Equivalent Test Weight (lbs.)    0.075012
# of Gears                       0.057145
THC (g/mi)                       0.034072
Test Procedure Cd                0.031264
Target Coef B (lbf/mph)          0.026167
Set Coef C (lbf/mph**2)          0.024983
CO2 (g/mi)                       0.017695
Shift Indicator Light Use Cd     0.015261
dtype: float64

In [11]:
from sklearn.grid_search import GridSearchCV

features = X[['Test Fuel Type Cd', 'THC (g/mi)', 'NOx (g/mi)', 'CO2 (g/mi)', 'Test Veh Configuration #', 
             'N/V Ratio', 'Test Procedure Cd', 'Set Coef A (lbf)', 'Rated Horsepower']]

parameters = {
    'max_features':[1,3,5],
    'max_depth':[1,2,3,4,5,6]    
}

grid = GridSearchCV(dtc, parameters, cv=10, verbose=0)
#Fit the Data
grid.fit(features, Y)



GridSearchCV(cv=10, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=1,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_features': [1, 3, 5], 'max_depth': [1, 2, 3, 4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [12]:
print(grid.best_score_)
print(grid.best_params_)

-2.8158302216222353
{'max_depth': 4, 'max_features': 5}


In [18]:
# Initialize our tree.
dtc = tree.DecisionTreeRegressor(
    max_features=5,
    max_depth=4
)
dtc.fit(features, Y)
# Time to render and Cross Validation scores
start_time = time.time()
print("Runtime --- %s seconds ---" % (time.time() - start_time))
print(dtc.score(features,Y))

Runtime --- 5.1021575927734375e-05 seconds ---
0.5188605060046264


***The runtime barely improved, but the score improved greatly***

In [14]:
# Initialize our forest
rfc = ensemble.RandomForestRegressor()
rfc.fit(features,Y)

start_time = time.time()
print("Runtime --- %s seconds ---" % (time.time() - start_time))
print(rfc.score(features, Y))

Runtime --- 4.982948303222656e-05 seconds ---
0.9724274247283814


***Runtime is about the same, but the score improved to the point it may be overfitting.***

## Third attempt

### Train Test Split

In [15]:
#Split the data into training and validation
X_train, X_test, y_train, y_test = train_test_split(features,Y)

In [24]:
#Create single Tree
start_time = time.time()
dtc = tree.DecisionTreeRegressor(
    max_features=5,
    max_depth=4
)
model = dtc.fit(X_train, y_train)
print("Runtime --- %s seconds ---" % (time.time() - start_time))
print(dtc.score(features, Y))

Runtime --- 0.006599903106689453 seconds ---
0.9992712049437592


***This is by par the best run decision tree***

In [25]:
#Create Forest
start_time = time.time()
rfc = RandomForestRegressor()
model = rfc.fit(X_train, y_train)
print("Runtime --- %s seconds ---" % (time.time() - start_time))
print(rfc.score(features, Y))

Runtime --- 0.16591191291809082 seconds ---
0.9487161558970815


## Summary
Using a decision tree and training the data actually performed better than using a random forrest, but it may actually be overfitting