# Challenge

Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Hopefully out of this you'll see the power of random forests, but also their potential costs. Remember, in the real world you won't necessarily be dealing with thousands of rows. It could be millions, billions, or even more.

### Notebook Flow

    * Import Libraries
    * Import homes data
    * Convert categorical features into numeric features
    * Prepare Tree Model for the regression task of predicting Sale price
    * Evaluate Tree
    * Prepare Simple Random Forest
    * Compare accuracy and Runtime

## Importing Libraries and Data

In [1]:
import numpy as np 
import pandas as pd 
from sklearn import preprocessing
from sklearn import metrics
from sklearn.tree import DecisionTreeRegressor
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import time

In [2]:
# timer script
start_time = time.time()
print("--- %s seconds ---" % (time.time() - start_time))

--- 0.0 seconds ---


In [3]:
# read in data
PATH = r'C:\Users\latee\Documents\GitHub\homes.csv'
homes = pd.read_csv(PATH)

## Preprocessing

In [4]:
def cat_converter(df):
    # obj holding non-numeric-colmuns
    non_numeric_columns = homes.select_dtypes(['object']).columns
    for cols in df:
        if cols in non_numeric_columns:
            # Create a label (category) encoder object
            le = preprocessing.LabelEncoder()
            # Create a label (category) encoder object
            le.fit(df[cols])
            # Apply the fitted encoder to the pandas column
            df[cols] = le.transform(df[cols]) 
    return df
cat_converter(homes).head()

Unnamed: 0.1,Unnamed: 0,id,mssubclass,mszoning,lotarea,street,lotshape,landcontour,utilities,lotconfig,...,enclosedporch,threessnporch,screenporch,poolarea,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,0,1,60,3,8450,1,3,3,0,4,...,0,0,0,0,0,2,2008,8,4,208500
1,1,2,20,3,9600,1,3,3,0,2,...,0,0,0,0,0,5,2007,8,4,181500
2,2,3,60,3,11250,1,0,3,0,4,...,0,0,0,0,0,9,2008,8,4,223500
3,3,4,70,3,9550,1,0,3,0,0,...,0,0,0,0,0,2,2006,8,0,140000
4,4,5,60,3,13518,1,0,3,0,2,...,0,0,0,0,0,10,2008,8,4,250000


In [5]:
homes = homes.drop(['Unnamed: 0', 'id'], axis=1)

In [6]:
homes.head()

Unnamed: 0,mssubclass,mszoning,lotarea,street,lotshape,landcontour,utilities,lotconfig,landslope,neighborhood,...,enclosedporch,threessnporch,screenporch,poolarea,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,60,3,8450,1,3,3,0,4,0,5,...,0,0,0,0,0,2,2008,8,4,208500
1,20,3,9600,1,3,3,0,2,0,24,...,0,0,0,0,0,5,2007,8,4,181500
2,60,3,11250,1,0,3,0,4,0,5,...,0,0,0,0,0,9,2008,8,4,223500
3,70,3,9550,1,0,3,0,0,0,6,...,0,0,0,0,0,2,2006,8,0,140000
4,60,3,13518,1,0,3,0,2,0,15,...,0,0,0,0,0,10,2008,8,4,250000


## Decision Tree Regressor Model 

In [7]:
# selecting the "data" and "target"
X = homes.iloc[:, :-1]
y = homes['saleprice']

# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)

In [8]:
# create a regressor object 
regressor = DecisionTreeRegressor()  

# fitting model 
regressor.fit(X_train, y_train) 

# making predictions
y_pred = regressor.predict(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

# timer script
start_time = time.time()
print("--- %s seconds ---" % (time.time() - start_time))

Mean Absolute Error: 18512.915422885573
Mean Squared Error: 780953927.8059702
Root Mean Squared Error: 27945.552916447552
--- 0.0 seconds ---


In [9]:
scores = cross_val_score(regressor, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.74 (+/- 0.04)


## Random Forest Regressor Model

In [10]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)

In [11]:
# create model instance
f_regressor = ensemble.RandomForestRegressor()

# fitting model
f_regressor.fit(X_train, y_train)

# making predictions
y_pred = f_regressor.predict(X_test)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

# timer script
start_time = time.time()
print("--- %s seconds ---" % (time.time() - start_time))

Mean Absolute Error: 13723.038805970149
Mean Squared Error: 384555966.48910445
Root Mean Squared Error: 19610.09858438005
--- 0.0 seconds ---




In [12]:
scores = cross_val_score(f_regressor, X, y, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.86 (+/- 0.04)


### In Closing

The forest outperforms that tree model here with a little more variance, but lower error. These cells didn't take long to run because the homes dataset is not that big, but as the size the data increases so does the computational resources required to implement the forest model. That is due to the nature of the algorithm, sorting through the data numerous times to complete each tree in the forest. 

It's not noticeable in the notebook - but the forest model did take a little longer to run. 