# Challenge

Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with [runtime](https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution). Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Hopefully out of this you'll see the power of random forests, but also their potential costs. Remember, in the real world you won't necessarily be dealing with thousands of rows. It could be millions, billions, or even more.

### Import Statements

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
%matplotlib inline

# This is the model we'll be using.
from sklearn import tree

# A convenience for displaying visualizations.
from IPython.display import Image

# Packages for rendering our tree.
import pydotplus
import graphviz

ModuleNotFoundError: No module named 'graphviz'

### The Dataframe

In [4]:
lending_df = pd.read_csv('LoanStats3d.csv', nrows=5000, skipinitialspace=True, header=1)

### Data Cleaning

In [5]:
categorical = lending_df.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

term
2
int_rate
42
grade
7
sub_grade
35
emp_title
3113
emp_length
11
home_ownership
3
verification_status
3
issue_d
1
loan_status
7
pymnt_plan
1
url
5000
desc
1
purpose
12
title
12
zip_code
708
addr_state
49
earliest_cr_line
453
revol_util
981
initial_list_status
2
last_pymnt_d
13
next_pymnt_d
2
last_credit_pull_d
14
application_type
2
verification_status_joint
3


Drop the columns that have 30+ unique values.

In [6]:
# Convert ID and Interest Rate to numeric.
lending_df['id'] = pd.to_numeric(lending_df['id'], errors='coerce')
lending_df['int_rate'] = pd.to_numeric(lending_df['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
lending_df.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

In [7]:
# This will convert categorical variables into dummy/indicator variables.
pd.get_dummies(lending_df)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,last_credit_pull_d_Mar-16,last_credit_pull_d_May-16,last_credit_pull_d_Nov-16,last_credit_pull_d_Oct-16,last_credit_pull_d_Sep-16,application_type_INDIVIDUAL,application_type_JOINT,verification_status_joint_Not Verified,verification_status_joint_Source Verified,verification_status_joint_Verified
0,68009401,72868139,16000,16000,16000,14.85,379.39,48000.0,33.18,0,...,0,0,0,0,0,1,0,0,0,0
1,68354783,73244544,9600,9600,9600,7.49,298.58,60000.0,22.44,0,...,0,0,0,0,0,1,0,0,0,0
2,68466916,73356753,25000,25000,25000,7.49,777.55,109000.0,26.02,0,...,0,0,0,0,0,1,0,0,0,0
3,68466961,73356799,28000,28000,28000,6.49,858.05,92000.0,21.60,0,...,0,0,0,0,0,1,0,0,0,0
4,68495092,73384866,8650,8650,8650,19.89,320.99,55000.0,25.49,0,...,0,0,0,0,0,1,0,0,0,0
5,68506798,73396623,23000,23000,23000,8.49,471.77,64000.0,18.28,0,...,0,0,0,0,0,1,0,0,0,0
6,68566886,73456723,29900,29900,29900,12.88,678.49,65000.0,21.77,0,...,0,0,0,0,0,1,0,0,0,0
7,68577849,73467703,18000,18000,18000,11.99,400.31,112000.0,8.68,0,...,0,0,0,0,0,1,0,0,0,0
8,66310712,71035433,35000,35000,35000,14.85,829.90,110000.0,17.06,0,...,0,0,0,0,0,1,0,0,0,0
9,68476807,73366655,10400,10400,10400,22.45,289.91,104433.0,25.37,1,...,0,0,0,0,0,1,0,0,0,0


### Dropping Dataframe Columns