# Challenge

Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with [runtime](https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution). Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Hopefully out of this you'll see the power of random forests, but also their potential costs. Remember, in the real world you won't necessarily be dealing with thousands of rows. It could be millions, billions, or even more.

### Import Statements

In [29]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
%matplotlib inline

# This is the model we'll be using.
from sklearn import tree

# A convenience for displaying visualizations.
from IPython.display import Image

# Packages for rendering our tree.
import pydotplus
import graphviz

ModuleNotFoundError: No module named 'graphviz'

### The Dataframe

In [48]:
lending_df = pd.read_csv('LoanStats3d.csv', nrows=5000, skipinitialspace=True, header=1)

### Data Cleaning

In [49]:
categorical = lending_df.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

term
2
int_rate
42
grade
7
sub_grade
35
emp_title
3113
emp_length
11
home_ownership
3
verification_status
3
issue_d
1
loan_status
7
pymnt_plan
1
url
5000
desc
1
purpose
12
title
12
zip_code
708
addr_state
49
earliest_cr_line
453
revol_util
981
initial_list_status
2
last_pymnt_d
13
next_pymnt_d
2
last_credit_pull_d
14
application_type
2
verification_status_joint
3


Drop the columns that have 30+ unique values.

In [50]:
# revision_one_lending_df = lending_df.copy()

# Convert ID and Interest Rate to numeric.
lending_df['id'] = pd.to_numeric(lending_df['id'], errors='coerce')
lending_df['int_rate'] = pd.to_numeric(lending_df['int_rate'].str.strip('%'), errors='coerce')

# Drop other columns with many unique variables
lending_df.drop(['url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

In [51]:
revision_one_lending_df = lending_df.copy()

In [52]:
revision_one_lending_df['term'].head()

0    60 months
1    36 months
2    36 months
3    36 months
4    36 months
Name: term, dtype: object

In [53]:
# This will convert categorical variables into dummy/indicator variables.
pd.get_dummies(revision_one_lending_df, drop_first=True)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,...,last_credit_pull_d_Jul-16,last_credit_pull_d_Jun-16,last_credit_pull_d_Mar-16,last_credit_pull_d_May-16,last_credit_pull_d_Nov-16,last_credit_pull_d_Oct-16,last_credit_pull_d_Sep-16,application_type_JOINT,verification_status_joint_Source Verified,verification_status_joint_Verified
0,68009401,72868139,16000,16000,16000,14.85,379.39,48000.0,33.18,0,...,0,0,0,0,0,0,0,0,0,0
1,68354783,73244544,9600,9600,9600,7.49,298.58,60000.0,22.44,0,...,0,0,0,0,0,0,0,0,0,0
2,68466916,73356753,25000,25000,25000,7.49,777.55,109000.0,26.02,0,...,0,0,0,0,0,0,0,0,0,0
3,68466961,73356799,28000,28000,28000,6.49,858.05,92000.0,21.60,0,...,0,0,0,0,0,0,0,0,0,0
4,68495092,73384866,8650,8650,8650,19.89,320.99,55000.0,25.49,0,...,0,1,0,0,0,0,0,0,0,0
5,68506798,73396623,23000,23000,23000,8.49,471.77,64000.0,18.28,0,...,0,0,0,0,0,0,0,0,0,0
6,68566886,73456723,29900,29900,29900,12.88,678.49,65000.0,21.77,0,...,0,0,0,0,0,0,0,0,0,0
7,68577849,73467703,18000,18000,18000,11.99,400.31,112000.0,8.68,0,...,0,0,0,0,0,0,0,0,0,0
8,66310712,71035433,35000,35000,35000,14.85,829.90,110000.0,17.06,0,...,0,0,0,0,0,0,0,0,0,0
9,68476807,73366655,10400,10400,10400,22.45,289.91,104433.0,25.37,1,...,0,0,0,0,0,0,0,0,0,0


In [44]:
lending_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Columns: 103 entries, id to total_il_high_credit_limit
dtypes: float64(31), int64(56), object(16)
memory usage: 3.6+ MB


In [None]:
second_revision_house_prices_df['street'] = pd.get_dummies(second_revision_house_prices_df['street'], drop_first=True)

In [54]:
revision_one_lending_df['term']

0       60 months
1       36 months
2       36 months
3       36 months
4       36 months
5       60 months
6       60 months
7       60 months
8       60 months
9       60 months
10      60 months
11      60 months
12      60 months
13      36 months
14      36 months
15      36 months
16      36 months
17      36 months
18      60 months
19      36 months
20      36 months
21      36 months
22      36 months
23      36 months
24      36 months
25      36 months
26      36 months
27      36 months
28      36 months
29      36 months
          ...    
4970    36 months
4971    36 months
4972    36 months
4973    36 months
4974    36 months
4975    36 months
4976    36 months
4977    36 months
4978    36 months
4979    36 months
4980    60 months
4981    36 months
4982    36 months
4983    36 months
4984    36 months
4985    60 months
4986    36 months
4987    60 months
4988    36 months
4989    36 months
4990    36 months
4991    36 months
4992    36 months
4993    36 months
4994    60

### Running the Random Forest Model

In [35]:
from sklearn import ensemble
from sklearn.model_selection import cross_val_score

rfc = ensemble.RandomForestClassifier()
X = lending_df.drop('loan_status', 1)
Y = lending_df['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

cross_val_score(rfc, X, Y, cv=10)



array([0.9860835 , 0.98011928, 0.98605578, 0.98007968, 0.98802395,
       0.98797595, 0.98795181, 0.98594378, 0.98594378, 0.97983871])

### Getting the Random Forest Model's Runtime

In [36]:
from timeit import default_timer as timer

start= timer()

rfc = ensemble.RandomForestClassifier()
X = lending_df.drop('loan_status', 1)
Y = lending_df['loan_status']
X = pd.get_dummies(X)
X = X.dropna(axis=1)

cross_val_score(rfc, X, Y, cv=10)

end = timer() 

print("Time taken:", end-start) 



Time taken: 1.0622926249998272


### Code for a Decision Tree

Please note that while I'm having issues importing the graphiz package that would display a decision tree, I thought it'd be helpful to include the code for that so I can run the decision tree once I've fixed the graphiz error.

In [37]:
# To build my decision tree, I need to know what the 'loan_status' column's values are.

lending_df['loan_status'].unique()

array(['Current', 'Fully Paid', 'Charged Off', 'Late (31-120 days)',
       'In Grace Period', 'Default', 'Late (16-30 days)'], dtype=object)

In [38]:
# Making a variable using the dataframe's 'loan_status' column.

loan_status = lending_df['loan_status']

In [39]:
# Initialize and train our tree.
decision_tree = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=1,
    max_depth=4,
    random_state = 1337
)
decision_tree.fit(lending_df, loan_status)

# Render our tree.
dot_data = tree.export_graphviz(
    decision_tree, out_file=None,
    feature_names=loan_status.columns,
    class_names=['Current', 'Fully Paid', 'Charged Off', 'Late (31-120 days)',
       'In Grace Period', 'Default', 'Late (16-30 days)'],
    filled=True
)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

ValueError: could not convert string to float: '60 months'

In [40]:
lending_df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,emp_length,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,68009401,72868139,16000,16000,16000,60 months,14.85,379.39,C,10+ years,...,0,2,78.9,0.0,0,2,298100,31329,281300,13400
1,68354783,73244544,9600,9600,9600,36 months,7.49,298.58,A,8 years,...,0,2,100.0,66.7,0,0,88635,55387,12500,75635
2,68466916,73356753,25000,25000,25000,36 months,7.49,777.55,A,10+ years,...,0,0,100.0,20.0,0,0,373572,68056,38400,82117
3,68466961,73356799,28000,28000,28000,36 months,6.49,858.05,A,10+ years,...,0,0,91.7,22.2,0,0,304003,74920,41500,42503
4,68495092,73384866,8650,8650,8650,36 months,19.89,320.99,E,8 years,...,0,12,100.0,50.0,1,0,38998,18926,2750,18248
