In [74]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
%matplotlib inline

In [2]:
# Importing DataFrame and removing columns/null-values that created index problems
# (similar to Guided Example DF):

data = pd.read_csv("LoanStats3d.csv", skipinitialspace = True, header = 1, engine = 'python', skipfooter = 2)

In [3]:
# Cleaning:

data.drop(['member_id', 'id', 'url', 'emp_title', 'zip_code', 'earliest_cr_line', 'revol_util',
            'sub_grade', 'addr_state', 'desc'], 1, inplace=True)

# Giving the 'loan_status' column some values that we can refer to later:

data.replace({"loan_status":{"Charged Off": 0,
                             "Current": 1,
                             "Default":  2,
                             "Fully Paid":  3,
                             "In Grace Period":  4,
                             "Late (16-30 days)":  5,
                             "Late (31-120 days)":  6}}, inplace = True)

data.loan_status.astype('int64')
data.drop(data.select_dtypes(include = ['object']).keys(), axis = 1, inplace = True)
data.dropna(how = 'any', axis = 1, inplace = True)

# Making a copy of the df so we can assign one to the target data (with loan status info) and one for the training
# data without the loan status info.

data2 = data.copy()

# Dropping the last two problematic rows:

data2 = data2[:-2]
data = data[:-2]

### Complex Trees:

Starting our complex tree here to compare it to the forest model later.

In [4]:
from sklearn import tree

  from collections import Sequence


In [5]:
x = data.drop(columns = ['out_prncp', 'loan_status',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
       'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'collection_recovery_fee', 'last_pymnt_amnt'], axis = 1)

y = data['loan_status']

class_list = ["Charged Off", "Current", "Default", "Fully Paid", "In Grace Period", 
                "Late (16-30 days)", "Late (31-120 days)"]

###### Default Tree

In [75]:
# Going with the default tree first (limited max_depth to 10 for runtime considerations):
start_time = datetime.now()
default_tree = tree.DecisionTreeClassifier(criterion = 'gini', 
                                           max_depth = 10, 
                                           max_features = None)

default_tree.fit(x, y)

# Graph:
from IPython.display import Image
import pydotplus.graphviz

dot_data = tree.export_graphviz(default_tree, out_file = None, 
                                feature_names = x.columns, 
                                class_names = class_list, filled=True)

graph = pydotplus.graph_from_dot_data(dot_data)

Image(graph.create_png())
end_time = datetime.now()

In [77]:
print("The mean accuracy score is: {}. \n The runtime is:{}".format(default_tree.score(x, y), (end_time-start_time)))

The mean accuracy score is: 0.5954599102811018. 
 The runtime is:0:00:24.468823


The default code will let the tree run until until it has found the best classification for the datapoints (no matter how deep or how many splits. That is why you will get a graph such as the one above, where the graph is so complex (i.e. number of splits, depth and therefore leaves), that it is hard to read unless you export the file.

###### First Tree:
> Making a tree with slightly more depth than the default and also specifying the number of leaf samples and leaf nodes.

In [78]:
# Model:
start_time = datetime.now()

tree_1 = tree.DecisionTreeClassifier(criterion = 'entropy', 
                                     max_depth = 16, 
                                     max_features = None,
                                     min_samples_split = 200,
                                     min_samples_leaf = 70,
                                     max_leaf_nodes = 60)

tree_1.fit(x, y)

# Graph
dot_data = tree.export_graphviz(tree_1, out_file = None,
                                feature_names = x.columns, 
                                class_names = class_list,
                                filled=True)

graph = pydotplus.graph_from_dot_data(dot_data)

Image(graph.create_png())
end_time = datetime.now()

In [79]:
print("The mean accuracy score is: {}. \n The runtime is:{}".format(tree_1.score(x, y), (end_time-start_time)))

The mean accuracy score is: 0.5814796256408917. 
 The runtime is:0:00:09.116010


The parameters on this tree are a little more specific.  While the accuracy has gone done, we are at least able to get a visual of the classification process.  Controlling the max_leaf_nodes is probably the most helpful in getting the visual.

###### Second Tree:
> Respecifying the parameters in the hopes of getting a more diverse representation of the classes for 'loan-status' and/or a higher accuracy score.  I increased the max_depth, removed the min_samples_split to give it more flexibility, as well as decreasing the min_sample_leaf amount.  I allowed for more leaf nodes as well.

In [80]:
# Model:
start_time = datetime.now()

tree_2 = tree.DecisionTreeClassifier(criterion = 'entropy', 
                                     max_depth = 18, 
                                     max_features = None,
                                     min_samples_leaf = 100,
                                     max_leaf_nodes = 70)
tree_2.fit(x, y)

# Graph:
dot_data = tree.export_graphviz(tree_2, out_file = None,
                                feature_names = x.columns,
                                class_names = class_list,
                                filled=True)

graph = pydotplus.graph_from_dot_data(dot_data)

Image(graph.create_png())
end_time = datetime.now()

In [81]:
print("The mean accuracy score is: {}. \n The runtime is:{}".format(tree_2.score(x, y), (end_time-start_time)))

The mean accuracy score is: 0.5817764721807297. 
 The runtime is:0:00:08.912194


Here, you can still see the class imbalance between 'Fully Paid' and 'Current' loans (especially given that these are the only two out of six categories we see).  The accuracy score has increased a very miniscule amount and the runtime has decreased by about a third.  In order to increase the accuracy, we would have to allow for more room for depth and leaves, which would make it difficult to create a visual.  

### Simple Forests:
> Using the default constraints here, but limiting the depth for runtime purposes. 

In [83]:
from sklearn import ensemble

start_time = datetime.now()

default_forest = ensemble.RandomForestClassifier(max_depth = 10)

default_forest.fit(x, y)

end_time = datetime.now()

In [84]:
print("The mean accuracy score is: {}. \n The runtime is:{}".format(default_forest.score(x, y), (end_time-start_time)))

The mean accuracy score is: 0.571859422977822. 
 The runtime is:0:00:10.150381


###### Forest 1
Aiming towards simplicity here...

In [91]:
start_time = datetime.now()

forest_1 = ensemble.RandomForestClassifier(n_estimators = 8, 
                                           criterion = 'entropy',
                                           max_features = 4,
                                           max_depth = 6,
                                           min_samples_split = 5000,
                                           min_samples_leaf = 1000,
                                           max_leaf_nodes = 40)

forest_1.fit(x, y)

end_time = datetime.now()

In [92]:
print("The mean accuracy score is: {}. \n The runtime is:{}".format(forest_1.score(x, y), (end_time-start_time)))

The mean accuracy score is: 0.5549011738499572. 
 The runtime is:0:00:04.549697


In comparison to the default model, the accuracy has only decreased by two percent, yet the runtime has decreased by fifty percent, which is an interesting consideration. In comparison to the best decision tree model above, we are getting only a five percent decrease in accuracy for a significantly more reliable accuracy score at an eighth of the runtime. We do not, however, get the visual which could be useful for certain phases of the analysis/study in question. 

In [87]:
start_time = datetime.now()

forest_2 = ensemble.RandomForestClassifier(n_estimators = 4, 
                                           criterion = 'entropy',
                                           max_features = 3,
                                           max_depth = 3,
                                           min_samples_split = 2000,
                                           min_samples_leaf = 500,
                                           max_leaf_nodes = 20)

forest_2.fit(x, y)

end_time = datetime.now()

In [88]:
print("The mean accuracy score is: {}. \n The runtime is:{}".format(forest_2.score(x, y), (end_time-start_time)))

The mean accuracy score is: 0.5548821756714075. 
 The runtime is:0:00:01.375754


Here, our accuracy has decreased only minimally and the runtime has decreased by a fourth.  You can see here that the model has reached a point where it starts to plateau in terms of changing parameters and accuracy.  The most significant change was in runtime.  Thus, for half the number of estimatores, half the leaves, you get the same accuracy and one fourth of the runtime.  In comparison to the tree models, where the reliability of your accuracy score is weaker than four trees in this last forest model, this model is quite powerful.