# Identifying safe loans with decision trees

The [LendingClub](https://www.lendingclub.com/) is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this project, we will build a classification model to predict whether or not a loan provided by LendingClub is likely to [default](https://en.wikipedia.org/wiki/Default_%28finance%29).

In this project we will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be [charged off](https://en.wikipedia.org/wiki/Charge-off) and possibly go into default. In this project we will:

* Use SFrames to do some feature engineering.
* Train a decision-tree on the LendingClub dataset.
* Predict whether a loan will default along with prediction probabilities (on a validation set).
* Train a complex tree model and compare it to simple tree model.


In [1]:
import turicreate

In [2]:
loans = turicreate.SFrame('lending-club-data.sframe/')

In [3]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')

In [4]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borroour had a delinquincy
            'last_major_derog_none',     # has borroour had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

In [5]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print("Number of safe loans  : %s" % len(safe_loans_raw))
print("Number of risky loans : %s" % len(risky_loans_raw))

Number of safe loans  : 99457
Number of risky loans : 23150


In [7]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

In [8]:
print("Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data)))
print("Percentage of risky loans                :", len(risky_loans) / float(len(loans_data)))
print("Total number of loans in our new dataset :", len(loans_data))

Percentage of safe loans                 : 0.5022361744216048
Percentage of risky loans                : 0.4977638255783951
Total number of loans in our new dataset : 46508


In [9]:
train_data, validation_data = loans_data.random_split(.8, seed=1)

# Our classifier

In [10]:
decision_tree_model = turicreate.decision_tree_classifier.create(train_data,
                                                                 validation_set=None,
                                                                 target = target,
                                                                 features = features)

In [11]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none
B,B3,0,11,OWN,11.18,credit_card,36 months,1
D,D1,0,10,RENT,16.85,debt_consolidation,36 months,1
D,D2,0,3,RENT,13.97,other,60 months,0
A,A5,0,11,MORTGAGE,16.33,debt_consolidation,36 months,1

last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
1,82.4,0.0,1
1,96.4,0.0,1
1,59.5,0.0,-1
1,62.1,0.0,-1


In [12]:
decision_tree_model.predict(sample_validation_data)

dtype: int
Rows: 4
[1, -1, -1, 1]

# Smaller model

In [13]:
small_model = turicreate.decision_tree_classifier.create(train_data,
                                                         validation_set=None,
                                                         target = target,
                                                         features = features,
                                                         max_depth = 2)

In [14]:
small_model.predict(sample_validation_data, output_type = 'probability')

dtype: float
Rows: 4
[0.5803016424179077, 0.4085058867931366, 0.4085058867931366, 0.7454202175140381]

In [15]:
print(small_model.evaluate(train_data)['accuracy'])
print(decision_tree_model.evaluate(train_data)['accuracy'])

0.6135020416935311
0.6405813453685794


# Bigger model

In [16]:
big_model = turicreate.decision_tree_classifier.create(train_data, validation_set=None,
                                                       target = target, features = features, max_depth = 10)

In [18]:
print(big_model.evaluate(train_data)['accuracy'])
print(big_model.evaluate(validation_data)['accuracy'])

0.665538362346873
0.6274235243429557


In [19]:
print(decision_tree_model.evaluate(validation_data)['accuracy'])

0.6367944851357173


# Getting and evaluating predictions 

In [25]:
predictions = decision_tree_model.predict(validation_data)

In [26]:
false_positives, false_negatives = 0, 0
for i in range(len(predictions)):
    if (predictions[i] == 1) and (validation_data['safe_loans'][i] == -1):
        false_positives += 1
    if (predictions[i] == -1) and (validation_data['safe_loans'][i] == 1):
        false_negatives += 1

In [27]:
10000*false_negatives + 20000*false_positives

50280000