# **Machine Learning with Decision Trees and Random Forests**
In this project we will be exploring publicly available data from LendingClub.com. Lending Club is a service that connects borrowers with people who have money (investors). We will try to create a model that will help predict if a person has a high probability of paying the loan back.

In [None]:
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# **Data**
We will use lending data from 2007-2010 and try to classify and predict whether or not the borrower paid back their loan in full.

The data set contains the following features:



*   **credit.policy**: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
*   **purpose**: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
*   **int.rate**: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
*   **installment**: The monthly installments owed by the borrower if the loan is funded.
*   **log.annual.inc**: The natural log of the self-reported annual income of the borrower.
*   **dti**: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
*   **fico**: The FICO credit score of the borrower.
*   **days.with.cr.line**: The number of days the borrower has had a credit line.
*   **revol.bal**: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
*   **revol.util**: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
*   **inq.last.6mths**: The borrower's number of inquiries by creditors in the last 6 months.
*   **delinq.2yrs**: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
*   **pub.rec**: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).









In [None]:
#Reading the data
loans = pd.read_csv('data/loan_data.csv')

In [None]:
loans.info()

In [None]:
loans.describe()

In [None]:
loans.head()

# **Exploratory Data Analysis**
Using quick visualisations, we can explore the relationship between different variables in the dataset.

Let's start with a dual histogram of the FICO score of the borrowers, depending on the credit policy (i.e. if a borrower met the underlying criteria).

In [None]:
loans[loans['credit.policy']==1]['fico'].hist(bins=30,alpha=0.6,label='Credit.Policy=1')
loans[loans['credit.policy']==0]['fico'].hist(bins=30,alpha=0.6,label='Credit.Policy=0')
plt.legend()
plt.xlabel('FICO')

In [None]:
loans[loans['not.fully.paid']==1]['fico'].hist(bins=30,alpha=0.6,label='not.fully.paid=1',color='red')
loans[loans['not.fully.paid']==0]['fico'].hist(bins=30,alpha=0.6,label='not.fully.paid=0')
plt.legend()
plt.xlabel('FICO')

Let's visualize the counts of of loan purposes, based on whether a borrower fully paid the loan back or not.

In [None]:
plt.figure(figsize=(9,5))
sns.countplot(loans['purpose'],hue=loans['not.fully.paid'])
plt.tight_layout()

Let's see the trend between FICO score and interest rate.

In [None]:
sns.jointplot(x='fico',y='int.rate',data=loans)