# Random Forest 

For this notebook we will be exploring publicly available data from [LendingClub.com](www.lendingclub.com). Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.


We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full. Use the load_data.csv already provided.

Here are what the columns represent:
* credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
* purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
* int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
* installment: The monthly installments owed by the borrower if the loan is funded.
* log.annual.inc: The natural log of the self-reported annual income of the borrower.
* dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
* fico: The FICO credit score of the borrower.
* days.with.cr.line: The number of days the borrower has had a credit line.
* revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
* revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
* inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
* delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
* pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

## Import Libraries

**Import the usual libraries.**

## Get the Data

** Use pandas to read loan_data.csv as a dataframe called loans.**

** Check out the info(), head(), and describe() methods on loans.**

## Exploratory Data Analysis

** Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.**

** Create a similar figure, except this time select by the not.fully.paid column.**

** Create a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid. **

** Let's see the trend between FICO score and INTEREST RATE with a jointplot.**

** Create lmplots to see the trend differed between not.fully.paid and credit.policy. Check the documentation for `lmplot()`. **

## Setting up the Data
**Check loans.info() again.**

## Categorical Features

Notice that the **purpose** column as categorical

We need to trasform the **purpose** column using dummy variables (by using `pd.get_dummies`) so as sklearn will be able to understand them


**Create a list of 1 element containing the string 'purpose'. Call this list cat_feats.**

**Now use `pd.get_dummies()` to create a fixed larger dataframe that has new feature columns with dummy variables. Set this dataframe as final_data. Check the documentation to se how `pd.get_dummies()` works.**

## Train Test Split

Split the data into a training set and a testing set.

** Use sklearn to split your data into a training set and a testing set.**

## Training a Decision Tree Model

Let's start by training a single decision tree first.

** Import DecisionTreeClassifier**

**Create an instance of DecisionTreeClassifier() called dtree and fit it to the training data.**

## Predictions and Evaluation of Decision Tree
**Create predictions from the test set and create a classification report, accuracy score a confusion matrix.**

## Training the Random Forest model

**Create an instance of the RandomForestClassifier class and fit it to our training data.**

## Predictions and Evaluation

Let's predict the y_test values and evaluate our model.

** Predict the class of not.fully.paid for the X_test data.**

**Compute confusion matrix, accuracy score and classification report.**

## Grid Search

#### Create a parameter grid. Focus especially on these hyperparameters: number of estimators, maximum depth, maximum features. You can also investigate minimum samples leaf, minimum sample split, bootstrap. Investigate a little bit on which values/values range to use for each parameter.

#### Create a base classifier.

#### Perform a grid search.

#### Fit the estimator to the training data (this will take a while).

#### Check the best estimator and score.

#### Predict on the test data.

**Compute confusion matrix, accuracy score and classification report.**