# 1. Introduction

* In this course, we will walk through the full data science life cycle, from data cleaning and feature selection to machine learning. We will focus on `credit modelling`, a well known data science problem that focuses on **modeling a borrower's [credit risk.](https://en.wikipedia.org/wiki/Credit_risk)**

* We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/). Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. You can read more about their marketplace [here]().

# 2. Introduction to the data

* Lending Club releases data for all of the approved and declined loan applications periodically on [their website](https://www.lendingclub.com/auth/login?login_url=%2Fstatistics%2Fadditional-statistics%3F). You can select a few different year ranges to download the datasets (in CSV format) for both approved and declined loans.

* You'll also find a data dictionary (in XLS format) which contains information on the different column names towards the bottom of the page. We recommend downloading the data dictionary to so you can refer to it whenever you want to learn more about what a column represents in the datasets. Here's a link to the data dictionary file hosted on [Google Drive.](https://docs.google.com/spreadsheets/d/191B2yJ4H1ZPXq0_ByhUgWMFZOYem5jFz0Y3by_7YBY4/edit#gid=2081333097)

* Before diving into the datasets themselves, let's get familiar with the data dictionary. The LoanStats sheet describes the approved loans datasets and the RejectStats describes the rejected loans datasets. Since rejected applications don't appear on the Lending Club marketplace and aren't available for investment, **we'll be focusing on data on approved loans only.**

* The approved loans datasets contain information on current loans, completed loans, and defaulted loans. Let's now define the problem statement for this machine learning project:

`Can we build a machine learning model that can accurately predict if a borrower will pay off their loan on time or not?`

Before we can start doing machine learning, `we need to define what features we want to use and which column represents the target column we want to predict.`

# 3. Reading in to Pandas

* In this mission, we'll focus on approved loans data from 2007 to 2011, since a good number of the loans have already finished. In the datasets for later years, many of the loans are current and still being paid off.

In [1]:
import pandas as pd 

## TODO:
* Read loans_2007.csv into a DataFrame named loans_2007 and use the print function to display the first row of the Dataframe.

* Use the print function to:
  * display the first row of loans_2007 and
  *  the number of columns in loans_2007.

In [2]:
loans_2007=pd.read_csv('loans_2007.csv',dtype='unicode')
print(loans_2007.iloc[0])
loans_2007.shape[1]

id                                    1077501
member_id                           1296599.0
loan_amnt                              5000.0
funded_amnt                            5000.0
funded_amnt_inv                        4975.0
term                                36 months
int_rate                               10.65%
installment                            162.87
grade                                       B
sub_grade                                  B2
emp_title                                 NaN
emp_length                          10+ years
home_ownership                           RENT
annual_inc                            24000.0
verification_status                  Verified
issue_d                              Dec-2011
loan_status                        Fully Paid
pymnt_plan                                  n
purpose                           credit_card
title                                Computer
zip_code                                860xx
addr_state                        

52

# 4. First group of columns

* `The Dataframe contains many columns and can be cumbersome to try to explore all at once.` Let's `break up the columns into 3 groups of 18 columns and use the data dictionary` to become familiar with what each column represents. As you understand each feature, you want to pay attention to any features that:

  * leak information from the future (after the loan has already been funded)
  * don't affect a borrower's ability to pay back a loan (e.g. a randomly generated ID value by Lending Club)
  * formatted poorly and need to be cleaned up
  * require more data or a lot of processing to turn into a useful feature
  * contain redundant information

* We need to especially pay attention to data leakage, since it can cause our model to overfit. This is because the model would be using data about the target column that wouldn't be available when we're using the model on future loans. We encourage you to `spend as much time as you need to understand each column,because a poor understanding could cause you to make mistakes in the data analysis and modeling process.` As you go through the dictionary, keep in mind that` we need to select one of the columns as the target column` we want to use when we move on to the machine learning phase.

In [3]:
column_groups={'group_1':loans_2007.iloc[:,:18],'group_2':loans_2007.iloc[:,18:36],'group_3':loans_2007.iloc[:,36:54]}

In [4]:
print(column_groups['group_1'][:10].shape)
column_groups['group_1'][:10]

(10, 18)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n
5,1075269,1311441.0,5000.0,5000.0,5000.0,36 months,7.90%,156.46,A,A4,Veolia Transportaton,3 years,RENT,36000.0,Source Verified,Dec-2011,Fully Paid,n
6,1069639,1304742.0,7000.0,7000.0,7000.0,60 months,15.96%,170.08,C,C5,Southern Star Photography,8 years,RENT,47004.0,Not Verified,Dec-2011,Fully Paid,n
7,1072053,1288686.0,3000.0,3000.0,3000.0,36 months,18.64%,109.43,E,E1,MKC Accounting,9 years,RENT,48000.0,Source Verified,Dec-2011,Fully Paid,n
8,1071795,1306957.0,5600.0,5600.0,5600.0,60 months,21.28%,152.39,F,F2,,4 years,OWN,40000.0,Source Verified,Dec-2011,Charged Off,n
9,1071570,1306721.0,5375.0,5375.0,5350.0,60 months,12.69%,121.45,B,B5,Starbucks,< 1 year,RENT,15000.0,Verified,Dec-2011,Charged Off,n


After analyzing each column, we can conclude that the following features need to be removed:

* `id:` randomly generated field by Lending Club for unique identification purposes only
* `member_id:` also a randomly generated field by Lending Club for unique identification purposes only
* `funded_amnt:` leaks data from the future (after the loan is already started to be funded)
* `funded_amnt_inv:` also leaks data from the future (after the loan is already started to be funded)
* `grade:` contains redundant information as the interest rate column (int_rate)
* `sub_grade:` also contains redundant information as the interest rate column (int_rate)
* `emp_title:` requires other data and a lot of processing to potentially be useful
* `issue_d:` leaks data from the future (after the loan is already completely funded)
* Recall that Lending Club assigns a grade and a sub-grade based on the borrower's interest rate. While the grade and sub_grade values are categorical, the int_rate column contains continuous values, which are better suited for machine learning.

* Let's now **drop these columns from the Dataframe** before moving onto the next group of columns.

# 5. First group of columns

## TODO:
* Use the Dataframe method drop to remove the following columns from the loans_2007 Dataframe:
  * id
  * member_id
  * funded_amnt
  * funded_amnt_inv
  * grade
  * sub_grade
  * emp_title
  * issue_d

In [5]:
loans_2007.drop(['id','member_id','funded_amnt','funded_amnt_inv','grade','sub_grade','emp_title','issue_d'],axis=1,inplace=True)

# 6. Second group of features

In [6]:
print(column_groups['group_2'][:10].shape)
column_groups['group_2'][:10]

(10, 18)


Unnamed: 0,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv
0,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.1551866952,5833.84
1,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71
2,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.6668441393,3005.67
3,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.890000000902,12231.89
4,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12
5,wedding,My wedding loan I promise to pay back,852xx,AZ,11.2,0.0,Nov-2004,3.0,9.0,0.0,7963.0,28.3%,12.0,f,0.0,0.0,5632.209999999401,5632.21
6,debt_consolidation,Loan,280xx,NC,23.51,0.0,Jul-2005,1.0,7.0,0.0,17726.0,85.6%,11.0,f,0.0,0.0,10137.840007529006,10137.84
7,car,Car Downpayment,900xx,CA,5.35,0.0,Jan-2007,2.0,4.0,0.0,8221.0,87.5%,4.0,f,0.0,0.0,3939.1352939056974,3939.14
8,small_business,Expand Business & Buy Debt Portfolio,958xx,CA,5.55,0.0,Apr-2004,2.0,11.0,0.0,5210.0,32.6%,13.0,f,0.0,0.0,646.02,646.02
9,other,Building my credit history.,774xx,TX,18.08,0.0,Sep-2004,0.0,2.0,0.0,9279.0,36.5%,3.0,f,0.0,0.0,1476.19,1469.34


Within this group of columns, we need to drop the following columns:

* `zip_code:` redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible (which only can be used to identify the state the borrower lives in)
* `out_prncp:` leaks data from the future, (after the loan already started to be paid off)
* `out_prncp_inv:` also leaks data from the future, (after the loan already started to be paid off)
* `total_pymnt:` also leaks data from the future, (after the loan already started to be paid off)
* `total_pymnt_inv:` also leaks data from the future, (after the loan already started to be paid off)
* `total_rec_prncp:` also leaks data from the future, (after the loan already started to be paid off)

## TODO:
Use the Dataframe method drop to remove the following columns from the loans_2007 Dataframe:

* zip_code
* out_prncp
* out_prncp_inv
* total_pymnt
* total_pymnt_inv
* total_rec_prncp

In [7]:
loans_2007.drop(['zip_code','out_prncp','out_prncp_inv','total_pymnt','total_pymnt_inv','total_rec_prncp'],axis=1,inplace=True)

# 8. Third group of features

In [8]:
print(column_groups['group_3'].shape)

(42538, 16)


In [9]:
column_groups['group_3'][:10]

Unnamed: 0,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
5,5000.0,632.21,0.0,0.0,0.0,Jan-2015,161.03,Jan-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
6,7000.0,3137.84,0.0,0.0,0.0,May-2016,1313.76,May-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
7,3000.0,939.14,0.0,0.0,0.0,Jan-2015,111.34,Dec-2014,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
8,162.02,294.94,0.0,189.06,2.09,Apr-2012,152.39,Aug-2012,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
9,673.48,533.42,0.0,269.29,2.52,Nov-2012,121.45,Mar-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


## TODO:
* Use the Dataframe method drop to remove the following columns from the loans_2007 Dataframe:

  * total_rec_int
  * total_rec_late_fee
  * recoveries
  * collection_recovery_fee
  * last_pymnt_d
  * last_pymnt_amnt
* Use the print function to:

  * display the first row of loans_2007 and
  * the number of columns in loans_2007.

In [10]:
loans_2007.drop(['total_rec_int','total_rec_late_fee','recoveries','collection_recovery_fee','last_pymnt_d','last_pymnt_amnt'],axis=1,inplace=True)

In [11]:
loans_2007.iloc[0]

loan_amnt                          5000.0
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                        24000.0
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                           0.0
earliest_cr_line                 Jan-1985
inq_last_6mths                        1.0
open_acc                              3.0
pub_rec                               0.0
revol_bal                         13648.0
revol_util                          83.7%
total_acc                             9.0
initial_list_status                     f
last_credit_pull_d               J

In [12]:
loans_2007.shape[1]

32

# 10. Target column

* Just by becoming familiar with the columns in the dataset, we were able to reduce the number of columns from 52 to 32 columns. We now need to decide on a target column that we want to use for modeling.

* We should use the `loan_status` column, since it's the only column that directly describes if a loan was paid off on time, had delayed payments, or was defaulted on the borrower. 
* Currently, this column contains text values and we need to convert it to a numerical one for training a model.

## TODO:
* Use the Series method value_counts to return the frequency of the unique values in the loan_status column.
* Display the frequency of each unique value using the print function.

In [13]:
loans_2007['loan_status'].value_counts()

Fully Paid                                             33136
Charged Off                                             5634
Does not meet the credit policy. Status:Fully Paid      1988
Current                                                  961
Does not meet the credit policy. Status:Charged Off      761
Late (31-120 days)                                        24
In Grace Period                                           20
Late (16-30 days)                                          8
Default                                                    3
Name: loan_status, dtype: int64

# 11. Binary classification

There are 8 different possible values for the loan_status column. You can read about most of the different loan statuses on the [Lending Clube website](https://help.lendingclub.com/hc/en-us/articles/215488038-What-do-the-different-Note-statuses-mean-). The two values that start with "Does not meet the credit policy" aren't explained unfortunately. A quick Google search takes us to explanations from the lending community [here](http://www.lendacademy.com/forum/index.php?topic=2427.msg20813#msg20813).

`From the investor's perspective, we're interested in trying to predict which loans will be paid off on time and which ones won't be.` Only the` Fully Paid` and` Charged Off` values describe the final outcome of the loan. The other values describe loans that are still ongoing and where the jury is still out on if the borrower will pay back the loan on time or not. While the Default status resembles the Charged Off status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance. You can read about the difference [here](https://help.lendingclub.com/hc/en-us).

Since we're interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as a **binary classification** one. Let's remove all the loans that don't contain either Fully Paid and Charged Off as the loan's status and then transform the `Fully Paid values to 1` for the positive case and the `Charged Off values to 0`

Lastly, one thing we need to keep in mind is the **class imbalance** between the positive and negative cases. While there are 33,136 loans that have been fully paid off, there are only 5,634 that were charged off. **This class imbalance is a common problem in binary classification and during training, the model ends up having a strong bias towards predicting the class with more observations in the training set and will rarely predict the class with less observations.** The stronger the imbalance, the more biased the model becomes.

## TODO:
* Remove all rows from loans_2007 that contain values other than Fully Paid or Charged Off for the loan_status column.
* Use the Dataframe method replace to replace:
  * Fully Paid with 1
  * Charged Off with 0

In [14]:
loans_2007=loans_2007[(loans_2007['loan_status']=='Fully Paid')|(loans_2007['loan_status']=='Charged Off')]

In [15]:
loans_2007.replace({'loan_status':{'Fully Paid':1,'Charged Off':0}},inplace=True)

# 13. Removing single value columns

* let's look for any columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application. In addition, removing these columns will reduce the number of columns

* We'll need to compute the number of unique values in each column and drop the columns that contain only one unique value. While the Series method unique returns the unique values in a column, it also counts the Pandas missing value object nan as a value

## TODO:
* Remove any columns from loans_2007 that contain only one unique value:
  * Create an empty list, drop_columns to keep track of which columns you want to drop
  * For each column:
    * Use the Series method dropna to remove any null values and then use the Series method unique to return the set of non-null unique values
    * Use the len() function to return the number of values in that set
    * Append the column to drop_columns if it contains only 1 unique value
  * Use the Dataframe method drop to remove the columns in drop_columns from loans_2007
* Use the print function to display drop_columns so we know which ones were removed

In [16]:
orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
    col_series = loans_2007[col].dropna().unique()
    if len(col_series) == 1:
        drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis=1)
print(drop_columns)

['pymnt_plan', 'initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']


It looks we were able to remove 9 more columns since they only contained 1 unique value.

In this mission, we started to become familiar with the columns in the dataset and removed many columns that aren't useful for modeling. We also selected our target column and decided to focus our modeling efforts on binary classification. In the next mission, we'll explore the individual features in greater depth and work towards training our first machine learning model.