`PART 2: Features Preparation`
--------------------------------------------
# Machine learning model that can accurately predict if a borrower will pay 


# off their loan on time or not?
---------------------------------------------------------------------------------

Here we will be continuting our prediction modelling. We will use the csv file we saved in *PART 1*.

In this part we will mainly focus on preparing features. We will prepare data for machine learning by focusing on handling missing values, converting categorical values to numeric values and removing any extraneous columns we encounter. We need to convert categorical type columns to numerical type because most of the Machine Learning algorithms assume data is numeric and contains no missing values.If this requirement isn't fulfilled then sklearn will raise error when working with models like `LinearRegression` and `LogisticRegression`.

In [78]:
import pandas as pd
loans = pd.read_csv("filtered_loans.csv")

In [79]:
loans.isnull().sum()

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1036
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         2
pub_rec_bankruptcies     697
dtype: int64

In [80]:
#dropping "pub_rec_bankruptcies" column as it contains more than 1% null values
loans = loans.drop(["pub_rec_bankruptcies"],axis=1)
#removing rows with any null values
loans = loans.dropna()

In [81]:
loans.isnull().sum()

loan_amnt              0
term                   0
int_rate               0
installment            0
emp_length             0
home_ownership         0
annual_inc             0
verification_status    0
loan_status            0
purpose                0
title                  0
addr_state             0
dti                    0
delinq_2yrs            0
earliest_cr_line       0
inq_last_6mths         0
open_acc               0
pub_rec                0
revol_bal              0
revol_util             0
total_acc              0
last_credit_pull_d     0
dtype: int64

In [82]:
#to print datatype of columns combined
loans.dtypes.value_counts()

object     11
float64    10
int64       1
dtype: int64

#### Creating separate dataframe for object type columns:

In [83]:
object_columns_df = loans.select_dtypes(include=["object"])

In [84]:
object_columns_df.head()

Unnamed: 0,term,int_rate,emp_length,home_ownership,verification_status,purpose,title,addr_state,earliest_cr_line,revol_util,last_credit_pull_d
0,36 months,10.65%,10+ years,RENT,Verified,credit_card,Computer,AZ,Jan-1985,83.7%,Jun-2016
1,60 months,15.27%,< 1 year,RENT,Source Verified,car,bike,GA,Apr-1999,9.4%,Sep-2013
2,36 months,15.96%,10+ years,RENT,Not Verified,small_business,real estate business,IL,Nov-2001,98.5%,Jun-2016
3,36 months,13.49%,10+ years,RENT,Source Verified,other,personel,CA,Feb-1996,21%,Apr-2016
4,36 months,7.90%,3 years,RENT,Source Verified,wedding,My wedding loan I promise to pay back,AZ,Nov-2004,28.3%,Jan-2016


* `int_rate` and `revol_util` are numeric columns, we can see above.
* `earliest_cr_line` and `last_credit_pull_d` columns contain date and hence need a good feature engineering. So we will drop them as well


In [85]:
#printing unique value count in each above column
for col in object_columns_df.columns.drop(["int_rate","revol_util","earliest_cr_line","last_credit_pull_d"]):
    print(object_columns_df[col].value_counts())

 36 months    28234
 60 months     9441
Name: term, dtype: int64
10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64
RENT        18112
MORTGAGE    16686
OWN          2778
OTHER          96
NONE            3
Name: home_ownership, dtype: int64
Not Verified       16281
Verified           11856
Source Verified     9538
Name: verification_status, dtype: int64
debt_consolidation    17751
credit_card            4911
other                  3711
home_improvement       2808
major_purchase         2083
small_business         1719
car                    1459
wedding                 916
medical                 655
moving                  552
house                   356
vacation                348
educational             312
renewable_energy         94
Name: purpose, dtype: int64
Debt Consolidation                          20

### Doing further analysis based on columns:

In [86]:
loans["purpose"].value_counts()

debt_consolidation    17751
credit_card            4911
other                  3711
home_improvement       2808
major_purchase         2083
small_business         1719
car                    1459
wedding                 916
medical                 655
moving                  552
house                   356
vacation                348
educational             312
renewable_energy         94
Name: purpose, dtype: int64

In [87]:
loans["title"].value_counts()

Debt Consolidation                          2068
Debt Consolidation Loan                     1599
Personal Loan                                624
Consolidation                                488
debt consolidation                           466
Credit Card Consolidation                    345
Home Improvement                             336
Debt consolidation                           314
Small Business Loan                          298
Credit Card Loan                             294
Personal                                     290
Consolidation Loan                           250
Home Improvement Loan                        228
personal loan                                219
Loan                                         202
Wedding Loan                                 199
personal                                     198
Car Loan                                     188
consolidation                                186
Other Loan                                   168
Wedding             

It seems like **purpose** and **title** contains overlapping information. So we will keep one of them. Here we will keep **purpose** column as it contains less descreet values.

#### Further cleaning begins:

In [88]:
cols_to_drop = ["last_credit_pull_d", "addr_state", "title", "earliest_cr_line"]
loans = loans.drop(cols_to_drop,axis=1)

In [89]:
#converting `int_rate` and `revol_util` to numeric type
loans["int_rate"] =loans["int_rate"].str.rstrip('%').astype(float)
loans["revol_util"] = loans["revol_util"].str.rstrip('%').astype(float)

In [90]:
loans["emp_length"].value_counts()

10+ years    8545
< 1 year     4513
2 years      4303
3 years      4022
4 years      3353
5 years      3202
1 year       3176
6 years      2177
7 years      1714
8 years      1442
9 years      1228
Name: emp_length, dtype: int64

In [91]:
mapping = {"emp_length":{
    "10+ years":10,
"< 1 year"     :0,
"2 years"      :2,
"3 years"      :3,
"4 years"      :4,
"5 years"      :5,
"1 year"       :1,
"6 years"      :6,
"7 years"      :7,
"8 years"      :8,
"9 years"      :9,
"n/a"          :0
}}
loans=loans.replace(mapping)

In [92]:
loans.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc
0,5000.0,36 months,10.65,162.87,10,RENT,24000.0,Verified,1,credit_card,27.65,0.0,1.0,3.0,0.0,13648.0,83.7,9.0
1,2500.0,60 months,15.27,59.83,0,RENT,30000.0,Source Verified,0,car,1.0,0.0,5.0,3.0,0.0,1687.0,9.4,4.0
2,2400.0,36 months,15.96,84.33,10,RENT,12252.0,Not Verified,1,small_business,8.72,0.0,2.0,2.0,0.0,2956.0,98.5,10.0
3,10000.0,36 months,13.49,339.31,10,RENT,49200.0,Source Verified,1,other,20.0,0.0,1.0,10.0,0.0,5598.0,21.0,37.0
4,5000.0,36 months,7.9,156.46,3,RENT,36000.0,Source Verified,1,wedding,11.2,0.0,3.0,9.0,0.0,7963.0,28.3,12.0


In [93]:
loans.shape

(37675, 18)

#### Working on home_ownership, verification_status, purpose, and term:
Encoding these columns as dummy columns.

In [94]:
col_list = ["home_ownership", "verification_status", "purpose", "term"]
for col in col_list:
    dummy = pd.get_dummies(loans[col],prefix=col)
    loans = pd.concat([loans,dummy],axis=1)
    loans = loans.drop(col,axis=1)

In [95]:
loans.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0,0,0,0,0,0,0,0,1,0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0,0,0,0,0,0,0,0,0,1
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,...,0,0,0,0,0,1,0,0,1,0
3,10000.0,13.49,339.31,10,49200.0,1,20.0,0.0,1.0,10.0,...,0,0,0,1,0,0,0,0,1,0
4,5000.0,7.9,156.46,3,36000.0,1,11.2,0.0,3.0,9.0,...,0,0,0,0,0,0,0,1,1,0


In [98]:
loans.dtypes.value_counts()

uint8      24
float64    12
int64       2
dtype: int64

So we have succesful converted all the columns to **Numerical type** columns.Let's save this dataframe.

In [99]:
loans.to_csv("loans.csv",index=False)

## So far..
* We have converted necessary columns to numerical type.
* Removed columns which provide overlapping information
* Added new features using dummy variables
* Cleaned dataset by removing null values
* Mapped category values to specific integer

We have done a lot of preprocessing till now. Dataset looks good and cleaned. We will now start working on Machine Learning models in next part.