## Steps for Data Cleansing:
- Import the necessary libraries.
- Load the dataset into a dataframe.
- Data Understanding  (very important)
- Understand the problem statement.
- Check for missing values.
- Columns that has high percentage of missing values (>40 ot 50%), directly discard that column from the analysis.
- Columns with missing values under acceptable missing value range, prefer to impute them & replace with some values.
- For numerical cloumns, prefer to use mean or median (meadian is preferred most because it not affected by outliers).
- For categorical columns, prefer to use mode.
- If there are very less number missing values in a column (<1%), they can be either imputed or can be dropped from dataset.
- Segmentation - seggregate the columns based on their nature.

### Importing necessary libraries

In [186]:
#Numerica & Data Analysis
import pandas as pd
import numpy as np

#Data Visualization
from matplotlib import pyplot as plt

#Others
import warnings
warnings.filterwarnings('ignore')

### Data Loading

In [187]:
df = pd.read_csv('/Users/nikhilnaveen/Downloads/loan.csv')
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,1077501,1296599,5000,5000,4975.0,36 months,10.65%,162.87,B,B2,...,,,,,0.0,0.0,,,,
1,1077430,1314167,2500,2500,2500.0,60 months,15.27%,59.83,C,C4,...,,,,,0.0,0.0,,,,
2,1077175,1313524,2400,2400,2400.0,36 months,15.96%,84.33,C,C5,...,,,,,0.0,0.0,,,,
3,1076863,1277178,10000,10000,10000.0,36 months,13.49%,339.31,C,C1,...,,,,,0.0,0.0,,,,
4,1075358,1311748,3000,3000,3000.0,60 months,12.69%,67.79,B,B5,...,,,,,0.0,0.0,,,,


### Problem Statement:
Analyze the loan dataset and derive variables/factors that affects loan approval of customers.

### Missing value check

In [188]:
for column in df.columns:
    if df[column].isnull().any():
        print(f'{column} {100*df[column].isnull().mean()} null values')
#df.isnull().sum() --> count of missing values
#df.isnull().mean() --> percentage of missing values

emp_title 6.191303472064859 null values
emp_length 2.7066495455346575 null values
desc 32.58554271470655 null values
title 0.027695948838029054 null values
mths_since_last_delinq 64.66248709620565 null values
mths_since_last_record 92.98537150338646 null values
revol_util 0.12589067653649572 null values
last_pymnt_d 0.1787647606818239 null values
next_pymnt_d 97.12969257496789 null values
last_credit_pull_d 0.0050356270614598285 null values
collections_12_mths_ex_med 0.1409975577208752 null values
mths_since_last_major_derog 100.0 null values
annual_inc_joint 100.0 null values
dti_joint 100.0 null values
verification_status_joint 100.0 null values
tot_coll_amt 100.0 null values
tot_cur_bal 100.0 null values
open_acc_6m 100.0 null values
open_il_6m 100.0 null values
open_il_12m 100.0 null values
open_il_24m 100.0 null values
mths_since_rcnt_il 100.0 null values
total_bal_il 100.0 null values
il_util 100.0 null values
open_rv_12m 100.0 null values
open_rv_24m 100.0 null values
max_bal_bc

In [189]:
#Missing value columns --> tax liens, tot_hi_cred_lim, total_bal_ex_mort, etc

In [190]:
#Drop the column whose mean of null values is >60% using variables
null_value_mean = 100*df.isnull().mean()
columns_to_drop = null_value_mean[null_value_mean >60].index
df_new = df.drop(columns=columns_to_drop, axis=1) #axis=1 represents dropping columns
df_new.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599,5000,5000,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,May-16,0.0,1,INDIVIDUAL,0,0.0,0,0.0,0.0
1,1077430,1314167,2500,2500,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-13,0.0,1,INDIVIDUAL,0,0.0,0,0.0,0.0
2,1077175,1313524,2400,2400,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,May-16,0.0,1,INDIVIDUAL,0,0.0,0,0.0,0.0
3,1076863,1277178,10000,10000,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-16,0.0,1,INDIVIDUAL,0,0.0,0,0.0,0.0
4,1075358,1311748,3000,3000,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,May-16,0.0,1,INDIVIDUAL,0,0.0,0,0.0,0.0


In [191]:
100*df_new.isnull().mean()

id                             0.000000
member_id                      0.000000
loan_amnt                      0.000000
funded_amnt                    0.000000
funded_amnt_inv                0.000000
term                           0.000000
int_rate                       0.000000
installment                    0.000000
grade                          0.000000
sub_grade                      0.000000
emp_title                      6.191303
emp_length                     2.706650
home_ownership                 0.000000
annual_inc                     0.000000
verification_status            0.000000
issue_d                        0.000000
loan_status                    0.000000
pymnt_plan                     0.000000
url                            0.000000
desc                          32.585543
purpose                        0.000000
title                          0.027696
zip_code                       0.000000
addr_state                     0.000000
dti                            0.000000


In [192]:
null_value_mean_new = 100*df_new.isnull().mean()
columns_to_display = null_value_mean_new[null_value_mean_new >0].index
for column in columns_to_display:
    print(column, null_value_mean_new[column])

emp_title 6.191303472064859
emp_length 2.7066495455346575
desc 32.58554271470655
title 0.027695948838029054
revol_util 0.12589067653649572
last_pymnt_d 0.1787647606818239
last_credit_pull_d 0.0050356270614598285
collections_12_mths_ex_med 0.1409975577208752
chargeoff_within_12_mths 0.1409975577208752
pub_rec_bankruptcies 1.7549160309187504
tax_liens 0.09819472769846666


In [193]:
df_new["emp_title"].value_counts()

emp_title
US Army                              134
Bank of America                      109
IBM                                   66
AT&T                                  59
Kaiser Permanente                     56
                                    ... 
Community College of Philadelphia      1
AMEC                                   1
lee county sheriff                     1
Bacon County Board of Education        1
Evergreen Center                       1
Name: count, Length: 28820, dtype: int64

In [194]:
#Approach-1 : Impute the missing values with mode
df_new["emp_title"] = df_new["emp_title"].fillna('**No Title**')
100*df_new.isnull().mean()

id                             0.000000
member_id                      0.000000
loan_amnt                      0.000000
funded_amnt                    0.000000
funded_amnt_inv                0.000000
term                           0.000000
int_rate                       0.000000
installment                    0.000000
grade                          0.000000
sub_grade                      0.000000
emp_title                      0.000000
emp_length                     2.706650
home_ownership                 0.000000
annual_inc                     0.000000
verification_status            0.000000
issue_d                        0.000000
loan_status                    0.000000
pymnt_plan                     0.000000
url                            0.000000
desc                          32.585543
purpose                        0.000000
title                          0.027696
zip_code                       0.000000
addr_state                     0.000000
dti                            0.000000


In [195]:
null_value_mean_new = 100*df_new.isnull().mean()
columns_to_display = null_value_mean_new[null_value_mean_new >0].index
for column in columns_to_display:
    print(column, null_value_mean_new[column])

emp_length 2.7066495455346575
desc 32.58554271470655
title 0.027695948838029054
revol_util 0.12589067653649572
last_pymnt_d 0.1787647606818239
last_credit_pull_d 0.0050356270614598285
collections_12_mths_ex_med 0.1409975577208752
chargeoff_within_12_mths 0.1409975577208752
pub_rec_bankruptcies 1.7549160309187504
tax_liens 0.09819472769846666


In [196]:
#Approach-1 : Impute the missing values with mode
df_new["emp_length"] = df_new["emp_length"].fillna('n\a')
100*df_new.isnull().mean()

id                             0.000000
member_id                      0.000000
loan_amnt                      0.000000
funded_amnt                    0.000000
funded_amnt_inv                0.000000
term                           0.000000
int_rate                       0.000000
installment                    0.000000
grade                          0.000000
sub_grade                      0.000000
emp_title                      0.000000
emp_length                     0.000000
home_ownership                 0.000000
annual_inc                     0.000000
verification_status            0.000000
issue_d                        0.000000
loan_status                    0.000000
pymnt_plan                     0.000000
url                            0.000000
desc                          32.585543
purpose                        0.000000
title                          0.027696
zip_code                       0.000000
addr_state                     0.000000
dti                            0.000000


In [197]:
null_value_mean_new = 100*df_new.isnull().mean()
columns_to_display = null_value_mean_new[null_value_mean_new >0].index
for column in columns_to_display:
    print(column, null_value_mean_new[column])

desc 32.58554271470655
title 0.027695948838029054
revol_util 0.12589067653649572
last_pymnt_d 0.1787647606818239
last_credit_pull_d 0.0050356270614598285
collections_12_mths_ex_med 0.1409975577208752
chargeoff_within_12_mths 0.1409975577208752
pub_rec_bankruptcies 1.7549160309187504
tax_liens 0.09819472769846666


In [198]:
df_new["desc"] = df_new["desc"].fillna('**No Desc**')
100*df_new.isnull().mean()

id                            0.000000
member_id                     0.000000
loan_amnt                     0.000000
funded_amnt                   0.000000
funded_amnt_inv               0.000000
term                          0.000000
int_rate                      0.000000
installment                   0.000000
grade                         0.000000
sub_grade                     0.000000
emp_title                     0.000000
emp_length                    0.000000
home_ownership                0.000000
annual_inc                    0.000000
verification_status           0.000000
issue_d                       0.000000
loan_status                   0.000000
pymnt_plan                    0.000000
url                           0.000000
desc                          0.000000
purpose                       0.000000
title                         0.027696
zip_code                      0.000000
addr_state                    0.000000
dti                           0.000000
delinq_2yrs              

In [199]:
null_value_mean_new = 100*df_new.isnull().mean()
columns_to_display = null_value_mean_new[null_value_mean_new >0].index
for column in columns_to_display:
    print(column, null_value_mean_new[column])

title 0.027695948838029054
revol_util 0.12589067653649572
last_pymnt_d 0.1787647606818239
last_credit_pull_d 0.0050356270614598285
collections_12_mths_ex_med 0.1409975577208752
chargeoff_within_12_mths 0.1409975577208752
pub_rec_bankruptcies 1.7549160309187504
tax_liens 0.09819472769846666


In [200]:
df_new["title"] = df_new["title"].fillna('**No Title**')
100*df_new.isnull().mean()

id                            0.000000
member_id                     0.000000
loan_amnt                     0.000000
funded_amnt                   0.000000
funded_amnt_inv               0.000000
term                          0.000000
int_rate                      0.000000
installment                   0.000000
grade                         0.000000
sub_grade                     0.000000
emp_title                     0.000000
emp_length                    0.000000
home_ownership                0.000000
annual_inc                    0.000000
verification_status           0.000000
issue_d                       0.000000
loan_status                   0.000000
pymnt_plan                    0.000000
url                           0.000000
desc                          0.000000
purpose                       0.000000
title                         0.000000
zip_code                      0.000000
addr_state                    0.000000
dti                           0.000000
delinq_2yrs              

In [201]:
null_value_mean_new = 100*df_new.isnull().mean()
columns_to_display = null_value_mean_new[null_value_mean_new >0].index
for column in columns_to_display:
    print(column, null_value_mean_new[column])

revol_util 0.12589067653649572
last_pymnt_d 0.1787647606818239
last_credit_pull_d 0.0050356270614598285
collections_12_mths_ex_med 0.1409975577208752
chargeoff_within_12_mths 0.1409975577208752
pub_rec_bankruptcies 1.7549160309187504
tax_liens 0.09819472769846666


In [202]:
df_new = df_new.drop(columns="collections_12_mths_ex_med", axis=1) #axis=1 represents dropping columns
df_new.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599,5000,5000,4975.0,36 months,10.65%,162.87,B,B2,...,Jan-15,171.62,May-16,1,INDIVIDUAL,0,0.0,0,0.0,0.0
1,1077430,1314167,2500,2500,2500.0,60 months,15.27%,59.83,C,C4,...,Apr-13,119.66,Sep-13,1,INDIVIDUAL,0,0.0,0,0.0,0.0
2,1077175,1313524,2400,2400,2400.0,36 months,15.96%,84.33,C,C5,...,Jun-14,649.91,May-16,1,INDIVIDUAL,0,0.0,0,0.0,0.0
3,1076863,1277178,10000,10000,10000.0,36 months,13.49%,339.31,C,C1,...,Jan-15,357.48,Apr-16,1,INDIVIDUAL,0,0.0,0,0.0,0.0
4,1075358,1311748,3000,3000,3000.0,60 months,12.69%,67.79,B,B5,...,May-16,67.79,May-16,1,INDIVIDUAL,0,0.0,0,0.0,0.0


In [203]:
null_value_mean_new = 100*df_new.isnull().mean()
columns_to_display = null_value_mean_new[null_value_mean_new >0].index
for column in columns_to_display:
    print(column, null_value_mean_new[column])

revol_util 0.12589067653649572
last_pymnt_d 0.1787647606818239
last_credit_pull_d 0.0050356270614598285
chargeoff_within_12_mths 0.1409975577208752
pub_rec_bankruptcies 1.7549160309187504
tax_liens 0.09819472769846666


In [204]:
df_new["last_credit_pull_d"] = df_new["last_credit_pull_d"].fillna('**No Date**')
100*df_new.isnull().mean()

id                          0.000000
member_id                   0.000000
loan_amnt                   0.000000
funded_amnt                 0.000000
funded_amnt_inv             0.000000
term                        0.000000
int_rate                    0.000000
installment                 0.000000
grade                       0.000000
sub_grade                   0.000000
emp_title                   0.000000
emp_length                  0.000000
home_ownership              0.000000
annual_inc                  0.000000
verification_status         0.000000
issue_d                     0.000000
loan_status                 0.000000
pymnt_plan                  0.000000
url                         0.000000
desc                        0.000000
purpose                     0.000000
title                       0.000000
zip_code                    0.000000
addr_state                  0.000000
dti                         0.000000
delinq_2yrs                 0.000000
earliest_cr_line            0.000000
i

In [205]:
null_value_mean_new = 100*df_new.isnull().mean()
columns_to_display = null_value_mean_new[null_value_mean_new >0].index
for column in columns_to_display:
    print(column, null_value_mean_new[column])

revol_util 0.12589067653649572
last_pymnt_d 0.1787647606818239
chargeoff_within_12_mths 0.1409975577208752
pub_rec_bankruptcies 1.7549160309187504
tax_liens 0.09819472769846666


In [206]:
df_new['revol_util'] = df_new['revol_util'].str.replace('%', '').astype(float)
100*df_new.isnull().mean()

id                          0.000000
member_id                   0.000000
loan_amnt                   0.000000
funded_amnt                 0.000000
funded_amnt_inv             0.000000
term                        0.000000
int_rate                    0.000000
installment                 0.000000
grade                       0.000000
sub_grade                   0.000000
emp_title                   0.000000
emp_length                  0.000000
home_ownership              0.000000
annual_inc                  0.000000
verification_status         0.000000
issue_d                     0.000000
loan_status                 0.000000
pymnt_plan                  0.000000
url                         0.000000
desc                        0.000000
purpose                     0.000000
title                       0.000000
zip_code                    0.000000
addr_state                  0.000000
dti                         0.000000
delinq_2yrs                 0.000000
earliest_cr_line            0.000000
i

In [207]:
med = df_new['revol_util'].median()
df_new['revol_util'] = df_new['revol_util'].fillna(med)
100*df_new.isnull().mean()

id                          0.000000
member_id                   0.000000
loan_amnt                   0.000000
funded_amnt                 0.000000
funded_amnt_inv             0.000000
term                        0.000000
int_rate                    0.000000
installment                 0.000000
grade                       0.000000
sub_grade                   0.000000
emp_title                   0.000000
emp_length                  0.000000
home_ownership              0.000000
annual_inc                  0.000000
verification_status         0.000000
issue_d                     0.000000
loan_status                 0.000000
pymnt_plan                  0.000000
url                         0.000000
desc                        0.000000
purpose                     0.000000
title                       0.000000
zip_code                    0.000000
addr_state                  0.000000
dti                         0.000000
delinq_2yrs                 0.000000
earliest_cr_line            0.000000
i

In [210]:
df_new['revol_util'] = (df_new['revol_util']).astype(str) + '%'

In [212]:
null_value_mean_new = 100*df_new.isnull().mean()
columns_to_display = null_value_mean_new[null_value_mean_new >0].index
for column in columns_to_display:
    print(column, null_value_mean_new[column])

last_pymnt_d 0.1787647606818239
chargeoff_within_12_mths 0.1409975577208752
pub_rec_bankruptcies 1.7549160309187504
tax_liens 0.09819472769846666


In [213]:
df_new["last_pymnt_d"] = df_new["last_pymnt_d"].fillna('**No Date**')
100*df_new.isnull().mean()

id                          0.000000
member_id                   0.000000
loan_amnt                   0.000000
funded_amnt                 0.000000
funded_amnt_inv             0.000000
term                        0.000000
int_rate                    0.000000
installment                 0.000000
grade                       0.000000
sub_grade                   0.000000
emp_title                   0.000000
emp_length                  0.000000
home_ownership              0.000000
annual_inc                  0.000000
verification_status         0.000000
issue_d                     0.000000
loan_status                 0.000000
pymnt_plan                  0.000000
url                         0.000000
desc                        0.000000
purpose                     0.000000
title                       0.000000
zip_code                    0.000000
addr_state                  0.000000
dti                         0.000000
delinq_2yrs                 0.000000
earliest_cr_line            0.000000
i

In [214]:
null_value_mean_new = 100*df_new.isnull().mean()
columns_to_display = null_value_mean_new[null_value_mean_new >0].index
for column in columns_to_display:
    print(column, null_value_mean_new[column])

chargeoff_within_12_mths 0.1409975577208752
pub_rec_bankruptcies 1.7549160309187504
tax_liens 0.09819472769846666


In [215]:
100*df_new.isnull().mean()

id                          0.000000
member_id                   0.000000
loan_amnt                   0.000000
funded_amnt                 0.000000
funded_amnt_inv             0.000000
term                        0.000000
int_rate                    0.000000
installment                 0.000000
grade                       0.000000
sub_grade                   0.000000
emp_title                   0.000000
emp_length                  0.000000
home_ownership              0.000000
annual_inc                  0.000000
verification_status         0.000000
issue_d                     0.000000
loan_status                 0.000000
pymnt_plan                  0.000000
url                         0.000000
desc                        0.000000
purpose                     0.000000
title                       0.000000
zip_code                    0.000000
addr_state                  0.000000
dti                         0.000000
delinq_2yrs                 0.000000
earliest_cr_line            0.000000
i

In [217]:
df_new = df_new.drop(columns="tax_liens", axis=1) #axis=1 represents dropping columns
df_new.head()

KeyError: "['tax_liens'] not found in axis"

In [219]:
100*df_new.isnull().mean()

id                          0.000000
member_id                   0.000000
loan_amnt                   0.000000
funded_amnt                 0.000000
funded_amnt_inv             0.000000
term                        0.000000
int_rate                    0.000000
installment                 0.000000
grade                       0.000000
sub_grade                   0.000000
emp_title                   0.000000
emp_length                  0.000000
home_ownership              0.000000
annual_inc                  0.000000
verification_status         0.000000
issue_d                     0.000000
loan_status                 0.000000
pymnt_plan                  0.000000
url                         0.000000
desc                        0.000000
purpose                     0.000000
title                       0.000000
zip_code                    0.000000
addr_state                  0.000000
dti                         0.000000
delinq_2yrs                 0.000000
earliest_cr_line            0.000000
i

In [220]:
df_new = df_new.drop(columns="chargeoff_within_12_mths", axis=1) #axis=1 represents dropping columns
df_new.head()
100*df_new.isnull().mean()

id                         0.000000
member_id                  0.000000
loan_amnt                  0.000000
funded_amnt                0.000000
funded_amnt_inv            0.000000
term                       0.000000
int_rate                   0.000000
installment                0.000000
grade                      0.000000
sub_grade                  0.000000
emp_title                  0.000000
emp_length                 0.000000
home_ownership             0.000000
annual_inc                 0.000000
verification_status        0.000000
issue_d                    0.000000
loan_status                0.000000
pymnt_plan                 0.000000
url                        0.000000
desc                       0.000000
purpose                    0.000000
title                      0.000000
zip_code                   0.000000
addr_state                 0.000000
dti                        0.000000
delinq_2yrs                0.000000
earliest_cr_line           0.000000
inq_last_6mths             0