<font color=blue>INTRODUCTION</font>

This case study aims to give us an idea of applying EDA in a real business scenario. In this case study, we develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.

<font color=blue>BUSINESS OBJECTIVES</font>


This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

 

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment.

To develop your understanding of the domain, you are advised to independently research a little about risk analytics - understanding the types of variables and their significance should be enough).

In [None]:
# importing all the necessry libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# To Supress Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# To increase the display size for rows and columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [None]:
# Reading the application CSV dataset
df1=pd.read_csv("../input/loan-defaulter/application_data.csv")
df1.head()

In [None]:
# Reading the shape
df1.shape

In [None]:
# checking statistics
df1.describe()

In [None]:
## Finding columns with greater than 40 % null values 
null_column =round((df1.isnull().sum()/len(df1))*100,4) 
null_column_40 = null_column[null_column.values > 40.0000]
null_column_40

In [None]:
## Droping the columns more than 40%  null values 
null_column_40 = list(null_column_40.index)
df1.drop(labels=null_column_40,axis=1,inplace=True)

In [None]:
## After anlysing we found many more columns that is not required for the analysis and dropping the same.
list1=['NAME_TYPE_SUITE','REGION_POPULATION_RELATIVE','DAYS_REGISTRATION','DAYS_ID_PUBLISH','FLAG_MOBIL',
           'FLAG_EMP_PHONE','FLAG_WORK_PHONE','FLAG_PHONE','FLAG_CONT_MOBILE','FLAG_EMAIL',
           'REGION_RATING_CLIENT','REGION_RATING_CLIENT_W_CITY','WEEKDAY_APPR_PROCESS_START','HOUR_APPR_PROCESS_START',
           'REG_REGION_NOT_LIVE_REGION','REG_REGION_NOT_WORK_REGION','LIVE_REGION_NOT_WORK_REGION','REG_CITY_NOT_LIVE_CITY',
           'REG_CITY_NOT_WORK_CITY','LIVE_CITY_NOT_WORK_CITY','OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE',   
           'OBS_60_CNT_SOCIAL_CIRCLE','OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE',
           'DEF_60_CNT_SOCIAL_CIRCLE','DAYS_LAST_PHONE_CHANGE','FLAG_DOCUMENT_2','FLAG_DOCUMENT_3','FLAG_DOCUMENT_4',
           'FLAG_DOCUMENT_5','FLAG_DOCUMENT_6','FLAG_DOCUMENT_7','FLAG_DOCUMENT_8','FLAG_DOCUMENT_9','FLAG_DOCUMENT_10',
           'FLAG_DOCUMENT_11','FLAG_DOCUMENT_12','FLAG_DOCUMENT_13','FLAG_DOCUMENT_14','FLAG_DOCUMENT_15','FLAG_DOCUMENT_16',
            'FLAG_DOCUMENT_17','FLAG_DOCUMENT_18','FLAG_DOCUMENT_19','FLAG_DOCUMENT_20','FLAG_DOCUMENT_21']
df1.drop(labels=list1,axis=1,inplace=True)

In [None]:
# Checking data types.
df1.dtypes

In [None]:
# Checking for number of unique data in each column
df1.nunique().sort_values()

In [None]:
## Numeric convertion
num_col=['TARGET','CNT_CHILDREN','AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY','DAYS_BIRTH',
                'DAYS_EMPLOYED','EXT_SOURCE_2','EXT_SOURCE_3']
df1[num_col]=df1[num_col].apply(pd.to_numeric)

# Handling missing values and invalid data

In [None]:
##  Column -> OCCUPATION_TYPE
df1.OCCUPATION_TYPE.value_counts()

In [None]:
## Column EXT_SOURCE_2 and 3
print(df1.EXT_SOURCE_3.describe())
print(df1.EXT_SOURCE_3.describe())

> ### <font color=green>External source date are normalized source information and imputing with with wrong values may impact the analysis. So leaving the missing values.

In [None]:
## Columns -> AMT_REQ_CREDIT_BUREAU.*
print(df1.AMT_REQ_CREDIT_BUREAU_HOUR.value_counts())
print(df1.AMT_REQ_CREDIT_BUREAU_DAY.value_counts())
print(df1.AMT_REQ_CREDIT_BUREAU_WEEK.value_counts())
print(df1.AMT_REQ_CREDIT_BUREAU_MON.value_counts())
print(df1.AMT_REQ_CREDIT_BUREAU_QRT.value_counts())
print(df1.AMT_REQ_CREDIT_BUREAU_YEAR.value_counts())

> ### <font color=green>Other than AMT_REQ_CREDIT_BUREAU_YEAR field all other Bureau fields majority of the records are having zero calls to the customer care.

In [None]:
## Column -> AMT_ANNUITY
df1.AMT_ANNUITY.describe()

In [None]:
 ## The max and mean are having huge difference, so we are imputing with median
df1.loc[df1['AMT_ANNUITY'].isnull(),'AMT_ANNUITY']=df1['AMT_ANNUITY'].median()
## Column ->AMT_GOODS_PRICE
df1.AMT_GOODS_PRICE.describe()

In [None]:
## filling the missing values with mean 
df1.loc[df1['AMT_GOODS_PRICE'].isnull(),'AMT_GOODS_PRICE']=df1['AMT_GOODS_PRICE'].mean()
## Column -> CNT_FAM_MEMBERS
df1.loc[df1['CNT_FAM_MEMBERS'].isnull()]

> ### <font color=green>Only 2 records with null values, whome they doesnot have any payment difficulties as well as family status is known. So we are imputing it with default value 1

In [None]:
# Replacing with 1 for missing values.
df1['CNT_FAM_MEMBERS'].fillna(1.0,inplace=True)

In [None]:
## Column -> CODE_GENDER
df1.CODE_GENDER.value_counts()

In [None]:
##  We need to convert this column to numeric which will be a good field for correlation. 
## So converting to 0 and 1 ( Male and Female ). XNA as 1.
df1['CODE_GENDER'].replace({"F":1, "M":0,"XNA":1}, inplace=True)
df1['CODE_GENDER']=df1['CODE_GENDER'].astype(int)

In [None]:
## Convertion Y/N to 1 and 0 respectively for OWN_CAR and OWN_REALTY as these two fields are good for 
## correlation analyis( correlation consider only numeric fields)
df1['FLAG_OWN_CAR'].replace({"Y":1, "N":0}, inplace=True)
df1['FLAG_OWN_REALTY'].replace({"Y":1, "N":0}, inplace=True)

In [None]:
## Column -> FAMILY_STATUS
df1.NAME_FAMILY_STATUS.value_counts()

In [None]:
## When we took unknown family status, cannot impute as number of children is unknown and retaining as unknown itself
df1[df1['NAME_FAMILY_STATUS']=="Unknown"]

In [None]:
## Column -> ORGANIZATION_TYPE
df1.ORGANIZATION_TYPE.value_counts()

In [None]:
## The second manjority is 'XNA' for Organization type . However we cannot convert to any valid values, so converting 
## them to Nan. However it will increase the NaN to 18%
df1.loc[df1['ORGANIZATION_TYPE'] == 'XNA', 'ORGANIZATION_TYPE'] = np.NaN

In [None]:
## Checking Days birth and days employed fields
print(df1['DAYS_BIRTH'].value_counts())
print(df1['DAYS_EMPLOYED'].value_counts())

> ### <font color=green>Days of birth and Days employed are in negative. We need to convert them to positive
There is no outliers in Days of birth.
Adding one more column age group for our analysis

In [None]:
## adding a new column called  age group. Takking the floor value of 'Age' to get in integers.
df1['AGE']=abs(df1["DAYS_BIRTH"]//365)
slots = ['0-20','20-30','30-40','40-50','50-60','60-70','70 and above']
bins = [0,20,30,40,50,60,70,100]
df1['AGE_GROUP']=pd.cut(df1['AGE'],bins,labels=slots)

In [None]:
#Removing the column Age
df1=df1.drop('AGE',axis=1)

In [None]:
## Majority of the records with 365243 are without payment difficulties. May be it is a default or maxiumum value. 
## So filling with Nan
df1.loc[df1['DAYS_EMPLOYED'] == 365243, 'DAYS_EMPLOYED'] = np.nan

In [None]:
# Plotting to see the Days employed column
abs(df1['DAYS_EMPLOYED']).plot.hist(title = 'Employment days')
plt.xlabel('Days of Employment')

In [None]:
## adding a new column with years of experience . Taking the floor value to get as integer.
df1["YEARS_EXPERIENCE"]=abs(df1["DAYS_EMPLOYED"]//365)

In [None]:
# Removing days employed and days birth as they are not required for the analysis
df1=df1.drop('DAYS_EMPLOYED',axis=1)
df1=df1.drop('DAYS_BIRTH',axis=1)

In [None]:
## Since we have different income category of people it is good to bin the income slot to do the analysis. 
## Adding a new column to see the income range.
slots = ['0-50000','50000-100000','100000-150000', '150000-200000','200000-250000','250000-300000',
        '300000-350000','350000-400000','400000-450000','450000-500000','500000 and above']
bins = [0,50000,100000,150000,200000,250000,300000,350000,400000,450000,500000,10000000000]
df1['AMT_INCOME_RANGE']=pd.cut(df1['AMT_INCOME_TOTAL'],bins,labels=slots)

# Handling outliers

In [None]:
## Checking for Outliers in AMT_ANNUITY and AMT_GOODS_PRICE columns.
print(df1['AMT_ANNUITY'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,0.95,0.99,0.999,1]))
print(df1['AMT_GOODS_PRICE'].quantile([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,.95,.99,.999,1]))

In [None]:
# Vizualising the outliers for AMT_ANNUITY
df1.boxplot(column='AMT_ANNUITY',figsize=(4,4))

In [None]:
# Identifyin the record with max value.
df1.loc[df1['AMT_ANNUITY'] >= 258025.5]

> ### <font color=green>There are outliers in both columns, but when we check the maximum value, for both column it is the same record. He has highter education with 1 year experience only.So this records need to be removed.

In [None]:
## Vizualising Amount Annuity and Credit via scatter plot
plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
p=sns.scatterplot(x="AMT_ANNUITY", y="AMT_CREDIT", data=df1)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title(" Annuity vs Credit")

> ### <font color=green>AMT Annuity and Credit showing a linear relationship and it is good to procced with this data.

In [None]:
## Checking for outliers in AMT_INCOME_TOTAL
df1.boxplot(column='AMT_INCOME_TOTAL',figsize=(4,4))

> ### <font color=green>The cleaning Staff is a valid data for minimum value

In [None]:
#Checking what is the occupation type of the person with highest AMT_INCOME_TOTAL
df1.loc[df1['AMT_INCOME_TOTAL'] == df1['AMT_INCOME_TOTAL'].max(), ['AMT_INCOME_TOTAL','OCCUPATION_TYPE']]

> ### <font color=green>This is clearly evident that a laborers cannot be in the highest salary category. This is a true outlier and we need to remove it, so we need to analyse more with Zscore method

In [None]:
## Handling outliers using Zscore appraoch - Finding the rows which are more than 3 standard deviation from the Mean. 
from scipy import stats
out= df1[np.abs(stats.zscore(df1['AMT_INCOME_TOTAL'])) >3]
print(len(out))

> ### <font color=green>There are 454 records which shows outliers those needs to be removed.

In [None]:
## Analysing CNT_CHILDREN 
print(df1.CNT_CHILDREN.value_counts())
df1.boxplot(column='CNT_CHILDREN',figsize=(4,4))

> ### <font color=green>There are few outliers in CNT_CHILDREN column .The people who has highest number of children are non married.
Also people with more than 10 kids few are not married.
We anticipate that those people might be running charity.
Out of which only 1 person has payment difficulties.

In [None]:
## Analysing AMT_CREDIT column
print(df1.AMT_CREDIT.describe())
df1.boxplot(column='AMT_CREDIT',figsize=(4,4))

In [None]:
df1.loc[df1['AMT_CREDIT'] == df1['AMT_CREDIT'].max(), ['AMT_CREDIT','OCCUPATION_TYPE']]

> ### <font color=green>Credit amount maximum for managers and Acccounts which looks valid data.

# Checking Imbalance in data

In [None]:
## Checking the imbalance with Target column
(df1.TARGET.sum()/len(df1))*100

In [None]:
## Vizualising the % of Defaulters vs Non Defaulters in the dataset.
## 1 indicates Defaulters and 0 indicates non defaulters.
df1.TARGET.value_counts(normalize=True).plot.pie(autopct='%1.1f%%')

In [None]:
### Imbalance ratio
target0_df=df1.loc[df1["TARGET"]==0]
target1_df=df1.loc[df1["TARGET"]==1]
ratio=round(len(target0_df)/len(target1_df),2)
print("Imbalance ratio from Defaulters to non Defaulters is -> 1 :",ratio)

# Univarate Analysis

In [None]:
## Univarte Analysis of AMOUNT/EXT/BUREAU/Years fields
col = ['AMT_INCOME_TOTAL', 'AMT_CREDIT','AMT_ANNUITY', 'AMT_GOODS_PRICE','EXT_SOURCE_2','EXT_SOURCE_3',
       'AMT_REQ_CREDIT_BUREAU_YEAR','YEARS_EXPERIENCE']
# Plotting using box plot.. Removing the oulliers using showfliers command
for i in col:
    plt.figure(figsize=(18,8))
    plt.subplot(1,2,1)
    target0_df.boxplot(column=i,showfliers=False)
    plt.title('Non Defaulters (Target = 0)')
    plt.subplot(1,2,2)
    target1_df.boxplot(column=i,showfliers=False)
    plt.title('Defaulters (Target = 1)')


> ### <font color=green>Concluding the below points from the above plots.

> ### <font color=green>Cannot predict anything from the fields AMT_GOODS_PRICE and AMT_ANNUITY as the distribution is almost same for both defaulters and non defaulters.
> ### <font color=green>For defaulters,Total income range in mostly in Quardile 3 from 1.35 lkahs to 2 lakhs.
> ### <font color=green>The credit amount is a little lesser for defaulters than non defaulters.
> ### <font color=green>EXT_SOURCE_2, median is 5.75 for defaulters and 4.9 for non defaulters.
> ### <font color=green>EXT_SOURCE_3 median is 5.5 for defaulters and 3.9 for non defaulters.
> ### <font color=green>Defaulters had more phone calls as the AMT_REQ_CREDIT_BUREAU_YEAR median is 2, where as 1 for non defaulters.
> ### <font color=green>For defaulters the jobs is widely in range between 4 to 7, where as it is 5 to 9 in non defaulters.

In [None]:
## Univarte analysis on Occupation type
plt.figure(figsize=(25,18))
plt.subplot(1,2,1)
p = sns.countplot(data=target0_df, x= 'OCCUPATION_TYPE')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target = 0")
plt.subplot(1,2,2)
p = sns.countplot(data=target1_df, x= 'OCCUPATION_TYPE')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target = 1")

In [None]:
## Univarte analysis on Gender
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
ax = sns.countplot(data=target0_df, x= 'CODE_GENDER')
plt.ylabel(" Counts")
plt.title("Target = 0")
total = len(target0_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.03
        y = p.get_y() + p.get_height()/3
        ax.annotate(percentage, (x, y))
plt.subplot(1,2,2)
ax = sns.countplot(data=target1_df, x= 'CODE_GENDER')
plt.ylabel("Counts")
plt.title("Target = 1")

total = len(target1_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.03
        y = p.get_y() + p.get_height()/3
        ax.annotate(percentage, (x, y))

In [None]:
## Univarate Analysis on Number of children
plt.figure(figsize=(13,8))
plt.subplot(1,2,1)
p = sns.countplot(data=target0_df, x= 'CNT_CHILDREN')
plt.title("Target = 0")
plt.subplot(1,2,2)
p = sns.countplot(data=target1_df, x= 'CNT_CHILDREN')
plt.title("Target = 1")

In [None]:
# Univarate analysis on NAME_CONTRACT_TYPE
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
ax = sns.countplot(data=target0_df, x= 'NAME_CONTRACT_TYPE')
total = len(target0_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))
plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
ax = sns.countplot(data=target1_df, x= 'NAME_CONTRACT_TYPE')
total = len(target1_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))
plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Univarate analysis on OWN CAR
plt.figure(figsize=(8,8))
plt.subplot(1,2,1)
ax = sns.countplot(data=target0_df, x= 'FLAG_OWN_CAR')
total = len(target0_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))
        
plt.title("Target 0")
plt.subplot(1,2,2)
ax = sns.countplot(data=target1_df, x= 'FLAG_OWN_CAR')
total = len(target1_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y))
plt.title("Target 1")

In [None]:
# Univarate analysis on Income Type
plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
p = sns.countplot(data=target0_df, x= 'NAME_INCOME_TYPE')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(data=target1_df, x= 'NAME_INCOME_TYPE')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Univarate analysis on Family Status
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
ax = sns.countplot(data=target0_df, x= 'NAME_FAMILY_STATUS')
plt.xticks(rotation='vertical')

total = len(target0_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/1.25
        ax.annotate(percentage, (x, y))

plt.title("Target 0")
plt.subplot(1,2,2)
ax = sns.countplot(data=target1_df, x= 'NAME_FAMILY_STATUS')
plt.xticks(rotation='vertical')
    

total = len(target1_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/1.25
        ax.annotate(percentage, (x, y))


plt.title("Target 1")

In [None]:
# Univarate analysis on Housing Type
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
ax= sns.countplot(data=target0_df, x= 'NAME_HOUSING_TYPE')


total = len(target0_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/1.25
        ax.annotate(percentage, (x, y))
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
ax = sns.countplot(data=target1_df, x= 'NAME_HOUSING_TYPE')


total = len(target1_df)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/1.25
        ax.annotate(percentage, (x, y))
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Univarate analysis on Organization type
plt.figure(figsize=(25,18))
plt.subplot(1,2,1)
p = sns.countplot(data=target0_df, y= 'ORGANIZATION_TYPE')
plt.xlabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(data=target1_df, y= 'ORGANIZATION_TYPE')
plt.xlabel(" Counts")
plt.title("Target 1")

In [None]:
# Univarate analysis on Age group
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
p = sns.countplot(data=target0_df, x= 'AGE_GROUP')
plt.ylabel(" Counts")
plt.xticks(rotation='vertical')
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(data=target1_df, x= 'AGE_GROUP')
plt.xlabel(" Counts")
plt.xticks(rotation='vertical')
plt.title("Target 1")

In [None]:
# Univarate analysis on Income Range
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
p = sns.countplot(data=target0_df, x= 'AMT_INCOME_RANGE')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(data=target1_df, x= 'AMT_INCOME_RANGE')
plt.xticks(rotation='vertical')
plt.xlabel(" Counts")
plt.title("Target 1")

> ### <font color=green>Concluding the below points from Univariate plots

> ### <font color=green>Laborers and Sales staff are more tend to take loans in both case .
> ### <font color=green>Females are more tend to take loans and more likely to be defaulters too.
> ### <font color=green>People are preferring Cash loans in both categories.
> ### <font color=green>Number of children is 1 or 0 in both case.
> ### <font color=green>Married or unmarried are more likely to take loans.
> ### <font color=green>Working or commercial associcate are more likely to take loans.
> ### <font color=green>They are mostly living in House/ appartment or with parents.
> ### <font color=green>Busniness entity 2 and Self employed are more likely to take loans.
> ### <font color=green>There is no defaulters from income category of Student and Businessman
> ### <font color=green>All age group from 20 to 70 are taking taking loans and most defaulters are in 30-40 age group.
> ### <font color=green>Defaulters Income range is mostly distributed betweem 1 lakh to 2 lakh . Lowest defaulters are in 4.5lakhs + category.

# Univarate Segmented Analysis

In [None]:
# Univarate analysis of AGE group and Family members
plt.figure(figsize=(20,10))
plt.subplot(1,2,1)
p = sns.countplot(x = "AGE_GROUP", hue= 'CNT_FAM_MEMBERS', data = target0_df)
plt.xticks(rotation='vertical')
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(x = "AGE_GROUP", hue= 'CNT_FAM_MEMBERS', data = target1_df)
plt.xticks(rotation='vertical')
plt.title("Target 1")

In [None]:
# Univarate Analysis of Education type and Bureau calls.
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
p = sns.countplot(x = "NAME_EDUCATION_TYPE", hue= 'AMT_REQ_CREDIT_BUREAU_YEAR', data = target0_df)
plt.xticks(rotation='vertical')
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(x = "NAME_EDUCATION_TYPE", hue= 'AMT_REQ_CREDIT_BUREAU_YEAR', data = target1_df)
plt.xticks(rotation='vertical')
plt.title("Target 1")

In [None]:
## Univariate analysis on Age group an Number of children
plt.figure(figsize=(20,10))
plt.subplot(1,2,1)
p = sns.countplot(x = "AGE_GROUP", hue= 'CNT_CHILDREN', data = target0_df)
plt.xticks(rotation='vertical')
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(x = "AGE_GROUP", hue= 'CNT_CHILDREN', data = target1_df)
plt.xticks(rotation='vertical')
plt.title("Target 1")

In [None]:
# Univarate Analysis of Age group with Education type.
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
p = sns.countplot(x = "AGE_GROUP", hue= 'NAME_EDUCATION_TYPE', data = target0_df)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(x = "AGE_GROUP", hue= 'NAME_EDUCATION_TYPE', data = target1_df)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 1")


In [None]:
# Univarate Analysis Income range with Family members.
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
p = sns.countplot(x = "AMT_INCOME_RANGE", hue= 'CNT_FAM_MEMBERS', data = target0_df)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(x = "AMT_INCOME_RANGE", hue= 'CNT_FAM_MEMBERS', data = target1_df)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 1")

> ### <font color=green>Concluding the below points about defaulters from the Univariate Segmented plots.

> ### <font color=green>Age group 30 - 50 with 2 family members are more defaulters .Majority of them have secondary education only.
> ### <font color=green>Most of them in group are having either 0 or 1 kids.
> ### <font color=green>This age group is making more enquires with the bank in both the case.
> ### <font color=green>Their income range are in between 1 to 2 lakhs.

# Bivariate Analysis

In [None]:
# Bivariate Analysis of EXT souce 2 and income total
plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
p = sns.scatterplot(y="EXT_SOURCE_3", x="AMT_INCOME_TOTAL", data=target0_df)
plt.xticks(rotation='vertical')
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.scatterplot(y="EXT_SOURCE_3", x="AMT_INCOME_TOTAL", data=target1_df)
plt.xticks(rotation='vertical')
plt.title("Target 1") 

In [None]:
# Bivariate Analysis of EXT source 2 and Amount Credit
plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
p = sns.scatterplot(y="EXT_SOURCE_3", x="AMT_CREDIT", data=target0_df)
plt.xticks(rotation='vertical')
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.scatterplot(y="EXT_SOURCE_3", x="AMT_CREDIT", data=target1_df)
plt.xticks(rotation='vertical')
plt.title("Target 1")

In [None]:
 #Bivariate Analysis of Years Experience and Amount Income total
plt.figure(figsize=(15,8))
plt.subplot(1,2,1)
p=sns.scatterplot(y="YEARS_EXPERIENCE", x="AMT_INCOME_TOTAL", data=target0_df)
plt.xticks(rotation='vertical')
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.scatterplot(y="YEARS_EXPERIENCE", x="AMT_INCOME_TOTAL", data=target1_df)
plt.xticks(rotation='vertical')
plt.title("Target 1")

> ### <font color=green>External source 3 is a good column for analysis .Normalized score scattered very less for defaulters. It is less than 10k income range, where as for non defaulters it is distributed well till 2 lakhs.The score for credit is scattered till 1.5 lakhs where as 2 lakhs for non defaulters.
> ### <font color=green>Defaulters are maintining a low income total compare with non defulters.

In [None]:
# Bivariate Analysis with box plot for Target 0 ->Occupation type vs Credit
sns.catplot(data =target0_df, x='OCCUPATION_TYPE', y='AMT_CREDIT',kind='box',color='b',showfliers=False)
plt.xticks(rotation='vertical')
plt.title("Occupation vs Credit Amount")

In [None]:
# Bivariate Analysis with box plot for Target 1 ->Occupation type vs Credit
sns.catplot(data =target1_df, x='OCCUPATION_TYPE', y='AMT_CREDIT',kind='box',color='b',showfliers=False)
plt.xticks(rotation='vertical')
plt.title("Occupation vs Credit Amout")

In [None]:
# Bivariate Analysis with box plot for Target 0 ->Occupation type vs Income Total
sns.boxplot(data =target0_df, x='OCCUPATION_TYPE', y='AMT_INCOME_TOTAL',color='b',showfliers=False)
plt.xticks(rotation='vertical')
plt.title("Occupation Type vs Income total")

> ### <font color=green>Amoung defaulters managers and accountants are having more credit amount.Low credit amount are for low skill and waiters.
> ### <font color=green>Among defaulters, Managers has more income total and cleaning/cooking/low skill laborers have low income range.

# Segmented Bivariate Analysis w.r.t Target 1

In [None]:
# Plotting with Family status/Income Total and education type
plt.figure(figsize=(16,12))
sns.boxplot(data =target1_df, x='NAME_FAMILY_STATUS', y='AMT_INCOME_TOTAL', hue ='NAME_EDUCATION_TYPE',showfliers=False)
plt.title('Income Total/Family Status')
plt.show()

> ### <font color=green>Among defaulters,Academic degress holders has highest income total and lowest income total is for lower secondary education. These two categories are belongs to widow.

In [None]:
# Plotting with Family Status/Occupation type and Credit amount
plt.figure(figsize=(16,12))
sns.boxplot(data =target1_df, x='NAME_FAMILY_STATUS', y='AMT_CREDIT', hue ='OCCUPATION_TYPE',showfliers=False)
plt.title('Family Status / Occupation type / Credit amount')
plt.show()

> ### <font color=green>For defaulters hightest and lowest credit amounts are holding by widows. The highest credit amount people are working as accountants or managers.The lowest credit amount people are working as low skilled laborers

# Correlation

In [None]:
## Converting Age group from Category to Numeric as we need this column for correlation analysis.

target0_df.AGE_GROUP.replace({"0-20":1,"20-30":2,"30-40" :3,"40-50":4,"50-60":5,"60-70":6,"70 and above": 7},
                              inplace=True)
target0_df.loc[:,'AGE_GROUP'] = pd.to_numeric(target0_df['AGE_GROUP'], errors = 'coerce')
target1_df.AGE_GROUP.replace({"0-20":1,"20-30":2,"30-40" :3,"40-50":4,"50-60":5,"60-70":6,"70 and above": 7},
                              inplace=True)
target1_df.loc[:,'AGE_GROUP'] = pd.to_numeric(target1_df['AGE_GROUP'], errors = 'coerce')

In [None]:
## Converting AMT_INCOME_RANGE from Category to Numeric as we need this column for correlation analysis.

target0_df.AMT_INCOME_RANGE.replace({"50000-100000":1,"100000-150000":2,"150000-200000": 3,"200000-250000": 4,
                                     "200000-250000": 5,"250000-300000": 6,"300000-350000": 7,"350000-400000": 8,
                                      "400000-450000": 9,"450000-500000": 10,"500000 and above": 11},
                                      inplace=True)
target0_df.loc[:,'AMT_INCOME_RANGE'] = pd.to_numeric(target0_df['AMT_INCOME_RANGE'], errors = 'coerce')
target1_df.AMT_INCOME_RANGE.replace({"50000-100000":1,"100000-150000":2,"150000-200000": 3,"200000-250000": 4,
                                     "200000-250000": 5,"250000-300000": 6,"300000-350000": 7,"350000-400000": 8,
                                      "400000-450000": 9,"450000-500000": 10,"500000 and above": 11},
                                      inplace=True)
target1_df.loc[:,'AMT_INCOME_RANGE'] = pd.to_numeric(target1_df['AMT_INCOME_RANGE'], errors = 'coerce')

In [None]:
# Correlation matrix columns -> Removed ID, Target, AMOUNT_INCOME_TOTAL( Kept Income range instead) ,
# Bureau fields others than Year

target0_corr=target0_df[['NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_INCOME_TYPE',
       'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
       'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'ORGANIZATION_TYPE',
       'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'AGE_GROUP', 'YEARS_EXPERIENCE',
       'AMT_INCOME_RANGE']]
target1_corr=target1_df[['NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_INCOME_TYPE',
       'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
       'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'ORGANIZATION_TYPE',
       'EXT_SOURCE_2', 'EXT_SOURCE_3','AMT_REQ_CREDIT_BUREAU_YEAR', 'AGE_GROUP', 'YEARS_EXPERIENCE',
       'AMT_INCOME_RANGE']] 

target0=target0_corr.corr()
target1=target1_corr.corr()

In [None]:
# Seeing Target0
target0

In [None]:
# Seeing Target1
target1

In [None]:
# Finding Top 10 correlation matrix for Target = 0
corrdf = target0.where(np.triu(np.ones(target0.shape), k=1).astype(np.bool))
corrdf = corrdf.unstack().reset_index()
corrdf.columns = ['Variable_1', 'Variable_2', 'Correlation']
corrdf.dropna(subset = ['Correlation'], inplace = True)
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
corrdf.sort_values(by = 'Correlation', ascending = False).reset_index(drop=True).head(10)

In [None]:
# Top 10 correlation matrix for Target = 1

corrdf = target1.where(np.triu(np.ones(target1.shape), k=1).astype(np.bool))
corrdf = corrdf.unstack().reset_index()
corrdf.columns = ['Variable_1', 'Variable_2', 'Correlation']
corrdf.dropna(subset = ['Correlation'], inplace = True)
corrdf['Correlation'] = round(corrdf['Correlation'], 2)
corrdf['Correlation'] = abs(corrdf['Correlation'])
corrdf.sort_values(by = 'Correlation', ascending = False).reset_index(drop=True).head(10)

In [None]:
# Vizualisation of correlation using Heat map.

def corr_matrix(data,title):
    plt.figure(figsize=(10, 12))
    plt.rcParams['axes.titlesize'] = 25
    plt.rcParams['axes.titlepad'] = 50
 # masking the upper side   
    mask = np.zeros_like(data, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True

    # Want diagonal elements as well
    mask[np.diag_indices_from(mask)] = False

# heatmap with a color map of choice
    ax=sns.heatmap(data, cmap="YlGnBu",mask=mask,annot=True,linewidth=.3)
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)
    plt.title(title)
    plt.yticks(rotation=0)
    plt.show()


In [None]:
# For Target 0
corr_matrix(data=target0,title='Correlation matrix for Target 0')

In [None]:
# For Target 1
corr_matrix(data=target1,title='Correlation matrix for Target 1')

> ### <font color=green>Top 10 correlation of Target 0 and Traget 1 are almost same. Rank 3/ 4 and 8 / 9 are interchanged.

In [None]:
# Reading the application CSV dataset
df2=pd.read_csv(r"../input/loan-defaulter/previous_application.csv")
df2.head()

In [None]:
# Checking the information
df2.info()

In [None]:
# Checking the statistics
df2.describe()

In [None]:
## Chekcing for missing values
df2.isnull().sum()

In [None]:
# Removing null columns greater than 40 percent
null_column_prev =round((df2.isnull().sum()/len(df2))*100,4) 
null_column_40_prev = null_column_prev[null_column_prev.values > 40.0000]
null_column_40_prev = list(null_column_40_prev.index)
df2.drop(labels=null_column_40_prev,axis=1,inplace=True)

In [None]:
## Removing some more columns which are not important
list2  = ['WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START','FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
          'DAYS_DECISION','NAME_PAYMENT_TYPE','NAME_CLIENT_TYPE','SELLERPLACE_AREA','CNT_PAYMENT','NAME_YIELD_GROUP',
          'PRODUCT_COMBINATION']
df2.drop(labels=list2,axis=1,inplace=True)

In [None]:
## Merging application CSV and previous application using SKI_ID_CURR
merge_out = pd.merge(df1, df2, how='inner', on=('SK_ID_CURR'))

In [None]:
# Seeing the merge dataset
merge_out.head()

> ### <font color=green>For same ID it self NAME_CONTRACT_STATUS is different status, so it is good to keep all the entries
There are duplicate column names which are suffixed with _x and _y

In [None]:
## Replacing XNA and XAP with Nan for few of the columns as they are required for the analysis.
merge_out.loc[merge_out['NAME_CASH_LOAN_PURPOSE']=='XNA','NAME_CASH_LOAN_PURPOSE'] = np.NaN
merge_out.loc[merge_out['NAME_CASH_LOAN_PURPOSE']=='XAP','NAME_CASH_LOAN_PURPOSE'] = np.NaN 
merge_out.loc[merge_out['NAME_PORTFOLIO']=='XNA','NAME_PORTFOLIO'] = np.NaN 
merge_out.loc[merge_out['NAME_GOODS_CATEGORY']=='XNA','NAME_GOODS_CATEGORY'] = np.NaN

In [None]:
merge_out.info()

# Univarate Analysis on Merged Dataset

In [None]:
## Splitting merge-out dataset based on Target fields.
mer_tar0 = merge_out[merge_out['TARGET'] == 0]
mer_tar1 = merge_out[merge_out['TARGET'] == 1]

In [None]:
## column Name contract Status
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
ax = sns.countplot(data=mer_tar0, x= 'NAME_CONTRACT_STATUS')
plt.xticks(rotation='vertical')
total = len(mer_tar0)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.03
        y = p.get_y() + p.get_height()/3
        ax.annotate(percentage, (x, y))
plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
ax = sns.countplot(data=mer_tar1, x= 'NAME_CONTRACT_STATUS')
plt.xticks(rotation='vertical')
total = len(mer_tar1)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height()/total)
        x = p.get_x() + p.get_width() + 0.03
        y = p.get_y() + p.get_height()/3
        ax.annotate(percentage, (x, y))
plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Column Goods category
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
p = sns.countplot(data=mer_tar0, x= 'NAME_GOODS_CATEGORY')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(data=mer_tar1, x= 'NAME_GOODS_CATEGORY')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Column Name Portfolio
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
p = sns.countplot(data=mer_tar0, x= 'NAME_PORTFOLIO')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(data=mer_tar1, x= 'NAME_PORTFOLIO')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Column Channe1 type
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
p = sns.countplot(data=mer_tar0, x= 'CHANNEL_TYPE')
ax.set(ylabel="Percent")
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(data=mer_tar1, x= 'CHANNEL_TYPE')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Column Cash loan purpose
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
p = sns.countplot(data=mer_tar0, x= 'NAME_CASH_LOAN_PURPOSE')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(data=mer_tar1, x= 'NAME_CASH_LOAN_PURPOSE')
plt.xticks(rotation='vertical')
plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Univarate Analysis of education type with contract status
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
p = sns.countplot(x = "NAME_EDUCATION_TYPE", hue= 'NAME_CONTRACT_STATUS', data = mer_tar0)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(x = "NAME_EDUCATION_TYPE", hue= 'NAME_CONTRACT_STATUS', data = mer_tar1)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Univarate Analysis of education type with contract status
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
p = sns.countplot(x = "NAME_INCOME_TYPE", hue= 'NAME_CONTRACT_STATUS', data = mer_tar0)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(x = "NAME_INCOME_TYPE", hue= 'NAME_CONTRACT_STATUS', data = mer_tar1)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Univarate Analysis of education type with contract status
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
p = sns.countplot(x = "OCCUPATION_TYPE", hue= 'NAME_CONTRACT_STATUS', data = mer_tar0)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 0")
plt.subplot(1,2,2)
p = sns.countplot(x = "OCCUPATION_TYPE", hue= 'NAME_CONTRACT_STATUS', data = mer_tar1)
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Univarate Analysis of Goods category with Contarct status for -> Target  1
plt.figure(figsize=(10,8))
sns.countplot(data =mer_tar1,hue='NAME_CONTRACT_STATUS',x='NAME_GOODS_CATEGORY')
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 1")

In [None]:
# Univarate Analysis of AGE group with Contract Status for Target 1
plt.figure(figsize=(10,8))
sns.countplot(data =mer_tar1,hue='NAME_CONTRACT_STATUS',x='AGE_GROUP')
plt.xticks(rotation='vertical')
#plt.ylabel(" Counts")
plt.title("Target 1")

> ### <font color=green>Conculution from above Univariate plots about defaulters.

> ### <font color=green>Loan demand mostly for the purchase of mobile.
> ### <font color=green>Portfolio type mostly by POS
> ### <font color=green>Channel type is credit and cash offices.
> ### <font color=green>Majority of purpose of loan is for repairs.
> ### <font color=green>More cancelled loans are from secondary/secondary special educated people.
> ### <font color=green>More loan cancallation is from Laborers and sales staff occupation.
> ### <font color=green>Previously approved but now in defaulters are mostly Drivers,Laborers,Sales and Cores staffs and education is of higher education

# Bivariate Analysis
Analysis on Merget Target 1 file with Previous and Current columns w.r.t Contract Status

In [None]:
# Plotting Education type with Previous credit amount For Target 1
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, x='NAME_EDUCATION_TYPE', y='AMT_CREDIT_y', hue ='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Previous Credit amount vs Education type')
plt.show()

In [None]:
# Plotting Education type with Current credit amount for Target 1
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, x='NAME_EDUCATION_TYPE', y='AMT_CREDIT_x', hue ='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Current Credit amount vs Education Type')
plt.show()

In [None]:
# Plotting Education type with Current Annuity amount for Target 1
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, x='NAME_EDUCATION_TYPE', y='AMT_ANNUITY_x', hue ='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Current Annuity amount vs Education Type')
plt.show()

In [None]:
# Plotting Education type with Previous Annuity Amount for Target 1
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, x='NAME_EDUCATION_TYPE', y='AMT_ANNUITY_y', hue ='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Previous Annuity vs Education Type')
plt.show()

In [None]:
## Plotting current contract type with current credit for Target 1
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, x='NAME_CONTRACT_TYPE_x', y='AMT_CREDIT_x', hue ='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Current Contract vs Current Credit')
plt.show()

In [None]:
## Plotting previous contract type with previous credit for Target 1
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, x='NAME_CONTRACT_TYPE_y', y='AMT_CREDIT_y', hue ='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Previous contract type vs Previous credit')
plt.show()

In [None]:
## Plotting current goods price with Income type for Target 1
plt.figure(figsize=(20,8))
sns.barplot(data =mer_tar1, x='NAME_INCOME_TYPE',hue='NAME_CONTRACT_STATUS',y='AMT_GOODS_PRICE_x')
plt.title('Current Goods price with Income type')
plt.show()

In [None]:
# Plotting previous goods price with income type for Target 1
plt.figure(figsize=(20,8))
sns.barplot(data =mer_tar1, x='NAME_INCOME_TYPE',hue='NAME_CONTRACT_STATUS',y='AMT_GOODS_PRICE_y')
plt.title('Previous Goods price with Income type')
plt.show()

In [None]:
# Plotting Goods category with Current credit amount for Target 1
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, hue='NAME_GOODS_CATEGORY', y='AMT_GOODS_PRICE_x', x='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Current Credit amount vs Goods category')
plt.show()

In [None]:
# Plotting Goods Category with Previous credit amount.
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, hue='NAME_GOODS_CATEGORY', y='AMT_GOODS_PRICE_y', x='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Previous Credit amount vs Goods category')
plt.show()

In [None]:
# Plotting Income range with Current credit amount for Target 1
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, x='AMT_INCOME_RANGE', y='AMT_CREDIT_x', hue='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Current Credit amount vs Income range')
plt.xticks(rotation='vertical')
plt.show()

In [None]:
# Plotting Previous credit  with Income range for Target 1
plt.figure(figsize=(16,12))
sns.boxplot(data =mer_tar1, x='AMT_INCOME_RANGE', y='AMT_CREDIT_y', hue='NAME_CONTRACT_STATUS',showfliers=False)
plt.title('Previous credit  vs amount Income range')
plt.xticks(rotation='vertical')
plt.show()

> ### <font color=green>Conclusion from above bivariate analysis

> ### <font color=green>Their credit amout has been increased as the result annuity amount also increased.
> ### <font color=green> of credit amount starting range is 2.5 to 3 lakhs.
> ### <font color=green>Academic degress holders has no refusal history.Previoulsy adacemic degree holder credit amount was very > ### <font color=green>less but in current application, their credit amount above 6 lakhs range.
> ### <font color=green>Previously there were consumer loans now only revolving and cash loans.
> ### <font color=green>More loan cancellation in previous application due to higher annuity.
> ### <font color=green>The goods price is increased drastically for almost all the items. Previous approved loans are mostly > ### <font color=green>below 2.5 lakhs worth commodity and in current application the range is from 2.5 lakhs to 7 lakhs above.
> ### <font color=green>Higher good price, loans are taking by people who are on maternity leave.

> ### <font color=green>Case study final conclusion¶
> ### <font color=green>AMT_CREDIT(x and y), AMT_INCOME_TOTAL, DAYS_BIRT(Age group after imputing),NAME_CONTRACT_STATUS, > ### <font color=green>AMT_ANNUITY( x and y),NAM_INCOME_TYPE,Occupation and Organization types are the major fields for analysis.
> ### <font color=green>The imbalance ratio is too high i.e non defaulters data is 11.38 times more than defaulters.
> ### <font color=green>Females are more tend to take loans. Among them widows and people who are on maternity leave tend to be > ### <font color=green>more defaulters.
> ### <font color=green>Majority of loan are taking from age group 30 to 40.
> ### <font color=green>Married people are taking more loans.
> ### <font color=green>Most of people who are taking loans has either 0 or 1 children and they are live in House/Apt or with parents.
> ### <font color=green>People are preferring more cash loans than revolving.
> ### <font color=green>Most of the loans are taking for repair works.
> ### <font color=green>Academic degree holders are asking more credit amount.
> ### <font color=green>Defaulters are mostly have secondary/special educaltion only.
> ### <font color=green>While analysing the previoius application vs current application, we identified the below points why previously approved customers are now in defaulters list.

> ### <font color=green>Amount Good price spreaded for all the items in current application w.r.t previous application.
> ### <font color=green>Amount Credit and Amount Annuity also increased from previous to current application.
> ### <font color=green>Plotted a boxplot below for more clarity

> ### <font color=green>It would be good if had a date field in the current and previous application file to see time period on which this data is collected, so that we should had a better picture on why the goods price had increased.

In [None]:
## Boxplot vizualisation on Amt fields ( x indicates current and y indicates previous)
col = ['AMT_GOODS_PRICE_x', 'AMT_GOODS_PRICE_y','AMT_CREDIT_x', 'AMT_CREDIT_y','AMT_ANNUITY_x','AMT_ANNUITY_y']
  # Plotting using box plot.. Removing the oulliers using showfliers command
i=0
for j in range(3):
    plt.figure(figsize=(10,6))
    plt.subplot(1,2,1)
    mer_tar1.boxplot(column=col[i],showfliers=False)
    plt.title(' Current Application')
    plt.subplot(1,2,2)
    mer_tar1.boxplot(column=col[i+1],showfliers=False)
    plt.title('Previous application')
    i=i+2
