### Business Objectives
> The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

> The company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.

> The Data has been taken from three files viz. -

> 1. 'application_data.csv' contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.

> 2. 'previous_application.csv' contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer.

> 3. 'columns_description.csv' is a data dictionary which describes the meaning of the variables.

### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
App_Data=pd.read_csv('../input/credit-eda-case-study/application_data.csv')


In [None]:
Pre_Data=pd.read_csv('../input/credit-eda-case-study/previous_application-1.csv/previous_application-1.csv')


In [None]:
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)

## First we will check The Structure of the Application Data

In [None]:
App_Data.head()

In [None]:
App_Data.shape

In [None]:
App_Data.info(verbose = True)

#### Taking a look at the individual data types.
###### So we have three Data Types with us i.e int64,object and float64
###### A few Object types need to be converted to Category type which will help us reduce memory usage and therefore increasing efficiency

In [None]:
# Checking for float type
App_Data.select_dtypes('float').columns

##### After checking columns_description file found few columns which can not be float so needs to be int
##### DAYS_REGISTRATION, CNT_FAM_MEMBERS, OBS_30_CNT_SOCIAL_CIRCLE, DEF_30_CNT_SOCIAL_CIRCLE, DAYS_LAST_PHONE_CHANGE, AMT_REQ_CREDIT_BUREAU_HOUR

In [None]:
# Numbers of days and numbers of Enquires can not be float.

App_Data['DAYS_REGISTRATION']=App_Data['DAYS_REGISTRATION'].astype(int,errors='ignore')
App_Data['CNT_FAM_MEMBERS']=App_Data['CNT_FAM_MEMBERS'].astype(int,errors='ignore')
App_Data['OBS_30_CNT_SOCIAL_CIRCLE']=App_Data['OBS_30_CNT_SOCIAL_CIRCLE'].astype(int,errors='ignore')
App_Data['DEF_30_CNT_SOCIAL_CIRCLE']=App_Data['DEF_30_CNT_SOCIAL_CIRCLE'].astype(int,errors='ignore')
App_Data['DAYS_LAST_PHONE_CHANGE']=App_Data['DAYS_LAST_PHONE_CHANGE'].astype(int,errors='ignore')
App_Data['AMT_REQ_CREDIT_BUREAU_HOUR']=App_Data['AMT_REQ_CREDIT_BUREAU_HOUR'].astype(int,errors='ignore')

In [None]:
# Changing the dtypes for object type
for col in ['TARGET','CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE','NAME_FAMILY_STATUS','NAME_HOUSING_TYPE','OCCUPATION_TYPE','ORGANIZATION_TYPE','WALLSMATERIAL_MODE','EMERGENCYSTATE_MODE']:
    App_Data[col] = App_Data[col].astype('category')

In [None]:
#Checking the dtypes again
App_Data.dtypes

In [None]:
App_Data.describe()

### Now we will check The Structure of the Previous_Application Data

In [None]:
Pre_Data.head()

In [None]:
Pre_Data.shape

In [None]:
Pre_Data.info()

In [None]:
# Changing the dtypes
for col in ['NAME_CONTRACT_TYPE', 'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE','NAME_TYPE_SUITE','NAME_CLIENT_TYPE','NAME_PRODUCT_TYPE','CHANNEL_TYPE','NAME_SELLER_INDUSTRY','NAME_YIELD_GROUP','PRODUCT_COMBINATION']:
    Pre_Data[col] = Pre_Data[col].astype('category')

In [None]:
#Checking dtypes once again
Pre_Data.dtypes

In [None]:
Pre_Data.describe()

### Missing value check and Data Quality Check for Application Data

In [None]:
#Calculating the percentage of missing values
percent_App = App_Data.isnull().sum()*100/len(App_Data)
percent_App

#### Since there are many columns in our dataframe wherein the missing value percentage is more than 50%, such values should be straightaway dropped as it can reduce the efficiency of our analysis.

In [None]:
App_Data=App_Data.loc[:,App_Data.isnull().mean()<=.50]

In [None]:
#Checking App_Data again

percent_App = (App_Data.isnull().sum()*100/len(App_Data)).round(2)
percent_App

### Missing value check and Data Quality Check for Previous Data

In [None]:
#Calculating the percentage of missing values in Previous data
percent_Pre = Pre_Data.isnull().sum()*100/len(Pre_Data)
percent_Pre



##### The same logic could be used for previous application data as well and the columns can be dropped.

In [None]:
Pre_Data=Pre_Data.loc[:,Pre_Data.isnull().mean()<=.50]

In [None]:
# Checking the Previous Data Again

(100*Pre_Data.isnull().sum()/len(Pre_Data)).round(2)

### Checking for imputation of values having less than 13% missing values

In [None]:
#Checking the columns having missing values less than 13%
percent_App_13 = percent_App[percent_App <= 13]
percent_App_13

### Imputation should be done for  columns where missing values are less than around 13%
##### Our strategy is as followes:
##### 1. Missing value in categorical variables Name_Type_Suites could be replaced with Mode value 'Unaccompained '
###### 2. [AMT_REQ_CREDIT_BUREAU_HOUR,   AMT_REQ_CREDIT_BUREAU_DAY,    AMT_REQ_CREDIT_BUREAU_WEEK ] we can impute the missing values in these columns with  0 because they have more 0 values (Mode).
##### 3. Missing values in 'AMT_GOODS_PRICE' can be imputed by mean/Avarage  value becasue this columns has continous float variables.



### Doing the above mentioned imputations will help us improve the usability of our analysis by taking a hollistic view into account.

In [None]:
#checking the data again
App_Data.head()

In [None]:
Pre_Data.head()

### Checking Outliers

In [None]:
App_Data.head()

In [None]:
App_Data.describe()

In [None]:
# For AMT_INCOME_TOTAL
plt.figure(figsize=(8,2))
sns.boxplot(App_Data.AMT_INCOME_TOTAL)
plt.show()

In [None]:
App_Data.AMT_INCOME_TOTAL.median()

In [None]:
App_Data.AMT_INCOME_TOTAL.max()

#### Here we can see there is an extreme income value which could be of the Rich. This value should ideally be removed as it would distort our data and might give false results.

In [None]:
# For DAYS_EMPLOYED
plt.figure(figsize=(10,2))
sns.boxplot(App_Data.DAYS_EMPLOYED)
plt.show()

In [None]:
App_Data.DAYS_EMPLOYED.median()

In [None]:
App_Data.DAYS_EMPLOYED.max()

##### 365243 is a very big number and looks strange. Such a number is usualy used when data is not available. This value is as good as NA. Therefore we can change it to Null values for futher analysis.

In [None]:
# For AMT_CREDIT
plt.figure(figsize=(10,2))
sns.boxplot(App_Data.AMT_CREDIT)
plt.show()

In [None]:
App_Data.AMT_CREDIT.median()

In [None]:
App_Data.AMT_CREDIT.max()

##### Here the outlier doesn't look that extreme. Also, there is a concentration of few values towards the extreme. Such amount of credit would have been taken by business men and entrepreneurs.

In [None]:
# For AMT_ANNUITY
plt.figure(figsize=(10,2))
sns.boxplot(App_Data.AMT_ANNUITY)
plt.show()

In [None]:
App_Data.AMT_ANNUITY.median()

In [None]:
App_Data.AMT_ANNUITY.max()

###### Again here we can observe the max value isn't that far away from the median value which indicates there isn't any human error of sorts.

In [None]:
# For OBS_30_CNT_SOCIAL_CIRCLE
plt.figure(figsize=(10,2))
sns.boxplot(App_Data.OBS_30_CNT_SOCIAL_CIRCLE)
plt.show()

In [None]:
App_Data.OBS_30_CNT_SOCIAL_CIRCLE.median()

In [None]:
App_Data.OBS_30_CNT_SOCIAL_CIRCLE.max()

##### Here the value could be possibly due to human errors as such a deviation is quite unlikely.

### Binning in Continuous Variable

In [None]:
App_Data['AMT_INCOME_TOTAL'].describe()

In [None]:
# Creating  first binned variable

App_Data.loc[:,'Range_Income']=pd.qcut(App_Data.loc[:,'AMT_INCOME_TOTAL'],q=[0,0.20,0.50,0.90,1],labels=['Low','Medium','High','Very_High'])

In [None]:
# Checking the binned variable

App_Data['Range_Income'].value_counts()

In [None]:
# Creating second binned variable 'EXT_SOURCE_2' which contained Normalized score it could be used  for Rating 


App_Data['EXT_SOURCE_2'].describe()



In [None]:
App_Data.loc[:,'Rating']=pd.qcut(App_Data.loc[:,'EXT_SOURCE_2'],q=[0,0.20,0.50,0.90,1],labels=['Low','Medium','High','Very_High'])

In [None]:
#checking second binned variable 

App_Data['Rating'].value_counts()

###  Third binned variable could be 'DAYS_BIRTH' columns
#### As we can see, only no of days has been given for the customers, here we can derive the age in years which could be binned in columns.

In [None]:
App_Data['DAYS_BIRTH'].describe()

In [None]:
# Age in years can be converted by dividing by 365.25
#  Also have negative values which needs to be fixed

App_Data['Age']=App_Data['DAYS_BIRTH']//-365.25
App_Data.drop(['DAYS_BIRTH'],axis=1,inplace=True)

In [None]:
App_Data.Age.describe()

In [None]:
# now we can creat a binned varibale for 'Age' column

App_Data.head()

In [None]:
App_Data['Age_Group']=pd.cut(App_Data.Age,bins=np.linspace(20,70,num=11))

In [None]:
# checking  binned variable 

App_Data.Age_Group.value_counts()

### Checking Imbalance

In [None]:
# checking imbalance for target variable as per columns description file 'TARGET 'has two values 0 and 1 

# 1 comes under defaulters not able to make payments on time and rest comes under 0 means non defaulters

count1=0
count0=0
for i in App_Data['TARGET'].values:
    if i ==1:
        count1+=1
    else:
        count0+=1
count1=(count1/len(App_Data['TARGET']))*100
count0=(count0/len(App_Data['TARGET']))*100

x=['Defaulter(TARGET=1)','Non-Defaulter(TARGET=0)']
y=[count1,count0]

fig1,ax1=plt.subplots()
ax1.pie(y,labels=x,shadow=True,autopct='%1.1f%%')
plt.title('Data imbalance',fontsize=30)
plt.show()

#### As we can see now that Application data has high imbalance(positive) with Defaulters being just 8.1% as compare to non-Defaulters i.e 91.9% 
### Percentage ratio of imbalance would be 91.9 : 8.1
#### This is a good sign and means that the bank is able to recover most of their loan repayments.

In [None]:
App_Data.head()

#### Splitting Data with respect to TARGET =0 AND TARGET =1

In [None]:
App_Data_t0 = App_Data[App_Data.TARGET==0]
App_Data_t1 = App_Data[App_Data.TARGET==1]

## Analysis -

##### -Univariate
        - Categorical
        - Continous
        
##### -Bivariate 
        - Categorical Categorical
        - Categorical Continous
        - Continous Continous Univariate 

### Univariate Analysis with respect to target=1 and target=0 for catergorical variables

In [None]:
# creating a function for plotting categorical varibales 
def plotfunc(var):
    plt.figure(figsize=(15,5))
    plt.subplot(1,2,1)
    sns.countplot(var,data=App_Data_t0)
    plt.title('Distribution of '+'%s'%var+' for Non-Defaulter',fontsize=14)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('No of case for  Non-Defaulter')
    plt.subplot(1,2,2)
    sns.countplot(var,data=App_Data_t1)
    plt.title('Distribution of '+'%s'%var+' for Defaulter',fontsize=14)
    plt.xlabel(var)
    plt.xticks(rotation=90)
    plt.ylabel('No of case for  Defaulter')
    plt.show()
    
    

### Unordered Categorical Variables

In [None]:
plotfunc('NAME_TYPE_SUITE')

#### 1) When Unaccompanied client is applying for the loan does not have any impact on the default.
#### 2)Both populations have same proportions

In [None]:
plotfunc('NAME_CONTRACT_TYPE')

#### 1) Here we can see revolving loans as lesser in defaulter population.
#### 2) Revolving loans are comparatively safer.

In [None]:
plotfunc('NAME_INCOME_TYPE')

#### 1) Highest numbers of defaulters comes from working class
#### 2) Pensioners seem to default less due to the lower expenses associated with their routine

In [None]:
plotfunc('NAME_HOUSING_TYPE')

#### 1)  Those who are living in Rented appartment and living with parents have higher defaulter rate as compared to Non-Defaulter population
#### 2) There is a Difficulty of paying loan becasue they might not have high salary package and living cost is high for those who are living with parents and  in rented appartment (flow of money is high)

In [None]:
plotfunc('NAME_FAMILY_STATUS')

#### 1) Singles are higher defaulter becasue Single/not married is higher in defaulter population if you compare to Non-Defaulter population this could probably due to single earning capacity of the person.

### Ordered Categorical Variables

In [None]:
plotfunc('NAME_EDUCATION_TYPE')

#### 1) Higher Education count is less in Defaulter Population as compare to Non-Defaulter Population 
#### 2) Higher the education lower the Defaulter rate becasue they earn more and they can pay loan easily 

In [None]:
plotfunc('CNT_FAM_MEMBERS')

#### 1) As higher count of children proportion is more in defaulter population as compare to Non-defaulter 
#### 2) Children count is impacted the Defaulter Rate as larger the family, greater will be the expenses.

In [None]:
plotfunc('Range_Income')

#### Low Range has slightly higher defaulter population than the  Non-defaulter population it means Low Income has higher defaulters 

In [None]:
# we can simply check the actual figure of non-defaulter with respest to defaulter population

Default=App_Data_t1.Range_Income.value_counts(normalize=True)
NonDefault=App_Data_t0.Range_Income.value_counts(normalize=True)
print(Default,NonDefault)

In [None]:
plotfunc('Rating')

#### People with a  Low Rating are defaulting more which speaks for itself as the sole reason they have a low rating is because they are not able to pay their payments.

In [None]:
plotfunc('Age_Group')

#### Lower the age group, higher is the default rate this is because young people are greater risk takers.

In [None]:
# we can simply check the actual figure of non-defaulter with respest to defaulter population

Default=App_Data_t1.Age_Group.value_counts(normalize=True)
NonDefault=App_Data_t0.Age_Group.value_counts(normalize=True)
print(Default,NonDefault)

### Finding correlation for numerical columns for both 0 and 1 case

In [None]:
App_Data.head()

In [None]:
# Selecting float and int for correlation
col_int=list(App_Data_t0.select_dtypes('int64').columns)
col_float=list(App_Data_t0.select_dtypes('float').columns)
col=col_int+col_float
NonDef_num=App_Data_t0[col]
NonDef_corr=NonDef_num.corr()
round(NonDef_corr,3)

In [None]:
l1=NonDef_corr.unstack()

l1.sort_values(ascending=False).drop_duplicates().head(11)

### Top 10 correlations for non default population
#### We took out these value with the help of head and tail code


###### OBS_60_CNT_SOCIAL_CIRCLE    -         OBS_30_CNT_SOCIAL_CIRCLE              0.998269
###### FLOORSMAX_AVG             -                      FLOORSMAX_MEDI                                 0.997187
###### YEARS_BEGINEXPLUATATION_MEDI-  YEARS_BEGINEXPLUATATION_AVG       0.996124
###### FLOORSMAX_MODE               -               FLOORSMAX_MEDI                                 0.989195
###### FLOORSMAX_AVG                 -                 FLOORSMAX_MODE                                0.986594
###### AMT_CREDIT                   -                       AMT_GOODS_PRICE                               0.983103
###### YEARS_BEGINEXPLUATATION_AVG   -YEARS_BEGINEXPLUATATION_MODE   0.980466
###### YEARS_BEGINEXPLUATATION_MODE -YEARS_BEGINEXPLUATATION_MEDI    0.978073
###### REGION_RATING_CLIENT_W_CITY   -   REGION_RATING_CLIENT                       0.956637
###### CNT_CHILDREN                  -                    CNT_FAM_MEMBERS                              0.885484
###### DEF_30_CNT_SOCIAL_CIRCLE      -        DEF_60_CNT_SOCIAL_CIRCLE              0.868994
###### DAYS_EMPLOYED                -                   FLAG_EMP_PHONE                               -0.999702 it is a negative value

In [None]:
# Selecting float and int for correlation
col_int=list(App_Data_t1.select_dtypes('int64').columns)
col_float=list(App_Data_t1.select_dtypes('float').columns)
col=col_int+col_float
NonDef_num=App_Data_t1[col]
NonDef_corr=NonDef_num.corr()
round(NonDef_corr,3)

In [None]:
l1=NonDef_corr.unstack()

l1.sort_values(ascending=False).drop_duplicates()

### Top 10 correlations for  defaulter population
#### We took out these value with the help of head and tail code

###### OBS_60_CNT_SOCIAL_CIRCLE     -      OBS_30_CNT_SOCIAL_CIRCLE             0.998269
##### FLOORSMAX_AVG                 -                  FLOORSMAX_MEDI                               0.997187
##### YEARS_BEGINEXPLUATATION_MEDI  - YEARS_BEGINEXPLUATATION_AVG     0.996124
##### FLOORSMAX_MODE               -                 FLOORSMAX_MEDI                               0.989195
##### FLOORSMAX_AVG                -                   FLOORSMAX_MODE                                0.986594
##### AMT_CREDIT                   -                       AMT_GOODS_PRICE                                   0.983103
##### YEARS_BEGINEXPLUATATION_AVG -   YEARS_BEGINEXPLUATATION_MODE    0.980466
##### YEARS_BEGINEXPLUATATION_MODE - YEARS_BEGINEXPLUATATION_MEDI    0.978073
##### REGION_RATING_CLIENT_W_CITY  -    REGION_RATING_CLIENT                            0.956637
##### DAYS_EMPLOYED               -                   FLAG_EMP_PHONE                                      -0.999702 it is a negative value



### Univariate Analysis on continuous variables

In [None]:
# defining function for plotting contnous variables
def plotcont(var):
    plt.figure(figsize=(15,5))
    plt.subplot(1, 2, 1)
    App_Data_t0[var].plot.hist()
    plt.title('Distribution for Non-Defaulters', fontsize=12)
    plt.xlabel(var)
    plt.subplot(1, 2, 2)
    App_Data_t1[var].plot.hist()
    plt.title('Distribution for Defaulters', fontsize=12)
    plt.xlabel(var)
    plt.show()

In [None]:
plotcont('REGION_POPULATION_RELATIVE')

##### People living in higher density areas are less defaulter

In [None]:
App_Data.AMT_GOODS_PRICE.mean()

In [None]:
plotcont('AMT_GOODS_PRICE')

##### Defaulter rate is higher in between 0.0 to 0.5 Goods_Price

###  Segmented Analysis of Male vs Female

In [None]:
plt.figure(figsize=(20,6))
plt.subplot(121)
sns.countplot(data= App_Data_t0 ,x='TARGET',hue='CODE_GENDER')
plt.legend()
plt.subplot(122)
sns.countplot(data= App_Data_t1 ,x='TARGET',hue='CODE_GENDER')
plt.legend()
plt.show()

#### We cannot decide much on basis of gender as defaulters and non defaulters both follow same pattern
#### So in this case we will  plot a graph for defaulter and non defaulter by taking a median value.

In [None]:
# code for non-defaulter Male and female with mean value 
Ava_income_t_0_m=App_Data_t0[App_Data_t0.CODE_GENDER=='M']['AMT_INCOME_TOTAL'].median()
Ava_income_t_0_f=App_Data_t0[App_Data_t0.CODE_GENDER=='F']['AMT_INCOME_TOTAL'].median()

# code for defaulter Male and female with mean value 

Ava_income_t_1_m=App_Data_t1[App_Data_t1.CODE_GENDER=='M']['AMT_INCOME_TOTAL'].median()
Ava_income_t_1_f=App_Data_t1[App_Data_t1.CODE_GENDER=='F']['AMT_INCOME_TOTAL'].median()

x_Male=['AMT_INCOME_mean_T_0_Male','AMT_INCOME_mean_T_1_Male']

y_Male=[Ava_income_t_0_m,Ava_income_t_1_m]

x_Female=['AMT_INCOME_mean_T_0_Female','AMT_INCOME_mean_T_1_Female']

y_Female=[Ava_income_t_0_f,Ava_income_t_1_f]

plt.figure(figsize=(14,6))

plt.subplot(121)
plt.bar(x_Male,y_Male)
plt.subplot(122)
plt.bar(x_Female,y_Female)

plt.show()

#### Median among male and female defaulters do have less income compared to non-defaulters.

In [None]:
App_Data.head()

In [None]:
Pre_Data.head()

### BiVariate Analysis

### Making functions for repetitive plots

#### Note: Few functions might look repeatative this is because each team member has done different sets of analysis which was combined late.

In [None]:
#continuos-continuos variables
def bivariate_contcont(var1, var2):
    
    plt.figure(figsize=(15,5))

    plt.subplot(1,2,1)
    plt.title('Non-Default')
    sns.boxplot(x=var1,y=var2 ,data=App_Data_t0)
    plt.xticks(fontsize = 15, rotation =90)

    plt.subplot(1,2,2)
    plt.title('Default')
    sns.boxplot(x=var1,y=var2 ,data=App_Data_t1)
    plt.xticks(fontsize = 15, rotation =90)
    plt.show()

In [None]:
# continuos-continuos variables (barplot)
def bivariate_contcont_bar(var1, var2):
    
    plt.figure(figsize=(15,5))

    plt.subplot(1,2,1)
    plt.title('Non-Default')
    sns.barplot(x=var1,y=var2 ,data=App_Data_t0)
    plt.xticks(fontsize = 15, rotation =90)

    plt.subplot(1,2,2)
    plt.title('Default')
    sns.barplot(x=var1,y=var2 ,data=App_Data_t1)
    plt.xticks(fontsize = 15, rotation =90)
    plt.show()

In [None]:
# continuos-continuos variables (scatterplot)
def bivariate_contcont_scatter(var1, var2):
    
    plt.figure(figsize=(15,5))

    plt.subplot(1,2,1)
    plt.title('Non-Default')
    sns.scatterplot(x=var1,y=var2 ,data=App_Data_t0)
    plt.xticks(fontsize = 15, rotation =90)

    plt.subplot(1,2,2)
    plt.title('Default')
    sns.scatterplot(x=var1,y=var2 ,data=App_Data_t1)
    plt.xticks(fontsize = 15, rotation =90)
    plt.show()
    
def plotbivarcontcont(var1,var2):
    plt.figure(figsize=(15,5))
    plt.subplot(1, 2, 1)
    sns.scatterplot(x=var1,y=var2,data=App_Data_t0)
    plt.title('TARGET=0')
    plt.xlabel(var1)
    plt.xticks(rotation=90)
    plt.subplot(1, 2, 2)
    sns.scatterplot(x=var1,y=var2,data=App_Data_t1)
    plt.title('TARGET=1')
    plt.xlabel(var1)
    plt.xticks(rotation=90)
    plt.show()

In [None]:
#category-category variable
def bivariate_catcat(var1, var2):
    #Table for Non Default
    crosstab_0 = pd.crosstab(index=App_Data_t0.var1,columns=App_Data_t0.var2)
    # Table for Default
    crosstab_1 = pd.crosstab(index=App_Data_t1.var1, 
                          columns=App_Data_t1.var2)
    #plot
    fig, axes = plt.subplots (nrows = 1, ncols = 2, figsize=(15,5))

    crosstab_0.plot(ax = axes[0], kind="bar", stacked=True)
    crosstab_1.plot(ax = axes[1], kind="bar", stacked=True)

    axes[0].legend(prop={'size': 10}, loc='upper left')
    axes[1].legend(prop={'size': 10}, loc='upper left')

    axes[0].title.set_text('Non - Defaulters')
    axes[1].title.set_text('Defaulters')

### Education type vs amount for credit taken [Default:Non-Default]

In [None]:
bivariate_contcont_bar('NAME_EDUCATION_TYPE','AMT_CREDIT')

##### Inference:
##### 1. From the above chart of non-defaulters we can observe that people having higher eduaction level have are able to repay their greater more than those who have lower education mainly because people with higher education tend to earn more.
##### 2. People with lower education level are clearly not going to be the bank's target.

### Education Type vs Income Range

In [None]:
# Creating two way tables(Count) for Category-Category bivariate analysis - For Non-Default
education_RangeIncome_0 = pd.crosstab(index=App_Data_t0["Range_Income"], 
                          columns=App_Data_t0["NAME_EDUCATION_TYPE"])

education_RangeIncome_0

In [None]:
# Table for Default
education_RangeIncome_1 = pd.crosstab(index=App_Data_t1["Range_Income"], 
                          columns=App_Data_t1["NAME_EDUCATION_TYPE"])

education_RangeIncome_1

In [None]:
fig, axes = plt.subplots (nrows = 1, ncols = 2, figsize=(15,5))

education_RangeIncome_0.plot(ax = axes[0], kind="bar", stacked=True)
education_RangeIncome_1.plot(ax = axes[1], kind="bar", stacked=True)

axes[0].legend(prop={'size': 10}, loc='upper left')
axes[1].legend(prop={'size': 10}, loc='upper left')

axes[0].title.set_text('Non - Defaulters')
axes[1].title.set_text('Defaulters')



#### Inference:
#### 1. From the above table we can look at the breakdown of number of non defaulters according to the Education type within each income range. 
#### 2. The numbers are high for Higher education and Secondary Education.
#### 3. Furthermore most of the people are from the Income range of Medium-High stating the reason for their low level of default.

### Gender vs education

In [None]:
# Creating two way tables(Count) for Category-Category bivariate analysis - For Non-Default
gender_edu_0 = pd.crosstab(index=App_Data_t0["CODE_GENDER"], 
                          columns=App_Data_t0["NAME_EDUCATION_TYPE"])

gender_edu_0

In [None]:
# Table for Default
gender_edu_1 = pd.crosstab(index=App_Data_t1["CODE_GENDER"], 
                          columns=App_Data_t1["NAME_EDUCATION_TYPE"])

gender_edu_1

In [None]:
fig, axes = plt.subplots (nrows = 1, ncols = 2, figsize=(15,5))

gender_edu_0.plot(ax = axes[0], kind="bar", stacked=True)
gender_edu_1.plot(ax = axes[1], kind="bar", stacked=True)

axes[0].legend(prop={'size': 10}, loc='upper left')
axes[1].legend(prop={'size': 10}, loc='upper left')

axes[0].title.set_text('Non - Defaulters')
axes[1].title.set_text('Defaulters')

#### Inference:
#### 1. It is clearly visible that males especially having secondary level  education are amoung the most defaulters.
#### 2. While females tend to default lesser as comparaed to males.

### Gender vs Credit

In [None]:
bivariate_contcont('CODE_GENDER','AMT_CREDIT')

#### Inference:
#### 1. Although there isn't much difference on the basis of gender, we can still notice that men who dont default have taken credit even above 30 lac while those who tend to default dont take credit more than 30 lac

### Education vs Annuity amount

In [None]:
bivariate_contcont('NAME_EDUCATION_TYPE','AMT_ANNUITY')

#### Inference:
#### 1. From the above plot we can observe that amoung the defaulters, higher numbers are of Academic degree holders.
#### 2. One possible reason could be that these people might be taking educational loans.

### Population relative vs Income range

In [None]:
bivariate_contcont('Range_Income','REGION_POPULATION_RELATIVE')

#### Inference:
#### 1. From the plot we can make out that people from higher income categories are living in high populated regions (ex: Metro cities)
#### 2. Out of the high income category who live in densely populated cities, there are greater number of non defaulters from 

### Occupation vs goods price

In [None]:
bivariate_contcont('OCCUPATION_TYPE','AMT_GOODS_PRICE')

### Annuity vs Family status

In [None]:
bivariate_contcont('NAME_FAMILY_STATUS','AMT_ANNUITY')

#### Inferrence:
#### 1. Comparatively, married people are defaulting more than others which could be due to the additional expenses related to family which can hinder their cash flow.
#### 2. People usually having an annuity more than 60,000 are defaulting more as compared to non-Defaulters

### Housing type vs region population

In [None]:
bivariate_contcont('NAME_HOUSING_TYPE','REGION_POPULATION_RELATIVE')

#### Inference:
#### 1. People residing in rented apartments are comparatively on the lower side of defaulting.
#### 2. These are the same people who are residing in less populated areas.
#### 3. Reason for this could be cheaper property rates on the outskirts of the main city.

### Own realty vs credit

In [None]:
bivariate_contcont('FLAG_OWN_REALTY','AMT_CREDIT')

#### Inference:
#### 1. We can observe that people don't own realty have taken less credit. 
#### 2.This could mean they might be living on rent which would already be burdening them with add-on expenses, lower their risk appetite therefore might lead to not taking credit.

### Family status vs credit

In [None]:
bivariate_contcont('NAME_FAMILY_STATUS', 'AMT_CREDIT')

### Credit amount vs Income

In [None]:
plotbivarcontcont('AMT_CREDIT','AMT_INCOME_TOTAL')

### Credit amount vs Goods price

In [None]:
plotbivarcontcont('AMT_CREDIT','AMT_GOODS_PRICE')

#### Inferences:
#### 1. Defaulters are less if price of goods is up and amount credit is also less.

# PREVIOUS APPLICATION DATA ANALYSIS

### Merging the data

In [None]:
Pre_Data.shape

In [None]:
Pre_Data.info()

In [None]:
Pre_Data.dtypes.value_counts()

In [None]:
#Checking missng values
((Pre_Data.isnull().sum()*100)/len(Pre_Data)).round(2)

In [None]:
Pre_Data = Pre_Data.loc[:, Pre_Data.isnull().mean() <= .5]
Pre_Data.info()

In [None]:
# Since the Previous application data is very large, we are deleting some rows so that it could be easily merged
# Number can be changed as per analysis
Pre_Data=Pre_Data.loc[0:70000]

In [None]:
Pre_Data.shape

### Combining

In [None]:
Combined = pd.merge(App_Data, Pre_Data, how='left', on=['SK_ID_CURR'])
Combined.shape

In [None]:
Combined.columns

In [None]:
# Checking at the distribution of Contract status column visially
sns.countplot(Combined.NAME_CONTRACT_STATUS)
plt.xlabel("Contract Status")
plt.ylabel("Count of Contract Status")
plt.title("Distribution of Contract Status")
plt.show()



#### Based on the above observation, the merged data has been divided into the following categories:
1) Approved
2) Refused
3) Canceled
4) Unused Offer

In [None]:
#Making seperate dataframes for these four catgories
approved=Combined[Combined.NAME_CONTRACT_STATUS=='Approved']
refused=Combined[Combined.NAME_CONTRACT_STATUS=='Refused']
canceled=Combined[Combined.NAME_CONTRACT_STATUS=='Canceled']
unused=Combined[Combined.NAME_CONTRACT_STATUS=='Unused Offer']

In [None]:
# DEfining a function for plotting target variables with these categories
def plot_func(var):
    fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(15,5))
    
    s1=sns.countplot(ax=ax1,x=refused[var], data=refused, order= refused[var].value_counts().index,)
    ax1.set_title("Refused", fontsize=10)
    ax1.set_xlabel('%s' %var)
    ax1.set_ylabel("Count of Loans")
    s1.set_xticklabels(s1.get_xticklabels())
    
    s2=sns.countplot(ax=ax2,x=approved[var], data=approved, order= approved[var].value_counts().index,)
    s2.set_xticklabels(s2.get_xticklabels())
    ax2.set_xlabel('%s' %var)
    ax2.set_ylabel("Count of Loans")
    ax2.set_title("Approved", fontsize=10)
    
    
    s3=sns.countplot(ax=ax3,x=canceled[var], data=canceled, order= canceled[var].value_counts().index,)
    ax3.set_title("Canceled", fontsize=10)
    ax3.set_xlabel('%s' %var)
    ax3.set_ylabel("Count of Loans")
    s3.set_xticklabels(s3.get_xticklabels())
    plt.show()

In [None]:
plot_func('TARGET')

### Let us check the Percentage of each categories displayed above

In [None]:
refused.TARGET.value_counts(normalize=True)

In [None]:
approved.TARGET.value_counts(normalize=True)

In [None]:
canceled.TARGET.value_counts(normalize=True)

#### Inference:
#### Loans which were previously refused or cancelled have a higher default rate

### Application amount vs income

In [None]:
plt.figure(figsize=(18,6))
plt.subplot(1,2,1)
sns.scatterplot(x='AMT_APPLICATION',y='AMT_INCOME_TOTAL',data=refused)
plt.title('Refused')

plt.subplot(1,2,2)
sns.scatterplot(x='AMT_APPLICATION',y='AMT_INCOME_TOTAL',data=approved)
plt.title('Approved')
plt.show()

