# <font color = blue> EDA Case Study </font>

### <font color = blue> Introduction </font>

This case study aims to give an idea of applying EDA in a real business scenario. In this case study, we will develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.

### <font color = blue> Business Understanding </font>

This case study aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

 

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment.

To develop your understanding of the domain, you are advised to independently research a little about risk analytics - understanding the types of variables and their significance should be enough).


### <font color = blue> Business Objective </font>

The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

 

When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

### Filtering out the warnings

In [None]:
# Filtering out the warnings

import warnings

warnings.filterwarnings('ignore')

### Importing the libraries

In [None]:
# Importing the required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from mpl_toolkits.mplot3d import Axes3D
#from sklearn.preprocessing import Standard
import os # accessing directory structure
import plotly
import plotly.express as px
import plotly.graph_objects as go


##  Task 1: Reading the data

- ### Subtask 1.1: Read the application Data.

Read the application data and store it in a dataframe.

In [None]:
# Read the csv file using 'read_csv'.

data = pd.read_csv("application_data.csv")

In [None]:
pd.set_option('display.max_columns',122)

In [None]:
data.head()

- ###  Subtask 1.2: Inspect the Dataframe

Inspect the dataframe for dimensions, null-values, and summary of different numeric columns.

In [None]:
# Checking the number of rows and columns in the dataframe

data.shape

In [None]:
# Checking the column-wise info of the dataframe

data.columns

In [None]:
# Checking the data types of columns 

data.info()

In [None]:
# Checking the summary for the numeric columns 

data.describe()

In [None]:
pd.set_option('display.max_rows',122)

In [None]:
# Checking the data types of columns same like info

data.dtypes


In [None]:
# To see the total count of the columns which is having Null values

(data.isnull().sum()>0).sum()

In [None]:
# To get the sum of Null values for each columns

sum_null = data.isnull().sum()

In [None]:
sum_null.sort_values(ascending = False)

In [None]:
#To calculate the percentage of null columns

percentage_null = data.isnull().mean()*100
percentage_null = percentage_null.sort_values(ascending = False)
percentage_null



## Task 2: Data Cleaning and Manipluation

Now that we have loaded the dataset and inspected it, we see Null values in many columns and calculated it's percentages respectively. Let's now work on handling these Null values

-  ###  Subtask 2.1: Visualizing the Percentage of the columns which is having Null values!


In [None]:

null_col = percentage_null.sort_values(ascending = False)
null_col = null_col[null_col>32]
plt.figure(figsize=(25,10))
plt.title("Representation of Null columns vs their percentage", fontdict = {"fontsize":26, "fontweight":20, "color":"Green"})
plt.xlabel("Null columns", fontdict = {"fontsize":25, "fontweight":20, "color":"Green"})
plt.ylabel("Percentage", fontdict = {"fontsize":25, "fontweight":20, "color":"Green"})
null_col.plot.bar(color = 'purple');



In [None]:
# Copying the original dataframe to new dataframe called 'new_data' 

new_data = data

-  ###  Subtask 2.2: Removing the Null columns, having the percentage more than 32


In [None]:
# After inspecting the null columns and found out that there is no significance reason to keep the null columns more than 32%, so, dropping those columns

get_col = new_data.isnull().mean()*100
get_col = get_col[get_col.values>=32].index.to_list()
new_data.drop(labels = get_col,axis =1,inplace=True)
new_data.head()

In [None]:
# Checking the shape of the dataframe after dropping the null columns

new_data.shape

-  ###  Subtask 2.3: Handling Missing / Null values in each columns


In [None]:
# Sorting the null columns accordding to descending value

(new_data.isnull().mean()*100).sort_values(ascending = False).head(18)

In [None]:
pd.set_option('display.max_rows',30751)


-  #### 2.3.1: For the column 'OCCUPATION_TYPE', we can impute the most frequent category, i.e., 'Laborers', since it is a categorical coulmn.



In [None]:
new_data['OCCUPATION_TYPE'].isnull().sum()

In [None]:

new_data["OCCUPATION_TYPE"].value_counts()

#### impute the values

In [None]:
print(new_data["OCCUPATION_TYPE"].mode())

# Impute the mode value

new_data['OCCUPATION_TYPE'].fillna('Laborers', inplace = True)
new_data['OCCUPATION_TYPE'].isnull().sum()


- #### 2.3.2: For the columns 'EXT_SOURCE_3', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_REQ_CREDIT_BUREAU_MONTH', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_QRT', we can impute the missing values of these columns with the mode value, which is '0'. We can see those columns are about the credit inquiries made by bank. So, imputing mode value seems to the good approach.


#### impute the values

In [None]:
new_data['EXT_SOURCE_3'] = new_data['EXT_SOURCE_3'].fillna(new_data['EXT_SOURCE_3'].mode()[0])
new_data['AMT_REQ_CREDIT_BUREAU_YEAR'] = new_data['AMT_REQ_CREDIT_BUREAU_YEAR'].fillna(new_data['AMT_REQ_CREDIT_BUREAU_YEAR'].mode()[0])
new_data['AMT_REQ_CREDIT_BUREAU_MON'] = new_data['AMT_REQ_CREDIT_BUREAU_MON'].fillna(new_data['AMT_REQ_CREDIT_BUREAU_MON'].mode()[0])
new_data['AMT_REQ_CREDIT_BUREAU_WEEK'] = new_data['AMT_REQ_CREDIT_BUREAU_WEEK'].fillna(new_data['AMT_REQ_CREDIT_BUREAU_WEEK'].mode()[0])
new_data['AMT_REQ_CREDIT_BUREAU_DAY'] = new_data['AMT_REQ_CREDIT_BUREAU_DAY'].fillna(new_data['AMT_REQ_CREDIT_BUREAU_DAY'].mode()[0])
new_data['AMT_REQ_CREDIT_BUREAU_HOUR'] = new_data['AMT_REQ_CREDIT_BUREAU_HOUR'].fillna(new_data['AMT_REQ_CREDIT_BUREAU_HOUR'].mode()[0])
new_data['AMT_REQ_CREDIT_BUREAU_QRT'] = new_data['AMT_REQ_CREDIT_BUREAU_QRT'].fillna(new_data['AMT_REQ_CREDIT_BUREAU_QRT'].mode()[0])


In [None]:
#Again checking the Null values after imputing the above columns

new_data.isnull().sum().sort_values(ascending = False).head(10)

In [None]:
# Checking info

new_data.info()

In [None]:
(new_data.isnull().mean()*100).sort_values(ascending = False).head(10)


-  #### 2.3.3: For the column 'NAME_TYPE_SUITE', we can impute the most frequent category, i.e., 'Laborers', since it is a categorical coulmn.



In [None]:

new_data["NAME_TYPE_SUITE"].value_counts()

In [None]:
print(new_data['NAME_TYPE_SUITE'].mode())
new_data['NAME_TYPE_SUITE'] = new_data['NAME_TYPE_SUITE'].fillna(new_data['NAME_TYPE_SUITE'].mode()[0])
new_data['NAME_TYPE_SUITE'].isnull().sum()

-  #### 2.3.4: For the columns 'DEF_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',   'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', we can impute the most frequent category, i.e., '0', since these are all 'How many observation/defaulted of client's social surroundings with 30/60 DPD (days past due) default. So, this would be the  good approach.



#### Impute the values

In [None]:
new_data['DEF_30_CNT_SOCIAL_CIRCLE'] = new_data['DEF_30_CNT_SOCIAL_CIRCLE'].fillna(new_data['DEF_30_CNT_SOCIAL_CIRCLE'].mode()[0])
new_data['DEF_60_CNT_SOCIAL_CIRCLE'] = new_data['DEF_60_CNT_SOCIAL_CIRCLE'].fillna(new_data['DEF_60_CNT_SOCIAL_CIRCLE'].mode()[0])
new_data['OBS_30_CNT_SOCIAL_CIRCLE'] = new_data['OBS_30_CNT_SOCIAL_CIRCLE'].fillna(new_data['OBS_30_CNT_SOCIAL_CIRCLE'].mode()[0])
new_data['OBS_60_CNT_SOCIAL_CIRCLE'] = new_data['OBS_60_CNT_SOCIAL_CIRCLE'].fillna(new_data['OBS_60_CNT_SOCIAL_CIRCLE'].mode()[0])


In [None]:
new_data.isnull().sum().sort_values(ascending = False).head()

In [None]:
new_data.head()

-  #### 2.3.5: Dropping the columns which starts with FLAG_DOCUMENT. Since, it doesn't contain any information about what type of docyments these are.

In [None]:
new_data['FLAG_DOCUMENT_3'].value_counts()

In [None]:
flag_col = list(new_data.loc[:,'FLAG_DOCUMENT_2':'FLAG_DOCUMENT_21'])


In [None]:
flag_col

In [None]:
new_data.drop(flag_col, inplace=True, axis=1)


In [None]:
new_data.head()

In [None]:
new_data.isnull().sum().sort_values(ascending = False).head(6)

-  #### 2.3.6: For the column 'EXT_SOURCE_2', we can impute the most frequent value.

In [None]:
new_data['EXT_SOURCE_2'].mode()

In [None]:
new_data['EXT_SOURCE_2'] = new_data['EXT_SOURCE_2'].fillna(new_data['EXT_SOURCE_2'].mode()[0])

-  #### 2.3.7: For the column 'AMT_GOODS_PRICE', we can impute the median value, which is  '450000.0'. Since, this is a numeric columns with the skewness. Hence, this would be the best approach.


In [None]:
new_data['AMT_GOODS_PRICE'].median()

In [None]:
new_data['AMT_GOODS_PRICE'] = new_data['AMT_GOODS_PRICE'].fillna(new_data['AMT_GOODS_PRICE'].median())

-  #### 2.3.8: For the column 'AMT_ANNUITY', we can impute the median value, which is  '24903.0'. Since, this is a numeric columns with the skewness. Hence, this would be the best approach.


In [None]:
new_data['AMT_ANNUITY'].median()

In [None]:
new_data['AMT_ANNUITY'] = new_data['AMT_ANNUITY'].fillna(new_data['AMT_ANNUITY'].median())

-  #### 2.3.9: For the column 'CNT_FAM_MEMBERS', we can impute the mode value, which is  '2'. Since, this is a numeric columns with normalized value. Hence, this would be the best approach.

In [None]:
new_data['CNT_FAM_MEMBERS'].mode()

In [None]:
new_data['CNT_FAM_MEMBERS'] = new_data['CNT_FAM_MEMBERS'].fillna(new_data['CNT_FAM_MEMBERS'].mode()[0])

-  #### 2.3.10: For the column 'DAYS_LAST_PHONE_CHANGE', we can impute the median value, which is  '0'.


In [None]:
new_data['DAYS_LAST_PHONE_CHANGE'].mode()

In [None]:
new_data['DAYS_LAST_PHONE_CHANGE'] = new_data['DAYS_LAST_PHONE_CHANGE'].fillna(new_data['DAYS_LAST_PHONE_CHANGE'].mode()[0])

### Now, that we have handled all the Null/Missing values

In [None]:
new_data.isnull().sum().sort_values(ascending = False).head(6)

In [None]:
new_data.head()

In [None]:
#Checking the shape of the dataframe
new_data.shape

In [None]:
#Checking the info

new_data.info()

In [None]:
# describing the data

new_data.describe()

## Task 3: Handling Errors

-  ####  Subtask 3.1: checking the unique values of the columns starts with "DAYS" and handling the negative values.

In [None]:
print(new_data['DAYS_BIRTH'].unique())
print(new_data['DAYS_EMPLOYED'].unique())
print(new_data['DAYS_REGISTRATION'].unique())
print(new_data['DAYS_ID_PUBLISH'].unique())
print(new_data['DAYS_LAST_PHONE_CHANGE'].unique())


In [None]:
# Converting the negative values to absolute values using abs function

In [None]:
new_data[['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'DAYS_LAST_PHONE_CHANGE']] = abs(new_data[['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'DAYS_LAST_PHONE_CHANGE']])

In [None]:
new_data.head()

In [None]:
print(new_data['DAYS_BIRTH'].unique())
print(new_data['DAYS_EMPLOYED'].unique())
print(new_data['DAYS_REGISTRATION'].unique())
print(new_data['DAYS_ID_PUBLISH'].unique())
print(new_data['DAYS_LAST_PHONE_CHANGE'].unique())

-  ####  Subtask 3.1: There are some columns where the value is mentioned as 'XNA' which means 'Not Available'. So we have to find the number of rows and columns and implement suitable techniques on them to fill those missing values or to delete them


#### let's find these categorical columns having these 'XNA' values


#### For Gender column


In [None]:
new_data['CODE_GENDER'].unique()

In [None]:
new_data.loc[new_data['CODE_GENDER']=='XNA']

In [None]:
new_data['CODE_GENDER'].mode()

#### We can replace the "XNA" of CODE_GENDER with mode value, Which is F(Female)

In [None]:
new_data.loc[new_data['CODE_GENDER']=='XNA', 'CODE_GENDER'] = 'F'

In [None]:
print(new_data['CODE_GENDER'].unique())
new_data['CODE_GENDER'].value_counts()

In [None]:
new_data.head(2)

#### For ORGANIZATION_TYPE column

In [None]:
new_data['ORGANIZATION_TYPE'].value_counts()

#### So, for column 'ORGANIZATION_TYPE', we have total count of 307511 rows of which 55374 rows are having 'XNA' values. Which means 18% of the column is having this values. Hence we replace with 'NaN', will not have any major impact on our dataset.

In [None]:
new_data = new_data.replace('XNA',np.NaN)


In [None]:
new_data['ORGANIZATION_TYPE'].value_counts()

In [None]:
new_data.head()

## Task 4:  Analysis of Continuous variables and binning when required.
     
 Now our task is to analyze the continuous variable and creating the bins for the required categories.

- #### Subtask 4.1: Analysis of 'DAYS_BIRTH' column. Here the days_birth column data is provided in days. Hence converting the days into years by dividing it in 365 days and Binning the 'DAYS_BIRTH' into 'Young','Adult', 'Middle_Aged', 'Senior_Citizen' and storing in a new variable called 'YEAR_BIRTH_BINNING'.


In [None]:
# Analysis of 'DAYS_BIRTH' column.

new_data['DAYS_BIRTH']

In [None]:
# Here the days_birth column data is provided in days. Hence converting the days into years by dividing it into 365 days .

new_data['DAYS_BIRTH']= (new_data['DAYS_BIRTH']/365).astype(int)
new_data['YEAR_BIRTH_BINNING']= new_data['DAYS_BIRTH']


In [None]:
print(new_data['YEAR_BIRTH_BINNING'].unique())
print(new_data['YEAR_BIRTH_BINNING'].min())
print(new_data['YEAR_BIRTH_BINNING'].max())

In [None]:
# Binning the 'DAYS_BIRTH' into 'Young','Adult', 'Middle_Aged', 'Senior_Citizen' and storing in a new variable called 'YEAR_BIRTH_BINNING'

new_data['YEAR_BIRTH_BINNING']=pd.cut(new_data['YEAR_BIRTH_BINNING'], bins=[18,25,35,60,100], labels=['Young','Adult', 'Middle_Aged', 'Senior_Citizen'])

In [None]:
new_data['YEAR_BIRTH_BINNING'].value_counts()

- #### Subtask 4.2: Here the DAYS_EMPLOYED column data is provided in days. Hence converting the days into years by dividing it into 365 days and storing it in a new variable 'YEARS_EMPLOYED'.


In [None]:
# Here the DAYS_EMPLOYED column data is provided in days. Hence converting the days into years by dividing it into 365 days and storing it in a new variable 'YEARS_EMPLOYED'

new_data['YEARS_EMPLOYED'] = np.round(new_data.DAYS_EMPLOYED / 365, 1)


In [None]:
new_data['YEARS_EMPLOYED'].value_counts()

In [None]:
# After converting the DAYS_EMPLOYED from days to years and stored it in a new variable YEARS_EMPLOYED.
# we observed the years employed as 1000, which is practically impossible.
# Hence replacing those values to 'NaN'

new_data[new_data['YEARS_EMPLOYED']>=1000].head()

In [None]:
new_data.loc[new_data['YEARS_EMPLOYED'] >= 1000, 'YEARS_EMPLOYED'] = np.NaN


In [None]:
new_data['YEARS_EMPLOYED'].min()

In [None]:
new_data.head()

- #### Subtask 4.3: Binning the 'AMT_INCOME_TOTAL' column as 'VERY_LOW', 'LOW', "MEDIUM", 'HIGH', 'VERY_HIGH' and storing in a new variable called AMT_INCOME_TOTAL_RANGE.


In [None]:
new_data['AMT_INCOME_TOTAL'].value_counts()

In [None]:
new_data['AMT_INCOME_TOTAL_RANGE'] = pd.qcut(new_data.AMT_INCOME_TOTAL, q=[0, 0.2, 0.5, 0.8, 0.95, 1], labels=['VERY_LOW', 'LOW', "MEDIUM", 'HIGH', 'VERY_HIGH'])
new_data['AMT_INCOME_TOTAL_RANGE'].value_counts()

In [None]:
new_data['AMT_CREDIT'].value_counts()

In [None]:
new_data['AMT_CREDIT_RANGE'] = pd.qcut(new_data.AMT_CREDIT, q=[0, 0.2, 0.5, 0.8, 0.95, 1], labels=['VERY_LOW', 'LOW', "MEDIUM", 'HIGH', 'VERY_HIGH'])
new_data['AMT_CREDIT_RANGE'].head()

## Task 5:  Data Analysis.


- ###  Subtask 5.1: Checking data Imbalance

In [None]:
plt.figure(figsize=(5,5))
plt.title("To check the percentage of with payment difficulties vs Others", color = 'green')
new_data.TARGET.value_counts(normalize=True).plot.pie(autopct='%1.2f%%')
plt.show()

**Points to be concluded from the above graph**

Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)

1)We can see from the above pie chart, the percentage of clients with payment difficulties is 8%

2)the percentage of clients with non-payment difficulties is 92%



- ###  Subtask 5.2: Splitting Data with respect to TARGET=0 and TARGET=1

In [None]:
target_1 = new_data[new_data.TARGET==1]
target_0 = new_data[new_data.TARGET==0]

- ###  Subtask 5.3: Checking distribution of important columns

In [None]:
new_data.head(1)

In [None]:
# Checking Distribution of 'YEARS_EMPLOYED' column

plt.figure(figsize = [10,5])
sns.set(style="darkgrid")
sns.distplot(new_data.YEARS_EMPLOYED, color = 'magenta');
plt.title("Distribution of years employed", fontdict = {'fontsize': 15, 'fontweight': 20, 'color' : 'brown'})
plt.show()


**Points to be concluded from the above graph**

1)We can observe that the column "YEARS_EMPLOYED" is normally distributed.

2)The experience/years employed ranges from 0 to 50 years

In [None]:
# Checking Distribution of 'DAYS_BIRTH' column

plt.figure(figsize = [10,5])
sns.set(style="darkgrid")
sns.distplot(new_data.DAYS_BIRTH, color = 'violet');
plt.title("Distribution of Days Birth", fontdict = {'fontsize': 15, 'fontweight': 20, 'color' : 'brown'})
plt.show()


**Points to be concluded from the above graph**

1)From the above graph, we can observe that, the column "DAYS_BIRTH" is normally distributed.

2)The age ranges from 25 to 70 years

In [None]:
# Checking Distribution of 'AMT_CREDIT' column

plt.figure(figsize = [10,5])
sns.set(style="darkgrid")
sns.distplot(new_data.AMT_CREDIT, color = 'violet');
plt.title("Distribution of Days Amount credit", fontdict = {'fontsize': 15, 'fontweight': 20, 'color' : 'brown'})
plt.show()


**Points to be concluded from the above graph**

1)From the above graph, we can observe that, the column "AMT_CREDIT" distribution curve does not appear to be normal or bell curve.


In [None]:
# Checking Distribution of 'AMT_ANNUITY' column

plt.figure(figsize = [10,5])
sns.set(style="darkgrid")
sns.distplot(new_data.AMT_ANNUITY, color = 'violet');
plt.title("Distribution of AMT_ANNUITY", fontdict = {'fontsize': 15, 'fontweight': 20, 'color' : 'brown'})
plt.show()


**Points to be concluded from the above graph**

1)From the above graph, we can observe that, the column "AMT_ANNUITY" distribution curve appear to be normal.

2)And the curve is skewed to the right side of the graph


In [None]:
# Checking Distribution of 'AMT_GOODS_PRICE' column

plt.figure(figsize = [10,5])
sns.set(style="darkgrid")
sns.distplot(new_data.AMT_GOODS_PRICE, color = 'violet');
plt.title("Distribution of AMT_GOODS_PRICE", fontdict = {'fontsize': 15, 'fontweight': 20, 'color' : 'brown'})
plt.show()


**Points to be concluded from the above graph**

1)From the above graph, we can observe that, the column "AMT_GOODS_PRICE" distribution curve does not appear to be normal or bell curve.

2)However the curve is skewed to the right side of the graph

In [None]:
# Analysis on coulmn AMT_INCOME_TOTAL to find outliers

fig = px.box(new_data, y="AMT_INCOME_TOTAL", title=' Total Amount Income analysis')

fig.show()

In [None]:
new_data.AMT_INCOME_TOTAL.quantile([.5, .7, .9, .95, 0.99, 0.999, 0.9999])


**Few points can be concluded from the graph above**

Some outliers are noticed in income amount. The third quartiles is very slim for income amount. We can conclude that, the amount 117M observed from the box plot is an outlier

In [None]:
# Analysis on coulmn YEARS_EMPLOYED to find outliers

fig = px.box(new_data, y="YEARS_EMPLOYED", title='Years Employed Analysis', notched=True)
fig.show()


In [None]:
new_data.YEARS_EMPLOYED.quantile([.5, .7, .9, .95, 0.99, 0.999, 0.9999])


**Points to be concluded from the above plot**

Here, in the coloumn 'DAYS_EMPLOYED' which tells how many days before the application the person started current employment. 

We don't see any outliers in this case, as we have already handled a, having the value 1000.

In [None]:
# Analysis on coulmn DAYS_BIRTH to find outliers

fig = px.box(new_data, y="DAYS_BIRTH", title='Days birth Analysis', notched = True)
fig.show()

**Points to be concluded from the above plot**

There is no outliers present in the age column.

In [None]:
# Analysis on coulmn AMT_CREDIT to find outliers

fig = px.box(new_data, y="AMT_CREDIT",  title='Amount credit Analysis', notched = True)
fig.show() 

**Few points can be concluded from the plot above**

1) Some outliers are noticed in credit amount.

2) The first quartile is bigger than third quartile for credit amount which means most of the credits of clients are present in the first quartile.

In [None]:
# Analysis on coulmn AMT_ANNUITY to find outliers

fig = px.box(new_data, y="AMT_ANNUITY", title='Amount Annuity Analysis', notched = True)
fig.show()


In [None]:
new_data.AMT_ANNUITY.quantile([.5, .7, .9, .95, 0.99])

**Few points can be concluded from the graph above**

1) Some outliers are noticed in annuity amount.

2) The first quartile is bigger than third quartile for annuity amount which means most of the annuity clients are from first quartile.

3)The value above 258000 is an outlier here.

In [None]:
# Analysis on coulmn AMT_GOODS_PRICE to find outliers


fig = px.box(new_data, y="AMT_GOODS_PRICE",title='AMT_GOODS_PRICE Analysis', notched = True)

fig.show()  

In [None]:
new_data.AMT_GOODS_PRICE.quantile([.5, .7, .9, .95, 0.99])

**Few points could be concluded from the above plot**

1)For consumer loans it is the price of the goods for which the loan is given.

2)Some outliers are noticed in Goods price. 

3) The first quartile is bigger than third quartile for goods amount.

In [None]:
# Distribution of 'OCCUPATION_TYPE'

plt.figure(figsize=(8,5))

plt.title("The distribution of occupation- applied for loan", fontdict = {"fontsize":20, "fontweight":20, "color":"blue"})
plt.xlabel("Occupation", fontdict = {"fontsize":15, "fontweight":20, "color":"brown"})
plt.ylabel("count", fontdict = {"fontsize":15, "fontweight":20, "color":"brown"})
new_data['OCCUPATION_TYPE'].value_counts().plot.bar(color = 'violet')

plt.show()


**Inference from the above chart**

    1) 3 categories, 'Laborers', 'Sales staff', 'Core staffs' shows the major count, who applied 
    for loan.
    2) IT Staffs are the least applied for the loan


- ###  Subtask 5.4: Univariate Analysis

Here we will analysis the single variables for target1 and target2 by parallel plotting and will make an inference.

Note that, we already segregated the target1 and target2 variables on subtask 5.2


### Categorical variables:

In [None]:
# creating a function to Visualization the target1 target0 variables

def distribution(col, label_rotation = True,horizontal_layout= True, hue = None):
    if(horizontal_layout):
        fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,6))
    else:
        fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(12,14))
    
    s1 = sns.countplot(ax = ax1, data = target_1, x= target_1[col], order=target_1[col].value_counts().index,hue = hue,palette='magma') 
    if(label_rotation):
        s1.set_xticklabels(s1.get_xticklabels(),rotation = 90)
    ax1.set_xlabel('%s' %col)
    ax1.set_ylabel("Count")
    plt.xticks(rotation=90)

    
    ax1.set_title('Distribution of '+ '%s' %col +' for clients with payment difficulties', fontsize=10)


    
    s2 = sns.countplot(ax = ax2, data = target_0, x= target_0[col], order=target_0[col].value_counts().index,hue = hue,palette='magma') 
    #s2=sns.countplot(ax=ax2,x=nondefaulters[var], data=nondefaulters, order= nondefaulters[var].value_counts().index,)
    if(label_rotation):
        s2.set_xticklabels(s2.get_xticklabels(),rotation=45)
    ax2.set_xlabel('%s' %col)
    ax2.set_ylabel("Count")
    plt.xticks(rotation=90)

    ax2.set_title('Distribution of '+ '%s' %col +' for clients with Non-payment difficulties', fontsize=10)
    plt.show()

#### plot to see the distrubution of "NAME_INCOME_TYPE" with payments difficulties vs Non-payments difficulties

In [None]:
distribution('NAME_INCOME_TYPE', hue='CODE_GENDER')

**Points to be concluded from the above graph**

1) Clients who are either at Maternity leave OR Unemployed have payment difficulties

2) For this,  Females are having more number of credits than male.

3) Student and Businessmen have no defaults 

#### plot to see the distrubution of "NAME_EDUCATION_TYPE" with payments difficulties vs Non-payments difficulties

In [None]:
distribution('NAME_EDUCATION_TYPE', hue='CODE_GENDER')


**Points to be concluded from the above graph**


1) The count of Loan Payment Difficulties whose educational qualifications secondary/secondary special is higher compared to higher education, Incomplete higher.

2) And for those who has completed/studying Academic degree has no payments difficulties.

3) The Females are having more number of credits than male.

#### Ordinal varaibles

#### plot to see the distrubution of "NAME_TYPE_SUITE" with payments difficulties vs Non-payments difficulties

In [None]:
distribution('NAME_TYPE_SUITE', hue='CODE_GENDER')

**Points to be concluded from the above graph**

1) The count of Loan Payment Difficulties is higher for 'Unaccompanied' than rest other cases.

2) For the categories like Other_A and Group of prople has no payments difficulties.

3) The Females are having more number of credits than male

#### plot to see the distrubution of "NAME_FAMILY_STATUS" with payments difficulties vs Non-payments difficulties

In [None]:
distribution('NAME_FAMILY_STATUS', hue='CODE_GENDER')

**Points to be concluded from the above graph**

1) And decrease in count for separated and widow with Loan Payment Difficulties when comapred with the percentages from both the charts.

2) clients who have civil marriage or who are single default a lot.


#### plot to see the distrubution of "NAME_HOUSING_TYPE" with payments difficulties vs Non-payments difficulties

In [None]:
distribution('NAME_HOUSING_TYPE', hue='CODE_GENDER')

**Points to be concluded from the above graph**

1) We can conclude from the chart, that the Most people live in a House/Apartment

2) Ratio of People who live With Parents is more for defaulter than non-defaulters. 

3) We infer that the applicants who live with parents have a higher chance of having payment difficulties.


#### plot to see the distrubution of "NAME_CONTRACT_TYPE" with payments difficulties vs Non-payments difficulties

In [None]:
distribution('NAME_CONTRACT_TYPE', hue='CODE_GENDER')

**Points to be concluded from the above graph**

1) For contract type ‘cash loans’ is having higher number of credits than ‘Revolving loans’ contract type.

2) For this also Female is leading for applying credits.

3) For type 1 : there is only Female Revolving loans.

#### plot to see the distrubution of "OCCUPATION_TYPE" with payments difficulties vs Non-payments difficulties

In [None]:
plt.style.use('ggplot')

plt.figure(figsize = (15,6))
plt.subplot(1,2,1)

a1 = target_1[~(target_1.OCCUPATION_TYPE == 'Others')]
ax = (a1['OCCUPATION_TYPE'].value_counts(normalize = True)*100).plot.bar(color = 'green')
plt.title("Clients with payment difficulties", fontdict = {"fontsize":13, "fontweight":20, "color":"brown"})
plt.ylabel("Percentage", fontdict = {'fontsize': 12, 'fontweight': 20, 'color': 'brown'})
ticks = np.arange(0, 40, 5)
labels = ["{}%".format(i) for i in ticks]
plt.yticks(ticks, labels, rotation = 0)
plt.xticks(rotation = 90)
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                 textcoords = 'offset points')


plt.subplot(1,2,2)

b1 = target_0[~(target_0.OCCUPATION_TYPE == 'Others')]
ax = (b1['OCCUPATION_TYPE'].value_counts(normalize = True)*100).plot.bar(color = 'orange')
plt.title("clients with no payment difficulties", fontdict = {"fontsize":13, "fontweight":20, "color":"brown"})
plt.ylabel("Percentage", fontdict = {'fontsize': 12, 'fontweight': 20, 'color': 'brown'})
ticks = np.arange(0, 40, 5)
labels = ["{}%".format(i) for i in ticks]
plt.yticks(ticks, labels, rotation = 0)
plt.xticks(rotation = 90)

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                 textcoords = 'offset points')

plt.show()

**Inference from the above chart**

1) 3 categories, 'Laborers', 'Sales staff', 'Core staffs' shows the major count, who applied for loan are having the more percentage of payment difficulties.

2) IT Staffs are the least applied for the loan and non-defaulters in both the cases.


### Plotting for Organization type in logarithmic scale


In [None]:
plt.figure(figsize=(15,30))
plt.title("Distribution of Organization type for target - 1")
plt.xticks(rotation=90)
plt.xscale('log')


sns.countplot(data=target_1,y='ORGANIZATION_TYPE',order=target_1['ORGANIZATION_TYPE'].value_counts().index,palette='cubehelix')
plt.show()

In [None]:
plt.figure(figsize=(15,30))
plt.title("Distribution of Organization type for target - 0")
plt.xticks(rotation=90)
plt.xscale('log')


sns.countplot(data=target_0,y='ORGANIZATION_TYPE',order=target_0['ORGANIZATION_TYPE'].value_counts().index,palette='cubehelix')
plt.show()

**Points to be concluded from the above graph**

1) Clients which have applied for credits are from most of the organization type ‘Business entity Type 3’ , ‘Self employed’ , ‘Other’ , ‘Medicine’ and ‘Government’.

2) Less clients are from Industry type 8,type 6, type 10, religion and trade type 5, type 4.

3) Same as type 0 in distribution of organization type.

#### pie chart to see the % of "OCCUPATION_TYPE" with payments difficulties vs Non-payments difficulties

In [None]:

ax1 = target_1['CODE_GENDER'].value_counts()
ind = ax1.index
val = ax1.values
fig = go.Figure(data=[go.Pie(labels=ind, values=val, title = "\n With payment difficulties \n")])
fig.show()

ax2 = target_0['CODE_GENDER'].value_counts()
ind = ax2.index
val = ax2.values
fig = go.Figure(data=[go.Pie(labels=ind, values=val, title = "\n Non payment difficulties \n")])
fig.show()

**Inference from the above chart**

1) We can make an inference from the above chart, that the number of Females taking loans is much higher than the number of Males for both the cases.



### Discrete variable - Binned variables

#### pie chart to see the % of "YEAR_BIRTH_BINNING" with payments difficulties vs Non-payments difficulties

In [None]:

ax1 = target_1['YEAR_BIRTH_BINNING'].value_counts()
ind = ax1.index
val = ax1.values
fig = go.Figure(data=[go.Pie(labels=ind, values=val, title = "\n payment difficulties \n")])
fig.show()

ax2 = target_0['YEAR_BIRTH_BINNING'].value_counts()
ind = ax2.index
val = ax2.values
fig = go.Figure(data=[go.Pie(labels=ind, values=val, title = "\n Non payment difficulties \n")])
fig.show()

**Points to be concluded from the above graph**

1) We can see from the graph, that there is an increase in the percentage of Loan Payment Difficulties who are young in age when compared to the percentages of Payment Difficulties and Loan-Non Payment Difficulties from 5% to 7 %

2) The same is applicabe to Adults also.

In [None]:
new_data.head()

#### pie chart to see the % of   "AMT_INCOME_TOTAL_RANGE"   with payments difficulties vs Non-payments difficulties

In [None]:

ax1 = target_1['AMT_INCOME_TOTAL_RANGE'].value_counts()
ind = ax1.index
val = ax1.values
fig = go.Figure(data=[go.Pie(labels=ind, values=val, title = "\n With payment difficulties \n")])
fig.show();

ax2 = target_0['AMT_INCOME_TOTAL_RANGE'].value_counts()
ind = ax2.index
val = ax2.values
fig = go.Figure(data=[go.Pie(labels=ind, values=val, title = "\n Non payment difficulties \n")])
fig.show();

Points to be concluded from the above graph

1) We can see from the graph, that there is an increase in the percentage of Loan Payment Difficulties for 'Medium' and 'Low' Income when compared to the other case. 

#### pie chart to see the % of   "AMT_CREDIT_RANGE"   with payments difficulties vs Non-payments difficulties

In [None]:
ax1 = target_1['AMT_CREDIT_RANGE'].value_counts()
ind = ax1.index
val = ax1.values
fig = go.Figure(data=[go.Pie(labels=ind, values=val, title = "\n With payment difficulties \n")])
fig.show()

ax2 = target_0['AMT_CREDIT_RANGE'].value_counts()
ind = ax2.index
val = ax2.values
fig = go.Figure(data=[go.Pie(labels=ind, values=val, title = "\n Non payment difficulties \n")])
fig.show()

Points to be concluded from the above graph

1) We can see from the graph, that there is an increase in the percentage of Loan Payment Difficulties for 'Medium' and 'Low' credit when compared to the other case. 

- ###  Subtask 5.4: Bivariate Analysis

- ### 5.4.1 Bivariate Analysis of Categorical vs Numerical Variables


#### Payment difficulties - NAME_INCOME_TYPE vs AMT_CREDIT

In [None]:

fig = px.box(target_1, x="NAME_INCOME_TYPE", y="AMT_CREDIT", color='NAME_FAMILY_STATUS',
             title="Income type vs Amount Credit - Non-Loan Payment Difficulties") 
fig.show()

#### Non-Payment difficulties - NAME_INCOME_TYPE vs AMT_CREDIT

In [None]:
fig = px.box(target_0, x="NAME_INCOME_TYPE", y="AMT_CREDIT", color='NAME_FAMILY_STATUS',
             title="Income type vs Amount Credit - Loan Payment Difficulties", notched = True) 
fig.show()

**Inference from the above plot**

1) The plot for Loan Payment/non-Payment looks almost similar

2) The categories like 'Pensioner' and 'State Service' have credits decrease, which means they have low payment difficulties. 

3) We can also notice there are outliers present and for the commercial Associate the outliers value is very huge.

#### Payment difficulties - AMT_INCOME_TOTAL_RANGE vs AMT_CREDIT

In [None]:

fig = px.box(target_1, x="AMT_INCOME_TOTAL_RANGE", y="AMT_CREDIT", color='NAME_FAMILY_STATUS',
             title="AMT_INCOME_TOTAL_RANGE vs AMT_CREDIT - Payment Difficulties") 
fig.show()

#### Non- Payment difficulties - AMT_INCOME_TOTAL_RANGE vs AMT_CREDIT

In [None]:
fig = px.box(target_0, x="AMT_INCOME_TOTAL_RANGE", y="AMT_CREDIT", color='NAME_FAMILY_STATUS',
             title="AMT_INCOME_TOTAL_RANGE vs AMT_CREDIT --- Non-Payment Difficulties") 
fig.show()

**Inference from the above chart**

1) Both the plots appears to be similar.

2) We can see that, the  Family status of 'single', 'seperated' and 'married' of income range very-high are having higher number of credits than others.

### Payment difficulties - NAME_EDUCATION_TYPE vs AMT_CREDIT

In [None]:

fig = px.box(target_1, x="NAME_EDUCATION_TYPE", y="AMT_CREDIT", color='NAME_FAMILY_STATUS',
             title="NAME_EDUCATION_TYPE vs AMT_CREDIT - Payment Difficulties") 
fig.show()

### Non- Payment difficulties - NAME_EDUCATION_TYPE vs AMT_CREDIT

In [None]:
# 'NAME_EDUCATION_TYPE' vs 'AMT_CREDIT' for Loan - Non Payment Difficulties

fig = px.box(target_1, x="NAME_EDUCATION_TYPE", y="AMT_CREDIT", color='NAME_FAMILY_STATUS',
             title="NAME_EDUCATION_TYPE vs AMT_CREDIT - Payment Difficulties") 
fig.show()

**Inference from the above chart**

1) Quite similar with Target 0 From the above box plot we can say that Family status of 'civil marriage', 'marriage' and 'separated' of Academic degree education are having higher number of credits than others. 

2) Most of the outliers are from Education type 'Higher education' and 'Secondary'. Civil marriage for Academic degree is having most of the credits in the third quartile.

3) Females are having high payment difficulties than male

### plot to see the Payment difficulties - AMT_INCOME_TOTAL vs OCCUPATION_TYPE

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(target_1.OCCUPATION_TYPE, target_1.AMT_INCOME_TOTAL)
plt.xticks(rotation = 90)
plt.title("Payment difficulties - Amt_Income_total vs Occupation_type")
plt.show()

In [None]:
### plot to see the non - Payment difficulties - AMT_INCOME_TOTAL vs OCCUPATION_TYPE

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(target_0.OCCUPATION_TYPE, target_0.AMT_INCOME_TOTAL)
plt.xticks(rotation = 90)
plt.title("Non-Payment difficulties - Amt_Income_total vs Occupation_type")
plt.show()

**Inferences to be made from the above chart**

1) Here we see that both the graphs looks almost similar.

2) The income for managers is very high compared to other categories, who don't defaults.

3) The probability of payments difficulties is high for Laborers.


- ### 5.4.2 Bivariate Analysis of Categorical vs Categorical Variables

Here we have defined the function analyse the TARGET. i.e., payment difficulties

In [None]:
# define the function

def stat_chart(feature,label_rotation=True,horizontal_layout=True):
    temp_data = new_data[feature].value_counts()
    df1 = pd.DataFrame({feature: temp_data.index,'count': temp_data.values})

    # percentage of target=1 
    per = new_data[[feature, 'TARGET']].groupby([feature],as_index=False).mean()
    per.sort_values(by='TARGET', ascending=False, inplace=True)
    
    if(horizontal_layout):
        fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,6))
    else:
        fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(12,14))
    sns.set_color_codes("pastel")
    s = sns.barplot(ax=ax1, x = feature, y="count",data=df1)
    if(label_rotation):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    
    s = sns.barplot(ax=ax2, x = feature, y='TARGET', order=per[feature], data=per)
    if(label_rotation):
        s.set_xticklabels(s.get_xticklabels(),rotation=90)
    plt.ylabel('Percentage', fontsize=10)
    plt.tick_params(axis='both', which='major', labelsize=10)

    plt.show();

#### NAME_EDUCATION_TYPE with maximum Loan-Payment Difficulties

In [None]:
stat_chart('NAME_EDUCATION_TYPE')



**Inference**

1) From the above plot, we can infer that, clients with 'Lower secondary' education type have maximum percentage of Loan-Payment Difficulties.

#### NAME_CONTRACT_TYPE with maximum Loan-Payment Difficulties

In [None]:
stat_chart('NAME_CONTRACT_TYPE')



**Inference**

1) The above plot says that, the clients with 'Cash loans' contract type have maximum percentage of Loan Payemnt Difficulties.

#### CODE_GENDER with maximum Loan-Payment Difficulties

In [None]:
stat_chart('CODE_GENDER')


**Inference**

1) The above plot says that, the clients with 'Males' have maximum percentage of Loan Payemnt Difficulties.

#### FLAG_OWN_CAR with maximum Loan-Payment Difficulties

In [None]:
stat_chart('FLAG_OWN_CAR')


**Inference**

1) The above plot says that, the clients with 'car' have less percentage of Loan Payemnt Difficulties than the clients with no cars.

#### NAME_TYPE_SUITE with maximum Loan-Payment Difficulties

In [None]:
stat_chart('NAME_TYPE_SUITE')


**Inference**

1) The above plot says that, the clients with 'Other-B' have maximum percentage of Loan Payemnt Difficulties.

#### NAME_INCOME_TYPE with maximum Loan-Payment Difficulties

In [None]:
stat_chart('NAME_INCOME_TYPE')


**Inference**

1) The above plot says that, the clients with 'Maternity leave' category have maximum percentage of Loan Payemnt Difficulties.

#### NAME_EDUCATION_TYPE with maximum Loan-Payment Difficulties

In [None]:
stat_chart('NAME_EDUCATION_TYPE')


**Inference**

1) The above plot says that, the clients with 'Lower secondary' type have maximum percentage of Loan Payemnt Difficulties.

#### OCCUPATION_TYPE with maximum Loan-Payment Difficulties

In [None]:
stat_chart('OCCUPATION_TYPE')


**Inference**

1) The above plot says that, the clients with 'Low skilled Laborers' category have the maximum percentage of Loan Payemnt Difficulties.

#### DAYS_BIRTH_BINNING_CATEGORIES with maximum Loan-Payment Difficulties

In [None]:
new_data.head(1)

In [None]:
stat_chart('YEAR_BIRTH_BINNING')


**Inference**

1) The above plot says that, the clients with 'Young' people have the maximum percentage of Loan Payemnt Difficulties.

#### AMT_INCOME_TOTAL_RANGE with maximum Loan-Payment Difficulties

In [None]:
stat_chart('AMT_INCOME_TOTAL_RANGE')


**Inference**

1) The above plot says that, the clients with 'Low' income have the maximum percentage of Loan Payemnt Difficulties.

#### CNT_FAM_MEMBERS with maximum Loan-Payment Difficulties

In [None]:
stat_chart('CNT_FAM_MEMBERS')


**Inference**

1) The above plot says that, the clients with '11 family members' category have the maximum percentage of Loan Payemnt Difficulties.

- ### 5.4.3 Bivariate Analysis of Numerical vs Numerical Variables


In [None]:
new_data.head(1)

In [None]:
numerc = target_0[['AMT_CREDIT', 'AMT_ANNUITY', 'AMT_INCOME_TOTAL','DAYS_EMPLOYED', 'AMT_GOODS_PRICE', 'DAYS_BIRTH']]
sns.pairplot(numerc)

plt.show()

In [None]:
numerc = target_1[['AMT_CREDIT', 'AMT_ANNUITY', 'AMT_INCOME_TOTAL','DAYS_EMPLOYED', 'AMT_GOODS_PRICE', 'DAYS_BIRTH']]
sns.pairplot(numerc)

plt.show()

**Inferences**

- Credit amount is inversely proportional to the date of birth, which means Credit amount is higher for low age and vice-versa.

- AMT_CREDIT and AMT_GOODS_PRICE are highly correlated in both the cases (target0 and target1)


- AMT_CREDIT having greater credit value has less difficulties in repaying the loan.


- ### 5.4.4 Multivariate Analysis


In [None]:
# Finding some correlation for numerical columns for both target 0 and 1 

target1_corr=target_1[["CNT_CHILDREN", "AMT_INCOME_TOTAL", "AMT_CREDIT", "AMT_ANNUITY", "REGION_POPULATION_RELATIVE","DAYS_BIRTH", "DAYS_EMPLOYED", "DAYS_REGISTRATION", "DAYS_ID_PUBLISH", "HOUR_APPR_PROCESS_START", "REG_REGION_NOT_LIVE_REGION", "REG_REGION_NOT_WORK_REGION", "LIVE_REGION_NOT_WORK_REGION", "REG_CITY_NOT_LIVE_CITY", "REG_CITY_NOT_WORK_CITY", "LIVE_CITY_NOT_WORK_CITY"]]
target0_corr=target_0[["CNT_CHILDREN", "AMT_INCOME_TOTAL", "AMT_CREDIT", "AMT_ANNUITY", "REGION_POPULATION_RELATIVE","DAYS_BIRTH", "DAYS_EMPLOYED", "DAYS_REGISTRATION", "DAYS_ID_PUBLISH", "HOUR_APPR_PROCESS_START", "REG_REGION_NOT_LIVE_REGION", "REG_REGION_NOT_WORK_REGION", "LIVE_REGION_NOT_WORK_REGION", "REG_CITY_NOT_LIVE_CITY", "REG_CITY_NOT_WORK_CITY", "LIVE_CITY_NOT_WORK_CITY"]]

target0=target0_corr.corr(method='spearman')
target1=target1_corr.corr(method='spearman')

In [None]:
target0

In [None]:
target1

In [None]:
# Now, plotting the above correlation with heat map as it is the best choice to visulaize

# figure size

def targets_corr(data,title):
    plt.figure(figsize=(15, 10))
    plt.rcParams['axes.titlesize'] = 25
    plt.rcParams['axes.titlepad'] = 70

# heatmap with a color map of choice


    sns.heatmap(data, cmap="YlGnBu",annot=True)

    plt.title(title)
    plt.yticks(rotation=0)
    plt.show()

In [None]:
# For Target 0

targets_corr(data=target0,title='Correlation for target 0')

**Inference**

As we can see from above correlation heatmap, There are number of observation we can point out

1) Credit amount is inversely proportional to the date of birth, which means Credit amount is higher for low age and vice-versa.

2) Credit amount is inversely proportional to the number of children client have, means Credit amount is higher for less children count client have and vice-versa.

3) Income amount is inversely proportional to the number of children client have, means more income for less children client have and vice-versa.

4) less children client have in densely populated area.

5) Credit amount is higher to densely populated area.

6) The income is also higher in densely populated area.

In [None]:
# For Target 1

targets_corr(data=target1,title='Correlation for target 1')

**Inference**


This heat map for Target 1 is also having quite a same observation just like Target 0. But for few points are different. They are listed below.

1) The client's permanent address does not match contact address are having less children and vice-versa

2) The client's permanent address does not match work address are having less children and vice-versa

#### To find top 10 correlation

In [None]:
# Checking the columns
new_data.columns

In [None]:
# Finding top 10 correlation for target 0 (clients with no payment difficulties)
corr = target_0[['AMT_INCOME_TOTAL','AMT_CREDIT', 'CNT_FAM_MEMBERS','AMT_GOODS_PRICE','AMT_ANNUITY','DAYS_EMPLOYED','DAYS_BIRTH', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'YEARS_EMPLOYED', 'REGION_RATING_CLIENT']].corr()

corr_tar0 = target_0[['AMT_INCOME_TOTAL','AMT_CREDIT', 'CNT_FAM_MEMBERS','AMT_GOODS_PRICE','AMT_ANNUITY','DAYS_EMPLOYED','DAYS_BIRTH', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'YEARS_EMPLOYED', 'REGION_RATING_CLIENT']].corr()
corr_tar0=corr_tar0.where(np.triu(np.ones(corr_tar0.shape),k=1).astype(np.bool))
corr_df=corr_tar0.unstack().reset_index()
corr_df.columns = ['VAR1','VAR2','CORRELATION']
corr_df.dropna(subset=['CORRELATION'],inplace=True)
corr_df['CORR_ABS']=corr_df['CORRELATION'].abs()

# Sorting the values

corr_df.sort_values('CORR_ABS', ascending=False).head(10)


In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(corr, annot=False, cmap='coolwarm')
plt.show()

**Inferences from the above heatmap**

1) We can see that, there is a high correlation between credit amount and goods price.

2) There is high correlation for annuity and total income.

2) Defaulters have low correlation in number of years employed.

In [None]:
# Finding top 10 correlation for target 0 (clients with no payment difficulties)

corr = target_1[['AMT_INCOME_TOTAL','AMT_CREDIT', 'CNT_FAM_MEMBERS','AMT_GOODS_PRICE','AMT_ANNUITY','DAYS_EMPLOYED','DAYS_BIRTH', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'YEARS_EMPLOYED', 'REGION_RATING_CLIENT']].corr()

corr_tar1 = target_1[['AMT_INCOME_TOTAL','AMT_CREDIT', 'CNT_FAM_MEMBERS','AMT_GOODS_PRICE','AMT_ANNUITY','DAYS_EMPLOYED','DAYS_BIRTH', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'YEARS_EMPLOYED', 'REGION_RATING_CLIENT']].corr()
corr_tar1=corr_tar1.where(np.triu(np.ones(corr_tar1.shape),k=1).astype(np.bool))
corr_df=corr_tar1.unstack().reset_index()
corr_df.columns = ['VAR1','VAR2','CORRELATION']
corr_df.dropna(subset=['CORRELATION'],inplace=True)
corr_df['CORR_ABS']=corr_df['CORRELATION'].abs()

# Sorting the values

corr_df.sort_values('CORR_ABS', ascending=False).head(10)


In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(corr, annot=False, cmap='ocean_r')
plt.show()

**Inferences**

1) We can see that, there is a high correlation between credit amount and goods price. 

2) There appears to be some deviancies in the correlation of defaulter and no defaulters such as credit amount vs income.

3) The loan annuity correlation with credit amount has reduced a little for clients with payment difficulties.

4) We can also infer from the above heatmap that, the correlation high for years employed when compared with clients who falls under taget 0.


## 1.  Analysing  previous application

Here we are anlaysing the prevoius application data and making inferences.


In [None]:
# Read the previous application data and store it in a dataframe.

df1=pd.read_csv("previous_application.csv")

In [None]:
# Cleaning the missing data

# listing the null values columns having more than 30%

emptycol1=df1.isnull().sum()
emptycol1=emptycol1[emptycol1.values>(0.3*len(emptycol1))]
len(emptycol1)

In [None]:
# Removing those 15 columns

emptycol1 = list(emptycol1[emptycol1.values>=0.3].index)
df1.drop(labels=emptycol1,axis=1,inplace=True)

df1.shape

In [None]:
# Removing the column values of 'XNA' and 'XAP'
df1=df1.drop(df1[df1['NAME_CASH_LOAN_PURPOSE']=='XNA'].index)
df1=df1.drop(df1[df1['NAME_CASH_LOAN_PURPOSE']=='XNA'].index)
df1=df1.drop(df1[df1['NAME_CASH_LOAN_PURPOSE']=='XAP'].index)

df1.shape

In [None]:
# Now merging the Application dataset with previous appliaction dataset

new_df=pd.merge(left=new_data,right=df1,how='inner',on='SK_ID_CURR',suffixes='_x')

In [None]:
#renaming cols after merging is done

new_df1 = new_df.rename({'NAME_CONTRACT_TYPE_' : 'NAME_CONTRACT_TYPE','AMT_CREDIT_':'AMT_CREDIT','AMT_ANNUITY_':'AMT_ANNUITY',
                         'WEEKDAY_APPR_PROCESS_START_' : 'WEEKDAY_APPR_PROCESS_START',
                         'HOUR_APPR_PROCESS_START_':'HOUR_APPR_PROCESS_START','NAME_CONTRACT_TYPEx':'NAME_CONTRACT_TYPE_PREV',
                         'AMT_CREDITx':'AMT_CREDIT_PREV','AMT_ANNUITYx':'AMT_ANNUITY_PREV',
                         'WEEKDAY_APPR_PROCESS_STARTx':'WEEKDAY_APPR_PROCESS_START_PREV',
                         'HOUR_APPR_PROCESS_STARTx':'HOUR_APPR_PROCESS_START_PREV'}, axis=1)

In [None]:
#removing unwanted columns for analysis
new_df1.drop(['SK_ID_CURR','WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START','REG_REGION_NOT_LIVE_REGION', 
              'REG_REGION_NOT_WORK_REGION','LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
              'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY','WEEKDAY_APPR_PROCESS_START_PREV',
              'HOUR_APPR_PROCESS_START_PREV', 'FLAG_LAST_APPL_PER_CONTRACT','NFLAG_LAST_APPL_IN_DAY'],axis=1,inplace=True)

In [None]:
new_df1.head()

In [None]:
new_df1.columns

In [None]:
sns.countplot(new_df1.NAME_CONTRACT_STATUS)
plt.xlabel("Contract Status")
plt.ylabel("Count of Contract Status")
plt.title("Distribution of Contract Status")
plt.show()

In [None]:
approved=new_df1[new_df1.NAME_CONTRACT_STATUS=='Approved']
refused=new_df1[new_df1.NAME_CONTRACT_STATUS=='Refused']
canceled=new_df1[new_df1.NAME_CONTRACT_STATUS=='Canceled']
unused=new_df1[new_df1.NAME_CONTRACT_STATUS=='Unused Offer']

In [None]:
def plot_func(var):
    fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(15,5))
    
    s1=sns.countplot(ax=ax1,x=refused[var], data=refused, order= refused[var].value_counts().index,)
    ax1.set_title("Refused", fontsize=10)
    ax1.set_xlabel('%s' %var)
    ax1.set_ylabel("Count of Loans")
    s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
    
    s2=sns.countplot(ax=ax2,x=approved[var], data=approved, order= approved[var].value_counts().index,)
    s2.set_xticklabels(s2.get_xticklabels(),rotation=90)
    ax2.set_xlabel('%s' %var)
    ax2.set_ylabel("Count of Loans")
    ax2.set_title("Approved", fontsize=10)
    s3=sns.countplot(ax=ax3,x=canceled[var], data=canceled, order= canceled[var].value_counts().index,)
    ax3.set_title("Canceled", fontsize=10)
    ax3.set_xlabel('%s' %var)
    ax3.set_ylabel("Count of Loans")
    s3.set_xticklabels(s3.get_xticklabels(),rotation=90)
    plt.show()

In [None]:
plot_func('TARGET')

In [None]:
refused.TARGET.value_counts(normalize=True)

In [None]:
approved.TARGET.value_counts(normalize=True)

In [None]:
canceled.TARGET.value_counts(normalize=True)

In [None]:
def plot_func1(var):
    fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(15,5))
    
    s1=sns.scatterplot(x='AMT_CREDIT',y='AMT_GOODS_PRICE',data=approved)
    ax1.set_title("Refused", fontsize=10)
    ax1.set_xlabel('%s' %var)
    ax1.set_ylabel("Count of Loans")
    s1.set_xticklabels(s1.get_xticklabels(),rotation=90)
    
    s2=sns.scatterplot(x='AMT_CREDIT',y='AMT_GOODS_PRICE',data=refused)
    s2.set_xticklabels(s2.get_xticklabels(),rotation=90)
    ax2.set_xlabel('%s' %var)
    ax2.set_ylabel("Count of Loans")
    ax2.set_title("Approved", fontsize=10)
    
    
    s3=sns.scatterplot(x='AMT_CREDIT',y='AMT_GOODS_PRICE',data=cancelled)
    ax3.set_title("Canceled", fontsize=10)
    ax3.set_xlabel('%s' %var)
    ax3.set_ylabel("Count of Loans")
    s3.set_xticklabels(s3.get_xticklabels(),rotation=90)
    plt.show()

In [None]:
plt.figure(figsize=(18,6))
plt.subplot(1,2,1)
sns.scatterplot(x='AMT_APPLICATION',y='AMT_INCOME_TOTAL',data=refused)
plt.title('Refused')

plt.subplot(1,2,2)
sns.scatterplot(x='AMT_APPLICATION',y='AMT_INCOME_TOTAL',data=approved)
plt.title('Approved')
plt.show()

**Inference**

Loan request higher than 200k had a higher rejection rate. Also loan rejection rate was much lower if the income was higher than 500k.

**Performing univariate analysis**

In [None]:
# Distribution of contract status in logarithmic scale
sns.set_style('whitegrid')
sns.set_context('talk')

plt.figure(figsize=(15,30))
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30
plt.xticks(rotation=90)
plt.xscale('log')
plt.title('Distribution of contract status with purposes')
ax = sns.countplot(data = new_df1, y= 'NAME_CASH_LOAN_PURPOSE', 
                   order=new_df1['NAME_CASH_LOAN_PURPOSE'].value_counts().index,hue = 'NAME_CONTRACT_STATUS',palette='magma')

Points to be concluded from above plot:

1) Most rejection of loans came from purpose 'repairs'.

2) For education purposes we have equal number of approves and rejection

3) Paying other loans and buying a new car is having significant higher rejection than approves.

In [None]:

# Distribution of contract status

sns.set_style('whitegrid')
sns.set_context('talk')

plt.figure(figsize=(15,30))
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['axes.titlesize'] = 22
plt.rcParams['axes.titlepad'] = 30
plt.xticks(rotation=90)
plt.xscale('log')
plt.title('Distribution of purposes with target ')
ax = sns.countplot(data = new_df1, y= 'NAME_CASH_LOAN_PURPOSE', 
                   order=new_df1['NAME_CASH_LOAN_PURPOSE'].value_counts().index,hue = 'TARGET',palette='magma')

Few points we can conclude from abpve plot:

1) Loan purposes with 'Repairs' are facing more difficulites in payment on time.

2) There are few places where loan payment is significant higher than facing difficulties. They are 'Buying a garage', 'Business developemt', 'Buying land','Buying a new car' and 'Education' Hence we can focus on these purposes for which the client is having for minimal payment difficulties.

**Performing bivariate analysis**

In [None]:
# Box plotting for Credit amount in logarithmic scale

plt.figure(figsize=(16,12))
plt.xticks(rotation=90)
plt.yscale('log')
sns.boxplot(data =new_df1, x='NAME_CASH_LOAN_PURPOSE',hue='NAME_INCOME_TYPE',y='AMT_CREDIT_PREV',orient='v')
plt.title('Prev Credit amount vs Loan Purpose')
plt.show()

From the above we can conclude some points-

1) The credit amount of Loan purposes like 'Buying a home','Buying a land','Buying a new car' and'Building a house' is higher.

2) Income type of state servants have a significant amount of credit applied

3) Money for third person or a Hobby is having less credits applied for.

In [None]:
 #Box plotting for Credit amount prev vs Housing type in logarithmic scale

plt.figure(figsize=(16,12))
plt.xticks(rotation=90)
sns.barplot(data =new_df1, y='AMT_CREDIT_PREV',hue='TARGET',x='NAME_HOUSING_TYPE')
plt.title('Prev Credit amount vs Housing type')
plt.show()

Here for Housing type, office appartment is having higher credit of target 0 and co-op apartment is having higher credit of target 1. So, we can conclude that bank should avoid giving loans to the housing type of co-op apartment as they are having difficulties in payment. Bank can focus mostly on housing type with parents or House\appartment or miuncipal appartment for successful payments.

### Conclusion

After analysing the application data and previous data, we can conclude from the observations made on various parameters/columns and find out the attributes which can be repayer or defaulters. 

### Below are the few observed Metrices that clients falls under repayer category:


 - DAYS_BIRTH: Clients having age 50 or above are likely to fall under less defaulter’s category
 
 
 - NAME_INCOME_TYPE: We have inferred that, the Student and Businessmen category have no defaults.
 
 
 - DAYS_EMPLOYED: Clients with a greater number of experiences has the less probability to default.
 
 
 - AMT_INCOME_TOTAL: Clients having income more, has very less history of defaulters.
 
 
 - NAME_CASH_LOAN_PURPOSE: Loans bought for Hobby, Buying garage are being replayed mostly. 
 
 
 - CNT_CHILDREN: Clients with less child or less family members show the pattern of less defaulters 
 
 
 - NAME_EDUCATION_TYPE: People with Academic degree has less defaults.

 
 
 - CNT_CHILDREN: Clients with less child or less family members shows the pattern of less defaulters
 

 
 - ORGANIZATION_TYPE: Clients with Trade Type 4 and 5 and Industry type 8 have shown the pattern, that they are paying their loans properly.




### Below are the few observed metrices that clients falls under defaulter category:


- OCCUPATION_TYPE: Low-skill Laborers, Drivers, Security staff, Laborers and Cooking staff, these people are having high payment difficulties.


- NAME_INCOME_TYPE: We have inferred that, Clients who are either at Maternity leave OR Unemployed has high rate of repaying their loans.


- DAYS_BIRTH: Young and Adult people tend to have high difficulties in repaying the loan.


- NAME_EDUCATION_TYPE: Lower Secondary and  Secondary education categories have the high probabilities of default.


- DAYS_EMPLOYED: Clients with less employment rate is having high payment difficulties. 


- NAME_FAMILY_STATUS : civil marriage or single categories default a lot. So, we their applications can be rejected.


- CODE_GENDER: The percentage of loan default is more for Men in Gender category.




### Hence we recommend the bank to look into the above metrices before approving the loans to their clients

# <font color = blue> Thank You !! </font>