It's an honor really, that this notebook featured in @headsortails [Hidden Gems series](https://www.kaggle.com/general/232447). I would be happy if you could go through my notebook and leave your honest feedback in the comments. Also, upvote if you liked it!😄 *Happy Kaggling*

The main aim of this kernel is to answer some very common questions that arise while applying for loans using data.

**The Home Mortgage Disclosure Act (HMDA)** in America requires many financial institutions to maintain, report, and publicly disclose loan-level information about mortgages. These public data are important because they help show whether lenders are serving the housing needs of their communities; they give public officials information that helps them make decisions and policies; and they shed light on lending patterns that could be discriminatory.

Let's see a video to get a better view.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('wR9Tsdqgmuk',700,400)

The dataset we have here is for all the mortgages in Washington during the year 2016. I'll analyze this dataset to find patterns in the lending process.It's indeed an interesting dataset, which has so far been left unexplored.

In [None]:
import numpy as np 
import pandas as pd
from plotly import tools
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import seaborn as sns
import matplotlib.pyplot as plt

import os
print(os.listdir("../input"))

In [None]:
df=pd.read_csv('../input/washington-state-home-mortgage-hdma2016/Washington_State_HDMA-2016.csv')# load the dataset

A peak into the data.

In [None]:
df.head()

In [None]:
df['action_taken_name'].value_counts()

Loan originated is a widely used term in finance , it means the loan application has been approved. We will remove the rows where the applicant has withdrawn the request for loan. Loan purchased means that the lender bought the loan on a secondary market.

**What is a secondary market ?**

>It happens quite often that after lending the loan , the lender sells the loan and servicing rights to an investor in the secondary market.You might be thinking what is the need of a secondary market.

>When a person takes out a home loan, the loan is underwritten, funded and serviced by a bank. Because the bank has used their own funds to make the loan, they will eventually run out of money to loan, so they will sell the loan to the secondary market to replenish their money available to make more home loans.

Since our analysis only concerns the primary market where borrowers and lenders are involved , I'll remove the rows where the action taken is "Loan purchased by financial institution".

In [None]:
df=df[df['action_taken_name']!="Application withdrawn by applicant"]
df=df[df['action_taken_name']!='Loan purchased by the institution']

Each county has a unique FIPS or county code that is useful in plotting the county map. I'll upload an additional dataset to add this feature.

In [None]:
fips=pd.read_excel("../input/2016-state-county-fips-codes/all-geocodes-v2016.xlsx",converters={'County Code (FIPS)': lambda x: str(x)})
fips=fips[fips["State Code (FIPS)"]==53]# state code for washington
fips['county_code']=fips['State Code (FIPS)'].astype(str).str.cat(fips['County Code (FIPS)'].astype(str))
fips=fips.drop(labels=['Summary Level','State Code (FIPS)','County Code (FIPS)','County Subdivision Code (FIPS)','Place Code (FIPS)','Consolidtated City Code (FIPS)'],axis=1)
fips.columns=['county_name','county_code']

In [None]:
df=pd.merge(df,fips,how="left",on="county_name",sort=False)#merging it into original dataset

In [None]:
county1=pd.DataFrame(df['county_code'].value_counts())
county1=county1.reset_index()
county1.columns=['county_code','number of loans']#renaming columns

# Which county has the highest number of loan applications?

In [None]:
fips=county1['county_code'].tolist()
values=county1['number of loans'].tolist()
endpts = list(np.mgrid[min(values):max(values):13j])

colorscale = ["#141d43","#15425a","#0a6671","#26897d","#67a989","#acc5a6","#e0e1d2",
              "#f0dbce","#e4ae98","#d47c6f","#bb4f61","#952b5f","#651656","#330d35"] 

fig = ff.create_choropleth(
    fips=fips, values=values, scope=['Washington'], show_state_data=True,
    colorscale=colorscale, binning_endpoints=endpts, round_legend_values=True,
    plot_bgcolor='rgb(229,229,229)',
    paper_bgcolor='rgb(229,229,229)',
    legend_title='Number of Loans by county',
    county_outline={'color': 'rgb(255,255,255)', 'width': 0.2}, exponent_format=False
)
iplot(fig, filename='loans_washington')

King County has the highest number of loan applications while Garfield County has the lowest.Now, we will add a new feature representing approved for originated loan and not approved for everything else.

In [None]:
df['loan_status']=["approved" if x=="Loan originated" else "not approved" for x in df['action_taken_name']]

# Which county has the highest approval rate?

In [None]:
df_approved=df[df['loan_status']=='approved']
df_notapproved=df[df['loan_status']=='not approved']

In [None]:
county2=pd.DataFrame(df_approved['county_code'].value_counts())
county2=county2.reset_index()
county2.columns=['county_code','number of loans approved']
county2=pd.merge(county2,county1,how="left",on="county_code",sort=False)
l=[]
for x in range(county2.shape[0]):
    l.append(county2['number of loans approved'][x]/county2['number of loans'][x])
county2['approval rate']=[x*100 for x in l]

In [None]:
fips=county2['county_code'].tolist()
values=county2['approval rate'].tolist()
endpts = list(np.mgrid[min(values):max(values):13j])
colorscale = ["#141d43","#15425a","#0a6671","#26897d","#67a989","#acc5a6","#e0e1d2",
              "#f0dbce","#e4ae98","#d47c6f","#bb4f61","#952b5f","#651656","#330d35"]
fig = ff.create_choropleth(
    fips=fips, values=values, scope=['Washington'], show_state_data=True,
    colorscale=colorscale, binning_endpoints=endpts, round_legend_values=True,
    plot_bgcolor='rgb(229,229,229)',
    paper_bgcolor='rgb(229,229,229)',
    legend_title='Number of approved Loans by county',
    county_outline={'color': 'rgb(255,255,255)', 'width': 0.2},
    exponent_format=True,
)
iplot(fig, filename='approved_loans_washington')

King county has the higgest approval rate of 76.2% , so if you apply for a home loan in this county your chances are pretty better.The Ferry County in Wasington has the lowest approval rate of 52.1%.

# Does your purpose play a role?
Loan applications are applied for purchasing loans, home improvements or for refinanacing an existing mortage.

In [None]:
df_purpose=pd.crosstab(df['loan_purpose_name'],df['loan_status'])
df_purpose=df_purpose.reset_index()
df_purpose.columns=['purpose','approved count','not approved count']
l=[]
for x in range(3):
    l.append(df_purpose['approved count'][x]/(df_purpose['approved count'][x]+ df_purpose['not approved count'][x]))
df_purpose['percent approved']=[x*100 for x in np.array(l)]
df_purpose['percent not approved']=[100-x for x in df_purpose['percent approved']]

In [None]:
trace1=go.Bar(
x= df_purpose['purpose'],
y= df_purpose['approved count'],
name='approved',
marker=dict(
    color='#009393'))
trace2=go.Bar(
x= df_purpose['purpose'],
y=df_purpose['not approved count'],
name='not approved',
marker=dict(
        color='#930000'))
trace3=go.Bar(
x=df_purpose['purpose'],
y=df_purpose['percent approved'],
name='percent approved',
marker=dict(
    color='#8eb48b'))
trace4=go.Bar(
x=df_purpose['purpose'],
y=df_purpose['percent not approved'],
name="percent not approved",
marker=dict(
        color='#7fc780'))

fig = tools.make_subplots(rows=1, cols=2,subplot_titles=('Approved loans for different purposes','Percent of loans approved for differnet purposes'))
fig.append_trace(trace1,1,1)
fig.append_trace(trace2,1,1)
fig.append_trace(trace3,1,2)
fig.append_trace(trace4,1,2)

fig['layout'].update(height=600, width=900,barmode='stack')
iplot(fig)

Most of the home purchase loans are approved while it's less likely for a refinancing loan or a home improvement loan to get approved.

# Which type of loans have a better chance of being approved?
Many loans are insured or guaranteed by government programs offered by Federal Housing Administration (FHA), the Department of Veterans Affairs (VA), or the Department of Agriculture's Rural Housing Service (RHS) or Farm Service Agency (FSA). All other loans are classified as conventional.

In [None]:
df_type=pd.crosstab(df['loan_type_name'],df['loan_status'])
df_type=df_type.reset_index()
df_type.columns=['type','approved count','not approved count']
l=[]
for x in range(4):
    l.append(df_type['approved count'][x]/(df_type['approved count'][x]+ df_type['not approved count'][x]))
df_type['percent approved']=[x*100 for x in l]
df_type['percent not approved']=[100-x for x in df_type['percent approved']]

In [None]:
trace1 = {"x":df_type['percent approved'] ,
          "y": df_type['type'] ,
          "marker": {"color": "rgba(255, 182, 193, .9)", "size": 20},
          "mode": "markers",
          "name": "percent approved",
          "type": "scatter"
}

trace2 = {"x": df_type['percent not approved'],
          "y": df_type['type'],
          "marker": {"color": "rgba(152, 0, 0, .8)", "size": 20},
          "mode": "markers",
          "name": "percent not approved",
          "type": "scatter",
}
data = [trace1, trace2]
layout = go.Layout(title="Loan Status for different type of loans",
                  height=500,
                  width=700,
                  autosize=False,
                  margin=go.layout.Margin(
        l=150,
        r=50,
        b=100,
        t=100,
        pad=4
    ))

fig = go.Figure(data=data, layout=layout)
iplot(fig)#let's plot

I use percentage as a measure to avoid any scaling issues.FSA/RHS loans have a good record of getting approved. And even Conventional Loans not issued by government programmes have good chances.

Let's have a look at the distribution of applicant's income and loan amount.

In [None]:
sns.set(style="white", palette="deep", font_scale=1.2, 
        rc={"figure.figsize":(15,9)})
ax = sns.scatterplot(x="loan_amount_000s", y="applicant_income_000s", hue="loan_status",data=df)

I'm quite amazed by the blue dot in the far-right. A person with a low income gets a loan for a very large amount. However the orange dots in the top-left corner can be justified as applications get rejected for lack of documents or unverifiable credentials.

In [None]:
df_temp=df[df['loan_amount_000s']<20000]
ax = sns.scatterplot(x="loan_amount_000s", y="applicant_income_000s", hue="loan_status",data=df_temp)

# Do homes occupied by owners have a higher chance at getting loans?
owner_occupancy_name represents the owner-occupancy status of the property. Second homes, vacation homes, and rental properties are classified as "not owner-occupied as a principal dwelling".

In [None]:
df_owner=pd.crosstab(df['owner_occupancy_name'],df['loan_status'])
df_owner=df_owner.reset_index()
df_owner.columns=['owner_occupancy','approved','not approved']
l=[]
for x in range(3):
    l.append(df_owner['approved'][x]/(df_owner['approved'][x]+ df_owner['not approved'][x]))
df_owner['percent approved']=[x*100 for x in l]
df_owner['percent not approved']=[100-x for x in df_owner['percent approved']]

In [None]:
df_hoepa=pd.crosstab(df['hoepa_status_name'],df['loan_status'])
df_hoepa=df_hoepa.reset_index()
df_hoepa.columns=['hoepa_status','approved','not approved']
l=[]
for x in range(2):
    l.append(df_hoepa['approved'][x]/(df_hoepa['approved'][x]+ df_hoepa['not approved'][x]))
df_hoepa['percent approved']=[x*100 for x in l]
df_hoepa['percent not approved']=[100-x for x in df_hoepa['percent approved']]

In [None]:
trace1=go.Bar(
x= df_owner['owner_occupancy'],
y= df_owner['percent approved'],
name='percent approved',
marker=dict(
    color='rgb(158,202,225)'))
trace2=go.Bar(
x= df_owner['owner_occupancy'],
y=df_owner['percent not approved'],
name='percent not approved',
marker=dict(
        color='rgba(219, 64, 82, 0.7)'))
trace3=go.Bar(
x=df_hoepa['hoepa_status'],
y=df_hoepa['percent approved'],
name='percent approved',
marker=dict(
    color='rgba(204,204,204,1)'))
trace4=go.Bar(
x=df_hoepa['hoepa_status'],
y=df_hoepa['percent not approved'],
name="percent not approved",
marker=dict(
        color='rgba(222,45,38,0.8)'))

fig = tools.make_subplots(rows=1, cols=2,subplot_titles=('Approval % for owner occupancy','Approval % for HOEPA Status'))
fig.append_trace(trace1,1,1)
fig.append_trace(trace2,1,1)
fig.append_trace(trace3,1,2)
fig.append_trace(trace4,1,2)

fig['layout'].update(height=600, width=900,barmode='group')
iplot(fig)

The HOEPA status tells us whether a loan is subject to the Home Ownership and Equity Protection Act or not. From the above chart it seems like 100% of the HOEPA loans got approved.

Homes occupied by owners as principle dwellings have a slightly better chance of getting aprroved when compared to non -owner occupied homes. But the diffference is very less, we can't make a statement with much confidence.

# Does the neighbourhood family income affect your chances of getting a loan?
Instead of disclosing the address, lenders disclose the census tract , which is part of the community where the property is located.Each census tract is located in a Metropolitian Statistical Area/Metropolitian Division (MSA/MD). The hud_median_family_income is the median family income in dollars for the MSA/MD in which the tract is located.

Now , you must be expecting that for a loan to be approved the applicant's income must be quite similar or above the neighbourhood median family income . Right?

Well that turns out to be true!

In [None]:
df['hud_median_family_income_000s']=[x/1000 for x in df['hud_median_family_income']]
df_approved['hud_median_family_income_000s']=[x/1000 for x in df_approved['hud_median_family_income']]
df_notapproved['hud_median_family_income_000s']=[x/1000 for x in df_notapproved['hud_median_family_income']]


In [None]:
approved_msamd_diff=df_approved.groupby('msamd_name').mean()
not_approved_msamd_diff=df_notapproved.groupby('msamd_name').mean()

In [None]:
trace0 = go.Scatter(
    x = approved_msamd_diff.index,
    y = approved_msamd_diff['hud_median_family_income_000s'],
    mode = 'lines+markers',
    name = 'Neighbourhood median family income',
    line = dict(
        color = "#009393")
)
trace1 = go.Scatter(
    x = approved_msamd_diff.index,
    y = approved_msamd_diff['applicant_income_000s'],
    mode = 'lines+markers',
    name = 'Applicant income',
    line= dict( color= "#230405")
)
data=[trace0,trace1]
layout = dict(title = 'Difference in neighborhood median family income and applicant income for approved loans  ',
              xaxis = dict(title = 'MSA/MD'),
              yaxis = dict(title = 'Income'),
              margin=go.layout.Margin(
        l=50,
        r=50,
        b=200,
        t=100,
        pad=4
    )
              )

fig = dict(data=data, layout=layout)

iplot(fig)

What's very unsettling is that for the loans not approved the applicant's income is still greater than the neighbourhood median family income. Maybe , I did something wrong on this , ***please leave a comment , if you have any idea on what went wrong***.

In [None]:
trace0 = go.Scatter(
    x = not_approved_msamd_diff.index,
    y = not_approved_msamd_diff['hud_median_family_income_000s'],
    mode = 'lines+markers',
    name = 'Neighbourhood median family income',
    line = dict(
        color = "#009393")
)
trace1 = go.Scatter(
    x = not_approved_msamd_diff.index,
    y = not_approved_msamd_diff['applicant_income_000s'],
    mode = 'lines+markers',
    name = 'Applicant income',
    line= dict( color= "#230405")
)
data=[trace0,trace1]
layout = dict(title = 'Difference for the loans not approved',
              xaxis = dict(title = 'MSA/MD'),
              yaxis = dict(title = 'Income'),
               margin=go.layout.Margin(
        l=50,
        r=50,
        b=200,
        t=100,
        pad=4
    )
              )

fig = dict(data=data, layout=layout)

iplot(fig)

# Which property type should you apply for?
There are basically 3 property types, 1-4 family dwelling, multifamily dwelling and manufactured housing.Manufactured homes are housing that is esentially ready for occupancy upon leaving the factory and being transported to a building site.

In [None]:
df_property=pd.crosstab(df['property_type_name'],df['loan_status'])
df_property=df_property.reset_index()

l=[]
for x in range(df_property.shape[0]):
    l.append(df_property['approved'][x]/(df_property['approved'][x]+ df_property['not approved'][x]))
df_property['percent approved']=[x*100 for x in l]
df_property['percent not approved']=[100-x for x in df_property['percent approved']]
df_property['property_type_name']=df_property['property_type_name'].replace("One-to-four family dwelling (other than manufactured housing)",'1-4 Family dwelling')


In [None]:
trace1 = go.Bar(
    y=df_property['property_type_name'],
    x=df_property['percent approved'],
    name='percent approved',
    orientation = 'h',
    marker = dict(
        color = '#7bc043 '
        
    )
)
trace2 = go.Bar(
    y=df_property['property_type_name'],
    x=df_property['percent not approved'],
    name='percent not approved',
    orientation = 'h',
    marker = dict(
        color = '#fdf498 '
       
    )
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='stack',
    title="Effect of Property Type",
    
    margin=go.layout.Margin(
        l=200,
        r=50,
        b=100,
        t=100,
        pad=4
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Multifamily-dwelling is a safe choice when going for house loans.

> Getting a higher residential mortgage on a multifamily dwelling is easier based on the rental income generated, which can cover or reduce the mortgage.

It's tougher to get a loan for manufactured housing.

> Many manufactured home loan programs have strict guidelines about the property condition and age. That’s because manufactured housing tends to depreciate, while traditional home values tend to increase over time.

# What is the effect of tract_to_msamd_income?
tract_to_msamd_income is the percentage of the median family income for the census tract for which the applicant has applied compared to the median family income for the MSA/MD, rounded to two decimal places.

In [None]:
trace0 = go.Scatter(
    x = approved_msamd_diff.index,
    y = approved_msamd_diff['tract_to_msamd_income'],
    mode = 'lines+markers',
    name = 'approved',
    line = dict(
        color = "#a77d5f")
)
trace1 = go.Scatter(
    x = not_approved_msamd_diff.index,
    y = not_approved_msamd_diff['tract_to_msamd_income'],
    mode = 'lines+markers',
    name = 'not approved',
    line= dict( color= "#930000")
)
data=[trace0,trace1]
layout = dict(title = '',
              xaxis = dict(title = 'MSA/MD'),
              yaxis = dict(title = 'tract_to_msamd_income'),
              margin=go.layout.Margin(
        l=50,
        r=50,
        b=200,
        t=100,
        pad=4
    )
              )

fig = dict(data=data, layout=layout)

iplot(fig)

This ratio is greater for approved loans than for non-approved loans for all MSAMD. And this difference matters a lot if you are applying in Kennewick or Yakima.

# How does the lien status affect the loan?
A lien serves to guarantee an underlying obligation of the repayment of a loan. If the underlying obligation is not satisfied, the creditor(lender) may be able to seize the asset that is the subject of the lien.Once executed, a lien becomes the legal right of a creditor to sell the collateral property of a debtor who fails to meet the obligations of a loan or other contract.

Most mortgages are secured by a lien against the property. In the event of a forced liquidation, first lien holders will generally get paid before subordinate lien holders.

In [None]:
df_lien=pd.crosstab(df['lien_status_name'],df['loan_status'])
df_lien=df_lien.reset_index()
df_lien.columns=['lien_status','approved','not approved']
l=[]
for x in range(3):
    l.append(df_lien['approved'][x]/(df_lien['approved'][x]+ df_lien['not approved'][x]))
df_lien['percent approved']=[x*100 for x in l]
df_lien['percent not approved']=[100-x for x in df_lien['percent approved']]

In [None]:
trace1 = go.Bar(
    y=df_lien['lien_status'],
    x=df_lien['percent approved'],
    name='percent approved',
    orientation = 'h',
    marker = dict(
        color = 'rgba(71, 58, 131, 0.8)',
        line = dict(
            color = 'rgba(38, 24, 74, 0.8)',
            width = 3)
    )
)
trace2 = go.Bar(
    y=df_lien['lien_status'],
    x=df_lien['percent not approved'],
    name='percent not approved',
    orientation = 'h',
    marker = dict(
        color = 'rgba(190, 192, 213, 1)',
        line = dict(
            color = 'rgba(164, 163, 204, 0.85)',
            width = 3)
    )
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group',
    title="Effect of lien status",
    margin=go.layout.Margin(
        l=200,
        r=50,
        b=100,
        t=100,
        pad=4
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Loans secured by first lien get approved 70% of the time. Loans not secured by any lien has the highest percentage of not getting approved.

# What is the loan amount pattern for people in different income categories?
To study this first I'll categorize people as falling in low (for applicant income less than 100k), middle(between 100k and 200k) and high(more than 200k) income range.

In [None]:
df['applicant_income_range'] = np.nan
l = [df]
for col in l:
    col.loc[col['applicant_income_000s'] <= 100, 'applicant_income_range'] = 'Low'
    col.loc[(col['applicant_income_000s'] > 100) & (col['applicant_income_000s'] <= 200), 'applicant_income_range'] = 'Medium'
    col.loc[col['applicant_income_000s'] > 200, 'applicant_income_range'] = 'High'

In [None]:
df_approved=df[df['loan_status']=='approved']
df_notapproved=df[df['loan_status']=='not approved']

In [None]:
trace0 = go.Box(
    y=df_approved['loan_amount_000s'],
    x=df_approved['applicant_income_range'],
    name='approved',
    marker=dict(
        color='#3D9970'
    )
)
trace1 = go.Box(
    y=df_notapproved['loan_amount_000s'],
    x=df_notapproved['applicant_income_range'],
    name='not approved',
    marker=dict(
        color='#FF4136'
    )
)
data = [trace0, trace1]
layout = go.Layout(
    yaxis=dict(
        title='',
        zeroline=False
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

Okay, this is a very compressed view. Loans with very high loan amount are always rejected for the low and medium salary range applicant. While they may get accepted for applicants falling in the high income range. Let's get a zoomed in view by limiting the loan amount to less than 1500000.

In [None]:
df_approved1=df_approved[df_approved['loan_amount_000s']<1500]
df_notapproved1=df_notapproved[df_notapproved['loan_amount_000s']<1500]

In [None]:
trace0 = go.Box(
    y=df_approved1['loan_amount_000s'],
    x=df_approved1['applicant_income_range'],
    name='approved',
    marker=dict(
        color='#3D9970'
    )
)
trace1 = go.Box(
    y=df_notapproved1['loan_amount_000s'],
    x=df_notapproved1['applicant_income_range'],
    name='not approved',
    marker=dict(
        color='#FF4136'
    )
)
data = [trace0, trace1]
layout = go.Layout(
    yaxis=dict(
        title='',
        zeroline=False
    ),
    boxmode='group'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

In all the income ranges the approved loans have requested for loan amount greater than loans not approved. A usual trend, people generally request for loan amount proportionate to their incomes. In most of the cases , applications where the loan amount far exceeds the applicant's income the loans get rejected.

# Is the lending process discriminatory?
Now, I analyze to know whether one gender or race is prefered over the other or not. Not applicable means the applicant is not a natural person , is an organisation or something.

In [None]:
df_sex=pd.crosstab(df['applicant_sex_name'],df['loan_status'])
df_sex=df_sex.reset_index()
df_sex.columns=['sex','approved','not approved']
l=[]
for x in range(df_sex.shape[0]):
    l.append(df_sex['approved'][x]/(df_sex['approved'][x]+ df_sex['not approved'][x]))
df_sex['percent approved']=[x*100 for x in np.array(l)]
df_sex['percent not approved']=[100-x for x in df_sex['percent approved']]
df_sex['sex']=df_sex['sex'].replace('Information not provided by applicant in mail, Internet, or telephone application','Info not provided')

In [None]:
df_ethnicity=pd.crosstab(df['applicant_ethnicity_name'],df['loan_status'])
df_ethnicity=df_ethnicity.reset_index()
df_ethnicity.columns=['ethnicity','approved','not approved']
l=[]
for x in range(df_ethnicity.shape[0]):
    l.append(df_ethnicity['approved'][x]/(df_ethnicity['approved'][x]+ df_ethnicity['not approved'][x]))
df_ethnicity['percent approved']=[x*100 for x in np.array(l)]
df_ethnicity['percent not approved']=[100-x for x in df_ethnicity['percent approved']]
df_ethnicity['ethnicity']=df_ethnicity['ethnicity'].replace('Information not provided by applicant in mail, Internet, or telephone application','Info not provided')

In [None]:
df_race=pd.crosstab(df['applicant_race_name_1'],df['loan_status'])
df_race=df_race.reset_index()
df_race.columns=['race','approved','not approved']
l=[]
for x in range(df_race.shape[0]):
    l.append(df_race['approved'][x]/(df_race['approved'][x]+ df_race['not approved'][x]))
df_race['percent approved']=[x*100 for x in np.array(l)]
df_race['percent not approved']=[100-x for x in df_race['percent approved']]
df_race['race']=df_race['race'].replace('Information not provided by applicant in mail, Internet, or telephone application','Info not provided')

In [None]:
trace0=go.Bar(
x=df_ethnicity['ethnicity'],
y=df_ethnicity['percent approved'],
name='percent approved',
marker=dict(color='#051e3e '))
trace1=go.Bar(x=df_ethnicity['ethnicity'],
              y=df_ethnicity['percent not approved'],
              name='percent not approved',
             marker=dict(
             color='#851e3e '))
trace2=go.Bar(x=df_sex['sex'],
             y=df_sex['percent approved'],
             name='percent approved',
             marker=dict(color='#96ceb4  '))
trace3=go.Bar(x=df_sex['sex'],
             y=df_sex['percent not approved'],
             name='percent not approved',
             marker=dict(color='#ff6f69  '))
trace4=go.Scatter(x=df_race['race'],
                 y=df_race['percent approved'],
                 name='percent approved')
trace5=go.Scatter(x=df_race['race'],
                 y=df_race['percent not approved'],
                 name='percent not approved')
fig = tools.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
                          subplot_titles=('Applicant Race','Applicant sex', 'Applicant ethnicity'))

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 2, 1)
fig.append_trace(trace5, 2, 1)

fig['layout'].update( height=900,width=1000,paper_bgcolor = "rgb(255, 248, 243)",margin=go.layout.Margin(
        l=50,
        r=50,
        b=200,
        t=100,
        pad=4
    ))
iplot(fig)

There isn't much difference between approval rates for females and males, however there seems to be a little discrimination based on ethnicity and race favoring Non-hispanic or Latino Whites and Asians.

# What are the major reasons for loan denial?

In [None]:
df_reason=pd.DataFrame(df_notapproved['denial_reason_name_1'].value_counts())
df_reason=df_reason.reset_index()
df_reason.columns=['reason','number of loans']

In [None]:
trace0 = go.Bar(
    x=df_reason['reason'],
    y=df_reason['number of loans'],
    
    marker=dict(
        color='rgb(158,202,225)',
        line=dict(
            color='rgb(8,48,107)',
            width=1.5,
        )
    ),
    opacity=0.6
)

data = [trace0]
layout = go.Layout(
    title='Major loan denial reasons',height=500,
                  width=700,
                  autosize=False,
                  margin=go.layout.Margin(
        l=50,
        r=50,
        b=200,
        t=100,
        pad=4
    )
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Mostly applications are denied for high debt-to-income ratio , poor credit history, lack of suitable collateral. The most easily avoidable reason is incomplete credit application. Though a total of 5433 applications are denied because of it.

## Summary :-
To increase the chances of getting your loan approved in Washington, you must keep the following points in mind.

* Get a home loan in King County (highest approval rate).
* Application for a home purchase loan has a better chance at getting approved.
* Go for a FSA/RHS guranteed loan. They also have lineant requirements when compared to conventional loans, but this comes at the cost of higher mortgage interest.
* However, there isn't much difference but owner occupied loans have a better chance.
* Must go for HOEPA. Loans with HOEPA always get approved .
* In whichever MSAMD you are buying a home, your income should be greater than or equal to the median family income of that MSAMD.
* Apply for a multifamily or 1-4 family dwelling.
* Get your loan secured by a lien preferably by a first lien.
* Apply for loans proportional to your income. Most loans get rejected for high debt-to-income ratio .
* I don't think you can change any demographic features.

