# Home Loan Prediction
This dataset `home_loans_1.csv` is about home loan applications in San Diego county, where each row of the dataset is an individual loan application. This data could be used to build a machine learning model to predict whether to accept or reject a loan application.

**Your goal in this assignment is to understand the data and how biases can emerge in datasets.**


## Part 1: Data Exploration

Upload the .zip file ('data.zip') included in the homework assignment. I **strongly** recommend using the following code rather than the Colab web interface for uploading files, particularly for those with slower internet connections. 

In [2]:
from google.colab import files
uploaded = files.upload()

Saving data.zip to data.zip


In [3]:
import zipfile
import io
zf = zipfile.ZipFile(io.BytesIO(uploaded['data.zip']),"r")
zf.extractall()

The first few exercises will get you used to looking at the data using `pandas`. Pandas is a widely used library in python for manipulating data. 

> *Optional: Why? Datasets can consume a _lot_ of space in your computer's memory and traditional python data structures like lists or dictionaries will become painfully slow as we add thousands of rows of data. We use a specialized dataset library `pandas` which has a specialized data structure called a `dataframe` designed to be ultra fast & efficient. Documentation is here: https://pandas.pydata.org/pandas-docs/stable/*



In [4]:
import pandas as pd # import pandas library
df = pd.read_csv('data/home_loans_1.csv', low_memory=False) # read the csv file into a pandas dataframe object



To understand what kind of data was collected, `pandas` has some handy commands:
- `df.head()` will show us the first 5 rows of our dataset. You can also specify the first N rows, like `df.head(18)` will show us the first 18 rows.
- `df.sample(10)` will show us 10 randomly sampled rows of our dataset
- `df.shape` will tell us how many rows and how many columns are in the dataset
- `df.columns` will list the names of all columns in the dataset
- `df.describe()` will give you summary statistics about all numerical columns in the dataset



### Question 1.A:  How many rows are in this dataset? How many columns?
_Double click to write your answer question here. Show your work in code below if applicable._

12 Rows and 60122 Columns

In [8]:
#12 Rows and 60122 Columns

df.shape


(60122, 12)

### Question 1.B: One of the columns in the dataset is the outcome value for each application, the value we will try to predict. Which column is that?
_Double click to write your answer question here. Show your work in code below if applicable._

Column 6 (loan_approved)

In [10]:
df.head(5)


Unnamed: 0,town_name,loan_amount_000s,applicant_income_000s,is_hoepa_loan,occupied_by_owner,loan_purpose_name,loan_approved,denial_reason,co_applicant_sex,co_applicant_race,applicant_sex,applicant_race
0,El Cajon,607.322158,43.881427,1,1,Home purchase,0,Collateral,Male,White,Male,White
1,El Cajon,524.421187,44.530808,1,1,Home purchase,1,,Male,White,Female,White
2,El Cajon,595.130929,57.733958,1,1,Home purchase,1,,No co-applicant,No co-applicant,Male,Asian
3,El Cajon,595.332174,56.69338,1,1,Refinancing,1,,No co-applicant,No co-applicant,"Information not provided by applicant in mail,...","Information not provided by applicant in mail,..."
4,El Cajon,666.25182,49.78161,1,1,Home improvement,0,Credit history,No co-applicant,No co-applicant,Male,White


### Question 1.C: What reasons were given in this dataset for denying a loan application?
Hint: Try looking up the pandas command to list the unique values in a column.

_Double click to write your answer question here. Show your work in code below if applicable._

We can predict what factor will affect to deny. We want to predict what factor affects loan approval, so we also want to know what factor affects loan denial

In [45]:
df['denial_reason'].unique()

array(['Collateral', nan, 'Credit history', 'Debt-to-income ratio',
       'Credit application incomplete', 'Mortgage insurance denied',
       'Unverifiable information',
       'Insufficient cash (downpayment, closing costs)', 'Other',
       'Employment history'], dtype=object)

### Question 1.D: Given the denial reasons and the columns in this dataset, think about what information you _don't_ have about each application. Rank your top 3 _missing_ pieces of information about each application that could help you better predict the application's loan outcome.
_Double click to write your answer question here. Show your work in code below if applicable._
#1.  stability of income (job)
#2.  current value of their own property (and bank balanece)
#3.  desired loan term

## Part 2: Understanding Bias in Datasets

### Question 2.A: Does the likelihood of loan approval differ by town in this data?

You may find the groupby function useful for answering this question.

_Double click to write your answer question here. Show your work in code below if applicable._

It doesn't look like that loan approval differ by town. result following is show that each town's approval rate 

Carlsbad 72.5 %

Chula Vista 73.2 %

Coronado 72.2 %

Del Mar 73.3 %

El Cajon 73.7 %

Escondido 73.4 %

La Mesa 71.9 %

National City 72.9 %

Oceanside 72.4 %

Poway 72.1 %

San Diego 73.4 %

Solana Beach 72.4 %

since approval rate by each town is distributed between 72% and 73%, it is not significantly different by town.

In [52]:
df.groupby('town_name').size()
df.groupby('town_name')['loan_approved'].sum()

town_name
Carlsbad         3821
Chula Vista      3582
Coronado         3655
Del Mar          3956
El Cajon         3599
Escondido        3590
La Mesa          3513
National City    3571
Oceanside        3505
Poway            3754
San Diego        3623
Solana Beach     3641
Name: loan_approved, dtype: int64

### Question 2.B: Does the likelihood of loan approval differ by gender in this data?

_Double click to write your answer question here. Show your work in code below if applicable._

% of female's approval is about 60% and % of male's approval is about 80%. I believe that 20% is significantly different.

In [31]:
df.groupby('applicant_sex').size()
df.groupby('applicant_sex')['loan_approved'].sum()

applicant_sex
Female                                                                               14311
Information not provided by applicant in mail, Internet, or telephone application    10249
Male                                                                                 19250
Name: loan_approved, dtype: int64

### Question 2.C: Does the likelihood of loan approval differ by race in this data?

_Double click to write your answer question here. Show your work in code below if applicable._

Each percentage of approval by race is below

American Indian or Alaska Native : 68.7 %

Asian : 71.3 %

Black or African American : 44.1 %

Multiracial : 44.5 %

Native Hawaiian or Pacific Islander 71.7

White 74.8 %

Black or African American and Multiracial has about 45% , whereas other races has about 70 %.

Hence, It seems loan approval differ by race in this data.



In [54]:
df.groupby('applicant_race').size()
df.groupby('applicant_race')['loan_approved'].sum()

applicant_race
American Indian or Alaska Native                                                       336
Asian                                                                                 3696
Black or African American                                                             1077
Information not provided by applicant in mail, Internet, or telephone application    10249
Multiracial                                                                           2093
Native Hawaiian or Other Pacific Islander                                              353
White                                                                                26006
Name: loan_approved, dtype: int64

### Question 2.D: Does the likelihood of loan approval differ by age in this data?

_Double click to write your answer question here. Show your work in code below if applicable._

We don't know if age affects loan approval, because there is no information of age of applicant or co-applicant in this data.

In [79]:
df.sample(5)

Unnamed: 0,town_name,loan_amount_000s,applicant_income_000s,is_hoepa_loan,occupied_by_owner,loan_purpose_name,loan_approved,denial_reason,co_applicant_sex,co_applicant_race,applicant_sex,applicant_race,loan_amount_000s / applicant_income_000s
34404,Chula Vista,640.032417,63.075441,1,1,Home purchase,1,,No co-applicant,No co-applicant,Female,Asian,10.147094
52021,Solana Beach,3299.615855,417.909081,0,0,Home purchase,1,,Male,White,Male,White,7.895535
49740,Solana Beach,3269.465986,370.132865,0,1,Home purchase,0,Credit application incomplete,"Information not provided by applicant in mail,...","Information not provided by applicant in mail,...",Male,White,8.833223
48411,Coronado,1864.854745,397.094016,1,1,Home purchase,1,,Male,White,Male,White,4.696255
2979,El Cajon,660.283664,66.236504,1,1,Refinancing,1,,Female,White,Female,White,9.968577


### Question 2.D: Do you have enough information to determine if differential approval rates are an example of bias? Why or why not?

*Double click to write your answer here.*

*   Not quite, even though we got the information that gender and race affect loan approval, those were only two factors that we can get. I guess we need at least 2~3 more factors that affect loan approval need to decided whether bias or not.


## Part 3: Helping Others Understand Fairness & Bias

Imagine that you work as a software engineer for a small credit union. Your boss has asked you to build a machine learning system to predict which home loan applications the credit union should approve. 

There are three possible data sets you could you use (included in the assignment materials in data.zip: home_loans_1, _2, and _3.csv). You need to design a visualization that will convince your boss to use the data set that you think is the right choice. 

### Part 3.A: List the four most important attributes of the datasets that you think should be considered to decide which dataset to use.

_Double click to write your answer question here._
#1.  loan amount
#2.  applicant income
#3.  applicant sex
#4.  applicant race

### Part 3.B: Sketch a visualization that your boss (who is not a software engineer) can understand, that will help your boss understand the dataset and the aspects of it that you consider important. 


_Attach a pdf with your sketches. Please include any annotations/description on the pdf itself (not in this notebook)._