# Phase 5

## Introduction

### Research Question
Overall, our research questions are: <b>Is it possible to create a loan approval prediction model that balances fairness and "accuracy" in the form of predictive equality and recall? If so, does this model have any biases towards or against any groups?</b>

Inputs: `Applicant Race` (C), `Applicant Sex` (C), `Loan Type` (C), `Property Type` (C), `Loan Purpose` (C), `Loan Amount` (N), `Applicant Income` (N) \
Outputs: `Action Taken` (C) \
Evaluation Metrics: Recall, Predictive Equality \
\* (C) represents Categorical (Using Label Encoding) and (N) represents Numerical

The inputs we are interested in are `Applicant Race`, `Applicant Sex`, `Loan Type`, `Property Type`, `Loan Purpose`, `Loan Amount`, `Applicant Income` because `Applicant Income` and `Loan Amount` are, intuitively, the most applicable to whether a loan gets accepted or not. According to [investopedia](https://www.investopedia.com/articles/mortgages-real-estate/08/mortgage-candidate.asp) [1], credit score, debt, income, and appraisal value have an impact on whether an applicant has successfully gets a mortgage, so we believe that these variables can be the most indicative of these measures. We're including `Applicant Race`, `Applicant Sex` because these factors should not affect whether an applicant gets approved or not; these are sensitive features that by themselves should not affect the loan application outcome.

The main output we want to check is `Action Taken`, because this column indicates whether the loan was approved or not (aka originated). Another potential column of interest is `Denial Reason 1`, `Denial Reason 2`, or `Denial Reason 3` because it could be good supplemental information as to what was faulty about the application, however the denial reason for most applicants will be undefined since most loans in the dataset are approved.

Our main evaluation metric is recall because we believe that telling an applicant that they can't get a loan when they actually can is more detrimental than saying they can when they can't. Although there is the time aspect that goes into applying for a loan, it's better to apply and get rejected than not apply at all, because there is still a chance that the applicant could have gotten funding. However, although we are focusing on recall, we will still check other metrics like precision/F1 to make sure there isn't too much of a skew in the data.

We will also evaluate across the sensitive features for fairness, focusing on predictive equality, but also taking into consideration statistical parity and calibration. We consider predictive equality to be the main focus because we want to ensure that our model isn't unfairly predicting one race/gender/ethnicity would fail to get a loan compared to others.

### Hypotheses
We predict that white people and males will be the most likely to get approved in our model. This is due to the skew in our data towards a large amount of white people and males. This skew may be due to the makeup of the United States, which is majority white. Conventional loans also appear to be the most general and therefore most common type of loan we would see. By ensuring that our data is even across the different sensitive features, we predict that the model, in turn, will become more fair and representative for each sex, race, and ethnicity.

### Importance
Loans are an important part to financial stability and should be accessible to everyone who deserves it. Our model will provide some insight on the fairness of the loan approval process when it relates to minorities, and the impact of protected traits in the approval process. This relates to algorithmic fairness as our model will be trained to balance fairness and "accuracy" (predictive equality and recall). In theory, a fair model should yield fair results (in accordance with our definition of fairness); we want to test if this is true to bring attention to the current state of the loan approval process and raise conversation on the transparency of reasons for denial.


### Related Work
While there aren't any formal and notable works on loan approval, there are papers covering similar topics of the role of protected traits such as gender or race in the context of financial opportunities. For example, the paper [Leveraging Gender Proxies Can Lead to Fairer Credit Risk Predictions](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4602450) [2] evaluates gender bias in algorithms, using alternative data to evaluate credit risk prediction. Its takeaways include the fact incorporating strong gender proxies into the credit scoring process can potentially reduce the gender gap in credit risk prediction accuracy and credit allocation. This paper lies in the same domain as our project in the sense that they are both discovering ways to increase fairness in models that deal with data amongst different groups of people. Another paper that covers similar themes as our project is [Algorithmic Bias, Financial Inclusion, and Gender](https://www.womensworldbanking.org/wp-content/uploads/2021/02/2021_Algorithmic_Bias_Report.pdf) [3], which explores the use of synthetic data to discover potential areas of gender-based bias in relation to digital credit. This paper discusses different ways to avoid the historical themes of bias in relation to gender when using machine learning and artifiical intelligence. This is similar to our project's goal of discovering potential biases in the current state of the loan approval process and creating a model that is fair and mitigates this discrimination. The reason why no notable loan approval papers exist could be due to the fact that loan approval data is difficult to deal with and not as transparent as credit risk or digital credit. Credit is also something that is applicable to bigger groups of people; more people use credit cards than people that buy houses or run businesses, which is why the call for papers examining fairness for loan approvals is probably lower. Furthermore, loan approval data is also difficult because it is a more tedious task than something like applying for a credit card.

## 2. Datasheet

### Rows and Columns
The data consists of 29 columns and 9,793,702 rows. The columns are defined as such.
`respondent_id`: A 10 character identifier for each respondent. 
`agency_code`: Indicates the agency that the data is from, with codes 1 - Office of the Comptroller of the Currency (OCC), 2 - Federal Reserve System (FRS), 3 - Federal Deposit Insurance Corporation (FDIC), 5 - National Credit Union Administration (NCUA), 7 - Department of Housing and Urban Development (HUD), 9 - Consumer Financial Protection Bureau (CFPB). 
`loan_type`: Type of loan, with 1 - Conventional, 2 - FHA-insured, 3 - VA-guaranteed, 4 - FSA/RHS. 
`property_type`: 1 - One to four-family, 2 - Manufactured housing, 3 - Multifamily.
`loan_purpose`: 1 - Home purchase, 2 - Home improvement, 3 - Refinancing.
`owner_occupancy`: 1 - Owner-occupied as principal dwelling, 2 - Not owner-occupied, 3 - Not applicable.
`loan_amount_000s`: Loan amount in thousands of dollars.
`preapproval`: 1 - Preapproval requested, 2 - Not requested, 3 - Not applicable.
`action_taken`: 1 - Loan originated, 2 - Application approved but not accepted, 3 - Application denied, 4 - Application withdrawn, 5 - File closed for incompleteness, 6 - Purchased loan, 7 - Preapproval denied, 8 - Preapproval approved but not accepted.
`msamd`: Metropolitan Statistical Area/Metropolitan Division code.
`state_code`: Two-digit FIPS state identifier code.
`county_code`: Three-digit FIPS county identifier code.
`census_tract_number`: Census tract number.
`applicant_ethnicity`: 1 - Hispanic or Latino, 2 - Not Hispanic or Latino, 3 - Not provided, 4 - Not applicable, 5 - No co-applicant.
`co_applicant_ethnicity`: Same codes as applicant_ethnicity for co-applicant.
`applicant_race_1`: 1 - American Indian/Alaska Native, 2 - Asian, 3 - Black/African American, 4 - Hawaiian/Pacific Islander, 5 - White, 6 - Not provided, 7 - Not applicable, 8 - No co-applicant.
`co_applicant_race_1`: Same codes as applicant_race_1 for co-applicant.
`applicant_sex`: 1 - Male, 2 - Female, 3 - Not provided, 4 - Not applicable, 5 - No co-applicant.
`co_applicant_sex`: Same codes as applicant_sex for co-applicant.
`applicant_income_000s`: Applicant gross annual income in thousands of dollars.
`purchaser_type`: 0 - Not originated or sold, 1 - Fannie Mae, 2 - Ginnie Mae, 3 - Freddie Mac, 4 - Farmer Mac, 5 - Private securitization, 6 - Commercial/savings bank, 7 - Life insurance/credit union/mortgage bank, 8 - Affiliate, 9 - Other purchaser.
`hoepa_status`: 1 - HOEPA loan, 2 - Not a HOEPA loan (for originated/purchased loans only).
`lien_status`: 1 - Secured by first lien, 2 - Secured by subordinate lien, 3 - Not secured by lien, 4 - Not applicable for purchased loans (applications/originations only).
`population`: Total population in the census tract.
`minority_population`: Percentage of minority population to total population for the tract (carried to two decimal places).
`hud_median_family_income`: FFIEC Median family income in dollars for the MSA/MD in which the tract is located (adjusted annually by FFIEC).
`tract_to_msamd_income`: Percentage of tract median family income compared to MSA/MD median family income (carried to two decimal places).
`number_of_owner_occupied_units`: Number of dwellings, including individual condominiums, that are lived in by the owner.
`number_of_1_to_4_family_units`: Dwellings that are built to house fewer than 5 families.
### Purpose of Dataset
The dataset was created by the Consumer Financial Protection Bureau (CFPB), which is a government entity. The CFPB is tasked with enforcing fair lending laws like the Equal Credit Opportunity Act. The bureau collects mortgage and loan data, so they can monitor for potential discriminatory lending patterns based on factors like race, gender, age, etc. Financial institutions are required by the Home Mortgage Disclosure Act (HMDA) to report lending data to the CFPB.
### Dataset funding
Since the CFPB is a government organization, taxpayers funded for the creation for this dataset.
### Influences in data reporting
The government strictly outlines what values to include in the data. A [reference chart](https://files.consumerfinance.gov/f/documents/cfpb_reportable-hmda-data_regulatory-and-reporting-overview-reference-chart_2023-02.pdf) [4] is given to the employees at financial instutions on which data to report and how to do so. The [implementation and guidelines material](https://www.consumerfinance.gov/rules-policy/regulations/1003/) [5] also explain the different sections of of the HDMA and how to comply with each section. The law says that instiutions must report requests only if preapproval requests are denied, are approved by the financial institution but not accepted by the applicant, or result in the origination of home purchase loans.
### Preprocessing
The CSV file that is released by the CFPB has no preprocessing at all. It is a compiled dataset which consists of all the reports by the financial institutions.
### Involement of People and Knowledge of Data Use
The financial institutions and the government are aware of the data collection and know that the purpose is for auditing and consumer safety. Financial institutions are also required to give privacy notices to their customers that tells them what data is being collected and how it's used. This is because they are obligated to comply with the privacy component of the [Gramm-Leach-Bliley Act](https://www.ftc.gov/business-guidance/resources/how-comply-privacy-consumer-financial-information-rule-gramm-leach-bliley-act#obligations) [6].
### Raw Data Source
The raw data can be found at the [HDMA data](https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=nationwide&records=all-records&field_descriptions=codes) [7] section of the CFPB website next to the year 2017.