# HMDA Data Testing

## TODO:
### Data Collection

### Data Cleaning
- [ ] merge like columns together, ex: 'denial_reason-1', 'denial_reason-2', 'denial_reason-3',
       'denial_reason-4'
- [X] fix interest rate column
- [X] fix loan term column

### Statistics
- [x] summary statistics table: interest_rate by race
- [x] ANOVA test: interest_rate by race

### Documentation
- [ ] data exploration and cleanup process
- [ ] other data (show the download, df load, and df) and how poor it was
- [ ] API fixes - was downloading nationwide, and it was too big for jupyterlab/pandas/computer
- [ ] print example, found problem with 'state'.value_counts()
- [ ] data exploration - print columns, print value_counts for each
- [ ] data exploration - pull out a DF and show it of just the primary columns

## Setup
-----

In [100]:
import gzip
import os
import requests
import subprocess
import pandas as pd
import numpy as np
import scipy.stats as stats
from pathlib import Path
from hmda_lib import valid_state_codes
from hmda_lib import valid_years

In [101]:
def download_hmda_data(fd, state, year):
    url = f'https://ffiec.cfpb.gov/v2/data-browser-api/view/csv?states={state}&years={year}'

    try:
        with requests.get(url, stream=True) as response:
            response.raise_for_status()
            with open(output_file, 'wb') as fd:
                for chunk in response.iter_content(chunk_size=1024):
                    if chunk:
                        fd.write(chunk)
        return True

    except requests.exceptions.RequestException as e:
        print(f"Error downloading data: {e}")
        return False

In [102]:
def compress_hmda_data(f):
    subprocess.run(['gzip', f])    

## Data Collection
-----

In [104]:
# download HMDA data from API

state = 'MN'
#years = ['2018', '2019', '2020', '2021', '2022']

for year in years:
    output_file = Path('hmda_data', f'hmda-{state}-{year}.csv')
    if os.path.exists(f'{output_file}.gz'):
        print('File exists already! Skipping!')
        continue
    else:
        print(f'Downloading HMDA data for: {year} {state}.....', end='')
        download_hmda_data(output_file, state, year)
        print(' compressing.....', end='')
        compress_hmda_data(output_file)
        print(' done!')

File exists already! Skipping!
File exists already! Skipping!
File exists already! Skipping!


In [105]:
# load the HMDA data into Pandas DataFrames

data_path = 'hmda_data'
filenames = os.listdir(data_path)
all_dataframes = []

for filename in filenames:
    if filename.endswith('.csv.gz'):
        filepath = Path(data_path, filename)
        with gzip.open(filepath, 'rt') as file:
            df = pd.read_csv(filepath)
        all_dataframes.append(df)

unclean_df = pd.concat(all_dataframes, ignore_index=True)

  df = pd.read_csv(filepath)
  df = pd.read_csv(filepath)
  df = pd.read_csv(filepath)


## Data Cleaning
-----

In [106]:
# remove null values and 'Exempt' interest rate from dataframe

df = unclean_df[unclean_df['interest_rate'].notnull()]
df = df.query('interest_rate != "Exempt"')

In [107]:
# remove null loan terms from dataframe

df = df[df['loan_term'].notnull()]

In [108]:
# data type conversions

df['interest_rate'] = pd.to_numeric(df['interest_rate'], errors='raise')
df['loan_to_value_ratio'] = pd.to_numeric(df['loan_to_value_ratio'], errors='raise')

In [109]:
# rename values

df['derived_race'] = df['derived_race'].replace({
    'Black or African American': 'Black',
    'American Indian or Alaska Native': 'Native',
    'Native Hawaiian or Other Pacific Islander': 'Pacific Islander'
})

## Data Exploration
-----

In [110]:
df['derived_race'].value_counts()

derived_race
White                       596590
Race Not Available          193602
Asian                        33071
Black                        21090
Joint                        15605
Native                        2781
Pacific Islander               748
2 or more minority races       641
Free Form Text Only             43
Name: count, dtype: int64

In [111]:
df.head()

Unnamed: 0,activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,derived_ethnicity,...,denial_reason-2,denial_reason-3,denial_reason-4,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
1,2020,AD6GFRVSDT01YPT1CS68,33460,MN,27123.0,27123040000.0,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,4366,38.91,97300,71,750,878,50
2,2020,AD6GFRVSDT01YPT1CS68,33460,MN,27053.0,27053030000.0,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,3248,8.68,97300,138,1147,1246,49
3,2020,AD6GFRVSDT01YPT1CS68,33460,MN,27163.0,27163070000.0,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,3995,9.31,97300,103,1367,1741,38
4,2020,AD6GFRVSDT01YPT1CS68,99999,MN,27035.0,27035950000.0,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,1910,1.2,70900,90,843,2542,33
5,2020,AD6GFRVSDT01YPT1CS68,33460,MN,27139.0,27139080000.0,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,7709,20.08,97300,158,2415,2818,13


In [112]:
for c in df.columns:
    print(c)

activity_year
lei
derived_msa-md
state_code
county_code
census_tract
conforming_loan_limit
derived_loan_product_type
derived_dwelling_category
derived_ethnicity
derived_race
derived_sex
action_taken
purchaser_type
preapproval
loan_type
loan_purpose
lien_status
reverse_mortgage
open-end_line_of_credit
business_or_commercial_purpose
loan_amount
loan_to_value_ratio
interest_rate
rate_spread
hoepa_status
total_loan_costs
total_points_and_fees
origination_charges
discount_points
lender_credits
loan_term
prepayment_penalty_term
intro_rate_period
negative_amortization
interest_only_payment
balloon_payment
other_nonamortizing_features
property_value
construction_method
occupancy_type
manufactured_home_secured_property_type
manufactured_home_land_property_interest
total_units
multifamily_affordable_units
income
debt_to_income_ratio
applicant_credit_score_type
co-applicant_credit_score_type
applicant_ethnicity-1
applicant_ethnicity-2
applicant_ethnicity-3
applicant_ethnicity-4
applicant_ethnicit

In [113]:
for c in df.columns:
    print(f'Examining column: {c}')
    print(df[c].value_counts())
    print()

Examining column: activity_year
activity_year
2020    352175
2021    340417
2022    171579
Name: count, dtype: int64

Examining column: lei
lei
6BYL5QZYBDK8S7L73M02    71531
KB1H1DSPRFMYMCUFXT09    49175
549300FGXN1K3HLB1R50    40437
549300WYBPIWKK6SQC06    39238
549300HW662MN1WU8550    25701
                        ...  
5493005QK4NV0ZZ5EM64        1
549300V36YE6JCCEJB76        1
549300XOTES5TCS8T794        1
254900I68BOSEM149Q58        1
549300214PKB2Y1ZWH75        1
Name: count, Length: 890, dtype: int64

Examining column: derived_msa-md
derived_msa-md
33460    622484
99999    132228
40340     31026
20260     27884
41060     25350
22020      9669
31860      9506
24220      2999
29100      2171
0           854
Name: count, dtype: int64

Examining column: state_code
state_code
MN    864171
Name: count, dtype: int64

Examining column: county_code
county_code
27053.0    210362
27037.0     81553
27123.0     71397
27003.0     69074
27163.0     55430
            ...  
27077.0       305
270

## Statistics Summaries
-----

In [114]:
# Statistics Summary Table - Interest Rates by Race
race_group = df.groupby('derived_race')

summary_table = pd.DataFrame({
    "Mean Interest Rate": race_group['interest_rate'].mean(),
    "Median Interest Rate": race_group['interest_rate'].median(),
    "Interest Rate Variance": race_group['interest_rate'].var(),
    "Interest Rate Std. Dev.": race_group['interest_rate'].std(),
    "Interest Rate Std. Err.": race_group['interest_rate'].sem()
})

summary_table

Unnamed: 0_level_0,Mean Interest Rate,Median Interest Rate,Interest Rate Variance,Interest Rate Std. Dev.,Interest Rate Std. Err.
derived_race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2 or more minority races,3.715591,3.25,2.342016,1.530365,0.060446
Asian,3.376448,3.0,1.355509,1.164263,0.006402
Black,3.521807,3.25,1.768312,1.329779,0.009157
Free Form Text Only,3.417907,3.125,1.59835,1.264259,0.192798
Joint,3.499418,3.125,1.383847,1.17637,0.009417
Native,3.680538,3.25,2.090561,1.445877,0.027418
Pacific Islander,3.540037,3.25,1.720836,1.311807,0.047964
Race Not Available,3.551026,3.25,1.455078,1.206266,0.002741
White,3.419726,3.125,1.228057,1.108178,0.001435


In [115]:
# Statistics Summary Table - Loan Amount by Race
race_group = df.groupby('derived_race')

summary_table = pd.DataFrame({
    "Mean Loan Amount": race_group['loan_amount'].mean(),
    "Median Loan Amount": race_group['loan_amount'].median(),
    "Loan Amount Variance": race_group['loan_amount'].var(),
    "Loan Amount Std. Dev.": race_group['loan_amount'].std(),
    "Loan Amount Std. Err.": race_group['loan_amount'].sem()
})

summary_table

Unnamed: 0_level_0,Mean Loan Amount,Median Loan Amount,Loan Amount Variance,Loan Amount Std. Dev.,Loan Amount Std. Err.
derived_race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2 or more minority races,229695.787832,225000.0,16708070000.0,129259.7,5105.451397
Asian,270837.138278,255000.0,22206330000.0,149017.9,819.435343
Black,245320.056899,235000.0,17586090000.0,132612.6,913.159007
Free Form Text Only,180813.953488,175000.0,10039200000.0,100195.8,15279.719541
Joint,283391.541173,265000.0,31369960000.0,177115.7,1417.83301
Native,210415.318231,195000.0,18328510000.0,135382.8,2567.219676
Pacific Islander,177941.176471,155000.0,15938330000.0,126247.1,4616.050505
Race Not Available,308131.579219,235000.0,1633067000000.0,1277915.0,2904.337153
White,241416.533968,215000.0,26325320000.0,162250.8,210.062648


In [116]:
# Statistics Summary Table - loan_to_value_ratio by Race
race_group = df.groupby('derived_race')

summary_table = pd.DataFrame({
    "Mean Loan to Value Ratio": race_group['loan_to_value_ratio'].mean(),
    "Median Loan to Value Ratio": race_group['loan_to_value_ratio'].median(),
    "Loan to Value Ratio Variance": race_group['loan_to_value_ratio'].var(),
    "Loan to Value Ratio Std. Dev.": race_group['loan_to_value_ratio'].std(),
    "Loan to Value Ratio Std. Err.": race_group['loan_to_value_ratio'].sem()
})

summary_table

Unnamed: 0_level_0,Mean Loan to Value Ratio,Median Loan to Value Ratio,Loan to Value Ratio Variance,Loan to Value Ratio Std. Dev.,Loan to Value Ratio Std. Err.
derived_race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2 or more minority races,82.539619,88.131,342.754056,18.513618,0.761548
Asian,77.4271,80.0,309.903705,17.604082,0.099656
Black,83.885809,90.0,320.985167,17.916059,0.128461
Free Form Text Only,76.931721,80.0,459.568647,21.437552,3.269196
Joint,77.662098,80.0,302.625725,17.396141,0.144125
Native,79.820396,80.0,376.611868,19.40649,0.386971
Pacific Islander,75.343146,79.646,394.420117,19.860013,0.750102
Race Not Available,72.28766,75.0,347.137081,18.631615,0.062066
White,73.648697,77.22,358.633082,18.93761,0.025359


## Hypothesis Testing
-----

### ANOVA Tests

In [117]:
# Extract individual groups
group0 = df[df["derived_race"].str.fullmatch('White')]["interest_rate"]
group1 = df[df["derived_race"].str.fullmatch('Race Not Available')]["interest_rate"]
group2 = df[df["derived_race"].str.fullmatch('Asian')]["interest_rate"]
group3 = df[df["derived_race"].str.fullmatch('Joint')]["interest_rate"]
group4 = df[df["derived_race"].str.fullmatch('Black')]["interest_rate"]
group5 = df[df["derived_race"].str.fullmatch('Native')]["interest_rate"]
group6 = df[df["derived_race"].str.fullmatch('2 or more minority races')]["interest_rate"]
group7 = df[df["derived_race"].str.fullmatch('Pacific Islander')]["interest_rate"]
group8 = df[df["derived_race"].str.fullmatch('Free Form Text Only')]["interest_rate"]

# Perform the ANOVA test
stats.f_oneway(group0, group1, group2, group3, group4, group5, group6, group7, group8)

F_onewayResult(statistic=291.626691486737, pvalue=0.0)

#### Interpretation
Rejecting the Null Hypothesis: A p-value of 0 provides very strong evidence against the null hypothesis. This means there's a statistically significant effect or difference present in the data between racial group and interest rate.

Source: Google Gemini