# HMDA Data Testing

## TODO:
### Data Collection
- [x] download data straight from https://ffiec.cfpb.gov/documentation/api/data-browser/ API
- [ ] check for existence of data download - prompt user for overwrite if already there
- [x] is there a better way to save off the data stream from the API?
- [x] write to output file in chunks, rather than all at once?
- [X] it doesn't look like the API call filters/params are working as intended??

### Data Cleaning
- [X] determine what columns to keep or drop
- [ ] merge like columns together, ex: 'denial_reason-1', 'denial_reason-2', 'denial_reason-3',
       'denial_reason-4'

In [26]:
import gzip
import os
import requests
import subprocess
import pandas as pd
from pathlib import Path
from hmda_lib import valid_state_codes
from hmda_lib import valid_years

In [22]:
def download_hmda_data(state, year):
    output_file = Path('hmda_data', f'test-{state}-{year}.csv')
    url = f'https://ffiec.cfpb.gov/v2/data-browser-api/view/csv?states={state}&years={year}'

    try:
        with requests.get(url, stream=True) as response:
            response.raise_for_status()
            with open(output_file, 'wb') as fd:
                for chunk in response.iter_content(chunk_size=1024):
                    if chunk:
                        fd.write(chunk)
        return output_file
    except requests.exceptions.RequestException as e:
        print(f"Error downloading data: {e}")
        return False

In [23]:
def compress_hmda_data(f):
    subprocess.run(['gzip', f])    

In [24]:
def existence_check():
    return False

In [25]:
state = 'MN'
years = ['2020', '2021', '2022']

for year in years:
    print(f'Downloading HMDA data for: {year} {state}.....', end='')
    data_file_path = download_hmda_data(state, year)
    print(' compressing.....', end='')
    compress_hmda_data(data_file_path)
    print(' done!')

Downloading HMDA data for: 2020 MN..... compressing..... done!
Downloading HMDA data for: 2021 MN..... compressing..... done!
Downloading HMDA data for: 2022 MN..... compressing..... done!


In [12]:
df = pd.read_csv(Path('hmda_data', 'test-MN-2022.csv'))

  df = pd.read_csv(Path('hmda_data', 'test-MN-2022.csv'))


In [28]:
data_path = 'hmda_data'
filenames = os.listdir(data_path)
all_dataframes = []

for filename in filenames:
    if filename.endswith('.csv.gz'):
        filepath = Path(data_path, filename)
        with gzip.open(filepath, 'rt') as file:
            df = pd.read_csv(filepath)
        all_dataframes.append(df)

df = pd.concat(all_dataframes, ignore_index=True)

  df = pd.read_csv(filepath)
  df = pd.read_csv(filepath)
  df = pd.read_csv(filepath)


In [29]:
df

Unnamed: 0,activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,derived_ethnicity,...,denial_reason-2,denial_reason-3,denial_reason-4,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
0,2021,549300Q76VHK6FGPX546,99999,MN,27021.0,2.702196e+10,C,Conventional:Subordinate Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,2310,20.35,72400,74,941,3848,35
1,2021,549300Q76VHK6FGPX546,99999,MN,27133.0,2.713357e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,2384,3.02,72400,111,782,955,66
2,2021,549300Q76VHK6FGPX546,33460,MN,27139.0,2.713908e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,5230,5.51,100600,124,1631,1825,26
3,2021,549300Q76VHK6FGPX546,99999,MN,27063.0,2.706348e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,2357,3.52,72400,103,819,1146,63
4,2021,549300Q76VHK6FGPX546,99999,MN,27035.0,2.703595e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,5300,2.26,72400,134,1728,2666,30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200773,2022,549300RBJCM5B02O5U05,99999,MN,27091.0,2.709179e+10,C,Conventional:Subordinate Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,1722,6.33,83600,106,638,935,69
1200774,2022,549300RBJCM5B02O5U05,99999,MN,27173.0,2.717397e+10,C,Conventional:Subordinate Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,2302,8.73,83600,101,798,1120,71
1200775,2022,549300RBJCM5B02O5U05,99999,MN,27063.0,2.706348e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,2175,4.28,83600,119,839,1136,67
1200776,2022,549300RBJCM5B02O5U05,99999,MN,27033.0,2.703327e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,3249,25.02,83600,86,898,1210,63


In [30]:
for c in df.columns:
    print(c)

activity_year
lei
derived_msa-md
state_code
county_code
census_tract
conforming_loan_limit
derived_loan_product_type
derived_dwelling_category
derived_ethnicity
derived_race
derived_sex
action_taken
purchaser_type
preapproval
loan_type
loan_purpose
lien_status
reverse_mortgage
open-end_line_of_credit
business_or_commercial_purpose
loan_amount
loan_to_value_ratio
interest_rate
rate_spread
hoepa_status
total_loan_costs
total_points_and_fees
origination_charges
discount_points
lender_credits
loan_term
prepayment_penalty_term
intro_rate_period
negative_amortization
interest_only_payment
balloon_payment
other_nonamortizing_features
property_value
construction_method
occupancy_type
manufactured_home_secured_property_type
manufactured_home_land_property_interest
total_units
multifamily_affordable_units
income
debt_to_income_ratio
applicant_credit_score_type
co-applicant_credit_score_type
applicant_ethnicity-1
applicant_ethnicity-2
applicant_ethnicity-3
applicant_ethnicity-4
applicant_ethnicit

In [31]:
for c in df.columns:
    print(f'Examining column: {c}')
    print(df[c].value_counts())
    print()

Examining column: activity_year
activity_year
2020    493242
2021    460643
2022    246893
Name: count, dtype: int64

Examining column: lei
lei
6BYL5QZYBDK8S7L73M02    95149
KB1H1DSPRFMYMCUFXT09    66536
549300FGXN1K3HLB1R50    50244
549300WYBPIWKK6SQC06    44872
549300HW662MN1WU8550    29582
                        ...  
549300LU6Y2TXG48QY48        1
549300MUTFJQGRZJH019        1
2549008684GZZI5B5H82        1
549300C4ZH7G6OB81F33        1
549300323EON4W3CCM44        1
Name: count, Length: 1165, dtype: int64

Examining column: derived_msa-md
derived_msa-md
33460    848390
99999    193196
40340     42995
20260     41354
41060     33821
31860     15684
22020     12749
0          5463
24220      4305
29100      2821
Name: count, dtype: int64

Examining column: state_code
state_code
MN    1200778
Name: count, dtype: int64

Examining column: county_code
county_code
27053.0    285257
27037.0    111249
27123.0     99383
27003.0     94114
27163.0     75582
            ...  
27011.0       561
2