# HMDA Data Testing

## TODO:
### Data Collection
- [x] download data straight from https://ffiec.cfpb.gov/documentation/api/data-browser/ API
- [ ] check for existence of data download - prompt user for overwrite if already there
- [x] is there a better way to save off the data stream from the API?
- [x] write to output file in chunks, rather than all at once?
- [ ] multithreading on pd.read_csv()
- [ ] it doesn't look like the API call filters/params are working as intended??

### Data Cleaning
- [ ] determine what columns to keep or drop
- [ ] merge like columns together, ex: 'denial_reason-1', 'denial_reason-2', 'denial_reason-3',
       'denial_reason-4'

In [1]:
import requests
import pandas as pd
from pathlib import Path
from hmda_lib import valid_state_codes
from hmda_lib import valid_years

In [2]:
def download_hmda_data(state, year):
    output_file = Path('hmda_data', f'test-{state}-{year}.csv')
    url = 'https://ffiec.cfpb.gov/v2/data-browser-api/view/nationwide/csv'
    params = {
        'states': state,
        'years': year
    }

    try:
        with requests.get(url, params=params, stream=True) as response:
            response.raise_for_status()
            with open(output_file, 'wb') as fd:
                for chunk in response.iter_content(chunk_size=1024):
                    if chunk:
                        fd.write(chunk)
        return True
    except requests.exceptions.RequestException as e:
        print(f"Error downloading data: {e}")
        return False

In [21]:
download_hmda_data('mn', '2021')

True

In [3]:
df = pd.read_csv(Path('hmda_data', 'smalltest-mn-2021.csv'))

  df = pd.read_csv(Path('hmda_data', 'smalltest-mn-2021.csv'))


In [4]:
df

Unnamed: 0,activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_loan_product_type,derived_dwelling_category,derived_ethnicity,...,denial_reason-2,denial_reason-3,denial_reason-4,tract_population,tract_minority_population_percent,ffiec_msa_md_median_family_income,tract_to_msa_income_percentage,tract_owner_occupied_units,tract_one_to_four_family_homes,tract_median_age_of_housing_units
0,2021,549300NWBS6MQJX15N44,15764,MA,25017.0,2.501737e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,5103,16.07,120200,179,1770,1926,53
1,2021,549300NWBS6MQJX15N44,23540,FL,12001.0,1.200100e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,11217,33.20,68400,144,2741,3541,20
2,2021,549300NWBS6MQJX15N44,19124,TX,48085.0,4.808503e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,5312,24.51,89000,107,1551,1870,14
3,2021,549300NWBS6MQJX15N44,43300,TX,48181.0,4.818100e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,5771,11.61,70500,115,1962,2451,47
4,2021,549300NWBS6MQJX15N44,26420,TX,48201.0,4.820131e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,4366,78.63,79800,75,681,1872,67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99994,2021,549300BRJZYHYKT4BJ84,99999,CO,8107.0,8.107000e+09,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,2049,1.02,73500,118,595,1488,26
99995,2021,549300BRJZYHYKT4BJ84,19740,CO,8031.0,8.031008e+09,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Joint,...,,,,8352,80.24,104800,90,1769,2423,16
99996,2021,549300BRJZYHYKT4BJ84,29820,NV,32003.0,3.200301e+10,C,FHA:First Lien,Single Family (1-4 Units):Site-Built,Not Hispanic or Latino,...,,,,8307,43.01,72400,104,2168,2950,12
99997,2021,549300BRJZYHYKT4BJ84,33340,WI,55133.0,5.513320e+10,C,Conventional:First Lien,Single Family (1-4 Units):Site-Built,Ethnicity Not Available,...,,,,5031,6.86,84400,156,1485,1789,29


In [7]:
for c in df.columns:
    print(c)

activity_year
lei
derived_msa-md
state_code
county_code
census_tract
conforming_loan_limit
derived_loan_product_type
derived_dwelling_category
derived_ethnicity
derived_race
derived_sex
action_taken
purchaser_type
preapproval
loan_type
loan_purpose
lien_status
reverse_mortgage
open-end_line_of_credit
business_or_commercial_purpose
loan_amount
loan_to_value_ratio
interest_rate
rate_spread
hoepa_status
total_loan_costs
total_points_and_fees
origination_charges
discount_points
lender_credits
loan_term
prepayment_penalty_term
intro_rate_period
negative_amortization
interest_only_payment
balloon_payment
other_nonamortizing_features
property_value
construction_method
occupancy_type
manufactured_home_secured_property_type
manufactured_home_land_property_interest
total_units
multifamily_affordable_units
income
debt_to_income_ratio
applicant_credit_score_type
co-applicant_credit_score_type
applicant_ethnicity-1
applicant_ethnicity-2
applicant_ethnicity-3
applicant_ethnicity-4
applicant_ethnicit

In [25]:
for c in df.columns:
    print(f'Examining column: {c}')
    print(df[c].value_counts())
    print()

Examining column: activity_year
activity_year
2021    99999
Name: count, dtype: int64

Examining column: lei
lei
549300BRJZYHYKT4BJ84    72048
549300NWBS6MQJX15N44    11757
549300DPRWSBUY619V27     9051
549300567BJCXPG9IV35     3478
D32W5EBLENJC27207O81     1160
549300TMWSYX6B5ZOK69      722
549300GQOPGZ1DO0MZ49      490
549300TSIYX9RDYWC806      321
54930053KPO7OG48FP72      310
5493001QR7MEE12WC276      268
549300VE85K2XTVRSG76      214
254900SJONZGRA3CJM44      178
B90YWS6AFX2LGWOXJ111        2
Name: count, dtype: int64

Examining column: derived_msa-md
derived_msa-md
99999    8862
31084    4270
40140    3091
19124    2930
49660    2647
         ... 
44940       2
33540       1
21300       1
27060       1
24260       1
Name: count, Length: 404, dtype: int64

Examining column: state_code
state_code
CA    21702
TX     8951
FL     5378
OH     4982
AR     3782
CO     3418
AZ     3369
MO     3293
GA     3179
TN     3169
IL     3167
NJ     2593
NY     2140
WA     1987
UT     1949
KY     1