<a href="https://colab.research.google.com/github/AJLR888/hmda-ny-2007-loan-default/blob/main/ny_2007_data_preprocesing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **HMDA. Mortgage Data Analysis and Modeling Predictions.**

## The Dataset: https://www.consumerfinance.gov/data-research/hmda/

The Home Mortgage Disclosure Act (HMDA) requires many financial institutions to maintain, report, and publicly disclose loan-level information about mortgages. These data help show whether lenders are serving the housing needs of their communities; they give public officials information that helps them make decisions and policies; and they shed light on lending patterns that could be discriminatory. The public data are modified to protect applicant and borrower privacy.

HMDA was originally enacted by Congress in 1975 and is implemented by Regulation C.

## The Goal:
We will use the models Logistic Regression and LightGBM. to assess the outcomes, we will use ROC-AUC and Log Loss. The purpouse will be:

1.   **Find out if the are any intersectional biases** in model's predictions.
2.   **Which features influenced the most on model's outcomes**. We will use SHAP Values.

## Steps:

1.   Setup of working environment and libraries.
2.   EDA
  *   Understanding of each feature. What does each feature describe?
  *   Taking a sample to understand the data structure.

      *   df.shape
      *   df.columns
      *   df.dtypes






# Step 1: Setup of working environment and libraries.

In [1]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Setting up GitHub
import os
import subprocess
from getpass import getpass #Secure token storage

# Importing working space
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#!git init #is for Google Colab environment, we should use "subprocess.run(["git", "init"])" which is more in line with PEP 8 doc. Style.
#We only use once "git init", when starting a new project. The purpose of this code is to create things related to the project in a .git directory in the project folder.
#subprocess.run(["git", "init"])


In [3]:
# Username and email:
subprocess.run(["git", "config", "--global", "user.name", "AJLR888"], check=True)
subprocess.run(["git", "config", "--global", "user.email", "roldan.analytics@gmail.com"], check=True)


# Storing GitHub token and repository details
GITHUB_TOKEN = getpass("Enter GitHub Token:")
REPO_OWNER = "AJLR888"
REPO_NAME = "hmda-ny-2007-loan-default"
BRANCH_NAME = "main"

#Setting GitHub remot URL with authentcation
GIT_REMOTE_URL = f"https://{GITHUB_TOKEN}@github.com/{REPO_OWNER}/{REPO_NAME}.git"
os.system(f"git remote set-url origin {GIT_REMOTE_URL}")

Enter GitHub Token:··········


32768

In [4]:
# Loading the data
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmda_2007_ny_all-records_labels.csv')

# Checking data size
df = pd.DataFrame(df)


  df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmda_2007_ny_all-records_labels.csv')


# Step 2: EDA



## A)   Understanding of each feature. What does each feature describe?




1.   **as_of_year**: The year the mortgage was reported.
2.   **respondent_id**: The lender.
3.   **agency_name**: The regulatory agency responsible for overseeing the financial institution that reported the mortgage.
4.   **agency_abbr**: Agency abrebiation name.

5.   **agency_code**: The regulator that supervises the regulated lender that report the data.

6.   **loan_type_name**:The type of covered loan or application (if covered)

7.   **property_type_name**:

  *   One to four-family (other than manufactured housing)
  *   Manufactured housing
  *   Multifamily


8.   **loan_purpose_name**: What was the loan used for.

9.   **owner_occupancy_name**: Whether the owner intend to live in the property or not.

10.   **loan_amount_000s**

11.   **preapproval_name**: Preapproval only applies to home purchase loans, not refinancing or home improvement loans.

12.   **action_taken_name**: The final outcome of a loan application.

13.   **msamd_name**: MSA stands for Metropolitan Statistical Area, is a region with a high population density at its core and close economic ties throught the area; MD stands for Metropolitan division, is a sub-region within a large MSA (used when the area is particularly populous).


  *   *Considerations*:
    *   If the property is outside of the MSA/MD area it may be blank/empty.
    *   Helps analyze lending patterns, approval rates, and borrower demographics in specific regions.
    *   msamd is broader, city or regional-level in comparison with census_tract_number. msamd is better use for housing trends or market segmentation for instance.

14.   **census_tract_number**: is a small area. Designed to be socially and economically homogeneous. In other words, they are kind of neighbourhoods.

16.   **purchaser_type_name**: Describes who purchased the loan in the secondary market after it was originated by the reporting institution.

17.   **denial_reason_name_1**: Important to consider that if NA the loan was not denied.

18.   **rate_spread:** The exra interest that a borrower pay compared to the "best-qualified" borrowers. Generally speaking it means higher risk although there are nuances to consider. e.g. someone can have a higher rate spread due to a bad credit score which doesn't reflect their ability and will to pay back.

19.   **hoepa_status_name**: hoepa stands for Home Ownership and Equity Protection Act. is a law disgned to protect borrowers from predatory lending.
  *   Considerations:
    *   It can be used to track if high-cost loans are disproportionally targeted at vulnerable groups.

20.   **lien_status_name**: Basically, it indicates whether the mortgage is secured or not.

21.   **edit_status_name**: Are rules to assist filers in checking the accuracy of HMDA data prior to submission. Idicate if there are any accuracy(quality)issues.  
    *   Note: In our dataset we have almost 200,000 records as "Quality edit failure only" which is indicative of some potentail data inconsistency.

22.   **minority_population:** % of minority population to total tract population.

23.   **hud_median_family_income:** Median family income for the MSA (metropolitan statistical area) or MD (metropolitan division)
      *   Note: hud(Housing & Urban development).
*   List item

24.   **population**: total population in tract (tract ~ neighbourhood).

25.   **tract_to_msamd_income:** Indicates how wealthy or poor a tract is compared to its surroundings areas. compares the median family income of the census tract to the median family income of the corresponding metropolitan statistical area MSA OR metropolitan division MD. Let's remember that a tract is inside of a MSA/MD.

25.   **number_of_owner_occupied_units:** Total number of housing units in the census tract that are occupied by their owners rather than rented.

26.   **number_of_1_to_4_family_units:** Indicates the number of housing units that are classified as single-family homes. The higher the number the less population density.

--------------------------------------------------------------------------------

Check the following sources for further undertanding:

* Check filing instructions guide for further information: https://ffiec.cfpb.gov/documentation/fig/2024/overview

* Glosary: https://www.ffiec.gov/hmda/glossary.htm#top

* Original source: https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=nationwide&records=all-records&field_descriptions=labels

* Check "lar_record_codes.pdf" from my GitHub for further Inf.

##  B) Taking a sample to understand the data

### Data structure


In [5]:
# Taking the sample
df_sample = df.sample(n=12000, random_state=42)

# Dataset shape
print(df_sample.shape, '\n')


(12000, 78) 



In [13]:
pd.set_option('display.max_rows', None)

print(df_sample.dtypes)

as_of_year                          int64
respondent_id                      object
agency_name                        object
agency_abbr                        object
agency_code                         int64
loan_type_name                     object
loan_type                           int64
property_type_name                 object
property_type                       int64
loan_purpose_name                  object
loan_purpose                        int64
owner_occupancy_name               object
owner_occupancy                     int64
loan_amount_000s                    int64
preapproval_name                   object
preapproval                         int64
action_taken_name                  object
action_taken                        int64
msamd_name                         object
msamd                             float64
state_name                         object
state_abbr                         object
state_code                          int64
county_name                       

In [None]:
# Checking null values per columns and the columns names
pd.set_option('display.max_rows', None)

print(df_sample.isnull().sum())

as_of_year                            0
respondent_id                         0
agency_name                           0
agency_abbr                           0
agency_code                           0
loan_type_name                        0
loan_type                             0
property_type_name                    0
property_type                         0
loan_purpose_name                     0
loan_purpose                          0
owner_occupancy_name                  0
owner_occupancy                       0
loan_amount_000s                      0
preapproval_name                      0
preapproval                           0
action_taken_name                     0
action_taken                          0
msamd_name                         1087
msamd                              1087
state_name                            0
state_abbr                            0
state_code                            0
county_name                          11
county_code                          11


In [None]:
# Data types
print(df_sample.dtypes)

as_of_year                          int64
respondent_id                      object
agency_name                        object
agency_abbr                        object
agency_code                         int64
loan_type_name                     object
loan_type                           int64
property_type_name                 object
property_type                       int64
loan_purpose_name                  object
loan_purpose                        int64
owner_occupancy_name               object
owner_occupancy                     int64
loan_amount_000s                    int64
preapproval_name                   object
preapproval                         int64
action_taken_name                  object
action_taken                        int64
msamd_name                         object
msamd                             float64
state_name                         object
state_abbr                         object
state_code                          int64
county_name                       

In [None]:
# Checking content in the columns
pd.set_option('display.max_columns', None)
print(df_sample.head(5))

        as_of_year respondent_id                                  agency_name  \
400869        2007    0000018039                 Office of Thrift Supervision   
25128         2007    0000501105                       Federal Reserve System   
73622         2007    0001881185                       Federal Reserve System   
637936        2007    4216200005  Department of Housing and Urban Development   
570628        2007    56-0811711    Office of the Comptroller of the Currency   

       agency_abbr  agency_code loan_type_name  loan_type  \
400869         OTS            4   Conventional          1   
25128          FRS            2   Conventional          1   
73622          FRS            2   Conventional          1   
637936         HUD            7   Conventional          1   
570628         OCC            1   Conventional          1   

                                       property_type_name  property_type  \
400869  One-to-four family dwelling (other than manufa...             

In [None]:
# Statistics of numerical features
print(df.describe())

       as_of_year   agency_code     loan_type  property_type  loan_purpose  \
count   1009451.0  1.009451e+06  1.009451e+06   1.009451e+06  1.009451e+06   
mean       2007.0  3.223797e+00  1.047713e+00   1.024571e+00  2.096983e+00   
std           0.0  2.301860e+00  2.371925e-01   1.877141e-01  9.307315e-01   
min        2007.0  1.000000e+00  1.000000e+00   1.000000e+00  1.000000e+00   
25%        2007.0  1.000000e+00  1.000000e+00   1.000000e+00  1.000000e+00   
50%        2007.0  2.000000e+00  1.000000e+00   1.000000e+00  2.000000e+00   
75%        2007.0  4.000000e+00  1.000000e+00   1.000000e+00  3.000000e+00   
max        2007.0  7.000000e+00  4.000000e+00   3.000000e+00  3.000000e+00   

       owner_occupancy  loan_amount_000s   preapproval  action_taken  \
count     1.009451e+06      1.009451e+06  1.009451e+06  1.009451e+06   
mean      1.084364e+00      2.591246e+02  2.813417e+00  2.658396e+00   
std       2.987558e-01      5.125972e+02  4.372745e-01  1.701818e+00   
min      

In [None]:
# Checking unique values for categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
  print(f"Value counts for: {col}")
  print(df[col].value_counts(), "\n")

Value counts for: respondent_id
respondent_id
0001644643    65163
0000000008    59394
0000001741    54209
0000018039    39448
13-3222578    38384
0000013044    36366
16-1245395    34902
0000008551    29951
0001881185    28936
0003197956    24871
0000003970    24157
0000001461    23068
7069000008    18959
4216200005    17405
75-2921540    16695
0000024571    15681
7185300006    13798
0001868177    13126
0000015044    12041
05-0402708    11458
36-3744610    11027
0000006069    10699
0000501105    10534
13-3210378     9694
41-1704421     8253
0510528989     8231
7604800006     6978
0003394401     6527
0000008531     6496
0000014470     6443
0000000786     6358
0000014761     6256
1463600006     6212
7197000003     6072
0000017094     6026
0000012642     5856
0000000001     5616
0000012504     5609
3027509990     5305
0000008412     5113
0002752527     5048
0002752321     5025
0000017945     4622
0000024563     4513
0001072246     4512
56-0811711     4319
22-3887207     4250
0330661303    

# Data Cleaning

In [None]:
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
  print(f"Value counts for: {col}")
  print(df[col].value_counts(), "\n")

Value counts for: loan_type_name
loan_type_name
Conventional          579399
FHA-insured            19791
VA-guaranteed           1697
FSA/RHS-guaranteed       387
Name: count, dtype: int64 

Value counts for: property_type_name
property_type_name
One-to-four family dwelling (other than manufactured housing)    590101
Manufactured housing                                               7094
Multifamily dwelling                                               4079
Name: count, dtype: int64 

Value counts for: loan_purpose_name
loan_purpose_name
Refinancing         288173
Home purchase       237915
Home improvement     75186
Name: count, dtype: int64 

Value counts for: action_taken_name
action_taken_name
Loan originated                                        231939
Application denied by financial institution            179429
Application withdrawn by applicant                      61678
Application approved but not accepted                   56103
Loan purchased by the institution          

## Excluding: irrelevant records from the following columns:

*   applicant_ethnicity_name
*   applicant_race_name_1
*   applicant_sex_name
*   action_taken_name





In [None]:
df = df[
    ~df["action_taken_name"].isin([
        "File closed for incompleteness",
        "Preapproval request denied by financial institution",
        "Preapproval request approved but not accepted",
        "Application withdrawn by applicant"
    ]) &
    ~df["applicant_ethnicity_name"].isin([
        "Information not provided by applicant in mail, Internet, or telephone application",
        "Not applicable"
    ]) &
    ~df["applicant_race_name_1"].isin([
        "Information not provided by applicant in mail, Internet, or telephone application",
        "Not applicable"
    ]) &
    ~df["applicant_sex_name"].isin([
        "Information not provided by applicant in mail, Internet, or telephone application",
        "Not applicable"
    ])
]


In [None]:
print(df.shape)

(410858, 16)


## Addressing missing values

In [None]:
print(df.isnull().sum())

loan_type_name                   0
property_type_name               0
loan_purpose_name                0
loan_amount_000s                 0
action_taken_name                0
msamd_name                   36465
census_tract_number            483
applicant_ethnicity_name         0
applicant_race_name_1            0
applicant_sex_name               0
applicant_income_000s        22625
denial_reason_name_1        313621
rate_spread                 372712
lien_status_name                 0
hud_median_family_income       508
tract_to_msamd_income          638
dtype: int64


In [None]:
df = df.assign(
    msamd_name=df['msamd_name'].fillna("Unknown"),
    denial_reason_name_1=df['denial_reason_name_1'].fillna("Unknown"),
    rate_spread=df['rate_spread'].fillna(0)
)

df = df.dropna(subset=['hud_median_family_income', 'tract_to_msamd_income', 'applicant_income_000s'])


In [None]:
print(df.shape)
print(df.isnull().sum())

(387610, 16)
loan_type_name              0
property_type_name          0
loan_purpose_name           0
loan_amount_000s            0
action_taken_name           0
msamd_name                  0
census_tract_number         0
applicant_ethnicity_name    0
applicant_race_name_1       0
applicant_sex_name          0
applicant_income_000s       0
denial_reason_name_1        0
rate_spread                 0
lien_status_name            0
hud_median_family_income    0
tract_to_msamd_income       0
dtype: int64


## Feature selection Part 1
We are considering only single individuals hence we only consider "No Co-aplicant" records

In [None]:
df = df[df['co_applicant_ethnicity_name'] == 'No co-applicant']

In [None]:
print(df.dtypes)

as_of_year                          int64
respondent_id                      object
agency_name                        object
agency_abbr                        object
agency_code                         int64
loan_type_name                     object
loan_type                           int64
property_type_name                 object
property_type                       int64
loan_purpose_name                  object
loan_purpose                        int64
owner_occupancy_name               object
owner_occupancy                     int64
loan_amount_000s                    int64
preapproval_name                   object
preapproval                         int64
action_taken_name                  object
action_taken                        int64
msamd_name                         object
msamd                             float64
state_name                         object
state_abbr                         object
state_code                          int64
county_name                       

## Selection of relevant features

In [None]:
df = df[[
    "loan_type_name",
    "property_type_name",
    "loan_purpose_name",
    "loan_amount_000s",
    "action_taken_name",
    "msamd_name",
    "census_tract_number",
    "applicant_ethnicity_name",
    "applicant_race_name_1",
    "applicant_sex_name",
    "applicant_income_000s",
    "denial_reason_name_1",
    "rate_spread",
    "lien_status_name",
    "hud_median_family_income",
    "tract_to_msamd_income"
]]

In [None]:
print(df.dtypes)

loan_type_name               object
property_type_name           object
loan_purpose_name            object
loan_amount_000s              int64
action_taken_name            object
msamd_name                   object
census_tract_number         float64
applicant_ethnicity_name     object
applicant_race_name_1        object
applicant_sex_name           object
applicant_income_000s       float64
denial_reason_name_1         object
rate_spread                 float64
lien_status_name             object
hud_median_family_income    float64
tract_to_msamd_income       float64
dtype: object


In [None]:
print(df.shape)

(601274, 16)


In [None]:
print(df.describe())

       loan_amount_000s  census_tract_number  applicant_income_000s  \
count     601274.000000        600501.000000          558351.000000   
mean         257.485403          1434.016365             115.918044   
std          425.800429          2516.739575             215.812326   
min            1.000000             1.000000               1.000000   
25%           76.000000           132.000000              50.000000   
50%          176.000000           374.000000              80.000000   
75%          368.000000          1351.010000             123.000000   
max        93000.000000          9929.000000            9999.000000   

        rate_spread  hud_median_family_income  tract_to_msamd_income  
count  46312.000000             600473.000000          600246.000000  
mean       5.119319              66569.953853             106.660651  
std        1.658133              14311.443551              46.883337  
min        3.000000              50900.000000               5.050000  
25%  

# Creation of new columns


## ethnicity_race_sex

In [None]:

df['ethnicity_race_sex'] = df['applicant_ethnicity_name'].str.lower() + "_" + df['applicant_race_name_1'].str.lower() + "_" + df['applicant_sex_name'].str.lower()

# Checking column created
print(df[['ethnicity_race_sex']].value_counts())

ethnicity_race_sex                                                     
not hispanic or latino_white_male                                          152266
not hispanic or latino_white_female                                         99758
not hispanic or latino_black or african american_female                     33890
not hispanic or latino_black or african american_male                       27917
hispanic or latino_white_male                                               23592
not hispanic or latino_asian_male                                           16565
hispanic or latino_white_female                                             14471
not hispanic or latino_asian_female                                         10629
not hispanic or latino_american indian or alaska native_male                 1325
hispanic or latino_black or african american_male                            1199
not hispanic or latino_native hawaiian or other pacific islander_male        1152
hispanic or latino_black o

In [None]:
print("541565")

541565


In [None]:
print("0021325")

0021325


# Commit to GitHub


The code below will help us to commit our project onto GitHub:



In [None]:
os.chdir("/content/drive/My Drive/Colab Notebooks/hmda_ny_2007_preprocessing/") #First we call our project's location using we use os.chdir() instead of %cd because
                                                                                #the first option is the better choice for code that needs to be portable and run in different environments
                                                                                #and we are trying to replicate a working environment.

#Source: https://www.tutorialspoint.com/python/os_chdir.htm

In [None]:
#We use subprocess.run(["git", "status"]) instead of "!git status" (Google Colab) for a PEP-8.
result = subprocess.run(["git", "status"], capture_output=True, text=True) # The following helps us to "capture the results"; capture_output=True, text=True

print(result.stdout)  # Print the status message
print(result.stderr)  # Print errors if there are any.

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   ny-2007-data-preprocesing.ipynb

no changes added to commit (use "git add" and/or "git commit -a")




In [None]:
subprocess.run(["git", "add", "."], check=True) #!git add . |/| We use "check=True" to ensure that if there are any error, they will be shown.

CompletedProcess(args=['git', 'add', '.'], returncode=0)

In [None]:
subprocess.run(["git", "commit", "-m", "Updating-20:00-25032025"], check=True) #Google Colab: !git commit -m "Updating"

CompletedProcess(args=['git', 'commit', '-m', 'Updating-20:19-24032025'], returncode=0)

In [None]:
# ERROR, NEED TO FIX: subprocess.run(["git", "push", "origin", "master"], check=True) #!git push origin main #gave error, solution !git config --global credential.helper store

In [None]:
subprocess.run(args=['git', 'branch'], check=True)


CompletedProcess(args=['git', 'branch'], returncode=0)

In [None]:
logStatus = subprocess.run(["git", "log"], capture_output=True, text=True, check=True) #Google Colab: !git log
print(logStatus.stdout) #Shows output
print(logStatus.stderr) #Shows errors

commit 22f14bb8dd2143947c156433309f96c7048974f4
Author: AJLR888 <roldan.analytics@gmail.com>
Date:   Mon Mar 24 20:21:10 2025 +0000

    Updating-20:19-24032025

commit c7fad8581585547c97e2d4d0653c80278f40d5b0
Author: AJLR888 <roldan.analytics@gmail.com>
Date:   Mon Mar 24 17:57:19 2025 +0000

    Updating-08:40-21032025

commit 3c9d2766364bfb3617c9a311742c26f18461db5e
Author: AJLR888 <roldan.analytics@gmail.com>
Date:   Sun Mar 16 10:21:19 2025 +0000

    Updating10:20-16032025

commit 7a277871e6a4326cefb4c4e12992b04615fdbb6d
Author: AJLR888 <roldan.analytics@gmail.com>
Date:   Sun Mar 16 10:19:04 2025 +0000

    Updating1412-15032025

commit c75bc5b5e448ae2ddf6b9a1baf0728a665f3131f
Author: AJLR888 <roldan.analytics@gmail.com>
Date:   Sat Mar 15 14:29:55 2025 +0000

    Updating1412-15032025

commit f81fdc41532adddf2127dc0d8edd32302a439d01
Author: AJLR888 <roldan.analytics@gmail.com>
Date:   Sat Mar 15 14:28:42 2025 +0000

    Updating1412-15032025

commit d26c1d090a36da8442dc879eccc1