## Target Variable:
- *Current Loan Delinquency Status*: The number of days the borrower is delinquent in making loan payments as of the end of the monthly reporting period. Used to derive the target (e.g., delinquent if >0). Guide notes: 0 = Current, 1 = 30-59 days, 2 = 60-89 days, ..., RA = Repayment Plan, RF = REO, 999 = Unknown.

## Predictor Variables (Features):
#### From Origination Data:
- *Credit Score*: The standardized credit score used to evaluate the borrower during the loan origination process. Lower scores indicate higher risk. Guide notes: FICO score, masked as 300 for <300, 850 for >850, or 9999 for missing.
- *Original Combined Loan-to-Value (CLTV)*: The ratio of the original loan amount and any subordinate lien amount to the property value at origination. Higher ratios increase default risk. Guide notes: Rounded to nearest integer, 999 for missing. 
- *Original Debt-to-Income (DTI) Ratio*: : The ratio of the borrower's total monthly debt payments to gross monthly income at origination. Higher DTI suggests financial strain. Guide notes: Rounded to nearest integer, 999 for missing or not considered.
- *Original Interest Rate*: The interest rate on the loan as stated on the note at the time the loan was originated. Higher rates may lead to higher payments and defaults. Guide notes: Reported to the nearest eighth of a percent.
- *Original Loan Term*: The number of months in which the loan is scheduled to be repaid. Longer terms may reduce monthly payments but increase long-term risk. Guide notes: In months, e.g., 360 for 30-year loans.
- *Number of Borrowers*: The number of borrowers who are obligated to repay the mortgage note. Multiple borrowers may reduce risk. Guide notes: 99 for missing.
- *Property State*: The two-letter postal abbreviation for the state in which the property is located. Captures regional economic factors. Guide notes: U.S. states only.
- *Occupancy Status*: The classification for the property occupancy status at the time the loan was originated. Investment properties have higher risk. Guide notes: O = Owner Occupied, S = Second Home, I = Investment Property, 9 = Unknown.

#### From Performance Data:
- *Loan Age*: The number of scheduled monthly payments that have elapsed since the loan was originated. Helps capture loan seasoning. Guide notes: In months, 999 for missing.
- *Remaining Months to Legal Maturity*: The number of months remaining until the loan is scheduled to mature. Shorter terms may indicate higher risk near maturity. Guide notes: In months, 999 for missing.
- *Current Actual UPB*:The unpaid principal balance of the loan as of the end of the monthly reporting period. Higher UPB may correlate with defaults. Guide notes: Rounded to nearest $1,000, 000000 for zero balance.
- *Current Interest Rate*: The interest rate on the loan as of the end of the monthly reporting period. Adjustments can affect affordability. Guide notes: Reported to the nearest eighth of a percent, 99.999 for missing.

Rationale for Selection:<br>
<br>
These variables cover borrower creditworthiness, loan affordability, property details, and ongoing performance, which are key drivers of default risk. The target is derived from 'Current Loan Delinquency Status' as a binary flag (1 for delinquent, 0 for current).<br>
<br>
Key Identifiers:<br>
<br>
- *Loan Sequence Number*: A unique identifier for each loan, critical for merging and tracking across origination and performance data. Guide notes: 12-character alphanumeric, masked for privacy.
- *Original Loan-to-Value (LTV)*: The ratio of the original loan amount to the property value at origination, providing additional context to Original Combined Loan-to-Value (CLTV). Guide notes: Rounded to nearest integer, 999 for missing.
- *First Payment Date*: The date of the first scheduled payment, offering a temporal anchor for loan age and performance. Guide notes: Format YYYYMMDD, parsed as datetime64[ns].

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("/Users/61310joy/Default_Predict/Data/raw/merged_loans_2014_2024.csv")

  df = pd.read_csv("/Users/61310joy/Default_Predict/Data/raw/merged_loans_2014_2024.csv")


In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,Credit Score,First Payment Date,First Time Homebuyer Flag,Maturity Date,Metropolitan Statistical Area (MSA) Or Metropolitan Division,Mortgage Insurance Percentage (MI %),Number of Units,Occupancy Status,Original Combined Loan-to-Value (CLTV),...,Modification Cost,Step Modification Flag,Deferred Payment Plan,Estimated Loan-to-Value (ELTV),Zero Balance Removal UPB,Delinquent Accrued Interest,Delinquency Due to Disaster,Borrower Assistance Status Code,Current Month Modification Cost,Interest Bearing UPB
0,F14Q10000001,629,2014-05-01,N,2029-04-01,17300.0,0,1,P,77,...,,,,50.0,214370.31,,,,,0.0
1,F14Q10000002,770,2014-04-01,N,2029-03-01,,12,1,P,89,...,,,,999.0,52348.54,,,,,0.0
2,F14Q10000003,674,2014-03-01,N,2029-02-01,,0,1,P,89,...,,,,999.0,118062.84,,,,,0.0
3,F14Q10000004,717,2014-04-01,N,2044-03-01,39300.0,0,1,I,77,...,,,,21.0,,,,,,84852.01
4,F14Q10000005,813,2014-05-01,Y,2044-04-01,19780.0,30,1,P,95,...,,,,999.0,152675.92,,,,,0.0


In [6]:
df = df.rename(columns={
    "Unnamed: 0": "Loan Sequence Number", 
})

df = df[~df["Loan Sequence Number"].astype(str).str.startswith("F19")]
df.head()

Unnamed: 0,Loan Sequence Number,Credit Score,First Payment Date,First Time Homebuyer Flag,Maturity Date,Metropolitan Statistical Area (MSA) Or Metropolitan Division,Mortgage Insurance Percentage (MI %),Number of Units,Occupancy Status,Original Combined Loan-to-Value (CLTV),...,Modification Cost,Step Modification Flag,Deferred Payment Plan,Estimated Loan-to-Value (ELTV),Zero Balance Removal UPB,Delinquent Accrued Interest,Delinquency Due to Disaster,Borrower Assistance Status Code,Current Month Modification Cost,Interest Bearing UPB
0,F14Q10000001,629,2014-05-01,N,2029-04-01,17300.0,0,1,P,77,...,,,,50.0,214370.31,,,,,0.0
1,F14Q10000002,770,2014-04-01,N,2029-03-01,,12,1,P,89,...,,,,999.0,52348.54,,,,,0.0
2,F14Q10000003,674,2014-03-01,N,2029-02-01,,0,1,P,89,...,,,,999.0,118062.84,,,,,0.0
3,F14Q10000004,717,2014-04-01,N,2044-03-01,39300.0,0,1,I,77,...,,,,21.0,,,,,,84852.01
4,F14Q10000005,813,2014-05-01,Y,2044-04-01,19780.0,30,1,P,95,...,,,,999.0,152675.92,,,,,0.0


In [7]:
# Selected column names for logistic regression
selected_columns = [
    'Loan Sequence Number',
    'Credit Score',
    'Original Loan-to-Value (LTV)',
    'Original Combined Loan-to-Value (CLTV)',
    'Original Debt-to-Income (DTI) Ratio',
    'Original Interest Rate',
    'Original UPB',
    'Current Actual UPB',
    'Loan Age',
    'Remaining Months to Legal Maturity',
    'Estimated Loan-to-Value (ELTV)',
    'Current Loan Delinquency Status',
    'Number of Borrowers',
    'Property State',
    'Current Deferred UPB',
    'Current Interest Rate',
    'Occupancy Status',
    'Original Loan Term',
    'First Payment Date',
]
df = df[selected_columns]
df.head()

Unnamed: 0,Loan Sequence Number,Credit Score,Original Loan-to-Value (LTV),Original Combined Loan-to-Value (CLTV),Original Debt-to-Income (DTI) Ratio,Original Interest Rate,Original UPB,Current Actual UPB,Loan Age,Remaining Months to Legal Maturity,Estimated Loan-to-Value (ELTV),Current Loan Delinquency Status,Number of Borrowers,Property State,Current Deferred UPB,Current Interest Rate,Occupancy Status,Original Loan Term,First Payment Date
0,F14Q10000001,629,71,77,45,3.875,324000,0.0,74,106,50.0,0,2,KY,0.0,3.875,P,180,2014-05-01
1,F14Q10000002,770,89,89,30,3.375,65000,0.0,40,140,999.0,0,2,NY,0.0,3.375,P,180,2014-04-01
2,F14Q10000003,674,76,89,999,3.375,182000,0.0,75,105,999.0,0,1,MI,0.0,3.375,P,180,2014-03-01
3,F14Q10000004,717,77,77,41,5.25,107000,84852.01,132,228,21.0,0,2,RI,0.0,5.25,I,360,2014-04-01
4,F14Q10000005,813,95,95,32,4.125,165000,0.0,47,313,999.0,3,1,IA,0.0,4.125,P,360,2014-05-01


In [8]:
# Derive binary target 'Default' from 'Current Loan Delinquency Status'
if "Current Loan Delinquency Status" in df.columns:
    delinquency = pd.to_numeric(df["Current Loan Delinquency Status"], errors="coerce").fillna(0)

    df["Default"] = np.where(
        (delinquency >= 3) | (df["Current Loan Delinquency Status"].astype(str) == "RA"),1, 0
)
df.head()

Unnamed: 0,Loan Sequence Number,Credit Score,Original Loan-to-Value (LTV),Original Combined Loan-to-Value (CLTV),Original Debt-to-Income (DTI) Ratio,Original Interest Rate,Original UPB,Current Actual UPB,Loan Age,Remaining Months to Legal Maturity,Estimated Loan-to-Value (ELTV),Current Loan Delinquency Status,Number of Borrowers,Property State,Current Deferred UPB,Current Interest Rate,Occupancy Status,Original Loan Term,First Payment Date,Default
0,F14Q10000001,629,71,77,45,3.875,324000,0.0,74,106,50.0,0,2,KY,0.0,3.875,P,180,2014-05-01,0
1,F14Q10000002,770,89,89,30,3.375,65000,0.0,40,140,999.0,0,2,NY,0.0,3.375,P,180,2014-04-01,0
2,F14Q10000003,674,76,89,999,3.375,182000,0.0,75,105,999.0,0,1,MI,0.0,3.375,P,180,2014-03-01,0
3,F14Q10000004,717,77,77,41,5.25,107000,84852.01,132,228,21.0,0,2,RI,0.0,5.25,I,360,2014-04-01,0
4,F14Q10000005,813,95,95,32,4.125,165000,0.0,47,313,999.0,3,1,IA,0.0,4.125,P,360,2014-05-01,1


In [11]:
# extract data from 2020Q1-2024Q4
df = df[df["Loan Sequence Number"].str[1:3].astype(int) >= 20].copy()
df.head()

Unnamed: 0,Loan Sequence Number,Credit Score,Original Loan-to-Value (LTV),Original Combined Loan-to-Value (CLTV),Original Debt-to-Income (DTI) Ratio,Original Interest Rate,Original UPB,Current Actual UPB,Loan Age,Remaining Months to Legal Maturity,Estimated Loan-to-Value (ELTV),Current Loan Delinquency Status,Number of Borrowers,Property State,Current Deferred UPB,Current Interest Rate,Occupancy Status,Original Loan Term,First Payment Date,Default
8753304,F20Q10000001,661,36,36,19,2.875,66000,40665.26,58,122,14.0,0,2,MD,0.0,2.875,P,180,2020-06-01,0
8753305,F20Q10000002,681,95,95,13,5.75,52000,46807.7,61,299,34.0,0,1,KS,0.0,5.75,P,360,2020-03-01,0
8753306,F20Q10000003,775,87,87,29,3.25,248000,0.0,24,336,999.0,0,2,CO,0.0,3.25,P,360,2020-04-01,0
8753307,F20Q10000004,770,65,65,14,3.625,125000,89979.7,61,119,999.0,0,1,MO,0.0,3.625,I,180,2020-03-01,0
8753308,F20Q10000005,791,80,80,33,3.875,58000,47698.13,60,300,20.0,0,1,NY,0.0,3.875,P,360,2020-04-01,0


In [12]:
df.tail()

Unnamed: 0,Loan Sequence Number,Credit Score,Original Loan-to-Value (LTV),Original Combined Loan-to-Value (CLTV),Original Debt-to-Income (DTI) Ratio,Original Interest Rate,Original UPB,Current Actual UPB,Loan Age,Remaining Months to Legal Maturity,Estimated Loan-to-Value (ELTV),Current Loan Delinquency Status,Number of Borrowers,Property State,Current Deferred UPB,Current Interest Rate,Occupancy Status,Original Loan Term,First Payment Date,Default
20323114,F24Q40281897,799,90,90,28,5.375,331000,325000.0,4,356,89.0,0,2,TN,0.0,5.375,P,360,2024-12-01,0
20323115,F24Q40281898,699,95,95,43,6.99,214000,213000.0,4,356,93.0,0,1,OH,0.0,6.99,P,360,2024-12-01,0
20323116,F24Q40281899,781,91,91,40,6.375,280000,279000.0,4,356,88.0,0,1,NY,0.0,6.375,P,360,2024-12-01,0
20323117,F24Q40281900,724,90,90,50,5.5,1058000,1048000.0,4,356,87.0,0,1,CA,0.0,5.5,P,360,2024-12-01,0
20323118,F24Q40281901,779,90,90,27,5.625,366000,363000.0,4,356,91.0,0,1,TN,0.0,5.625,P,360,2024-12-01,0


In [9]:
df["Default"].value_counts()

Default
0    18463131
1       78844
Name: count, dtype: int64

In [13]:
output_fp = "/Users/61310joy/Default_Predict/Data/regression_data/regression.csv"
df.to_csv(output_fp, index=False)