### Stage 2. Feature engineering for the credit scoring model.

In [156]:
%reload_ext kedro.ipython

In [119]:
df_intermediate = catalog.load("intermediate_data")
columns_list_intermediate = df_intermediate.columns.tolist()
print(columns_list_intermediate)

['loan_amnt', 'funded_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'purpose', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'mths_since_last_major_derog', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'avg_cur_bal', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_tl_op_past_12m', 'pub_rec_bankruptcies', 'tax_liens', 'revol_bal_joint', 'sec_app_earliest_cr_line', 'hardship_flag', 'hardship_dpd', 'hardship_loan_status', 'is_joint_app', 'loan_status_binary']


### Table of raw and pre-organized fields after pre-processing stage

| Field                                              | Keep/Remove        | Rationale                                                                  |
|-------------------------------------------------- -|--------------------|----------------------------------------------------------------------------|
| ✅purpose, home_ownership                          | Keep               | Important contextual factors                                               |
| ✅hardship_flag, hardship_loan_status, hardship_dpd | Combine ?         | Considering combining into single hardship indicator                       |
| ✅emp_length, earliest_cr_line                     | Keep               | Important stability indicators                                             |
| ✅loan_amnt, funded_amnt, installment              | Keep               | Core loan characteristics that directly impact risk assessment             |
| ✅fico_range_low, fico_range_high                  | Keep               | Critical credit quality indicators                                         |
| ✅term, int_rate, grade, sub_grade                 | Keep               | Core loan terms and risk assessment                                        |
| ✅delinq_2yrs, mths_since_last_delinq              | Keep               | Key delinquency metrics                                                    |
| ✅ open_acc, total_acc, open_acc_6m                | Keep               | Account management indicators                                              |
| ✅is_joint_app                                     | Keep (!)           | Important application characteristic                                       |
| ✅inq_last_6mths                                   | Keep               | Recent credit seeking behavior                                             |
| ✅pub_rec, mths_since_last_record                  | Keep               | Public record items important for risk                                     |
| ✅mths_since_last_major_derog                      | Keep               | Important derogatory indicator                                             |
| ✅annual_inc, annual_inc_joint                     | Keep one           | Individual income for individual and joint applications                    |
| ✅dti, dti_joint                                   | Keep one           | Same logic as income                                                       |
| ✅revol_util, revol_bal, revol_bal_joint           | Keep one           | Keep individual revolving data for non-joint apps                          |
| ✅initial_list_status                              | Remove?            | Less relevant for risk assessment                                          |
| ✅sec_app_earliest_cr_line                         | Remove if non-joi  | Only relevant for joint applications                                       |
| ✅verification_status                              | Keep               | Important for fraud/risk detection — indicates whether income was verified |
| ✅issue_d                                          | Keep               | Useful for time-based feature engineering and cohort analysis              |
| ✅ application_type                                | Delete             | Processed to `is_joint_app`                                                |
| ✅verification_status_joint                        | Keep               | Relevant only for joint apps — consider conditional usage                  |
| ✅ acc_now_delinq                                  | Consider Removing  | Often low variance, but may capture recent delinquencies                   |
| ✅tot_coll_amt                                     | Remove             | Often sparse or zero — limited signal                                      |
| ✅tot_cur_bal                                      | Consider Keeping   | Could reflect overall financial health                                     |
| ✅open_act_il                                      | Remove             | Duplicates of other installment account indicators                         |
| ✅open_il_12m                                      | Remove for now     | Reflects recent borrowing behavior                                         |
| ✅avg_cur_bal                                      | Keep               | Captures financial leverage & liquidity                                    |
| ✅mths_since_recent_bc_dlq                         | Remove             | Highly sparse and inconsistent                                             |
| ✅mths_since_recent_inq                            | Keep               | Shows recent credit-seeking behavior                                       |
| ✅mths_since_recent_revol_delinq                   | Remove             | Highly sparse — consider only if well-populated                            |
| ✅num_tl_op_past_12m                               | Keep               | Indicates recent credit activity — potential risk signal                   |
| ✅pub_rec_bankruptcies                             | Keep               | Key indicator of prior financial distress                                  |
| ✅tax_liens                                        | Remove             | Usually low variance, low signal                                           |
| ✅loan_status_binary                                | Keep (Target)      | This is our main target variable — used to define 'bad' loans              |


### Engineering the `home_ownership` fields with `create_home_ownership_ordinal()` in /feature_engineering/nodes.py
Functionality:
Creates an ordinal encoded feature home_ownership_ordinal from home_ownership, reflecting risk levels.
Input: home_ownership — categorical field with values like "own", "mortgage", "rent", "other"
Output: `home_ownership_ordinal` — integer feature (0 = safest, 3 = riskiest)
New field(s): `home_ownership_ordinal` — used in both tree-based and regression models.
Drop field(s): `home_ownership` -- used only to create home_ownership_ordinal

In [120]:
cols_for_reg = ["loan_status_binary"]  # columns for the regression-based models
cols_for_tree = ["loan_status_binary"]  # columns for thew tree-based models
drop_cols = ["home_ownership", "tot_coll_amt", "open_act_il", "open_il_12m", "mths_since_recent_bc_dlq", "tax_liens"] # columns to drop entirely

# print(f"Columns for the regression-based models: {cols_for_reg}")
# print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop: {drop_cols}")

Columns to drop: ['home_ownership', 'tot_coll_amt', 'open_act_il', 'open_il_12m', 'mths_since_recent_bc_dlq', 'tax_liens']


### Engineering the `purpose` field with `encode_purpose_field()` in /feature_engineering/nodes.py

**Functionality:**
Generates one-hot encoded features from a cleaned version of the `purpose` field (`purpose_cleaned`), replacing rare categories with `"other"`.
Useful primarily for **tree-based models**, which can leverage one-hot encoded categorical splits.

**Input:**  - `purpose` — a preprocessed categorical field (lowercase, space-normalized)

**Intermediate Output:**  - `purpose_cleaned` — categorical field where rare categories are grouped as `"other"`

**Output:**
- One-hot encoded fields:
  - `purpose_<value>` for each cleaned purpose category
  - Includes `purpose_other` and `purpose_nan`

**New field(s):**
- `purpose_cleaned` — used to generate the one-hot columns
- `purpose_*` — one-hot encoded columns, **tree model-specific**

**Drop field(s):**
- `purpose` — used only for generating `purpose_cleaned` and will be used in `create_loan_purpose_risk_features()` below
- `purpose_cleaned` — optional to drop after one-hot encoding, unless used later

> 💡 Tip: Keep `purpose_cleaned` if any downstream analysis requires interpretable categorical grouping. Otherwise, it can be dropped post one-hot encoding.


In [121]:

cols_for_tree = list(set(cols_for_tree +  ["purpose_other", "purpose_nan"]))

drop_cols = list(set(drop_cols + ["purpose_cleaned"]))

# print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the tree-based models: ['purpose_other', 'loan_status_binary', 'purpose_nan']
Columns to drop ['open_act_il', 'tax_liens', 'tot_coll_amt', 'mths_since_recent_bc_dlq', 'open_il_12m', 'home_ownership', 'purpose_cleaned']


### Engineering the `purpose` fields with `create_loan_purpose_risk_features()` in /feature_engineering/nodes.py
Functionality:
Transforms the purpose field into multiple risk-based features using business logic (risk mappings) and creates derived features useful for both tree-based and regression models.
Input: `purpose` -- a preprocessed categorical feature (e.g., "debt consolidation", "medical")
Output_1: `purpose_risk_score` — integer score from 1 (low risk) to 4 (very high risk) -- Regression + Tree
Output_2: `purpose_risk_category` —  categorical label (low, medium, etc.) -- Tree only (used in one-hot encoding or tree splits)
Output_3: `purpose_high_risk` — binary indicator (1 if risk_score ≥ 3) -- tree only
New field(s): `purpose_risk_score` — used in tree and regression models.
New field(s): `purpose_risk_category`, `purpose_high_risk` — used in tree-based models.
Drop field(s): `purpose` —  used to generate all 3 output columns above

In [122]:
cols_for_reg = list(set(cols_for_reg +  ["home_ownership_ordinal", "purpose_risk_score"]))

cols_for_tree = list(set(cols_for_tree + ["home_ownership_ordinal", "purpose_risk_score", "purpose_risk_category", "purpose_high_risk"]))

drop_cols = list(set(drop_cols + ["purpose"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['purpose_risk_score', 'home_ownership_ordinal', 'loan_status_binary']
Columns for the tree-based models: ['purpose_other', 'purpose_risk_score', 'home_ownership_ordinal', 'purpose_risk_category', 'purpose_nan', 'purpose_high_risk', 'loan_status_binary']
Columns to drop ['open_act_il', 'tax_liens', 'tot_coll_amt', 'mths_since_recent_bc_dlq', 'purpose', 'open_il_12m', 'home_ownership', 'purpose_cleaned']


### Engineering the `hardship_flag` field with `create_has_hardship_flag()` in /feature_engineering/nodes.py

**Functionality:**
Converts the categorical `hardship_flag` column (values like `'y'` / `'n'`) into a binary numeric indicator `has_hardship`.

**Input:**
- `hardship_flag` — a string column indicating if the borrower had a hardship (values like `"y"`, `"n"`, case-insensitive)

**Output:**
- `has_hardship` — binary indicator:
  - `1` if hardship_flag is `"y"`
  - `0` otherwise

**New field(s):**
- `has_hardship` — **used in both tree-based and regression models**

**Drop field(s):**
- `hardship_flag` — only used for generating `has_hardship`, can be dropped after this node


In [123]:
cols_for_reg = list(set(cols_for_reg + ["has_hardship"]))

cols_for_tree = list(set(cols_for_tree + ["has_hardship"]))

drop_cols = list(set(drop_cols + ["hardship_flag"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['has_hardship', 'purpose_risk_score', 'loan_status_binary', 'home_ownership_ordinal']
Columns for the tree-based models: ['purpose_other', 'purpose_risk_score', 'has_hardship', 'purpose_risk_category', 'home_ownership_ordinal', 'purpose_nan', 'purpose_high_risk', 'loan_status_binary']
Columns to drop ['open_act_il', 'tax_liens', 'tot_coll_amt', 'hardship_flag', 'mths_since_recent_bc_dlq', 'purpose', 'open_il_12m', 'home_ownership', 'purpose_cleaned']


### Engineering the `hardship_loan_status` field with `create_was_late_before_hardship()` in /feature_engineering/nodes.py

**Functionality:**
Creates a binary feature that indicates whether a borrower entered hardship while already in a **late loan status**.

**Input:**
- `hardship_loan_status` — string field indicating loan status at time of hardship (e.g., `"late (31-120 days)"`, `"fully paid"`, etc.)

**Output:**
- `was_late_before_hardship` — binary indicator:
  - `1` if the loan status at hardship contains the word `"late"` (case-insensitive)
  - `0` otherwise or if field is missing

**New field(s):**
- `was_late_before_hardship` — **used in both tree-based and regression models**

**Drop field(s):**
- `hardship_loan_status` — used only to derive `was_late_before_hardship`, can be dropped after this node


In [124]:
cols_for_reg = list(set(cols_for_reg + ["was_late_before_hardship"]))

cols_for_tree = list(set(cols_for_tree + ["was_late_before_hardship"]))

drop_cols = list(set(drop_cols + ["hardship_loan_status"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['purpose_risk_score', 'has_hardship', 'home_ownership_ordinal', 'was_late_before_hardship', 'loan_status_binary']
Columns for the tree-based models: ['purpose_other', 'purpose_risk_score', 'has_hardship', 'purpose_risk_category', 'home_ownership_ordinal', 'purpose_nan', 'purpose_high_risk', 'was_late_before_hardship', 'loan_status_binary']
Columns to drop ['open_act_il', 'tax_liens', 'tot_coll_amt', 'hardship_flag', 'mths_since_recent_bc_dlq', 'purpose', 'open_il_12m', 'home_ownership', 'hardship_loan_status', 'purpose_cleaned']


### Engineering the `hardship_dpd` field with `create_hardship_features()` in /feature_engineering/nodes.py

**Functionality:**
Fills missing values in the `hardship_dpd` (days past due during hardship) field and creates a cleaned version for modeling.

**Input:**
- `hardship_dpd` — numeric field representing how many days the loan was past due at the time of hardship (can be NaN)

**Output:**
- `hardship_dpd_filled` — numeric field with missing values filled as `0`

**New field(s):**
- `hardship_dpd_filled` — **used in both tree-based and regression models**

**Drop field(s):**
- `hardship_dpd` — original raw column used only to create the cleaned version, can be dropped after this node


In [125]:
cols_for_reg = list(set(cols_for_reg + ["hardship_dpd_filled"]))

cols_for_tree = list(set(cols_for_tree + ["hardship_dpd_filled"]))

drop_cols = list(set(drop_cols + ["hardship_dpd"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['purpose_risk_score', 'has_hardship', 'home_ownership_ordinal', 'was_late_before_hardship', 'hardship_dpd_filled', 'loan_status_binary']
Columns for the tree-based models: ['purpose_other', 'purpose_risk_score', 'has_hardship', 'purpose_risk_category', 'home_ownership_ordinal', 'purpose_nan', 'purpose_high_risk', 'was_late_before_hardship', 'hardship_dpd_filled', 'loan_status_binary']
Columns to drop ['hardship_dpd', 'open_act_il', 'tax_liens', 'tot_coll_amt', 'hardship_flag', 'mths_since_recent_bc_dlq', 'purpose', 'open_il_12m', 'home_ownership', 'hardship_loan_status', 'purpose_cleaned']


### Engineering the `emp_length` field with `engineer_emp_length_features()` in /feature_engineering/nodes.py

**Functionality:**
Transforms the text-based `emp_length` field into numeric versions for use in both tree-based and regression models. Handles missing values separately depending on the model type.

**Input:**
- `emp_length` — a categorical field representing the number of years employed (e.g., `"10+ years"`, `"< 1 year"`)

**Output:**
- `emp_length_clean` — numeric base transformation (used internally)
- `emp_length_clean_tree` — tree-model-specific version with missing filled as `-1`
- `emp_length_clean_reg` — regression-model-specific version with missing filled as **median**

**New field(s):**
- `emp_length_clean_tree` — **used in tree-based models**
- `emp_length_clean_reg` — **used in regression models**

**Drop field(s):**
- `emp_length` — original text field, can be dropped after this node
- `emp_length_clean` — intermediate transformation, not directly used downstream


In [126]:
cols_for_reg = list(set(cols_for_reg + ["emp_length_clean_reg"]))

cols_for_tree = list(set(cols_for_tree + ["emp_length_clean_tree"]))

drop_cols = list(set(drop_cols + ["emp_length", "emp_length_clean"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['purpose_risk_score', 'has_hardship', 'home_ownership_ordinal', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'loan_status_binary']
Columns for the tree-based models: ['emp_length_clean_tree', 'purpose_other', 'purpose_risk_score', 'has_hardship', 'purpose_risk_category', 'home_ownership_ordinal', 'purpose_nan', 'purpose_high_risk', 'was_late_before_hardship', 'hardship_dpd_filled', 'loan_status_binary']
Columns to drop ['hardship_dpd', 'open_act_il', 'tax_liens', 'emp_length_clean', 'tot_coll_amt', 'hardship_flag', 'mths_since_recent_bc_dlq', 'emp_length', 'purpose', 'open_il_12m', 'home_ownership', 'hardship_loan_status', 'purpose_cleaned']


### Engineering the `earliest_cr_line` fields with `create_credit_age_feature()` in /feature_engineering/nodes.py

**Functionality:**
Combines individual and joint applicants’ credit line information into a unified field and calculates credit history length (age) in months.

**Inputs:**
- `earliest_cr_line` — datetime field for the primary applicant’s earliest credit line
- `sec_app_earliest_cr_line` — datetime for secondary applicant (if joint)
- `issue_d` — loan issue date
- `is_joint_app` — binary flag indicating joint application

**Outputs:**
- `earliest_cr_line_final` — unified field (uses secondary applicant’s date if applicable)
- `credit_age_months` — numeric field representing months between credit line and loan issue

**New field(s):**
- `earliest_cr_line_final` — used as temporal reference, especially for joint cases
- `credit_age_months` — used in **both** tree-based and regression models

**Drop field(s):**
- `earliest_cr_line` — superseded by `earliest_cr_line_final`
- `sec_app_earliest_cr_line` — no longer needed after merge
- `issue_d` — used in other features too, so **retain**


In [127]:
cols_for_reg = list(set(cols_for_reg + ["credit_age_months"]))

cols_for_tree = list(set(cols_for_tree + ["credit_age_months"]))

drop_cols = list(set(drop_cols + ["earliest_cr_line", "sec_app_earliest_cr_line", "earliest_cr_line_final", "issue_d"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['purpose_risk_score', 'credit_age_months', 'has_hardship', 'home_ownership_ordinal', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'loan_status_binary']
Columns for the tree-based models: ['emp_length_clean_tree', 'purpose_other', 'purpose_risk_score', 'credit_age_months', 'has_hardship', 'purpose_risk_category', 'home_ownership_ordinal', 'purpose_nan', 'purpose_high_risk', 'was_late_before_hardship', 'hardship_dpd_filled', 'loan_status_binary']
Columns to drop ['hardship_dpd', 'open_act_il', 'sec_app_earliest_cr_line', 'tax_liens', 'emp_length_clean', 'tot_coll_amt', 'earliest_cr_line', 'hardship_flag', 'mths_since_recent_bc_dlq', 'purpose_cleaned', 'purpose', 'issue_d', 'open_il_12m', 'home_ownership', 'hardship_loan_status', 'emp_length', 'earliest_cr_line_final']


### Engineering the `loan_amnt` and `installment` fields with `create_loan_amount_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates ratio-based and binned features that capture the relationship between loan amount and monthly installment. These features reflect affordability and general loan size.

**Input Fields:**
- `loan_amnt` – total loan amount
- `installment` – monthly payment amount

**Output Fields:**
- `loan_to_installment_ratio` – ratio of loan amount to installment payment, clipped to 1st–99th percentile. Used in **tree-based and regression** models
- `loan_amount_band` – decile-based binning of `loan_amnt` (0–9), fallback to fixed bins if needed. Used in **tree-based** models (categorical splits)

**Drop Field(s):**
- _None dropped at this stage_
  `loan_amnt` and `installment` are still used in:
  - `income_to_loan_reg`
  - `payment_to_income`
  - Downstream validation and feature profiling

⚠️ _Dropping `loan_amnt` and `installment` after all dependent features are created._


### The `funded_amnt` field will be deleted since its correlated with `loan_amnt`

In [128]:
cols_for_reg = list(set(cols_for_reg + ["loan_to_installment_ratio"]))

cols_for_tree = list(set(cols_for_tree + ["loan_to_installment_ratio", "loan_amount_band"]))

drop_cols = list(set(drop_cols + ["funded_amnt"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['purpose_risk_score', 'credit_age_months', 'has_hardship', 'home_ownership_ordinal', 'loan_to_installment_ratio', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'loan_status_binary']
Columns for the tree-based models: ['emp_length_clean_tree', 'purpose_other', 'purpose_risk_score', 'credit_age_months', 'has_hardship', 'purpose_risk_category', 'home_ownership_ordinal', 'purpose_nan', 'purpose_high_risk', 'loan_to_installment_ratio', 'was_late_before_hardship', 'hardship_dpd_filled', 'loan_amount_band', 'loan_status_binary']
Columns to drop ['hardship_dpd', 'open_act_il', 'funded_amnt', 'sec_app_earliest_cr_line', 'tax_liens', 'emp_length_clean', 'tot_coll_amt', 'earliest_cr_line', 'hardship_flag', 'mths_since_recent_bc_dlq', 'emp_length', 'purpose', 'issue_d', 'open_il_12m', 'home_ownership', 'hardship_loan_status', 'purpose_cleaned', 'earliest_cr_line_final']


### Engineering the `fico` fields with `create_fico_score_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Generates standardized and model-friendly FICO score features by averaging FICO bounds and creating binned risk indicators.

**Inputs:**
- `fico_range_low`: Lower end of the borrower's estimated FICO range
- `fico_range_high`: Upper end of the borrower's estimated FICO range

**Output Fields:**
- `fico_average`: Mean of `fico_range_low` and `fico_range_high`, clipped to [300–850]
  - Used in **tree-based** and **regression** models
- `fico_risk_band`: Categorical score band (A–F), using either quantiles or fallback fixed bins
  - Used in **tree-based** models only

**New Field(s):**
- `fico_average` —  Tree & Regression
- `fico_risk_band` —  Tree only

**Drop Field(s):**
- `fico_range_low` —  Drop after creating `fico_average`
- `fico_range_high` —  Drop after creating `fico_average`


In [129]:
cols_for_reg = list(set(cols_for_reg + ["fico_average"]))

cols_for_tree = list(set(cols_for_tree + ["fico_average", "fico_risk_band"]))

drop_cols = list(set(drop_cols + ["fico_range_low", "fico_range_high"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['purpose_risk_score', 'credit_age_months', 'has_hardship', 'home_ownership_ordinal', 'loan_to_installment_ratio', 'was_late_before_hardship', 'hardship_dpd_filled', 'fico_average', 'emp_length_clean_reg', 'loan_status_binary']
Columns for the tree-based models: ['emp_length_clean_tree', 'purpose_other', 'purpose_risk_score', 'credit_age_months', 'fico_risk_band', 'has_hardship', 'purpose_risk_category', 'home_ownership_ordinal', 'purpose_nan', 'purpose_high_risk', 'loan_to_installment_ratio', 'was_late_before_hardship', 'hardship_dpd_filled', 'fico_average', 'loan_amount_band', 'loan_status_binary']
Columns to drop ['tax_liens', 'mths_since_recent_bc_dlq', 'open_il_12m', 'home_ownership', 'fico_range_high', 'open_act_il', 'emp_length_clean', 'tot_coll_amt', 'purpose', 'earliest_cr_line_final', 'hardship_dpd', 'earliest_cr_line', 'issue_d', 'hardship_loan_status', 'purpose_cleaned', 'emp_length', 'fico_range_low', 'sec_app_earliest_cr_line', 'ha

### Engineering the `term` fields with `create_term_model_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Generates model-specific features from the `term` column (e.g., 36 or 60 months) for both tree-based and regression models.

**Input:**
- `term` — numeric loan term in months (already cleaned during preprocessing)
- `int_rate` — interest rate ( is used later)

**Outputs:**
- `term_tree` — Categorical feature used in tree-based models -- Tree-only
- `term_normalized_reg` — Scaled feature (0–1) for regression models  -- Regression-only
- `term_rate_interaction_reg` — Interaction between term and interest rate -- Regression-only

**New field(s):**
- `term_tree`
- `term_normalized_reg`
- `term_rate_interaction_reg`

**Drop field(s):**
- `term` — can be dropped **after** derived features are created, as it is replaced by model-specific versions.


In [130]:
cols_for_reg = list(set(cols_for_reg + ["term_normalized_reg", "term_rate_interaction_reg"]))

cols_for_tree = list(set(cols_for_tree + ["term_tree"]))

drop_cols = list(set(drop_cols + ["term"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['purpose_risk_score', 'credit_age_months', 'has_hardship', 'home_ownership_ordinal', 'term_rate_interaction_reg', 'loan_to_installment_ratio', 'was_late_before_hardship', 'hardship_dpd_filled', 'fico_average', 'emp_length_clean_reg', 'term_normalized_reg', 'loan_status_binary']
Columns for the tree-based models: ['emp_length_clean_tree', 'purpose_other', 'purpose_risk_score', 'fico_risk_band', 'credit_age_months', 'has_hardship', 'purpose_risk_category', 'home_ownership_ordinal', 'purpose_nan', 'purpose_high_risk', 'term_tree', 'loan_to_installment_ratio', 'was_late_before_hardship', 'hardship_dpd_filled', 'fico_average', 'loan_amount_band', 'loan_status_binary']
Columns to drop ['tax_liens', 'term', 'mths_since_recent_bc_dlq', 'open_il_12m', 'home_ownership', 'fico_range_high', 'open_act_il', 'emp_length_clean', 'tot_coll_amt', 'purpose', 'earliest_cr_line_final', 'hardship_dpd', 'earliest_cr_line', 'issue_d', 'hardship_loan_status', 'emp_leng

### Engineering the `int_rate`, `grade`, and `sub_grade` fields with `encode_interest_and_grade_fields()` in `/feature_engineering/nodes.py`

**Functionality:**
Transforms key LendingClub classification and interest rate fields into numerical features for modeling.

**Inputs:**
- `int_rate` — interest rate in percentage (e.g., 13.49)
- `grade` — letter grade from A to G (already lowercased in preprocessing)
- `sub_grade` — finer-grained version of `grade`, from A1 to G5 (e.g., "b3", "f2")

**Outputs:**
- `int_rate` — converted from percentage to decimal (e.g., 0.1349) ✅ Regression + Tree
- `grade_encoded` — ordinal feature (1 = best grade A, 7 = worst grade G) ✅ Regression + Tree
- `sub_grade_encoded` — fine-grained ordinal feature (1 = A1, ..., 35 = G5) ✅ Regression + Tree

**New field(s):**
- `grade_encoded`
- `sub_grade_encoded`
- (transformed) `int_rate` (overwrites original)

**Drop field(s):**
- `grade` — used to create `grade_encoded`
- `sub_grade` — used to create `sub_grade_encoded`
- ⚠️ Retain `int_rate` (overwritten with numeric format)


In [131]:
cols_for_reg = list(set(cols_for_reg + ["int_rate", "grade_encoded", "sub_grade_encoded"]))

cols_for_tree = list(set(cols_for_tree + ["int_rate", "grade_encoded", "sub_grade_encoded"]))

drop_cols = list(set(drop_cols + ["grade", "sub_grade"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['sub_grade_encoded', 'purpose_risk_score', 'credit_age_months', 'has_hardship', 'home_ownership_ordinal', 'term_rate_interaction_reg', 'int_rate', 'loan_to_installment_ratio', 'was_late_before_hardship', 'hardship_dpd_filled', 'fico_average', 'emp_length_clean_reg', 'term_normalized_reg', 'grade_encoded', 'loan_status_binary']
Columns for the tree-based models: ['credit_age_months', 'loan_amount_band', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'purpose_high_risk', 'fico_average', 'emp_length_clean_tree', 'purpose_other', 'fico_risk_band', 'int_rate', 'purpose_nan', 'was_late_before_hardship', 'hardship_dpd_filled', 'grade_encoded', 'loan_status_binary', 'purpose_risk_score', 'home_ownership_ordinal', 'term_tree', 'loan_to_installment_ratio']
Columns to drop ['tax_liens', 'term', 'mths_since_recent_bc_dlq', 'open_il_12m', 'home_ownership', 'fico_range_high', 'open_act_il', 'emp_length_clean', 'tot_coll_amt', 'purpose', 'earli

### Engineering credit history indicators with `create_credit_history_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Combines multiple credit history fields into composite features that signal borrower risk due to past derogatory marks or delinquencies.

**Inputs:**
- `delinq_2yrs` — number of delinquencies in the past 2 years
- `pub_rec` — number of derogatory public records
- `mths_since_last_delinq` — months since the last delinquency (may contain NaNs)

**Outputs:**
- `has_derogatory` — binary indicator (1 if `delinq_2yrs` > 0 or `pub_rec` > 0) ✅ Tree + Regression
- `delinq_weight` — weighted delinquency severity score (higher if delinquencies were recent) ✅ Regression only

**New field(s):**
- `has_derogatory` — used in both tree-based and regression models
- `delinq_weight` — used in regression models only (continuous score)

**Drop field(s):**
- None explicitly dropped yet, but:
  - `delinq_2yrs`, `pub_rec`, and `mths_since_last_delinq` keep temporarily for use in later features (`create_credit_history_model_features`)


In [132]:
cols_for_reg = list(set(cols_for_reg + ["has_derogatory", "delinq_weight"]))

cols_for_tree = list(set(cols_for_tree + ["has_derogatory"]))

# drop_cols = list(set(drop_cols + ["grade", "sub_grade"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['sub_grade_encoded', 'purpose_risk_score', 'credit_age_months', 'has_hardship', 'home_ownership_ordinal', 'has_derogatory', 'term_rate_interaction_reg', 'int_rate', 'loan_to_installment_ratio', 'was_late_before_hardship', 'hardship_dpd_filled', 'fico_average', 'emp_length_clean_reg', 'term_normalized_reg', 'delinq_weight', 'grade_encoded', 'loan_status_binary']
Columns for the tree-based models: ['credit_age_months', 'loan_amount_band', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'fico_average', 'emp_length_clean_tree', 'purpose_other', 'fico_risk_band', 'int_rate', 'purpose_nan', 'was_late_before_hardship', 'hardship_dpd_filled', 'grade_encoded', 'loan_status_binary', 'purpose_risk_score', 'home_ownership_ordinal', 'term_tree', 'loan_to_installment_ratio']
Columns to drop ['tax_liens', 'term', 'mths_since_recent_bc_dlq', 'open_il_12m', 'home_ownership', 'fico_range_high', 'open_act_il', 

### Engineering model-specific credit history features with `create_credit_history_model_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Generates separate credit history features for tree-based and regression models by handling missing values and emphasizing recent delinquencies.

**Inputs:**
- `delinq_2yrs` — number of delinquencies in the past 2 years
- `mths_since_last_delinq` — months since the last delinquency

**Outputs:**

✅ **Tree-based model features**
- `delinq_2yrs_tree` — `delinq_2yrs` with missing values filled by `-1`
- `has_delinq_tree` — binary indicator for any delinquency
- `has_recent_delinq_tree` — binary flag for recent delinquency (< 24 months)

✅ **Regression model features**
- `delinq_2yrs_reg` — `delinq_2yrs` with median imputation
- `delinq_severity_reg` — delinquency severity (higher if recent)

**New field(s):**
- Tree models: `delinq_2yrs_tree`, `has_delinq_tree`, `has_recent_delinq_tree`
- Regression models: `delinq_2yrs_reg`, `delinq_severity_reg`

**Drop field(s):**
- None dropped explicitly
- `delinq_2yrs` and `mths_since_last_delinq` may be dropped later after final model-specific datasets are built


In [133]:
cols_for_reg = list(set(cols_for_reg + ["delinq_2yrs_reg", "delinq_severity_reg"]))

cols_for_tree = list(set(cols_for_tree + ["delinq_2yrs_tree", "has_delinq_tree", "has_recent_delinq_tree"]))

# drop_cols = list(set(drop_cols + ["grade", "sub_grade"]))
# TODO: check other functions for `delinq_2yrs` and `mths_since_last_delinq`

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'delinq_2yrs_reg', 'term_normalized_reg', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'int_rate', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'loan_amount_band', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'fico_average', 'has_recent_delinq_tree', 'emp_length_clean_tree', 'purpose_other', 'fico_risk_band', 'int_rate', 'purpose_nan', 'was_late_before_hardship', 'hardship_dpd_filled', 'has_delinq_tree', 'grade_encoded', 'loan_status_binary', 'purpose_risk_score', 'home_ownership_ordinal', 'term_tree', 'loan_to_installment_ratio']
Columns to drop ['tax_lien

### Handling missing `pub_rec` values with `handle_pub_rec_missing()` in `/feature_engineering/nodes.py`

**Functionality:**
Fills missing values in the `pub_rec` field using a specified imputation strategy.

**Inputs:**
- `pub_rec` — public derogatory record count
- `strategy` — either `"median"` or `"negative_one"` (default is `"median"`)

**Outputs:**
- `pub_rec` — updated with imputed values

**New field(s):**
- None created — the original `pub_rec` field is modified in-place

**Drop field(s):**
- None

**Model usage:**
- `pub_rec` is used by **both tree-based** and **regression** models.
  - Tree: as-is or through features like `has_derogatory`
  - Regression: often used in engineered ratios or severity scores

**Notes:**
- The strategy is configurable through pipeline parameters (`params:pub_rec_strategy`)
- This node ensures that later modeling steps can safely rely on non-missing `pub_rec`


In [134]:
cols_for_reg = list(set(cols_for_reg + ["pub_rec"]))

cols_for_tree = list(set(cols_for_tree + ["pub_rec"]))

# drop_cols = list(set(drop_cols + ["grade", "sub_grade"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'int_rate', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'loan_amount_band', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'fico_average', 'has_recent_delinq_tree', 'emp_length_clean_tree', 'purpose_other', 'fico_risk_band', 'int_rate', 'purpose_nan', 'was_late_before_hardship', 'hardship_dpd_filled', 'has_delinq_tree', 'grade_encoded', 'loan_status_binary', 'purpose_risk_score', 'home_ownership_ordinal', 'term_tree', 'loan_to_installment_ratio']
Colu

### Engineering delinquency features with `create_delinquency_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Generates composite indicators of borrower delinquency behavior using recent and historical fields.

**Inputs:**
- `delinq_2yrs`: Number of delinquencies in the past 2 years
- `mths_since_last_delinq`: Months since last delinquency
- `acc_now_delinq`: Number of accounts currently delinquent
- `mths_since_recent_revol_delinq`: (present but not used in current logic)

**Outputs:**
- `delinquency_score` — composite feature capturing overall delinquency risk
- `recent_delinq_bin` — ordinal bin of `mths_since_last_delinq` (used to adjust score)

**New field(s):**
- `delinquency_score` — used in **tree-based models** (intended for splits or ranking risk)
- `recent_delinq_bin` — tree-only ordinal bucket useful for boosting models

**Drop field(s):**
- `delinq_2yrs` and `mths_since_last_delinq` - to add to the drop list and the end of the feature creation

**Model usage:**
- `delinquency_score` and `recent_delinq_bin` are intended primarily for **tree-based models**
- Raw fields like `delinq_2yrs` and `acc_now_delinq` are retained for **model-specific encoding**

**Notes:**
- Combines multiple weak indicators into a stronger aggregate risk metric
- Supports interpretability in feature importance plots
- Score is clipped to a max of 10 to avoid dominance in feature scales


In [135]:
# cols_for_reg = list(set(cols_for_reg + ["pub_rec"]))

cols_for_tree = list(set(cols_for_tree + ["delinquency_score", "recent_delinq_bin"]))

drop_cols = list(set(drop_cols + ["delinq_2yrs", "mths_since_last_delinq", "mths_since_recent_revol_delinq", "acc_now_delinq"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'int_rate', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'loan_amount_band', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'recent_delinq_bin', 'fico_average', 'has_recent_delinq_tree', 'emp_length_clean_tree', 'purpose_other', 'fico_risk_band', 'int_rate', 'purpose_nan', 'was_late_before_hardship', 'hardship_dpd_filled', 'has_delinq_tree', 'grade_encoded', 'loan_status_binary', 'purpose_risk_score', 'home_ownership_ordinal', 'term_tree', 'loan_to_ins

### Engineering credit inquiry features with `create_credit_inquiry_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates new features that quantify recent credit inquiry behavior and potential risk associated with frequent credit seeking.

**Inputs:**
- `inq_last_6mths` — number of inquiries in the last 6 months
- `inq_last_12m` — number of inquiries in the last 12 months
- `open_acc_6m` — number of accounts opened in the last 6 months

**Outputs:**
- `recent_inquiry_intensity` — ratio of inquiries in 6m vs 12m (clipped to 1) — **Regression + Tree**
- `high_recent_inquiries` — binary flag for frequent inquiries (3 or more in last 6m) — **Tree**
- `inq_to_open_acc_ratio` — ratio of inquiries to opened accounts (proxy for credit seeking success rate) — **Regression + Tree**

**New field(s):**
- `recent_inquiry_intensity` — normalized ratio — **used in regression models**
- `high_recent_inquiries` — binary risk flag — **used in tree models**
- `inq_to_open_acc_ratio` — success ratio of inquiries — **used in both models**

**Drop field(s):**
- None (original input fields may still be useful elsewhere)


In [136]:
cols_for_reg = list(set(cols_for_reg + ["recent_inquiry_intensity", "inq_to_open_acc_ratio"]))

cols_for_tree = list(set(cols_for_tree + ["recent_inquiry_intensity", "high_recent_inquiries", "inq_to_open_acc_ratio"]))

drop_cols = list(set(drop_cols + ["inq_last_6mths", "inq_last_12m", "open_acc_6m"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'int_rate', 'recent_inquiry_intensity', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'loan_amount_band', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'recent_delinq_bin', 'fico_average', 'has_recent_delinq_tree', 'emp_length_clean_tree', 'purpose_other', 'fico_risk_band', 'int_rate', 'recent_inquiry_intensity', 'purpose_nan', 'was_late_before_hardship', 'hardship_dpd_filled', 'has_delinq_tree', 'grade_encoded', 'loan_status_b

### Engineering account activity features with `create_account_activity_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Generates features that describe the borrower’s recent credit account behavior, such as how quickly they are acquiring new credit and how active their accounts are.

**Inputs:**
- `open_acc` — number of currently open accounts
- `acc_open_past_24mths` — number of accounts opened in the last 24 months
- `total_acc` — total number of credit accounts

**Outputs:**
- `recent_acc_ratio` — share of open accounts that are newly opened
- `rapid_acc_acquisition` — binary flag (1 if more than 50% of open accounts were opened recently)
- `active_acc_ratio` — proportion of total accounts that are still open

**New field(s):**
- `recent_acc_ratio` — **used in tree and regression models**
- `rapid_acc_acquisition` — **used in tree-based models**
- `active_acc_ratio` — **used in tree and regression models**

**Drop field(s):**
- `open_acc`, `acc_open_past_24mths`, `total_acc` may be dropped after model-specific features are finalized if no longer used downstream

**Model usage:**
- `recent_acc_ratio`, `active_acc_ratio` — useful for both **regression** and **tree-based** models
- `rapid_acc_acquisition` — binary feature primarily for **tree-based** models


In [137]:
cols_for_reg = list(set(cols_for_reg + ["recent_acc_ratio", "active_acc_ratio"]))

cols_for_tree = list(set(cols_for_tree + ["recent_acc_ratio", "rapid_acc_acquisition", "active_acc_ratio"]))

drop_cols = list(set(drop_cols + ["open_acc", "acc_open_past_24mths"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'recent_acc_ratio', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'active_acc_ratio', 'int_rate', 'recent_inquiry_intensity', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'recent_delinq_bin', 'fico_average', 'has_recent_delinq_tree', 'rapid_acc_acquisition', 'active_acc_ratio', 'emp_length_clean_tree', 'purpose_other', 'fico_risk_band', 'int_rate', 'recent_inquiry_intensity', 'purpose_na

### Engineering revolving balance features with `create_joint_revol_bal_feature()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates a unified revolving balance feature `revol_bal_final` that accounts for joint applications by selecting the appropriate value based on whether the loan is a joint application.

**Inputs:**
- `revol_bal` — revolving balance for the individual borrower
- `revol_bal_joint` — revolving balance for joint applications
- `is_joint_app` — binary indicator for whether the loan is joint

**Outputs:**
- `revol_bal_final` — final value of revolving balance, selecting joint where applicable

**New field(s):**
- `revol_bal_final` — **used in downstream feature calculations**, including `inst_to_revol_ratio`

**Drop field(s):**
- `revol_bal`, `revol_bal_joint` — can be dropped after `revol_bal_final` is created

**Model usage:**
- `revol_bal_final` — **used in both tree and regression models**
  (as input to downstream features like debt composition ratio)


In [138]:
cols_for_reg = list(set(cols_for_reg + ["revol_bal_final"]))

cols_for_tree = list(set(cols_for_tree + ["revol_bal_final"]))

drop_cols = list(set(drop_cols + ["revol_bal", "revol_bal_joint"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'recent_acc_ratio', 'revol_bal_final', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'active_acc_ratio', 'int_rate', 'recent_inquiry_intensity', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'revol_bal_final', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'recent_delinq_bin', 'fico_average', 'has_recent_delinq_tree', 'rapid_acc_acquisition', 'active_acc_ratio', 'emp_length_clean_tree', 'purpose_other', 'fico_risk_band', 'int_rate', '

### Engineering debt composition features with `create_debt_composition_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Generates features that describe how a borrower’s debt is distributed between installment and revolving credit, and whether they have a mortgage.

**Inputs:**
- `total_bal_il` — total balance on installment loans
- `revol_bal_final` — final revolving balance (already cleaned/merged)
- `mort_acc` — number of mortgage accounts
- `total_acc` — total number of credit accounts

**Outputs:**
- `inst_to_revol_ratio` — ratio of installment to revolving debt
- `debt_composition_type` — categorical bucket of the borrower's debt profile
- `mortgage_ratio` — share of mortgage accounts out of total accounts

**New field(s):**
- `inst_to_revol_ratio` — **used in tree and regression models**
- `debt_composition_type` — **tree-specific categorical feature**
- `mortgage_ratio` — **used in both model types**
- `has_mortgage` — **binary feature for tree models**

**Drop field(s):**
- `total_bal_il`, `mort_acc`, `total_acc` — can be dropped later if not reused



In [139]:
cols_for_reg = list(set(cols_for_reg + ["inst_to_revol_ratio", "mortgage_ratio"]))

cols_for_tree = list(
    set(cols_for_tree + ["inst_to_revol_ratio", "debt_composition_type", "mortgage_ratio", "has_mortgage"]))

drop_cols = list(set(drop_cols + ["total_acc", "total_bal_il", "mort_acc"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'recent_acc_ratio', 'revol_bal_final', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'mortgage_ratio', 'active_acc_ratio', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'revol_bal_final', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'recent_delinq_bin', 'debt_composition_type', 'fico_average', 'has_recent_delinq_tree', 'rapid_acc_acquisition', 'mortgage_ratio', 'active_a

### Engineering derogatory record recency with `create_mths_since_last_record_feature()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates a normalized version of the `mths_since_last_record` field by filling missing values and capping extreme values.
This feature helps model how recent a derogatory public record was reported, if any.

**Inputs:**
- `mths_since_last_record` — months since the last derogatory record; can be missing

**Outputs:**
- `mths_since_last_record_filled` — filled and capped version of `mths_since_last_record`

**New field(s):**
- `mths_since_last_record_filled` — used in both tree-based and regression models

**Drop field(s):**
- `mths_since_last_record` — can be dropped after the filled version is created

**Model usage:**
- `mths_since_last_record_filled` — ✔️ Used in **tree** and **regression** models
  (as an indicator of financial recovery or risk history)


In [140]:
cols_for_reg = list(set(cols_for_reg + ["mths_since_last_record_filled"]))

cols_for_tree = list(
    set(cols_for_tree + ["mths_since_last_record_filled"]))

drop_cols = list(set(drop_cols + ["mths_since_last_record"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'recent_acc_ratio', 'revol_bal_final', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'revol_bal_final', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'recent_delinq_bin', 'debt_composition_type', 'fico_average', 'has_recent_delinq_tree', 'mths_since_last_re

# %% md
### Engineering income features with `create_joint_income_feature()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates a unified income field `annual_inc_final` by combining individual and joint income based on application type.

**Inputs:**
- `annual_inc` — income of the primary applicant
- `annual_inc_joint` — combined income of both applicants (if joint application)
- `is_joint_app` — binary flag indicating if application is joint

**Outputs:**
- `annual_inc_final` — unified income field reflecting total reported income

**New field(s):**
- `annual_inc_final` — ✔️ Used in both **tree-based** and **regression** models
  (improves accuracy by properly handling joint applications)

**Drop field(s):**
- `annual_inc`, `annual_inc_joint` — ✅ Can be dropped after feature is created
  (they're fully incorporated into the new `annual_inc_final` field)


In [141]:
cols_for_reg = list(set(cols_for_reg + ["annual_inc_final"]))

cols_for_tree = list(set(cols_for_tree + ["annual_inc_final"]))

drop_cols = list(set(drop_cols + ["annual_inc", "annual_inc_joint"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'recent_delinq_bin', 'debt_composition_type', 'fico_average', 'has_

# %% md
### Engineering income-related model-specific features with `create_income_model_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates tree-based and regression-based income features from `annual_inc_final`.

**Inputs:**
- `annual_inc_final` — unified income field (from joint or individual income)
- `loan_amnt` — required for `income_to_loan_reg` calculation

**Outputs (tree-based models):**
- `income_band_tree` — income quantile bins (ordinal)
- `is_high_income_tree` — binary flag for high-income borrowers

**Outputs (regression models):**
- `income_log_reg` — log-transformed income for normalization
- `income_to_loan_reg` — income-to-loan ratio (clipped to handle outliers)

**New field(s):**
- `income_band_tree` — ✔️ Used in tree-based models
- `is_high_income_tree` — ✔️ Used in tree-based models
- `income_log_reg` — ✔️ Used in regression models
- `income_to_loan_reg` — ✔️ Used in regression models

**Drop field(s):**
- None explicitly dropped here, but `annual_inc_final` should be retained for downstream use or modeling.


In [142]:
cols_for_reg = list(set(cols_for_reg + ["income_log_reg", "income_to_loan_reg"]))

cols_for_tree = list(set(cols_for_tree + ["income_band_tree", "is_high_income_tree"]))

# drop_cols = list(set(drop_cols + []))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'income_to_loan_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'has_hardship', 'purpose_risk_category', 'has_derogatory', 'purpose_high_risk', 'recent_delinq_bin', 'is_hi

### Engineering payment-related features with `create_payment_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Generates affordability and interest burden indicators from income, loan amount, term, and installment values.

**Inputs:**
- `annual_inc_final` — borrower (or joint) income
- `installment` — monthly loan payment
- `loan_amnt` — total loan amount
- `term` — loan duration in months

**Outputs (tree-based + regression models):**
- `payment_to_income` — percentage of annual income spent on loan payments
- `high_payment_burden` — binary flag for high burden (>20%)
- `total_payments` — total scheduled loan repayment (installment × term)
- `interest_burden_pct` — interest paid as % of loan amount

**New field(s):**
- `payment_to_income` — ✔️ Used in regression and possibly tree-based models
- `high_payment_burden` — ✔️ Useful as a binary feature for tree-based models
- `total_payments` — intermediate feature, used for calculating interest burden
- `interest_burden_pct` — ✔️ Potentially useful for either model type

**Drop field(s):**
- None explicitly dropped
- `total_payments` may be dropped if only `interest_burden_pct` is retained

**Notes:**
- `payment_to_income` is capped at 100% to avoid skew from extreme values
- Useful for identifying overextended borrowers and modeling affordability stress


In [143]:
cols_for_reg = list(set(cols_for_reg + ["payment_to_income", "interest_burden_pct"]))

cols_for_tree = list(set(cols_for_tree + ["payment_to_income", "high_payment_burden", "interest_burden_pct"]))

drop_cols = list(set(drop_cols + ["total_payments"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'income_to_loan_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight']
Columns for the tree-based models: ['high_payment_burden', 'credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'payment_to_incom

### Engineering the `dti` fields with `create_joint_dti_feature()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates a unified debt-to-income ratio (`dti_final`) that intelligently selects between `dti` and `dti_joint` based on whether the application is joint.

**Inputs:**
- `dti` — individual debt-to-income ratio
- `dti_joint` — joint DTI (if application is joint)
- `is_joint_app` — binary indicator (1 if joint application)

**Outputs:**
- `dti_final` — DTI used for modeling, derived from `dti_joint` when available for joint applications

**New field(s):**
- `dti_final` — ✔️ Used in both regression and tree-based models

**Drop field(s):**
- `dti`, `dti_joint` — ❌ can be dropped after `dti_final` is created

**Notes:**
- Ensures downstream models use a consistent and context-appropriate DTI value
- Joint DTI is prioritized if available


In [144]:
cols_for_reg = list(set(cols_for_reg + ["dti_final"]))

cols_for_tree = list(set(cols_for_tree + ["dti_final"]))

drop_cols = list(set(drop_cols + ["dti", "dti_joint"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'income_to_loan_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight', 'dti_final']
Columns for the tree-based models: ['high_payment_burden', 'credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'pay

### Engineering the `dti_final` field with `create_dti_model_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates multiple model-specific features based on the unified debt-to-income ratio (`dti_final`).

**Inputs:**
- `dti_final` — unified debt-to-income ratio (individual or joint, depending on application type)

**Outputs:**
- `dti_band_tree` — decile-based binning for tree-based models
- `is_high_dti_tree` — binary flag for high DTI (above median) — tree model only
- `dti_normalized_reg` — normalized DTI for regression (0–1 scale)
- `dti_log_reg` — log-transformed DTI for regression models

**New field(s):**
- `dti_band_tree`, `is_high_dti_tree` — ✔️ Tree-based models only
- `dti_normalized_reg`, `dti_log_reg` — ✔️ Regression models only

**Drop field(s):**
- `dti_final` — ❌ should be retained, as it may still be useful for diagnostics or interpretation

**Notes:**
- Decile binning ensures good feature behavior for decision trees
- Normalization and log transform are standard practices for linear/regression models


In [145]:
cols_for_reg = list(set(cols_for_reg + ["dti_normalized_reg", "dti_log_reg"]))

cols_for_tree = list(set(cols_for_tree + ["dti_band_tree", "is_high_dti_tree"]))

# drop_cols = list(set(drop_cols + ["dti", "dti_joint"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight', 'dti_final']
Columns for the tree-based models: ['high_payment_burden', 'credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'revol_bal_final', 'annual_inc_final', 'sub_grade_

### Engineering the `revol_util` field with `create_revol_util_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Generates model-specific features from `revol_util` (revolving credit utilization) by applying tailored missing value strategies.

**Inputs:**
- `revol_util` — percentage of revolving credit lines currently in use (may contain missing values)

**Outputs:**
- `revol_util_tree` — for tree-based models, with missing values filled with `-1`
- `revol_util_reg` — for regression models, with missing values filled using the median

**New field(s):**
- `revol_util_tree` — ✔️ Tree-based models only
- `revol_util_reg` — ✔️ Regression models only

**Drop field(s):**
- `revol_util` — ✅ Can be dropped after feature generation (used solely to derive model-specific fields)

**Notes:**
- Tree-based models can interpret out-of-range values like `-1` as a meaningful split
- Median imputation helps maintain distribution continuity for regression models


In [146]:
cols_for_reg = list(set(cols_for_reg + ["revol_util_reg"]))

cols_for_tree = list(set(cols_for_tree + ["revol_util_tree"]))

# drop_cols = list(set(drop_cols + ["revol_util"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight', 'dti_final']
Columns for the tree-based models: ['high_payment_burden', 'credit_age_months', 'delinq_2yrs_tree', 'pub_rec', 'recent_acc_ratio', 'loan_amount_band', 'revol_bal_final', 'annual_inc_f

### Engineering the `revol_util` field with `create_utilization_model_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates model-specific credit utilization features using binning and normalization techniques.

**Inputs:**
- `revol_util` — percentage of revolving credit used (e.g., 45.2)

**Outputs:**
- `util_band_tree` — quantile-binned utilization for **tree-based models**
- `is_high_util_tree` — binary flag for high utilization (above median) — **tree only**
- `util_normalized_reg` — scaled utilization from 0–1 — **regression only**
- `util_buckets_reg` — binned utilization buckets for regression modeling — **regression only**

**New field(s):**
- `util_band_tree` — ✔️ Tree models
- `is_high_util_tree` — ✔️ Tree models
- `util_normalized_reg` — ✔️ Regression models
- `util_buckets_reg` — ✔️ Regression models

**Drop field(s):**
- `revol_util` — ✅ Can be dropped after feature creation (used to derive all outputs)

**Notes:**
- `qcut()` is used to create quantile-based bins for `util_band_tree`, ensuring balanced splits
- `pd.cut()` creates fixed utilization brackets (e.g., 0–20%, 20–40%, …) for regression modeling
- This function complements `create_revol_util_features()`, which handled missing value imputation


In [147]:
cols_for_reg = list(set(cols_for_reg + ["util_normalized_reg", "util_buckets_reg"]))

cols_for_tree = list(set(cols_for_tree + ["util_band_tree", "is_high_util_tree"]))

drop_cols = list(set(drop_cols + ["revol_util"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['util_normalized_reg', 'credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'util_buckets_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight', 'dti_final']
Columns for the tree-based models: ['high_payment_burden', 'credit_age_months', 'delinq_2yrs_tree', 'is_high_util_tree', 'util_band_tree', '

### Engineering the `initial_list_status` field with `create_initial_list_status_flag()` in `/feature_engineering/nodes.py`

**Functionality:**
Encodes the `initial_list_status` field into a binary flag to represent whether a loan was initially listed as "whole" (`'w'`).

**Input:**
- `initial_list_status` — categorical field with values like `'w'` (whole) or `'f'` (fractional)

**Output:**
- `initial_list_status_flag` — binary flag: 1 if `'w'`, 0 otherwise

**New field(s):**
- `initial_list_status_flag` — ✔️ Used in **both tree-based and regression** models

**Drop field(s):**
- `initial_list_status` — ✅ Can be dropped after feature creation

**Notes:**
- This feature helps capture listing strategy which may correlate with risk
- Default behavior does not drop the original column, but it’s safe to do so in downstream cleanup


In [148]:
cols_for_reg = list(set(cols_for_reg + ["initial_list_status_flag"]))

cols_for_tree = list(set(cols_for_tree + ["initial_list_status_flag"]))

drop_cols = list(set(drop_cols + ["initial_list_status", "is_joint_app"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['util_normalized_reg', 'credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'util_buckets_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'initial_list_status_flag', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight', 'dti_final']
Columns for the tree-based models: ['high_payment_burden', 'credit_age_months', 'delinq_2yrs_tree', 'is_high_uti

### Engineering the `mths_since_last_major_derog` field with `create_major_derog_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates model-specific features from the "months since last major derogatory" field.
This feature captures recent severe negative credit events and transforms it into a numerical risk signal.

**Inputs:**
- `mths_since_last_major_derog` — numeric or NaN if no record of major derogatory events.

**Outputs:**
- `mths_since_last_major_derog_filled` — missing values filled with 999 (interpreted as "no derog")
- `recent_major_derog_flag` — binary flag for events within last 24 months — ✔️ Tree models
- `major_derog_score` — inverse score where recent events carry more weight — ✔️ Regression models

**New field(s):**
- `mths_since_last_major_derog_filled` — Intermediate field (used in both model types)
- `recent_major_derog_flag` — ✔️ Tree models
- `major_derog_score` — ✔️ Regression models

**Drop field(s):**
- `mths_since_last_major_derog` — ✅ Can be dropped after transformation

**Notes:**
- `major_derog_score` becomes large when the event is recent (e.g., 1/2 = 0.5 for 2 months ago)
- Fills `NaN` with `999` to avoid distorting the score — interpreted as no derogatory history
- Useful in both tree-based and regression models with different downstream transformations


In [149]:
cols_for_reg = list(set(cols_for_reg + ["major_derog_score", "mths_since_last_major_derog_filled"]))

cols_for_tree = list(set(cols_for_tree + ["recent_major_derog_flag", "mths_since_last_major_derog_filled"]))

drop_cols = list(set(drop_cols + ["mths_since_last_major_derog"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['util_normalized_reg', 'major_derog_score', 'credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'util_buckets_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'mths_since_last_major_derog_filled', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'initial_list_status_flag', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight', 'dti_final']
Columns for the tree-based models: ['high_payment_bur

# %% md
### Engineering the `verification_status` fields with `create_joint_verification_feature()` in `/feature_engineering/nodes.py`

**Functionality:**
Creates a unified verification status column `verification_status_final` that accounts for both individual and joint loan applications.

**Inputs:**
- `verification_status` — categorical field for individual application (e.g., "verified", "not verified")
- `verification_status_joint` — corresponding field for joint applicants
- `is_joint_app` — binary flag indicating whether application is joint (1) or individual (0)

**Outputs:**
- `verification_status_final` — merged status, using the joint status if the application is joint

**New field(s):**
- `verification_status_final` — used in both **tree-based** and **regression** models (if encoded later)

**Drop field(s):**
- `verification_status`, `verification_status_joint` — ✅ Can be dropped after generating `verification_status_final`

**Notes:**
- Ensures consistent handling of both joint and individual applications
- Original fields are preserved for now but marked as removable


In [150]:
cols_for_reg = list(set(cols_for_reg + ["verification_status_final"]))

cols_for_tree = list(set(cols_for_tree + ["verification_status_final"]))

drop_cols = list(set(drop_cols + ["verification_status", "verification_status_joint"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['util_normalized_reg', 'major_derog_score', 'credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'util_buckets_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'mths_since_last_major_derog_filled', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'initial_list_status_flag', 'verification_status_final', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight', 'dti_final']
Columns for the tree-bas

### Engineering total current balance features with `process_tot_cur_bal_features()` in `/feature_engineering/nodes.py`

**Functionality:**
Generates normalized and interpretable features from the `tot_cur_bal` (total current balance) field for both regression and tree-based models.

**Inputs:**
- `tot_cur_bal` — total current balance across all accounts
- `annual_inc` — annual income
- `loan_amnt` — loan amount

**Outputs:**
- `log_tot_cur_bal` — log-transformed total current balance (used in regression)
- `cur_bal_to_income` — ratio of current balance to annual income (regression + tree)
- `cur_bal_to_loan` — ratio of current balance to loan amount (regression + tree)
- `tot_cur_bal_missing` — binary indicator for missing `tot_cur_bal` values (tree only)

**New field(s):**
- `log_tot_cur_bal` — **regression models**
- `cur_bal_to_income` — **regression and tree models**
- `cur_bal_to_loan` — **regression and tree models**
- `tot_cur_bal_missing` — **tree models only**

**Drop field(s):**
- `tot_cur_bal` — ✅ Can be dropped after these derived features are created, if not reused elsewhere

**Notes:**
- All division operations are safely handled with `clip` and `fillna`
- Log transformation helps normalize highly skewed distributions (for regression models)
- Tree models benefit from the missingness indicator to capture structural nulls


In [151]:
cols_for_reg = list(set(cols_for_reg + ["log_tot_cur_bal", "cur_bal_to_income", "cur_bal_to_loan"]))

cols_for_tree = list(set(cols_for_tree + ["cur_bal_to_income", "cur_bal_to_loan", "tot_cur_bal_missing"]))

drop_cols = list(set(drop_cols + ["loan_amnt", "installment"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['util_normalized_reg', 'major_derog_score', 'credit_age_months', 'term_rate_interaction_reg', 'income_log_reg', 'util_buckets_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'cur_bal_to_income', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'mths_since_last_major_derog_filled', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'initial_list_status_flag', 'verification_status_final', 'inq_to_open_acc_ratio', 'loan_to_installment_ratio', 'delinq_weight', 'log_tot_cur_bal


### Engineering revolving balance features with `process_avg_cur_bal()` in `/feature_engineering/nodes.py`

**Functionality:**
Processes and transforms the `avg_cur_bal` (average current balance) column to support tree-based and regression models.

**Inputs:**
- `avg_cur_bal` — average current balance across all accounts
- `open_acc` — number of open accounts (optional, for ratio)

**Outputs:**
- `avg_cur_bal_log` — log-transformed balance (regression)
- `avg_cur_bal_missing` — binary indicator for missingness (tree)
- `avg_bal_per_acc` — ratio of average balance per open account (tree + regression)

**New field(s):**
- `avg_cur_bal_log` — ✔️ **regression models**
- `avg_cur_bal_missing` — ✔️ **tree-based models**
- `avg_bal_per_acc` — ✔️ **used in both model types** if `open_acc` is available

**Drop field(s):**
- `avg_cur_bal` — ✅ **Safe to drop** after transformation
- `open_acc` — ❌ Only if unused elsewhere

**Notes:**
- Uses `log1p()` for numeric stability and skew reduction
- Missing values handled by creating explicit indicator (`avg_cur_bal_missing`)
- Ratio `avg_bal_per_acc` can signal account-level exposure or ris

In [152]:
cols_for_reg = list(set(cols_for_reg + ["avg_cur_bal_log", "avg_bal_per_acc"]))

cols_for_tree = list(set(cols_for_tree + ["avg_cur_bal_missing", "avg_bal_per_acc"]))

drop_cols = list(set(drop_cols + ["loan_amnt", "installment", "avg_cur_bal"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['util_normalized_reg', 'major_derog_score', 'credit_age_months', 'avg_bal_per_acc', 'term_rate_interaction_reg', 'income_log_reg', 'util_buckets_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'cur_bal_to_income', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'active_acc_ratio', 'avg_cur_bal_log', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'mths_since_last_major_derog_filled', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'initial_list_status_flag', 'verification_status_final', 'inq_to_open_acc_ratio', 'loan_to_installment_rat

# %% md
### Engineering recent inquiry features with `process_mths_since_recent_inq()` in `/feature_engineering/nodes.py`

**Functionality:**
Transforms the `mths_since_recent_inq` column into model-friendly features for both regression and tree-based models by capping, creating missingness flags, and binary indicators.

**Inputs:**
- `mths_since_recent_inq` — number of months since the most recent credit inquiry

**Outputs:**
- `mths_since_recent_inq_missing` — binary flag for missing values (tree)
- `mths_since_recent_inq_capped` — capped numeric feature (regression)
- `had_recent_inquiry` — binary indicator if inquiry was within last 6 months (tree)

**New field(s):**
- `mths_since_recent_inq_missing` — ✔️ **tree-based models**
- `mths_since_recent_inq_capped` — ✔️ **regression models**
- `had_recent_inquiry` — ✔️ **tree-based models**

**Drop field(s):**
- `mths_since_recent_inq` — ✅ **Can be dropped** after transformation

**Notes:**
- Capping at 60 months limits outlier influence for regression
- Converts nulls to interpretable form for tree splits
- Binary indicator provides clear signal of recent credit activity


In [153]:
cols_for_reg = list(set(cols_for_reg + ["mths_since_recent_inq_capped"]))

cols_for_tree = list(set(cols_for_tree + ["mths_since_recent_inq_missing", "had_recent_inquiry"]))

drop_cols = list(set(drop_cols + ["mths_since_recent_inq"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['util_normalized_reg', 'major_derog_score', 'credit_age_months', 'avg_bal_per_acc', 'term_rate_interaction_reg', 'income_log_reg', 'util_buckets_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'cur_bal_to_income', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'mths_since_recent_inq_capped', 'active_acc_ratio', 'avg_cur_bal_log', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'mths_since_last_major_derog_filled', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'initial_list_status_flag', 'verification_status_final', 'inq_to_open_acc_

### Engineering recent trade line features with `process_num_tl_op_past_12m()` in `/feature_engineering/nodes.py`

**Functionality:**
Processes the `num_tl_op_past_12m` field (number of trade lines opened in past 12 months) for use in regression and tree-based models by handling invalid values, adding missing flags, and capping.

**Inputs:**
- `num_tl_op_past_12m` — number of new credit trade lines opened in the past 12 months

**Outputs:**
- `num_tl_op_past_12m_missing` — binary flag for missing values (tree)
- `num_tl_op_past_12m_capped` — numeric version capped at 10 (regression)

**New field(s):**
- `num_tl_op_past_12m_missing` — ✔️ **tree-based models**
- `num_tl_op_past_12m_capped` — ✔️ **regression models**

**Drop field(s):**
- `num_tl_op_past_12m` — ✅ **Can be dropped** after transformation

**Notes:**
- Negative values are treated as missing
- Cap of 10 aligns with practical business logic and prevents skew
- Useful for capturing borrower credit-seeking behavior in recent year

In [154]:
cols_for_reg = list(set(cols_for_reg + ["num_tl_op_past_12m_capped"]))

cols_for_tree = list(set(cols_for_tree + ["num_tl_op_past_12m_missing"]))

drop_cols = list(set(drop_cols + ["num_tl_op_past_12m"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['util_normalized_reg', 'major_derog_score', 'credit_age_months', 'avg_bal_per_acc', 'term_rate_interaction_reg', 'income_log_reg', 'util_buckets_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'cur_bal_to_income', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'mths_since_recent_inq_capped', 'active_acc_ratio', 'num_tl_op_past_12m_capped', 'avg_cur_bal_log', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'mths_since_last_major_derog_filled', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'initial_list_status_flag', 'verification_sta

### Engineering bankruptcy features with `process_pub_rec_bankruptcies()` in `/feature_engineering/nodes.py`

**Functionality:**
Processes the `pub_rec_bankruptcies` field to support both regression and tree-based models through cleaning, missing value handling, and derived risk indicators.

**Inputs:**
- `pub_rec_bankruptcies` — number of public records of bankruptcy (may contain nulls or negatives)

**Outputs:**
- `pub_rec_bankruptcies_missing` — binary flag for missing values (tree)
- `pub_rec_bankruptcies_capped` — version clipped at 3 for outlier handling (regression)
- `has_bankruptcy` — binary feature for whether applicant has ever declared bankruptcy (used in both models)

**New field(s):**
- `pub_rec_bankruptcies_missing` — ✔️ **tree-based models**
- `pub_rec_bankruptcies_capped` — ✔️ **regression models**
- `has_bankruptcy` — ✔️ **tree and regression models**

**Drop field(s):**
- `pub_rec_bankruptcies` — ✅ **Can be dropped** after transformation

**Notes:**
- Negative values are coerced to `NaN`
- Capping at 3 prevents distortion from rare extreme values
- `has_bankruptcy` is a strong signal for credit risk in many scoring models

In [155]:
cols_for_reg = list(set(cols_for_reg + ["pub_rec_bankruptcies_capped"]))

cols_for_tree = list(set(cols_for_tree + ["pub_rec_bankruptcies_missing"]))

drop_cols = list(set(drop_cols + ["pub_rec_bankruptcies"]))

print(f"Columns for the regression-based models: {cols_for_reg}")
print(f"Columns for the tree-based models: {cols_for_tree}")
print(f"Columns to drop {drop_cols}")

Columns for the regression-based models: ['util_normalized_reg', 'major_derog_score', 'credit_age_months', 'avg_bal_per_acc', 'term_rate_interaction_reg', 'income_log_reg', 'util_buckets_reg', 'delinq_2yrs_reg', 'pub_rec', 'term_normalized_reg', 'revol_util_reg', 'recent_acc_ratio', 'revol_bal_final', 'annual_inc_final', 'sub_grade_encoded', 'cur_bal_to_income', 'interest_burden_pct', 'payment_to_income', 'has_hardship', 'has_derogatory', 'pub_rec_bankruptcies_capped', 'fico_average', 'mths_since_last_record_filled', 'mortgage_ratio', 'mths_since_recent_inq_capped', 'active_acc_ratio', 'num_tl_op_past_12m_capped', 'avg_cur_bal_log', 'dti_log_reg', 'inst_to_revol_ratio', 'int_rate', 'recent_inquiry_intensity', 'mths_since_last_major_derog_filled', 'income_to_loan_reg', 'dti_normalized_reg', 'was_late_before_hardship', 'hardship_dpd_filled', 'emp_length_clean_reg', 'grade_encoded', 'loan_status_binary', 'delinq_severity_reg', 'purpose_risk_score', 'home_ownership_ordinal', 'initial_list_