# 🧠 Feature Engineering - Credit Risk Modeling Project

This document outlines the complete **feature engineering workflow** applied to the Home Credit Default Risk dataset. The goal is to transform raw data into meaningful features that improve model performance and interpretability.

---

## 🔹 1. Data Cleaning

- Converted all `DAYS_*` columns from negative values to positive (e.g., `DAYS_BIRTH` to `AGE_YEARS`).
- Handled missing values for critical features such as `EXT_SOURCE_2`, `EXT_SOURCE_3`, and `OCCUPATION_TYPE` using imputation strategies.
- Ensured consistent binary flag formatting (0/1) across all `FLAG_*` columns.

---

## 🔹 2. Categorical Encoding

- **Label Encoding** for binary categorical variables (e.g., `CODE_GENDER`, `FLAG_OWN_CAR`, `FLAG_OWN_REALTY`).
- **One-Hot Encoding** for nominal variables (e.g., `NAME_CONTRACT_TYPE`, `NAME_EDUCATION_TYPE`, `NAME_FAMILY_STATUS`).
- **Frequency Encoding** for high-cardinality variables like `ORGANIZATION_TYPE`.
- Converted `WEEKDAY_APPR_PROCESS_START` to numerical day of the week (Mon=0 to Sun=6).

---

## 🔹 3. Numerical Feature Transformation

Created new features and ratios to provide more meaningful relationships:

- `INCOME_PER_PERSON` = `AMT_INCOME_TOTAL` / `CNT_FAM_MEMBERS`
- `CREDIT_TO_INCOME` = `AMT_CREDIT` / `AMT_INCOME_TOTAL`
- `ANNUITY_TO_INCOME` = `AMT_ANNUITY` / `AMT_INCOME_TOTAL`
- `ANNUITY_TO_CREDIT` = `AMT_ANNUITY` / `AMT_CREDIT`
- `GOODS_TO_CREDIT` = `AMT_GOODS_PRICE` / `AMT_CREDIT`
- `AGE_YEARS` = `-DAYS_BIRTH / 365`
- `EMPLOYMENT_YEARS` = `-DAYS_EMPLOYED / 365`
- `YEARS_SINCE_REGISTRATION` = `-DAYS_REGISTRATION / 365`
- `YEARS_ID_ISSUED` = `-DAYS_ID_PUBLISH / 365`
- `WORKING_LIFE_RATIO` = `EMPLOYMENT_YEARS / AGE_YEARS`
- `TIME_SINCE_LAST_PHONE_CHANGE` = `-DAYS_LAST_PHONE_CHANGE / 365`

---

## 🔹 4. Flag Aggregation

Combined multiple binary flags to create summary features:

- `PHONE_FLAGS_SUM` = Sum of `FLAG_EMP_PHONE`, `FLAG_WORK_PHONE`, `FLAG_CONT_MOBILE`, `FLAG_PHONE`, `FLAG_MOBIL`, `FLAG_EMAIL`
- `REGION_MISMATCH_SUM` = Sum of `REG_REGION_NOT_LIVE_REGION`, `REG_REGION_NOT_WORK_REGION`, `LIVE_REGION_NOT_WORK_REGION`, `REG_CITY_NOT_LIVE_CITY`, `REG_CITY_NOT_WORK_CITY`, `LIVE_CITY_NOT_WORK_CITY`

---

## 🔹 5. Social Circle Features

Aggregated social circle statistics:

- `SOCIAL_OBS_TOTAL` = `OBS_30_CNT_SOCIAL_CIRCLE` + `OBS_60_CNT_SOCIAL_CIRCLE`
- `SOCIAL_DEF_TOTAL` = `DEF_30_CNT_SOCIAL_CIRCLE` + `DEF_60_CNT_SOCIAL_CIRCLE`
- `SOCIAL_DEF_RATIO` = `SOCIAL_DEF_TOTAL` / `SOCIAL_OBS_TOTAL` (handled divide-by-zero cases)

---

## 🔹 6. Document Flags Aggregation

- `NUM_DOCUMENTS_PROVIDED` = Sum of all `FLAG_DOCUMENT_*` features

---

## 🔹 7. Interaction Features

Created features that combine external scores and financial ratios:

- `EXT_SOURCE_AVG` = Mean of `EXT_SOURCE_2` and `EXT_SOURCE_3`
- `EXT_SOURCE_MIN`, `EXT_SOURCE_MAX` = Min and Max of `EXT_SOURCE_2`, `EXT_SOURCE_3`
- `CREDIT_INCOME_EXT_SOURCE` = (`AMT_CREDIT` / `AMT_INCOME_TOTAL`) * `EXT_SOURCE_AVG`

---

## 🔹 8. Temporal Features

- `APPLICATION_HOUR_BIN` = Categorized `HOUR_APPR_PROCESS_START` into time bins (Night, Morning, Afternoon, Evening)
- Converted `WEEKDAY_APPR_PROCESS_START` into integer codes (0–6)

---

## 🔹 9. Outlier Handling

- Used domain knowledge and IQR method to flag outliers in:
  - `AMT_INCOME_TOTAL` → `is_outlier_Income`
  - `AMT_CREDIT` → `is_outlier_Credit`
- Log-transformed skewed variables such as income and credit amount.

---

## ✅ Final Feature Matrix

After the above transformations, the final dataset included:

- Original base features
- 20+ engineered features
- Cleaned and encoded categorical variables
- Aggregated and interaction-based enhancements

These were used for training various models including Logistic Regression, Random Forest, XGBoost, and LightGBM.

---

