# Investigation plan

## Home Credit Default Risk Project

### 1. Introduction

We want to build a **proof-of-concept** for a credit risk prediction service that **banks could eventually use in real life.** 

The dataset comes **from Home Credit**, which is widely used as **a benchmark for loan default modeling.**

The idea is that **if we can demonstrate strong results on this dataset**, we’ll have **something credible to show potential clients** when we meet them.

**The plan below is a step-by-step outline** of how we’ll approach this. It covers how we’ll explore the data, engineer useful features, train and compare different models, and finally prepare a working prototype that can be deployed. The goal is to move in stages: first understand the data, then build and evaluate models, and finally produce a deployable pipeline that highlights our ability to deliver accurate, interpretable, and practical risk assessments.

This way, we’ll not only have strong technical results but also a concrete demonstration of how our service could work for a bank.

**Assumptions**

    Banks care about early risk detection and predictive power, so model accuracy (ROC-AUC) will be critical.
    
    Time to build and deploy matters — we need to balance experimentation with practicality.
    
    Kaggle’s Home Credit dataset is a good proxy for what banks’ data might look like (rich, tabular, multiple related tables).
    
    A diverse set of models will give us flexibility when presenting to clients.

**Overall Objectives**

    Understand the dataset thoroughly: distributions, anomalies, correlations, missingness.
    
    Engineer meaningful features combining statistical patterns and domain logic.
    
    Train and compare multiple models (both linear and tree-based).
    
    Optimize and validate top models.
    
    Demonstrate a deployable proof-of-concept pipeline.

### Step-by-Step Plan

#### 1. Background research

Read articles, blogs, and Kaggle discussions about the Home Credit dataset and competition to understand challenges and common approaches.

#### 2. Initial dataset exploration

Load the data, check its size and structure.

Run basic EDA: feature types, missing values, dataset shapes.

Train quick “baseline” models without preprocessing to check feasibility (can my computer handle the size, and what’s a baseline score?).

#### 3. Train/test comparison

Compare distributions in train vs test to detect strong dataset shifts.

Explore fixes for mismatched features.

#### 4. Deeper exploratory analysis (train/test)

Generate automated reports (Sweetviz, ydata).

Investigate missing values and decide which to drop, impute, or flag.

Check for duplicates and data consistency.

Inspect important “days” features (birth, employment, etc.) for anomalies.

Calculate correlations (tried Pearson/Chi², then PHIK for mixed data).

Identify 15–25 most correlated features.

Plot distributions (histograms, boxplots, feature vs target).

Flag unusual anomalies.

#### 5. Statistical analysis

Run hypothesis testing where useful.

#### 6. Feature engineering

Begin forming feature engineering ideas from EDA + domain knowledge.

#### 7. Feature engineering (additional datasets)

Investigate additional datasets (bureau, previous applications, etc.).

For each: run basic EDA, anomaly checks, auto-reports ((Sweetviz, ydata).

Brainstorm new features: ratios, flags, aggregations, time transformations.

Decide aggregation strategy for linking auxiliary datasets back to the main table (select ~50–80% most useful features, preserve domain-critical ones).

#### 8. Preprocessing strategy (train/test)

Apply transformations informed by EDA:

Downcast numerics for memory efficiency.

Handle date anomalies.

Convert DAYS to YEARS for interpretability.

Collapse rare categories.

Add missing-value flags.

Create/correct flags where needed.

Build predictive domain-inspired features.

#### 9. Modeling — selecting best models

Define evaluation metrics: ROC-AUC as the main one, but also track time, precision-recall, F1.

Train 5–6 baseline models on preprocessed data (e.g., Logistic Regression, LightGBM, CatBoost, XGBoost).

Pick the 2 strongest candidates.

#### 10. Modeling — enriched dataset

Merge engineered auxiliary datasets into main dataset.

Retrain 2 best models and evaluate improvements.

#### 11. Feature selection

Use LightGBM feature importance as the primary method.

Select features with importance > 0, experiment with top-N (e.g., 140, 170).

#### 12. Hyperparameter optimization

Use Optuna with cross-validation to tune top models (LightGBM, XGBoost) on selected features.

#### 13. Model training & evaluation

Train optimized models.

Evaluate with ROC, precision-recall curves, calibration curves.

Use SHAP to interpret top 15–20 features.

#### 14. Ensembling

Combine best models into ensembles, with and without calibration.

Compare ensemble. Test set results.

Analyze confusion matrices, threshold tuning options.

#### 15. Submission & validation

Submit results to Kaggle test set.

Track performance vs baseline.

#### 16. Deployment

Save final ensemble model(s) for deployment.

Build preprocessing + inference pipeline.

Prepare Docker container and test locally.

Deploy to Google Cloud.

#### 17. Documentation

Document the process and pipeline (README, interpretation notes).