# Mission 7: Credit Scoring Model Implementation

## Objective
Develop a credit scoring model to predict the probability of client bankruptcy. The model must optimize for a business cost metric where False Negatives are penalized 10 times more than False Positives.

## Workflow
1. **Data Exploration (SQL)**: Load data into SQLite and perform initial exploration.
2. **Feature Engineering**: Create domain-specific features.
3. **Feature Analysis**: Analyze distributions and outliers.
4. **Preprocessing**: Handle missing values, encoding, and scaling.
5. **Model Strategy**: Define the business cost function.
6. **Baseline Model**: Establish a baseline performance.
7. **Model Training & Tuning**: Train models using GridSearchCV and MLflow.
8. **Model Evaluation**: Evaluate models on the test set.
9. **Feature Importance**: Analyze global and local feature importance.
10. **Model Registration**: Register the best model in MLflow.

In [None]:
# Step 0: Imports and Setup
import sys
import os

# Add src to path (works in Docker: /app/src, local: ../src)
if os.path.exists('/app/src'):
    sys.path.insert(0, '/app/src')
    DATA_PATH = '/app/dataset'
else:
    sys.path.insert(0, os.path.abspath('../src'))
    DATA_PATH = '../dataset'

import pandas as pd
import numpy as np
import mlflow
import matplotlib.pyplot as plt
import seaborn as sns

from classes.data_loader import DataLoader
from classes.sqlite_connector import DatabaseConnection
from classes.feature_engineering import FeatureEngineering
from classes.business_scorer import BusinessScorer
from classes.model_trainer import ModelTrainer
from classes.eda_visualizer import EDAVisualizer

# Configure MLflow experiment
mlflow.set_tracking_uri("http://mlflow:5005")
mlflow.set_experiment("HomeCredit_DefaultRisk")

print(f"Data path: {DATA_PATH}")
print("Setup complete!")

## Step 1: Data Exploration (SQL)
We will load the CSV data into a SQLite database to enable SQL-based exploration.

In [None]:
# Initialize DataLoader and create SQLite database
loader = DataLoader(DATA_PATH)
db_path = os.path.join(DATA_PATH, 'home_credit.db')

# Create database only if it doesn't exist
if not os.path.exists(db_path):
    print("Creating SQLite database (this may take a few minutes)...")
    loader.create_database(db_path)
else:
    print(f"Database already exists at {db_path}")

# Connect to the database
db = DatabaseConnection(db_path)
print("Tables:", db.get_table_names())

In [None]:
# Example SQL Query: Check target distribution in application_train
query_target = """
SELECT TARGET, COUNT(*) as count 
FROM application_train 
GROUP BY TARGET
"""
df_target = db.execute_query(query_target)
print(df_target)

In [None]:
# Load full training data for further processing
df_train = db.read_table('application_train')
print(f"Loaded training data shape: {df_train.shape}")

## Step 2: Feature Engineering
We will create new features based on domain knowledge.

In [None]:
fe = FeatureEngineering()
df_train = fe.simple_feature_engineering(df_train)
print("Feature engineering complete.")

## Step 3: Feature Analysis
Visualize distributions and identify outliers using Plotly.

In [None]:
# Target Distribution
EDAVisualizer.plot_target_distribution(df_train).show()

In [None]:
# Numerical Distributions
numeric_cols = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'DAYS_BIRTH', 'DAYS_EMPLOYED']
EDAVisualizer.plot_numerical_distribution(df_train, columns=numeric_cols).show()

## Step 4: Preprocessing
Prepare data for modeling: imputation, encoding, and scaling.

In [None]:
# Define features
numeric_features = ['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'DAYS_BIRTH', 'DAYS_EMPLOYED']
categorical_features = ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']

X = df_train[numeric_features + categorical_features]
y = df_train['TARGET']

print(f"X shape: {X.shape}, y shape: {y.shape}")

# Create preprocessor
preprocessor = fe.create_preprocessor(numeric_features, categorical_features)

## Step 5: Model Strategy
Define the business cost function: Cost = 10 * FN + 1 * FP.

In [None]:
business_scorer = BusinessScorer(fn_cost=10, fp_cost=1)
scorer = business_scorer.get_scorer()
print("Business scorer created (FN cost=10, FP cost=1)")

## Step 6: Baseline Model
Train a simple Logistic Regression model as a baseline.

In [None]:
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE

# Use sample for faster training (adjust as needed)
SAMPLE_SIZE = 50000
X_sample = X.sample(n=SAMPLE_SIZE, random_state=42)
y_sample = y.loc[X_sample.index]
print(f"Using sample of {SAMPLE_SIZE} rows for training")

pipeline_baseline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(max_iter=1000))
])

param_grid_baseline = {'classifier__C': [1.0]}

trainer = ModelTrainer()
baseline_model = trainer.train_and_log(
    pipeline_baseline, param_grid_baseline, X_sample, y_sample, scorer, 
    run_name="Step6_Baseline_LogReg"
)
print("Baseline model training complete!")

## Step 7: Model Training & Tuning
Train and tune more complex models (e.g., LightGBM) using GridSearchCV.

In [None]:
from lightgbm import LGBMClassifier

pipeline_lgbm = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(random_state=42)),
    ('classifier', LGBMClassifier(random_state=42, verbose=-1))
])

param_grid_lgbm = {
    'classifier__n_estimators': [100, 200],
    'classifier__learning_rate': [0.01, 0.1],
    'classifier__num_leaves': [31, 50]
}

lgbm_model = trainer.train_and_log(
    pipeline_lgbm, param_grid_lgbm, X_sample, y_sample, scorer, 
    run_name="Step7_LGBM_Tuning"
)
print("LightGBM model training complete!")

## Step 8: Model Evaluation
Evaluate the best model on the test set (if available with labels) or using cross-validation results.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Evaluate on training sample
y_pred = lgbm_model.predict(X_sample)

print("Classification Report:")
print(classification_report(y_sample, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_sample, y_pred))

## Step 9: Feature Importance
Analyze feature importance for the best model.

In [None]:
# Global Feature Importance (LightGBM built-in)
import lightgbm as lgb

lgb.plot_importance(lgbm_model.named_steps['classifier'], max_num_features=20)
plt.title("Feature Importance")
plt.tight_layout()
plt.show()

## Step 10: Model Registration
Register the best model in MLflow Model Registry.

In [None]:
# Register the model (this would typically be done via MLflow UI or API)
print("Please register the best model via the MLflow UI at http://localhost:5005")
print("\nâœ… Notebook execution complete!")