# Survival Analysis

## 1. Notebook Styling and Library Installation

In [1]:
# !conda install -c sebp scikit-survival --yes 
# !pip install lifelines
!pip install xgbse

Collecting xgbse
  Using cached xgbse-0.3.3-py3-none-any.whl.metadata (17 kB)
Collecting joblib<2.0.0,>=1.4.2 (from xgbse)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting lifelines<0.30.0,>=0.29.0 (from xgbse)
  Using cached lifelines-0.29.0-py3-none-any.whl.metadata (3.2 kB)
Collecting scikit-learn<2.0.0,>=1.5.0 (from xgbse)
  Using cached scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting nvidia-nccl-cu12 (from xgboost<3.0.0,>=2.1.0->xgbse)
  Using cached nvidia_nccl_cu12-2.25.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.8 kB)
Using cached xgbse-0.3.3-py3-none-any.whl (35 kB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Using cached lifelines-0.29.0-py3-none-any.whl (349 kB)
Using cached scikit_learn-1.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
Using cached nvidia_nccl_cu12-2.25.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.4 

In [2]:
!pip install feature-engine



In [3]:
pip install optunahub

Note: you may need to restart the kernel to use updated packages.


In [4]:
import numpy as np # Library for math operations
import pandas as pd # Library for data handling
from sksurv.nonparametric import kaplan_meier_estimator # Library for survival analysis
import matplotlib.pyplot as plt # Library for plotting
import seaborn as sns # Another library for plotting
plt.style.use('fivethirtyeight') # Set the styling to FiveThirtyEight setting.
from datetime import date

## 2. Read & Process the Data

In [5]:
trainInput = pd.read_csv('trainInput.csv')
testInput = pd.read_csv('testInput.csv')

In [6]:
kaggle = pd.read_csv('kaggle.csv')

In [7]:
trainData = trainInput.drop(columns = ['id','purchased', 'days_on_market'])
trainLabels = trainInput['purchased']
testData = testInput.drop(columns = ['id','purchased', 'days_on_market'])
testLabels = testInput['purchased']

## 3. Data Augmentation:

In [8]:
trainData[['product_id']] = trainData[['product_id']].astype(str)

testData[['product_id']] = testData[['product_id']].astype(str)

## 4. Pipeline:

### Code Explanation and Design Choices

This code is designed for survival analysis using an **XGBoost-based Survival Model (XGBSE)**, optimized with **Optuna** for hyperparameter tuning. It includes a detailed **preprocessing pipeline**, **feature engineering**, and **model selection** steps.

---

### **Key Components and Why They Are Used**

#### **1. Hyperparameter Optimization with Optuna**
- **Why?** Automates the search for optimal hyperparameters, ensuring the best performance for the survival model.
- **How?** It defines an `objective` function that tries different preprocessing settings and model parameters, and evaluates them using **Concordance Index**.

#### **2. Preprocessing Pipeline**
The pipeline ensures the input data is clean and well-processed before being fed into the model.

- **Outlier Handling (Winsorization)**  
  - Uses `Winsorizer()` to limit extreme values in both tails or a specific tail.
  - **Why?** Controls the impact of outliers on the model.
  
- **Categorical Encoding**
  - `RareLabelEncoder()`: Groups infrequent categories into "rare" labels.
  - `WoEEncoder()`: Converts categorical variables to numerical values using **Weight of Evidence (WoE)** (optional).
  - **Why?** Reduces sparsity and improves model interpretability.

- **Feature Selection**
  - `DropConstantFeatures()`: Removes features with little variation (near-constant).
  - **Why?** Reduces redundancy and computational cost.

- **Feature Engineering**
  - `MathFeatures()`: Creates interaction terms like products of selected feature pairs.
  - **Why?** Enhances feature representation for better predictive power.

- **Label Encoding for Categorical Features**
  - Uses `LabelEncoder()` on categorical columns before feature transformations.
  - **Why?** Converts categorical values into numerical form, necessary for models that don’t handle categorical data natively.

- **MinMax Scaling**
  - Applies `MinMaxScaler()` to scale all features between 0 and 1.
  - **Why?** Ensures consistency across different feature ranges.

#### **3. Survival Model Setup**
- Uses **XGBSEStackedWeibull** as the base model, wrapped in **XGBSEBootstrapEstimator**.
- **Why these models?**
  - **XGBSEStackedWeibull**: Extends XGBoost for survival analysis, leveraging a **Weibull distribution**.
  - **XGBSEBootstrapEstimator**: Reduces variance by using **bootstrap resampling**.

- Converts labels using `convert_to_structured()`, necessary for survival analysis.
- **Why?** Survival models need structured time-to-event data instead of simple class labels.

#### **4. Model Training and Evaluation**
- **Hyperparameter tuning for the survival model includes:**
  - `learning_rate`: Controls step size (log-scaled for fine control).
  - `max_depth`: Controls tree depth for complexity.
  - `booster`: Chooses between "gbtree" (default) and "dart" (dropout trees for regularization).
  - `subsample`: Selects a subset of data to reduce overfitting.
  - `min_child_weight`: Controls minimum sum of instance weights per leaf.
  - `colsample_bynode`: Controls feature sampling for diversity.

- Uses **early stopping** to prevent overfitting by monitoring validation performance.

- **Evaluation Metric: Concordance Index (C-index)**
  - Measures how well the model ranks survival times.
  - **Why?** Ideal for survival analysis as it focuses on ranking rather than absolute time predictions.

#### **5. Model Persistence (Pickle)**
- Saves the trained model (`xgbse_model.pkl`) and preprocessing pipeline (`preprocessing_pipeline.pkl`) for future inference.
- **Why?** Avoids retraining the model every time predictions are needed.

---

### **Why This Structure?**
1. **Automated Hyperparameter Optimization**  
   - Optuna systematically tunes preprocessing and model parameters to maximize performance.

2. **Comprehensive Preprocessing**  
   - Handles outliers, rare categories, feature selection, encoding, and scaling.

3. **Feature Engineering to Enhance Predictive Power**  
   - Introduces interaction terms via `MathFeatures()`.

4. **Survival-Specific Label Conversion**  
   - Uses `convert_to_structured()` to properly format target labels.

5. **Bootstrap-Based Survival Modeling**  
   - Reduces variance by aggregating multiple model predictions.

6. **Scalability and Reproducibility**  
   - Saves preprocessing steps and models for easy deployment.

---

### **Summary**
This code builds an **optimized survival analysis model** using XGBoost and **XGBSE**, enhanced with automated preprocessing and feature selection. It leverages **Optuna for tuning** and **bootstrap ensembling** for stability, ensuring robust predictions in time-to-event modeling tasks.

In [None]:
import optuna
import pickle
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from xgbse import (
    XGBSEKaplanNeighbors, XGBSEDebiasedBCE, XGBSEKaplanTree,
    XGBSEStackedWeibull, XGBSEBootstrapEstimator
)
from xgbse.converters import convert_to_structured
from xgbse.metrics import (
    concordance_index, approx_brier_score, dist_calibration_score
)
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import RareLabelEncoder, WoEEncoder
from feature_engine.selection import DropConstantFeatures
from feature_engine.creation import MathFeatures

# Create Storage for Preprocessing Steps and Model
preprocessing_steps = {}
trained_models = {}  # Store trained model here

def objective(trial):
    global preprocessing_steps, trained_models  # Store best transformations and model

    # Preprocessing Hyperparameters

    # 1. Winsorization (Handle Outliers)
    winsor_tail = trial.suggest_categorical("winsor_tail", ["both", "right", "left"])
    winsor_limits = trial.suggest_float("winsor_limits", 0.05, 0.1)
    out = Winsorizer(tail=winsor_tail, fold=winsor_limits)

    # 2. Rare Label Encoding
    rare_tol = trial.suggest_float("rare_tol", 0.0075, 0.01)
    rare_n_categories = trial.suggest_int("rare_n_categories", 1, 5)
    enc = RareLabelEncoder(tol=rare_tol, n_categories=rare_n_categories)

    # 3. Feature Selection (Drop Constant Features)
    drop_tol = trial.suggest_float("drop_tol", 0.95, 0.98)
    con = DropConstantFeatures(tol=drop_tol)

    # 4. Weight of Evidence Encoding (Can be turned ON or OFF)
    apply_woe = trial.suggest_categorical("apply_woe", [True, False])
    enc2 = WoEEncoder() if apply_woe else None

    # 5. Feature Engineering (Math Features)
    use_math_features = trial.suggest_categorical("use_math_features", [True, False])
    possible_math_vars = [["brand", "product_id"], ["price", "color"]]
    
    # Convert feature lists to strings for Optuna
    math_vars_options = [",".join(vars) for vars in possible_math_vars] + [None]
    selected_math_vars = trial.suggest_categorical("math_vars", math_vars_options)

    # Convert back to list (or None) for MathFeatures
    math_vars = selected_math_vars.split(",") if selected_math_vars else None
    mf = MathFeatures(variables=math_vars, func=["prod"]) if use_math_features and math_vars else None

    # Apply Preprocessing Steps
    train_trans, test_trans = trainData.copy(), testData.copy()
    
    train_trans, test_trans = out.fit_transform(train_trans), out.transform(test_trans)
    train_trans, test_trans = enc.fit_transform(train_trans), enc.transform(test_trans)
    train_trans, test_trans = con.fit_transform(train_trans), con.transform(test_trans)
    
    if apply_woe:
        train_trans, test_trans = enc2.fit_transform(train_trans, trainLabels), enc2.transform(test_trans)

    # Convert Categorical Features to Numeric Before Feature Engineering
    label_encoders = {}
    for col in train_trans.select_dtypes(include=["object"]).columns:
        le = LabelEncoder()
        train_trans[col] = le.fit_transform(train_trans[col])
        test_trans[col] = le.transform(test_trans[col])
        label_encoders[col] = le  

    # Apply Math Features AFTER Encoding
    if use_math_features and math_vars and all(var in train_trans.columns for var in math_vars):
        train_trans, test_trans = mf.fit_transform(train_trans), mf.transform(test_trans)

    # MinMax Scaling
    scaler = MinMaxScaler()
    train_trans = pd.DataFrame(scaler.fit_transform(train_trans), columns=train_trans.columns)
    test_trans = pd.DataFrame(scaler.transform(test_trans), columns=test_trans.columns)

    # Store Fitted Transformers for Later Use
    preprocessing_steps["winsorizer"] = out
    preprocessing_steps["rare_encoder"] = enc
    preprocessing_steps["drop_constant"] = con
    preprocessing_steps["scaler"] = scaler
    preprocessing_steps["label_encoders"] = label_encoders
    if apply_woe:
        preprocessing_steps["woe_encoder"] = enc2
    if use_math_features and math_vars:
        preprocessing_steps["math_features"] = mf

    # Convert Labels for Survival Model
    y_train = convert_to_structured(trainInput["days_on_market"], trainLabels)
    y_val = convert_to_structured(testInput["days_on_market"], testLabels)

    # Model Hyperparameters
    params = {
        "objective": "survival:cox",
        "tree_method": "hist",
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 1e-1, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "booster": trial.suggest_categorical("booster", ["gbtree", "dart"]),
        "subsample": trial.suggest_float("subsample", 0.4, 0.9),
        "min_child_weight": trial.suggest_int("min_child_weight", 10, 200),
        "colsample_bynode": trial.suggest_float("colsample_bynode", 0.4, 0.9),
    }

    n_estimators = trial.suggest_int("n_estimators", 50, 500, step=50)
    num_boost_round = trial.suggest_int("num_boost_round", 10, 200, step=10)

    # Create the base model
    base = XGBSEStackedWeibull(params)
    xgbse_model = XGBSEBootstrapEstimator(
        base_estimator=base,
        n_estimators=n_estimators,
        random_state=1
    )

    # Fit model
    xgbse_model.fit(
        train_trans, y_train,
        num_boost_round=num_boost_round,
        validation_data=(test_trans, y_val),
        early_stopping_rounds=20,  
        verbose_eval=False
    )

    # Store trained model for later use
    trained_models["xgbse_model"] = xgbse_model  

    # Predict
    preds = xgbse_model.predict(test_trans)

    # Evaluate using Concordance Index
    return concordance_index(y_val, preds)

# Run Optuna optimization
study = optuna.create_study(direction="maximize")  # Maximize C-index
study.optimize(objective, n_trials=200)

# Best parameters
best_params = study.best_params
print(f"Best parameters: {best_params}")

# Save preprocessing steps and trained model for later use
with open("preprocessing_pipeline.pkl", "wb") as f:
    pickle.dump(preprocessing_steps, f)

with open("xgbse_model.pkl", "wb") as f:
    pickle.dump(trained_models["xgbse_model"], f)


[I 2025-02-20 23:27:11,402] A new study created in memory with name: no-name-dadd1e9e-3cb3-4193-b8dd-e49a6c02891a
[I 2025-02-20 23:29:12,790] Trial 0 finished with value: 0.7582376332910143 and parameters: {'winsor_tail': 'left', 'winsor_limits': 0.0848653038673566, 'rare_tol': 0.008852925609886764, 'rare_n_categories': 4, 'drop_tol': 0.969657634376374, 'apply_woe': False, 'use_math_features': False, 'math_vars': None, 'learning_rate': 0.0637679969690931, 'max_depth': 9, 'booster': 'gbtree', 'subsample': 0.5756872586802466, 'min_child_weight': 119, 'colsample_bynode': 0.726708103197294, 'n_estimators': 150, 'num_boost_round': 50}. Best is trial 0 with value: 0.7582376332910143.
[I 2025-02-20 23:31:52,066] Trial 1 finished with value: 0.7541729214230058 and parameters: {'winsor_tail': 'both', 'winsor_limits': 0.07022379718733353, 'rare_tol': 0.009176405409707777, 'rare_n_categories': 2, 'drop_tol': 0.9592582829844837, 'apply_woe': True, 'use_math_features': True, 'math_vars': None, 'lea

## 5. Submission

This script loads a pre-trained XGBSE survival model and its preprocessing pipeline, applies transformations to new data, generates predictions, and prepares a submission file. It ensures consistency with training by reusing preprocessing steps like outlier handling, encoding, feature selection, and scaling. Predictions are formatted correctly for submission, making the workflow efficient, reproducible, and deployment-ready.

In [None]:
import pickle
import pandas as pd

# Load Preprocessing Steps
with open("preprocessing_pipeline.pkl", "rb") as f:
    preprocessing_steps = pickle.load(f)

# Load the Trained Model
with open("xgbse_model.pkl", "rb") as f:
    trained_models = pickle.load(f)

print("Preprocessing pipeline and model loaded successfully!")

# Apply Preprocessing to New Data
def apply_preprocessing(new_data, preprocessing_steps):
    """Applies the saved preprocessing steps to new data before inference."""
    new_data = preprocessing_steps["winsorizer"].transform(new_data)
    new_data = preprocessing_steps["rare_encoder"].transform(new_data)
    new_data = preprocessing_steps["drop_constant"].transform(new_data)
    # Apply Label Encoding for categorical columns
    for col, le in preprocessing_steps["label_encoders"].items():
        new_data[col] = le.transform(new_data[col])
    if "woe_encoder" in preprocessing_steps:
        new_data = preprocessing_steps["woe_encoder"].transform(new_data)
    
    if "math_features" in preprocessing_steps:
        new_data = preprocessing_steps["math_features"].fit_transform(new_data)

    # Apply Label Encoding for categorical columns
    #for col, le in preprocessing_steps["label_encoders"].items():
     #   new_data[col] = le.transform(new_data[col])

    # Apply MinMax Scaling
    new_data = pd.DataFrame(preprocessing_steps["scaler"].transform(new_data), columns=new_data.columns)
    
    return new_data

# Apply preprocessing to kaggle
kaggle_transformed = apply_preprocessing(kaggle, preprocessing_steps)

# If labels are available, convert them
if 'days_on_market' in kaggle.columns:
    kaggle_labels = convert_to_structured(kaggle["days_on_market"], kaggleLabels)

print("kaggle data preprocessed successfully. Ready for predictions!")

# Prepare Submission
submission = kaggle.copy()
submission[['product_id']] = submission[['product_id']].astype(str)

submission['Expected'] = 1 - trained_models.predict(kaggle_transformed)[31]
submission['Id'] = submission.index.astype(str)

# Save to CSV
submission[['Id', 'Expected']].to_csv('to_kaggle.csv', index=False)
print("Submission file 'to_kaggle.csv' saved successfully!")

Please submit to: https://www.kaggle.com/t/8a2e03e370c74cafbb28375aed425682

## XGBSE Docs:

https://github.com/loft-br/xgboost-survival-embeddings/blob/main/docs/how_xgbse_works.md