# Ensamble von 3 Modellen: RandomSurvivalForrest, XGBRegressor mit Hyperparametertuning und top 25 features, und LGBMClassifier
Hier werden alle Modelle kombiniert, um ein gemeinsames Ergebnis zu erhalten.

Funktionsweise:
1. Die Funktion model_1() liest die Trainings- und Testdaten ein, bereinigt sie und trainiert ein RandomSurvivalForest-Modell. Die Vorhersagen werden in einer Datei namens modell_1.csv gespeichert.
2. Die Funktion model_2() liest die Trainings- und Testdaten ein, bereinigt sie und trainiert ein XGBRegressor-Modell mit Hyperparametertuning. Die Vorhersagen werden in einer Datei namens modell_2.csv gespeichert.
3. Die Funktion model_3() liest die Trainings- und Testdaten ein, bereinigt sie und trainiert ein LGBMClassifier-Modell. Die Vorhersagen werden in einer Datei namens modell_3.csv gespeichert.
4. Die Einzelergebnisse werden ausgelesen und der Durchschnitt wird genommen. Das Ergebnis wird in submission.csv abgespeichert.

In [1]:

# Gesamt-Prediction mit modell1 (RandomSurvivalForrest), modell2_new (XGBRegressor mit Hyperparametertuning und top 25 features) und modell3 (LGBMClassifier)
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
from sksurv.ensemble import RandomSurvivalForest
from sksurv.util import Surv
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
import re
from collections import defaultdict
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier, early_stopping
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer  
from sklearn.impute import SimpleImputer
# Modell 1 (RandomSurvivalForrest)

import warnings
warnings.simplefilter("ignore", category=FutureWarning)

def model_1():

    def clean_data(df):
        """Datenbereinigung: Spaltennamen-Normalisierung und Behandlung fehlender Werte"""
        # 1. Spaltennamen standardisieren
        df.columns = df.columns.str.strip().str.lower().str.replace(r"[^a-z0-9_]+", "_", regex=True)
        
        # 2. Behandlung der Survival-Zielvariablen: 'efs' (Ereignis-Indikator) und 'efs_time' (Zeit)
        survival_cols = ['efs', 'efs_time']
        for col in survival_cols:
            if col in df.columns:
                df[col] = df[col].fillna(0)
        
        # 3. Behandlung anderer Spalten: Numerische Spalten werden mit ihrem Median aufgefüllt,
        # andere Spalten werden mithilfe von Codes kodiert.
        for col in df.columns:
            if col in survival_cols or df[col].dtype == object:
                continue
            if pd.api.types.is_numeric_dtype(df[col]):
                df[col].fillna(df[col].median(), inplace=True)
            else:
                df[col] = df[col].astype('category').cat.codes
    
        return df
    
    def main_modell1():
        # Aktuelles Arbeitsverzeichnis abrufen
        current_dir = os.getcwd().replace("\\", "/")
    
        # Pfade zu den Trainings- und Testdaten
        train_path = os.path.join(current_dir, "data/train.csv")
        test_path = os.path.join(current_dir, "data/test.csv")
    
        # Trainings- und Testdaten einlesen
        df_train = pd.read_csv(train_path)
        df_test = pd.read_csv(test_path)
    
        # Daten bereinigen
        df_train = clean_data(df_train)
        df_test = clean_data(df_test)
    
        # Prüfen, ob alle notwendigen Spalten vorhanden sind
        required_cols = ['efs', 'efs_time', 'id']
        for col in required_cols:
            if col not in df_train.columns:
                raise KeyError(f"Fehlende erforderliche Spalte '{col}' im Trainingsdatensatz")
    
        # Survival-Zielvariable für das Training erstellen
        y_train = Surv.from_dataframe('efs', 'efs_time', df_train)
        X_train = df_train.drop(columns=['efs', 'efs_time', 'id'], errors='ignore')
        X_test = df_test.drop(columns=['efs', 'efs_time', 'id'], errors='ignore')
    
        # Es werden nur numerische Features verwendet
        X_train = X_train.select_dtypes(include=np.number)
        X_test = X_test[X_train.columns]
    
        # Initialisierung des RandomSurvivalForest-Modells
        model = RandomSurvivalForest(
            n_jobs=100, 
            n_estimators=500,
            max_depth=3,
            verbose=2,
            random_state=42,
        )
    
        print("Starte Training...")
        model.fit(X_train, y_train)
    
        # Statt model.event_times_ berechnen wir hier den Median der Ereigniszeiten aus den Trainingsdaten.
        median_time = np.median(df_train['efs_time'].values)
        print(f"Ausgewählter Zeitpunkt für Vorhersage: {median_time}")
    
        # Vorhersage der Überlebensfunktion: predict_survival_function liefert für jeden Testfall 
        # eine Funktion, die die Überlebenswahrscheinlichkeit zu jedem Zeitpunkt liefert.
        survival_functions = model.predict_survival_function(X_test)
        
        # Berechnung der Überlebenswahrscheinlichkeit zum errechneten Zeitpunkt (median_time)
        predictions = np.array([fn(median_time) for fn in survival_functions])
        # Optional: sicherstellen, dass die Vorhersagen im Intervall [0,1] liegen
        predictions = np.clip(predictions, 0.0, 1.0)
    
        # Ergebnisse speichern
        results = pd.DataFrame({
            'ID': df_test['id'].values,
            'prediction': predictions
        })
        results.to_csv("modell_1.csv", index=False)
        print("✅ Vorhersagen erfolgreich gespeichert")

    main_modell1()


# Modell 2 (XGBRegressor mit Hyperparametertuning und nur top 25 features)

def model_2():

    # Definiere die Spalten, die als Features genutzt werden sollen
    top_feature_cols = ['conditioning_intensity', 'year_hct', 'age_at_hct',
                        'sex_match', 'donor_age', 'prim_disease_hct', 'gvhd_proph', 
                        'comorbidity_score', 'karnofsky_score', 'cyto_score_detail', 
                        'dri_score', 'cmv_status', 'race_group', 'in_vivo_tcd', 'hla_match_drb1_high', 
                        'tbi_status', 'cardiac', 'cyto_score', 'hla_nmdp_6', 'mrd_hct', 'hla_match_dqb1_high', 
                        'hla_match_a_low', 'pulm_severe', 'psych_disturb', 'hla_match_c_high', 'ID']
    
    def clean_data(df, is_train=True):
        """
        Bereinigt den DataFrame:
        - Ersetzt fehlende Werte (NaN) mit sinnvollen Standardwerten.
        - Entfernt problematische Spaltennamen.
        - Wandelt kategorische Spalten in numerische um.
        """
        df.fillna(0, inplace=True)
        df.columns = df.columns.str.replace(r"[^a-zA-Z0-9_]", "_", regex=True)
        for col in df.columns:
            if df[col].dtype == 'object':
                df[col] = df[col].astype(str).astype('category').cat.codes
        return df
    
    def model_2_main():
        # Arbeitsverzeichnis ermitteln
        current_dir = os.getcwd().replace("\\", "/")
    
        # Trainings- und Testdaten einlesen
        train_path = os.path.join(current_dir, "data/train.csv")
        test_path = os.path.join(current_dir, "data/test.csv")
    
        df_train = pd.read_csv(train_path)
        df_test = pd.read_csv(test_path)
    
        # Daten bereinigen und auf die benötigten Spalten beschränken
        df_train = clean_data(df_train, is_train=True)
        df_train = df_train[top_feature_cols + ['efs', 'efs_time']]
        
        df_test = clean_data(df_test, is_train=False)
        df_test = df_test[top_feature_cols]
    
        # Features und Zielvariable definieren
        X_train = df_train.drop(columns=['efs', 'efs_time'], errors='ignore')
        y_train = df_train['efs']
        X_test = df_test
    
        # Nur numerische Spalten verwenden
        X_train = X_train.select_dtypes(include=[np.number])
        X_test = X_test.select_dtypes(include=[np.number])
    
        # XGBRegressor initialisieren (ohne vorab feste Parameter)
        xgb_model = XGBRegressor(random_state=42)
    
        # Parameter Grid für das Hyperparameter Tuning
        param_grid = {
            'n_estimators': [100, 300, 500],
            'max_depth': [3, 5, 7],
            'learning_rate': [0.01, 0.05, 0.1]
        }
    
        # GridSearchCV initialisieren
        grid_search = GridSearchCV(estimator=xgb_model,
                                   param_grid=param_grid,
                                   scoring='neg_mean_squared_error',  # Negatives MSE als Scoring-Metrik
                                   cv=3,                            # 3-fache Cross-Validation
                                   verbose=1,
                                   n_jobs=-1)
    
        # Hyperparameter Tuning durchführen
        print("Starte Hyperparameter Tuning...")
        grid_search.fit(X_train, y_train)
        print("Beste Parameter gefunden:", grid_search.best_params_)
        print("Bester Score (negatives MSE):", grid_search.best_score_)
    
        # Bestes Modell verwenden
        best_model = grid_search.best_estimator_
    
        # Vorhersagen für Testdaten mithilfe des besten Modells vornehmen
        print("Erstelle Vorhersagen für Testdaten...")
        risk_scores = best_model.predict(X_test)
        
        # Überprüfen, ob die ID-Spalte vorhanden ist
        if 'ID' in X_test.columns:
            results = pd.DataFrame({'ID': X_test['ID'], 'prediction': risk_scores})
        else:
            results = pd.DataFrame({'prediction': risk_scores})
        
        # Ergebnisse abspeichern (dieser Teil kann aktiviert werden, wenn das Speichern gewünscht ist)
        results.to_csv(os.path.join(current_dir, "modell_2.csv"), index=False)
        print("Ergebnisse wurden in results.csv gespeichert.")
    
    model_2_main()

def model_3():

    
    # 1. Daten laden
    train_df = pd.read_csv("data/train.csv")
    test_df = pd.read_csv("data/test.csv")
    
    # 1. Data Cleaning
    def clean_data(df: pd.DataFrame) -> pd.DataFrame:
        cleaned = df.copy()
        
        # Zieltabelle erstellen
        if 'efs' in cleaned.columns:
            cleaned['efs'] = cleaned['efs'].replace({'Event': 1, 'Censoring': 0})
            cleaned = cleaned[cleaned['efs'].isin([0, 1])]
        
        return cleaned
    
    # 2. Preprocessing-Pipeline (Kernel für Train/Test-Konsistenz)
    cat_cols = ["dri_score", "cyto_score", "graft_type", 
               "conditioning_intensity", "cmv_status", "prim_disease_hct"]
    num_cols = ["comorbidity_score", "age_at_hct", "donor_age", 
               "hla_high_res_8", "karnofsky_score"]
    
    preprocessor = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', Pipeline([
            ('impute', SimpleImputer(strategy='median')),
        ]), num_cols)
    ], remainder='drop')
    
    # 3. Daten laden und vorverarbeiten
    train_raw = train_df
    test_raw = test_df
    
    train_df = clean_data(train_raw)
    test_df = clean_data(test_raw)
    
    # WICHTIG: Preprocessing einmal auf alle Daten anwenden
    X_train = preprocessor.fit_transform(train_df)
    y_train = train_df['efs'].values
    X_test = preprocessor.transform(test_df)  # Einmalige Transformation [[1]]
    
    # 4. Modellkonfiguration
    model = LGBMClassifier(
        objective="binary",
        class_weight="balanced",
        n_estimators=1000,
        learning_rate=0.05,
        random_state=42
    )
    
    # 5. Cross-Validation
    cv = StratifiedKFold(n_splits=5)
    test_preds = []
    progress_bar = tqdm(cv.split(X_train, y_train), total=5, desc="Training Folds")
    
    for fold, (train_idx, val_idx) in enumerate(progress_bar):
        X_tr, y_tr = X_train[train_idx], y_train[train_idx]
        X_val, y_val = X_train[val_idx], y_train[val_idx]
        
        model.fit(
            X_tr, y_tr,
            eval_set=[(X_val, y_val)],
            callbacks=[early_stopping(stopping_rounds=50)],
            eval_metric="auc",
        )
        
        # PRE-verarbeiteter Test-Datensatz [[1]]
        test_preds.append(model.predict_proba(X_test)[:, 1])  
    
    # 6. Ensembling
    final_preds = np.mean(test_preds, axis=0)
    
    # Finale Vorhersagen in Datei speichern
    submission_df = pd.DataFrame({
        "ID": test_df["ID"],  # Spaltenzuweisung über die ID-Key [[1]]
        "prediction": final_preds  # Neue Spaltenbezeichnung
    })
    
    # Ohne Index als CSV speichern [[10]]
    submission_df.to_csv("modell_3.csv", index=False)
    print("Vorhersagen gespeichert in predictions_model_3.csv")
    
    print("done")


    

In [2]:
model_1()
model_2()
model_3()

Starte Training...


[Parallel(n_jobs=100)]: Using backend ThreadingBackend with 100 concurrent workers.


building tree 1 of 500
building tree 2 of 500
building tree 3 of 500
building tree 4 of 500
building tree 5 of 500
building tree 6 of 500
building tree 7 of 500
building tree 8 of 500
building tree 9 of 500
building tree 10 of 500
building tree 11 of 500
building tree 12 of 500
building tree 13 of 500
building tree 14 of 500
building tree 15 of 500
building tree 16 of 500
building tree 17 of 500
building tree 18 of 500
building tree 19 of 500
building tree 20 of 500
building tree 21 of 500
building tree 22 of 500
building tree 23 of 500
building tree 24 of 500
building tree 25 of 500
building tree 26 of 500
building tree 27 of 500
building tree 28 of 500
building tree 29 of 500
building tree 30 of 500
building tree 31 of 500
building tree 32 of 500
building tree 33 of 500
building tree 34 of 500
building tree 35 of 500
building tree 36 of 500
building tree 37 of 500
building tree 38 of 500
building tree 39 of 500
building tree 40 of 500
building tree 41 of 500
building tree 42 of 500
b

[Parallel(n_jobs=100)]: Done 165 tasks      | elapsed:   17.3s


building tree 268 of 500
building tree 269 of 500
building tree 270 of 500
building tree 271 of 500
building tree 272 of 500
building tree 273 of 500
building tree 274 of 500
building tree 275 of 500
building tree 276 of 500
building tree 277 of 500
building tree 278 of 500
building tree 279 of 500
building tree 280 of 500
building tree 281 of 500
building tree 282 of 500
building tree 283 of 500
building tree 284 of 500
building tree 285 of 500
building tree 286 of 500
building tree 287 of 500
building tree 288 of 500
building tree 289 of 500
building tree 290 of 500
building tree 291 of 500
building tree 292 of 500
building tree 293 of 500
building tree 294 of 500
building tree 295 of 500
building tree 296 of 500
building tree 297 of 500
building tree 298 of 500
building tree 299 of 500
building tree 300 of 500
building tree 301 of 500
building tree 302 of 500
building tree 303 of 500
building tree 304 of 500
building tree 305 of 500
building tree 306 of 500
building tree 307 of 500


[Parallel(n_jobs=100)]: Done 500 out of 500 | elapsed:   47.2s finished
[Parallel(n_jobs=100)]: Using backend ThreadingBackend with 100 concurrent workers.
[Parallel(n_jobs=100)]: Done 165 tasks      | elapsed:    0.1s


Ausgewählter Zeitpunkt für Vorhersage: 9.7965


[Parallel(n_jobs=100)]: Done 500 out of 500 | elapsed:    0.2s finished


✅ Vorhersagen erfolgreich gespeichert
Starte Hyperparameter Tuning...
Fitting 3 folds for each of 27 candidates, totalling 81 fits
Beste Parameter gefunden: {'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 300}
Bester Score (negatives MSE): -0.20447468740048724
Erstelle Vorhersagen für Testdaten...
Ergebnisse wurden in results.csv gespeichert.


Training Folds:   0%|                                     | 0/5 [00:00<?, ?it/s]

[LightGBM] [Info] Number of positive: 12426, number of negative: 10614
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001994 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 636
[LightGBM] [Info] Number of data points in the train set: 23040, number of used features: 55
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
Training until validation scores don't improve for 50 rounds


Training Folds:  20%|█████▊                       | 1/5 [00:00<00:03,  1.02it/s]

Early stopping, best iteration is:
[73]	valid_0's auc: 0.721179	valid_0's binary_logloss: 0.612338
[LightGBM] [Info] Number of positive: 12426, number of negative: 10614
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001623 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 636
[LightGBM] [Info] Number of data points in the train set: 23040, number of used features: 55
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
Training until validation scores don't improve for 50 rounds


Training Folds:  40%|███████████▌                 | 2/5 [00:01<00:02,  1.02it/s]

Early stopping, best iteration is:
[134]	valid_0's auc: 0.700997	valid_0's binary_logloss: 0.623178
[LightGBM] [Info] Number of positive: 12426, number of negative: 10614
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001395 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 636
[LightGBM] [Info] Number of data points in the train set: 23040, number of used features: 55
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
Training until validation scores don't improve for 50 rounds


Training Folds:  60%|█████████████████▍           | 3/5 [00:02<00:01,  1.08it/s]

Early stopping, best iteration is:
[116]	valid_0's auc: 0.717004	valid_0's binary_logloss: 0.614395
[LightGBM] [Info] Number of positive: 12425, number of negative: 10615
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001373 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 636
[LightGBM] [Info] Number of data points in the train set: 23040, number of used features: 55
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
Training until validation scores don't improve for 50 rounds


Training Folds:  80%|███████████████████████▏     | 4/5 [00:03<00:00,  1.11it/s]

Early stopping, best iteration is:
[105]	valid_0's auc: 0.713506	valid_0's binary_logloss: 0.616414
[LightGBM] [Info] Number of positive: 12425, number of negative: 10615
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001748 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 636
[LightGBM] [Info] Number of data points in the train set: 23040, number of used features: 55
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
Training until validation scores don't improve for 50 rounds


Training Folds: 100%|█████████████████████████████| 5/5 [00:04<00:00,  1.06it/s]

Early stopping, best iteration is:
[136]	valid_0's auc: 0.711721	valid_0's binary_logloss: 0.619048
Vorhersagen gespeichert in predictions_model_3.csv
done





In [3]:
# Read and merge modell_1.csv, modell_2.csv and modell_3.csv
df1 = pd.read_csv("modell_1.csv")
df2 = pd.read_csv("modell_2.csv")
df3 = pd.read_csv("modell_3.csv")
submission = pd.merge(df1, df2, on='ID')
submission = pd.merge(submission, df3, on='ID')
submission['prediction'] = submission[['prediction_x', 'prediction_y', 'prediction']].mean(axis=1)
submission = submission[['ID', 'prediction']]
submission.to_csv("submission.csv", index=False)
print("Submission file created successfully.")