# Spaceship Titanic - Model Training

This notebook trains and evaluates machine learning classifiers for the [Spaceship Titanic Kaggle competition](https://www.kaggle.com/competitions/spaceship-titanic). The goal is binary classification: predict whether a passenger was transported to an alternate dimension.

## Models Covered
1. **Random Forest** - Ensemble of decision trees using bagging
2. **XGBoost** - Gradient boosting framework (often top performer on tabular data)
3. **MLP (Multi-Layer Perceptron)** - Neural network for tabular classification

Each model section includes hyperparameter tuning via `RandomizedSearchCV` with 5-fold stratified cross-validation.

# 1. Imports

All required libraries for data processing and model training.

In [1]:
# core
import numpy as np
import pandas as pd

# sklearn - model selection & preprocessing
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# sklearn - models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

# xgboost
from xgboost import XGBClassifier

# distributions for hyperparameter search
from scipy.stats import randint, uniform
from sklearn.utils import check_random_state

# local preprocessing pipeline
from utils import preprocessing_pipeline as p

# 2. Data Loading & Preprocessing

We use a custom preprocessing pipeline (`utils/preprocessing_pipeline.py`) that handles:
- **Feature decomposition**: Extracts components from PassengerId (group info) and Cabin (deck/side)
- **Feature construction**: Creates GroupSize, TotalSpent, spending ratios
- **Age binning**: Uses decision-tree-based optimal binning
- **Imputation**: Handles missing values using domain logic (e.g., CryoSleep passengers have zero spending) and KNN for remaining gaps

The pipeline returns fitted parameters from training data to ensure consistent transformation on test data (preventing data leakage).

In [2]:
# load raw data
train_raw = pd.read_csv('data/train.csv')
test_raw = pd.read_csv('data/test.csv')

# apply preprocessing pipeline
train_processed, train_params = p.preprocess_train(train_raw)
test_processed = p.preprocess_test(test_raw, train_params)

print(f"Training samples: {len(train_processed)}")
print(f"Test samples: {len(test_processed)}")
train_processed.head()

Step 1: Decomposing PassengerId and Cabin...
Step 2: Constructing features (GroupSize, TotalSpent, etc.)...
Step 3: Binning Age into Age_Group...
Step 4: Imputing spending and CryoSleep...
Step 5: Imputing categorical features...


  df['VIP'] = df['VIP'].fillna(False).infer_objects(copy=False)


Step 6: Final CryoSleep/spending cleanup...
Step 6b: KNN imputing remaining spending...
Step 7: Recalculating spending features...
Done! Training data preprocessed.
HomePlanet      True
CryoSleep       True
Destination     True
VIP             True
RoomService     True
FoodCourt       True
ShoppingMall    True
Spa             True
VRDeck          True
Transported     True
Deck            True
CabinNum        True
Side            True
GroupSize       True
Age_Group       True
dtype: bool
Step 1: Decomposing PassengerId and Cabin...
Step 2: Constructing features (GroupSize, TotalSpent, etc.)...
Step 3: Applying Age binning (using train split points)...
Step 4: Imputing spending and CryoSleep...
Step 5: Imputing categorical features...
Step 6: Final CryoSleep/spending cleanup...
Step 6b: KNN imputing remaining spending (using train-fitted imputer)...
Step 7: Recalculating spending features...
Done! Test data preprocessed.
Training samples: 8693
Test samples: 4277


  df['VIP'] = df['VIP'].fillna(False).infer_objects(copy=False)


Unnamed: 0,HomePlanet,CryoSleep,Destination,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Deck,CabinNum,Side,GroupSize,Age_Group
0,Europa,False,TRAPPIST-1e,False,0.0,0.0,0.0,0.0,0.0,False,B,0.0,P,1,Age_Bin_11
1,Earth,False,TRAPPIST-1e,False,109.0,9.0,25.0,549.0,44.0,True,F,0.0,S,1,Age_Bin_5
2,Europa,False,TRAPPIST-1e,True,43.0,3576.0,0.0,6715.0,49.0,False,A,0.0,S,2,Age_Bin_13
3,Europa,False,TRAPPIST-1e,False,0.0,1283.0,371.0,3329.0,193.0,False,A,0.0,S,2,Age_Bin_10
4,Earth,False,TRAPPIST-1e,False,303.0,70.0,151.0,565.0,2.0,True,F,1.0,S,1,Age_Bin_3


# 3. Feature Encoding for ML Models

Most scikit-learn models require **numeric inputs**. Our preprocessed data still contains:
- **Boolean columns** (`CryoSleep`, `VIP`, `Transported`) → convert to 0/1
- **Ordinal categories** (`Deck`, `Side`, `Age_Group`) → label encode (preserves ordering for tree-based models)
- **Nominal categories** (`HomePlanet`, `Destination`) → encoding depends on model type!

## Two Encoding Strategies

We provide **two encoding functions** because different models have different needs:

| Model Type | Function | Categorical Encoding | Missing Values |
|------------|----------|---------------------|----------------|
| MLP, Logistic Regression | `encode_features_for_ml()` | One-hot | Fill with 0 |
| XGBoost, Random Forest | `encode_features_for_xgboost()` | Label/Categorical | Keep NaN |

**Why the difference?**
- **Neural networks** treat features as continuous values. Label encoding (1, 2, 3) implies "3 is closer to 2 than to 1" which is false for categories like "Earth, Europa, Mars". One-hot avoids this.
- **Tree-based models** make binary splits. They can naturally handle "category == X" splits without needing one-hot encoding. Fewer features = faster training and often better results.

In [3]:
def encode_features_for_ml(df: pd.DataFrame) -> pd.DataFrame:
    """
    Convert preprocessed DataFrame to ML-ready format for sklearn models.
    Uses one-hot encoding for nominal categoricals (needed for MLP/logistic regression).
    """
    df = df.copy()

    # convert booleans to integers (handle both bool dtype and object with True/False)
    bool_cols = df.select_dtypes(include=['bool']).columns.tolist()
    for c in bool_cols:
        df[c] = df[c].astype(int)
    
    # also handle object columns that contain boolean-like values
    for c in ['CryoSleep', 'VIP']:
        if c in df.columns and df[c].dtype == 'object':
            df[c] = df[c].map({True: 1, False: 0, 'True': 1, 'False': 0}).fillna(0).astype(int)

    # label encode ordinal features (Deck, Side have natural ordering by position)
    try:
        df['Deck'] = train_params.deck_encoder.transform(df['Deck'])
    except Exception:
        df['Deck'] = pd.factorize(df['Deck'])[0]
    try:
        df['Side'] = train_params.side_encoder.transform(df['Side'])
    except Exception:
        df['Side'] = pd.factorize(df['Side'])[0]

    # encode Age_Group (already ordinal from binning)
    if 'Age_Group' in df.columns:
        df['Age_Group'] = pd.Categorical(df['Age_Group']).codes

    # one-hot encode nominal categorical features (for MLP/linear models)
    cat_cols = [c for c in ['HomePlanet', 'Destination', 'TravelAcompanyStatus'] if c in df.columns]
    if cat_cols:
        df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

    # drop identifier columns
    for col in ('PassengerId', 'Name'):
        if col in df.columns:
            df.drop(col, axis=1, inplace=True)

    # fill any remaining NaN with 0
    df = df.fillna(0)

    return df


def encode_features_for_xgboost(df: pd.DataFrame) -> pd.DataFrame:
    """
    Convert preprocessed DataFrame to XGBoost-optimized format.
    
    Key differences from general encoding:
    1. Label encode ALL categoricals (no one-hot) - trees split better this way
    2. Keep NaN values - XGBoost learns optimal direction for missing values
    3. Use pandas Categorical for native XGBoost categorical support
    """
    df = df.copy()

    # convert boolean columns to int (handle both bool dtype and object with True/False)
    bool_cols = df.select_dtypes(include=['bool']).columns.tolist()
    for c in bool_cols:
        df[c] = df[c].astype(int)
    
    # explicitly handle CryoSleep and VIP which may be object type with True/False values
    for c in ['CryoSleep', 'VIP', 'Transported']:
        if c in df.columns:
            if df[c].dtype == 'object' or df[c].dtype == 'bool':
                df[c] = df[c].map({True: 1, False: 0, 'True': 1, 'False': 0})
                df[c] = pd.to_numeric(df[c], errors='coerce').fillna(0).astype(int)

    # convert categorical columns to pandas Categorical type
    cat_cols = ['Deck', 'Side', 'HomePlanet', 'Destination', 'Age_Group']
    for col in cat_cols:
        if col in df.columns:
            df[col] = df[col].astype('category')

    # drop identifier columns
    for col in ('PassengerId', 'Name', 'TravelAcompanyStatus'):
        if col in df.columns:
            df.drop(col, axis=1, inplace=True)

    # IMPORTANT: do NOT fill NaN for numeric columns - XGBoost handles them natively!
    
    return df

# 4. Train/Test Split

We split the labeled training data into:
- **70% training set** - used for model fitting and cross-validation
- **30% holdout test set** - used for final evaluation (simulates unseen data)

**Why stratified sampling?**  
The `stratify=y` parameter ensures both splits maintain the same class distribution as the original data. This is crucial for classification tasks, especially if classes are imbalanced - otherwise one split might have different class proportions leading to biased evaluation.

In [4]:
# encode features for ML models
train_encoded = encode_features_for_ml(train_processed)

# separate features and target
X = train_encoded.drop('Transported', axis=1)
y = train_encoded['Transported']

# stratified train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, shuffle=True, stratify=y
)

# cross-validation strategy (reused across all models)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Features: {X_train.shape[1]}")
print(f"Target distribution - Train: {y_train.mean():.2%} positive")
print(f"Target distribution - Test: {y_test.mean():.2%} positive")

Training set: 6085 samples
Test set: 2608 samples
Features: 16
Target distribution - Train: 50.37% positive
Target distribution - Test: 50.35% positive


# 5. Random Forest Classifier

**Random Forest** is an ensemble method that builds multiple decision trees and aggregates their predictions (bagging). It's an excellent baseline for tabular data because:
- Handles mixed feature types naturally
- Robust to outliers and noise
- Provides feature importance rankings
- Requires minimal preprocessing (no scaling needed)

## Key Hyperparameters

| Parameter | Description | Search Range |
|-----------|-------------|--------------|
| `n_estimators` | Number of trees in the forest | 300-700 |
| `max_depth` | Maximum tree depth (None = unlimited) | None, 5-60 |
| `min_samples_split` | Minimum samples to split a node | 2-30 |
| `min_samples_leaf` | Minimum samples in a leaf | 1-20 |
| `bootstrap` | Whether to use bootstrap sampling | True/False |
| `class_weight` | Weighting for imbalanced classes | balanced, balanced_subsample, None |

**Trade-offs:**
- More trees (`n_estimators`) = better accuracy but slower training
- Deeper trees (`max_depth`) = more complex patterns but risk overfitting
- Higher `min_samples_split/leaf` = regularization, prevents overfitting

In [5]:
# baseline model with default hyperparameters
baseline_rf = RandomForestClassifier(
    n_estimators=600, random_state=42, class_weight="balanced", n_jobs=-1
)
baseline_scores = cross_val_score(baseline_rf, X_train, y_train, cv=cv, scoring="accuracy", n_jobs=-1)
print(f"Baseline RF CV Accuracy: {baseline_scores.mean():.4f} ± {baseline_scores.std():.4f}")

# hyperparameter search space
param_distributions_rf = {
    "n_estimators": randint(300, 700),
    "max_depth": [None] + list(range(5, 61, 5)),
    "min_samples_split": randint(2, 31),
    "min_samples_leaf": randint(1, 21),
    "bootstrap": [True, False],
    "class_weight": ["balanced", "balanced_subsample", None],
}

# randomized search with cross-validation
search_rf = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42, n_jobs=-1),
    param_distributions=param_distributions_rf,
    n_iter=60,
    scoring="accuracy",
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=1,
    refit=True,
)

search_rf.fit(X_train, y_train)

print(f"\nBest CV accuracy: {search_rf.best_score_:.4f}")
print(f"Best params: {search_rf.best_params_}")

best_rf = search_rf.best_estimator_
rf_test_acc = best_rf.score(X_test, y_test)
print(f"Test accuracy: {rf_test_acc:.4f}")

Baseline RF CV Accuracy: 0.7975 ± 0.0070
Fitting 5 folds for each of 60 candidates, totalling 300 fits

Best CV accuracy: 0.8036
Best params: {'bootstrap': True, 'class_weight': None, 'max_depth': 25, 'min_samples_leaf': 8, 'min_samples_split': 28, 'n_estimators': 436}
Test accuracy: 0.8014


# 6. XGBoost Classifier

**XGBoost** (eXtreme Gradient Boosting) builds trees sequentially, where each new tree corrects errors made by previous trees. Unlike Random Forest's parallel bagging, XGBoost uses **gradient boosting**:

1. Train initial model on data
2. Compute residual errors (gradient of loss)
3. Train next tree to predict residuals
4. Repeat, combining predictions with learning rate shrinkage

## Why XGBoost Needs Different Encoding

**Critical insight**: XGBoost (and tree-based models in general) work *differently* than neural networks:

| Aspect | One-Hot Encoding | Label Encoding |
|--------|------------------|----------------|
| **How it works** | Creates N binary columns | Single column with integers |
| **For trees** | ❌ Fragments splits, needs multiple decisions | ✅ Single split can separate categories |
| **For NNs** | ✅ No ordinal assumption | ❌ Implies false ordering |

**Missing values**: XGBoost learns the optimal split direction for NaN values during training. Filling with 0 removes this advantage!

## Key Hyperparameters

| Parameter | Description | Search Range |
|-----------|-------------|--------------|
| `n_estimators` | Number of boosting rounds | 100-500 |
| `max_depth` | Maximum tree depth (keep low to prevent overfitting) | 3-10 |
| `learning_rate` | Step size shrinkage (η) | 0.01-0.3 |
| `subsample` | Row sampling ratio per tree | 0.6-1.0 |
| `colsample_bytree` | Column sampling ratio per tree | 0.6-1.0 |
| `min_child_weight` | Minimum sum of instance weight in child | 1-10 |
| `gamma` | Minimum loss reduction for split | 0-0.5 |

**Tip:** Lower `learning_rate` + more `n_estimators` = better generalization but slower training.

In [6]:
# use XGBoost-optimized encoding (label encoding, keep NaN)
train_xgb = encode_features_for_xgboost(train_processed)
test_xgb = encode_features_for_xgboost(test_processed)

X_xgb = train_xgb.drop('Transported', axis=1)
y_xgb = train_xgb['Transported']

X_train_xgb, X_test_xgb, y_train_xgb, y_test_xgb = train_test_split(
    X_xgb, y_xgb, test_size=0.3, random_state=42, shuffle=True, stratify=y_xgb
)

print(f"XGBoost training features: {X_train_xgb.shape[1]} (vs {X_train.shape[1]} for one-hot encoded)")
print(f"Categorical columns: {X_train_xgb.select_dtypes('category').columns.tolist()}")

# baseline XGBoost with native categorical support
baseline_xgb = XGBClassifier(
    n_estimators=300, 
    random_state=42, 
    n_jobs=-1, 
    eval_metric='logloss',
    enable_categorical=True,  # enable native categorical support
    tree_method='hist'        # required for categorical support
)
baseline_xgb_scores = cross_val_score(baseline_xgb, X_train_xgb, y_train_xgb, cv=cv, scoring="accuracy", n_jobs=-1)
print(f"\nBaseline XGBoost CV Accuracy: {baseline_xgb_scores.mean():.4f} ± {baseline_xgb_scores.std():.4f}")

# improved hyperparameter search space
param_distributions_xgb = {
    "n_estimators": randint(200, 600),
    "max_depth": randint(3, 8),                    # keep shallow to prevent overfitting
    "learning_rate": uniform(0.01, 0.19),          # 0.01 to 0.2
    "subsample": uniform(0.7, 0.3),                # 0.7 to 1.0
    "colsample_bytree": uniform(0.7, 0.3),         # 0.7 to 1.0
    "min_child_weight": randint(1, 10),
    "gamma": uniform(0, 0.3),                      # regularization
    "reg_alpha": uniform(0, 0.5),                  # L1 regularization
    "reg_lambda": uniform(0.5, 1.5),               # L2 regularization
}

# randomized search
search_xgb = RandomizedSearchCV(
    estimator=XGBClassifier(
        random_state=42, 
        n_jobs=-1, 
        eval_metric='logloss',
        enable_categorical=True,
        tree_method='hist'
    ),
    param_distributions=param_distributions_xgb,
    n_iter=80,
    scoring="accuracy",
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=1,
    refit=True,
)

search_xgb.fit(X_train_xgb, y_train_xgb)

print(f"\nBest CV accuracy: {search_xgb.best_score_:.4f}")
print(f"Best params: {search_xgb.best_params_}")

best_xgb = search_xgb.best_estimator_
xgb_test_acc = best_xgb.score(X_test_xgb, y_test_xgb)
print(f"Test accuracy: {xgb_test_acc:.4f}")

XGBoost training features: 14 (vs 16 for one-hot encoded)
Categorical columns: ['HomePlanet', 'Destination', 'Deck', 'Side', 'Age_Group']

Baseline XGBoost CV Accuracy: 0.7869 ± 0.0115
Fitting 5 folds for each of 80 candidates, totalling 400 fits

Best CV accuracy: 0.8110
Best params: {'colsample_bytree': np.float64(0.8691895276056508), 'gamma': np.float64(0.2522129957633793), 'learning_rate': np.float64(0.026948822455290677), 'max_depth': 6, 'min_child_weight': 9, 'n_estimators': 344, 'reg_alpha': np.float64(0.17146343213913012), 'reg_lambda': np.float64(1.2109549121458343), 'subsample': np.float64(0.8065312909837914)}
Test accuracy: 0.8087


# 7. MLP (Neural Network) Classifier

**Multi-Layer Perceptron (MLP)** is a feedforward neural network with one or more hidden layers. For tabular data, MLPs can capture complex non-linear relationships.

## Architecture Choices

- **Hidden layers**: 1-3 layers with 16-128 neurons each
- **Activation functions**: ReLU (most common) or Tanh
- **Regularization**: L2 penalty (`alpha`) prevents overfitting

## Key Hyperparameters

| Parameter | Description | Search Range |
|-----------|-------------|--------------|
| `hidden_layer_sizes` | Neurons per layer, e.g., (64, 32) | 1-3 layers, 16-128 neurons |
| `activation` | Activation function | relu, tanh |
| `alpha` | L2 regularization strength | 1e-6 to 1e-2 |
| `learning_rate_init` | Initial learning rate | 1e-4 to 1e-2 |
| `batch_size` | Samples per gradient update | 32, 64, 128, 256 |

**Why StandardScaler is essential for MLPs:**  
Neural networks optimize via gradient descent, which is sensitive to feature scales. If one feature ranges 0-1 and another 0-10000, gradients will be dominated by the larger feature. Scaling ensures all features contribute equally to learning.

**Custom samplers:** We define `HiddenLayerSampler` to randomly sample network architectures, and `LogUniform10` for logarithmic sampling of learning rates and regularization (common for parameters that span multiple orders of magnitude).

In [7]:
# custom samplers for hyperparameter search
class HiddenLayerSampler:
    """Samples network architectures: tuples of layer sizes (16-128 neurons, 1-3 layers)."""
    def __init__(self, sizes=(16, 32, 64, 128), min_layers=1, max_layers=3):
        self.sizes = np.array(sizes)
        self.min_layers = min_layers
        self.max_layers = max_layers

    def rvs(self, random_state=None):
        rng = check_random_state(random_state)
        n_layers = rng.randint(self.min_layers, self.max_layers + 1)
        return tuple(rng.choice(self.sizes, size=n_layers, replace=True))

class LogUniform10:
    """Samples 10^U(low_exp, high_exp) - uniform in log space."""
    def __init__(self, low_exp, high_exp):
        self.low_exp = low_exp
        self.high_exp = high_exp

    def rvs(self, random_state=None):
        rng = check_random_state(random_state)
        return float(10 ** rng.uniform(self.low_exp, self.high_exp))

# MLP pipeline: scaler + classifier
mlp_pipeline = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),
    ("mlp", MLPClassifier(
        random_state=42,
        solver="adam",
        early_stopping=True,
        validation_fraction=0.1,
        n_iter_no_change=15,
        max_iter=400,
        tol=1e-4
    ))
])

# hyperparameter search space
param_distributions_mlp = {
    "mlp__hidden_layer_sizes": HiddenLayerSampler(sizes=(16, 32, 64, 128), min_layers=1, max_layers=3),
    "mlp__activation": ["relu", "tanh"],
    "mlp__alpha": LogUniform10(-6, -2),              # 1e-6 to 1e-2
    "mlp__learning_rate_init": LogUniform10(-4, -2), # 1e-4 to 1e-2
    "mlp__batch_size": [32, 64, 128, 256],
}

# randomized search (more iterations for neural networks due to higher variance)
search_mlp = RandomizedSearchCV(
    estimator=mlp_pipeline,
    param_distributions=param_distributions_mlp,
    n_iter=500,
    scoring="accuracy",
    cv=cv,
    n_jobs=-1,
    random_state=42,
    verbose=2,
    refit=True
)

search_mlp.fit(X_train, y_train)

print(f"\nBest CV accuracy: {search_mlp.best_score_:.4f}")
print(f"Best params: {search_mlp.best_params_}")

best_mlp = search_mlp.best_estimator_
mlp_test_acc = accuracy_score(y_test, best_mlp.predict(X_test))
print(f"Test accuracy: {mlp_test_acc:.4f}")

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
[CV] END mlp__activation=relu, mlp__alpha=0.0015352246941973478, mlp__batch_size=128, mlp__hidden_layer_sizes=(np.int64(128), np.int64(16), np.int64(16)), mlp__learning_rate_init=0.00020513382630874485; total time=   0.5s
[CV] END mlp__activation=relu, mlp__alpha=2.5113061677389973e-06, mlp__batch_size=128, mlp__hidden_layer_sizes=(np.int64(128),), mlp__learning_rate_init=0.0001930783753654713; total time=   0.4s
[CV] END mlp__activation=relu, mlp__alpha=2.5113061677389973e-06, mlp__batch_size=128, mlp__hidden_layer_sizes=(np.int64(128),), mlp__learning_rate_init=0.0001930783753654713; total time=   0.6s
[CV] END mlp__activation=relu, mlp__alpha=2.5113061677389973e-06, mlp__batch_size=128, mlp__hidden_layer_sizes=(np.int64(128),), mlp__learning_rate_init=0.0001930783753654713; total time=   0.6s
[CV] END mlp__activation=relu, mlp__alpha=2.5113061677389973e-06, mlp__batch_size=128, mlp__hidden_layer_sizes=(np.int64(128),), 

KeyboardInterrupt: 

# 8. Model Comparison

*TODO: Add comparison of model performances (RF vs XGBoost vs MLP)*

Compare:
- Cross-validation accuracy
- Test set accuracy  
- Training time
- Feature importance (for tree-based models)

# 9. Generate Submission

Use the best performing model to predict on the test set and create the submission file for Kaggle.

In [8]:
# select best model for submission (change as needed)
# options: best_rf, best_xgb, best_mlp
best_model = best_xgb
model_name = "xgb"  # "rf", "xgb", or "mlp"

# encode test data appropriately for the selected model
if model_name == "xgb":
    # XGBoost uses its own encoding
    submission_features = test_xgb
else:
    # RF and MLP use the standard encoding
    submission_features = encode_features_for_ml(test_processed)

# generate predictions
predictions = best_model.predict(submission_features)

# create submission DataFrame
submission = pd.DataFrame({
    'PassengerId': test_raw['PassengerId'],
    'Transported': predictions.astype(bool)
})

# save to CSV
submission.to_csv('submission.csv', index=False)

print(f"Submission saved with {len(submission)} predictions using {model_name.upper()}")
submission.head()

Submission saved with 4277 predictions using XGB


Unnamed: 0,PassengerId,Transported
0,0013_01,False
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,True
