Step-by-Step Plan for the Notebook

Step 1: Load and Inspect Cleaned Data

- Print shapes

- Check Transported.value_counts(normalize=True) to see if the classes are balanced

Step 2: Prep Features and Target

- Drop PassengerId, Name, Surname, etc.

- Split X and y

- Encode categorical features (start with simple OneHotEncoder or pd.get_dummies)

- Optional: Use a ColumnTransformer to prep the pipeline


Step 3: Baseline Models (no tuning yet)

Pick 3 diverse classifiers:

1. RandomForestClassifier → Great benchmark, not too sensitive to scaling or encoding

2. LightGBMClassifier → Fast, robust, leaderboard-class

3. LogisticRegression (with regularization) → Simple linear baseline, tells you if nonlinear models are doing anything special

Use StratifiedKFold or cross_val_score with scoring='accuracy' or 'roc_auc'.

Step 4: Compare Results

Print:

- Mean + std dev of cross-val scores

- Confusion matrix on train set for quick insight

- Feature importances (where available)

Step 5: Decide the Winner(s)
Pick the one or two best models to:

- Tune (grid/random search or Optuna)

- Calibrate (optional)

- Ensemble (optional)

In [1]:
import pandas as pd

train = pd.read_csv('../data/train_clean.csv')
test = pd.read_csv('../data/test_clean.csv')


# Check initial shape
print("Train shape:", train.shape)
print("Test shape:", test.shape)

# Separate features and labels
X = train.drop(columns=['Transported'])
y = train['Transported'].astype(int)  # Convert boolean to int for modeling
X_test = test.copy()

y.value_counts(normalize=True)
X.dtypes.value_counts()

cat_cols = X.select_dtypes(include='object').columns.tolist()
num_cols = X.select_dtypes(include='number').columns.tolist()

print("Categorical columns:", cat_cols)
print("Numeric columns:", num_cols)


Train shape: (8693, 33)
Test shape: (4277, 32)
Categorical columns: ['PassengerId', 'HomePlanet', 'Cabin', 'Name', 'CabinDeck', 'CabinSide', 'Surname', 'VIP_confidence', 'Destination_confidence', 'Destination_cleaned', 'Age_confidence']
Numeric columns: ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'GroupID', 'CabinNum', 'Cabin_missing', 'NoAmenities', 'CryoSleep_missing', 'RoomService_missing', 'FoodCourt_missing', 'ShoppingMall_missing', 'Spa_missing', 'VRDeck_missing', 'TotalSpend', 'GroupSize']


Got 11 categorical columns:

`PassengerId`, `Name`, `Surname`, and `Cabin` are identifiers or high-cardinality noise — drop these.

`CabinDeck`, `CabinSide`, `HomePlanet`, `VIP_confidence`, `Destination_confidence`, `Destination_cleaned`, and `Age_confidence` are actual useful categories.

In [2]:
# Drop identifier columns that won’t help the model
drop_cols = ['PassengerId', 'Name', 'Surname', 'Cabin']
X = X.drop(columns=drop_cols)
X_test = X_test.drop(columns=drop_cols)

# Redefine categorical columns
cat_cols = ['CabinDeck', 'CabinSide', 'HomePlanet', 'VIP_confidence',
            'Destination_confidence', 'Destination_cleaned', 'Age_confidence']

# Ensure all other numeric columns are included
num_cols = [col for col in X.columns if col not in cat_cols]


Define the ColumnTransformer and Pipeline

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier

# Preprocessing: encode categoricals and scale numerics
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_cols)
    ]
)
# Note: sparse_output=False gives us dense numpy arrays for compatibility.

Define the Model Pipelines

In [5]:
# Define models to test
models = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'LightGBM': LGBMClassifier(random_state=42)
}

# Wrap each model with preprocessing pipeline
pipelines = {
    name: Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])
    for name, model in models.items()
}


Cross-Validation Function

In [6]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, accuracy_score
import numpy as np

def evaluate_models(X, y, pipelines, scoring='accuracy', cv=5):
    results = {}
    for name, pipe in pipelines.items():
        print(f"Evaluating {name}...")
        scores = cross_val_score(pipe, X, y, cv=cv, scoring=scoring)
        print(f"  {scoring}: {np.mean(scores):.4f} ± {np.std(scores):.4f}")
        results[name] = scores
    return results


In [7]:
evaluate_models(X, y, pipelines, scoring='accuracy')
evaluate_models(X, y, pipelines, scoring='f1')


Evaluating RandomForest...
  accuracy: 0.7728 ± 0.0368
Evaluating LogisticRegression...
  accuracy: 0.7832 ± 0.0100
Evaluating LightGBM...
[LightGBM] [Info] Number of positive: 3502, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000537 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2197
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 41
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503595 -> initscore=0.014380
[LightGBM] [Info] Start training from score 0.014380
[LightGBM] [Info] Number of positive: 3502, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000409 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [I



[LightGBM] [Info] Number of positive: 3502, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000428 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2198
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 41
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503595 -> initscore=0.014380
[LightGBM] [Info] Start training from score 0.014380
[LightGBM] [Info] Number of positive: 3503, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000380 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2197
[LightGBM] [Info] Number of data points in the train set: 6955, number of used features: 41
[LightGBM] [Info] [binary:



[LightGBM] [Info] Number of positive: 3503, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000481 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2198
[LightGBM] [Info] Number of data points in the train set: 6955, number of used features: 41
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503666 -> initscore=0.014666
[LightGBM] [Info] Start training from score 0.014666
  accuracy: 0.6804 ± 0.0761
Evaluating RandomForest...




  f1: 0.7402 ± 0.0726
Evaluating LogisticRegression...
  f1: 0.7921 ± 0.0095
Evaluating LightGBM...
[LightGBM] [Info] Number of positive: 3502, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000379 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2197
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 41
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503595 -> initscore=0.014380
[LightGBM] [Info] Start training from score 0.014380




[LightGBM] [Info] Number of positive: 3502, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000380 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2194
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 41
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503595 -> initscore=0.014380
[LightGBM] [Info] Start training from score 0.014380
[LightGBM] [Info] Number of positive: 3502, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000367 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2198
[LightGBM] [Info] Number of data points in the train set: 6954, number of used features: 41
[LightGBM] [Info] [binary:



[LightGBM] [Info] Number of positive: 3503, number of negative: 3452
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000384 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2197
[LightGBM] [Info] Number of data points in the train set: 6955, number of used features: 41
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503666 -> initscore=0.014666
[LightGBM] [Info] Start training from score 0.014666
[LightGBM] [Info] Number of positive: 3503, number of negative: 3452
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000850 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2198
[LightGBM] [Info] Number of data points in the train set: 6955, number of used features: 41
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.503666 -> initscore=0.014666
[LightGBM] 



{'RandomForest': array([0.62295082, 0.72637431, 0.80607083, 0.82750846, 0.71813031]),
 'LogisticRegression': array([0.77555556, 0.79847078, 0.79793341, 0.7875    , 0.80123584]),
 'LightGBM': array([0.35580524, 0.64115523, 0.72669492, 0.82005013, 0.30694981])}