<a href="https://colab.research.google.com/github/Deep-Dhaduk/AutoGluon/blob/main/Automatic-Feature-Engineering/AutoGluon_Spaceship_Titanic_Drive_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AutoGluon Tabular on **Spaceship Titanic** (Drive-based)

This notebook expects you to place **`spaceship-titanic.zip`** in your Google Drive, e.g. at:
`/content/drive/MyDrive/kaggle/spaceship-titanic.zip`.

Get the ZIP by joining the Kaggle competition and downloading:
```
https://www.kaggle.com/competitions/spaceship-titanic
```
Then upload the file to your Drive and run the cells below.

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
%%bash
pip -q install -U pip
pip -q install -U autogluon

  DEPRECATION: Building 'nvidia-ml-py3' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'nvidia-ml-py3'. Discussion can be found at https://github.com/pypa/pip/issues/6334
  DEPRECATION: Building 'seqeval' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'seqeval'. Discussion can be found at https://github.com/pypa/pip/issues/6334
ERROR: pip's dependency resolver does not currently take into account al

In [11]:
from pathlib import Path
import zipfile, os

# ===== Change this if your ZIP lives somewhere else in Drive =====
ZIP_PATH = '/content/drive/MyDrive/AutoGluon-Datasets/spaceship-titanic.zip'
DATA_DIR = Path('/content/spaceship_data')
DATA_DIR.mkdir(parents=True, exist_ok=True)

assert os.path.exists(ZIP_PATH), f'ZIP not found at {ZIP_PATH}. Please upload spaceship-titanic.zip to that location.'
with zipfile.ZipFile(ZIP_PATH) as z:
    z.extractall(DATA_DIR)
print('Extracted files:', os.listdir(DATA_DIR))

Extracted files: ['test.csv', 'train.csv', 'sample_submission.csv']


In [12]:
import pandas as pd
train_path = DATA_DIR / 'train.csv'
test_path  = DATA_DIR / 'test.csv'
train_df = pd.read_csv(train_path)
test_df  = pd.read_csv(test_path)
print(train_df.shape, test_df.shape)
display(train_df.head())
train_df.isna().mean().sort_values(ascending=False).head(12)

(8693, 14) (4277, 13)


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Unnamed: 0,0
CryoSleep,0.024963
ShoppingMall,0.023927
VIP,0.023352
HomePlanet,0.023122
Name,0.023007
Cabin,0.022892
VRDeck,0.021627
Spa,0.021051
FoodCourt,0.021051
Destination,0.020936


## Manual feature engineering
- `Cabin → (CabinDeck, CabinNum, CabinSide)`
- Group from `PassengerId` → `Group`, `GroupSize`, `IsAlone`
- `Surname` + `SurnameGroupSize` from `Name`
- `TotalSpend` from RoomService/FoodCourt/ShoppingMall/Spa/VRDeck
- Flags: `IsMinor` (`Age<18`), `CryoSleep`, `VIP`
- Interaction: `AgeTimesGroup`

In [13]:
import numpy as np
SPEND_COLS = ['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']

def split_cabin(cabin):
    if pd.isna(cabin):
        return 'Unknown', -1, 'Unknown'
    parts = str(cabin).split('/')
    if len(parts) != 3:
        return 'Unknown', -1, 'Unknown'
    deck, num, side = parts
    try:
        num = int(num)
    except:
        num = -1
    return deck, num, side

def surname_from_name(name):
    if pd.isna(name):
        return 'Unknown'
    parts = str(name).split()
    return parts[-1] if parts else 'Unknown'

def add_engineered_features(df):
    out = df.copy()
    deck, num, side = zip(*out['Cabin'].apply(split_cabin))
    out['CabinDeck'] = list(deck)
    out['CabinNum']  = list(num)
    out['CabinSide'] = list(side)

    out['Group'] = out['PassengerId'].astype(str).str.split('_').str[0]
    gcounts = out['Group'].value_counts()
    out['GroupSize'] = out['Group'].map(gcounts)
    out['IsAlone'] = (out['GroupSize'] == 1).astype(int)

    out['Surname'] = out['Name'].astype(str).apply(surname_from_name)
    scounts = out['Surname'].value_counts()
    out['SurnameGroupSize'] = out['Surname'].map(scounts)

    for c in SPEND_COLS:
        if c not in out.columns:
            out[c] = 0
    out['TotalSpend'] = out[SPEND_COLS].fillna(0).sum(axis=1)

    out['IsMinor'] = (out['Age'].fillna(-1) < 18).astype(int)
    out['CryoSleep'] = out['CryoSleep'].map({True:1, False:0})
    out['VIP'] = out['VIP'].map({True:1, False:0})
    out['AgeTimesGroup'] = out['Age'].fillna(out['Age'].median()) * out['GroupSize']
    return out

train_eng = add_engineered_features(train_df)
test_eng  = add_engineered_features(test_df)
train_eng.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,...,CabinNum,CabinSide,Group,GroupSize,IsAlone,Surname,SurnameGroupSize,TotalSpend,IsMinor,AgeTimesGroup
0,0001_01,Europa,0.0,B/0/P,TRAPPIST-1e,39.0,0.0,0.0,0.0,0.0,...,0,P,1,1,1,Ofracculy,1,0.0,0,39.0
1,0002_01,Earth,0.0,F/0/S,TRAPPIST-1e,24.0,0.0,109.0,9.0,25.0,...,0,S,2,1,1,Vines,4,736.0,0,24.0
2,0003_01,Europa,0.0,A/0/S,TRAPPIST-1e,58.0,1.0,43.0,3576.0,0.0,...,0,S,3,2,0,Susent,6,10383.0,0,116.0
3,0003_02,Europa,0.0,A/0/S,TRAPPIST-1e,33.0,0.0,0.0,1283.0,371.0,...,0,S,3,2,0,Susent,6,5176.0,0,66.0
4,0004_01,Earth,0.0,F/1/S,TRAPPIST-1e,16.0,0.0,303.0,70.0,151.0,...,1,S,4,1,1,Santantines,6,1091.0,1,16.0


In [14]:
from autogluon.tabular import TabularDataset, TabularPredictor
from autogluon.features.generators import PipelineFeatureGenerator, FillNaFeatureGenerator

label = 'Transported'
train_ag = TabularDataset(train_eng)
test_ag  = TabularDataset(test_eng)
manual_gen = PipelineFeatureGenerator(generators=[FillNaFeatureGenerator()])
predictor_manual = TabularPredictor(label=label, eval_metric='accuracy', path='ag_models__manual').fit(
    train_ag,
    feature_generator=manual_gen,
    presets='medium_quality',
    time_limit=600,
)
leaderboard_manual = predictor_manual.leaderboard(train_ag, silent=True)
leaderboard_manual

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          8
Memory Avail:       49.21 GB / 50.99 GB (96.5%)
Disk Space Avail:   189.57 GB / 235.68 GB (80.4%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "/content/ag_models__manual"
Train Data Rows:    8693
Train Data Columns: 24
Label Column:       Transported
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [np.False_, np.True_]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       binary
Preprocessin

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestGini,0.977453,0.781609,accuracy,0.220209,0.086543,1.124971,0.220209,0.086543,1.124971,1,True,3
1,RandomForestEntr,0.977338,0.78046,accuracy,0.190728,0.087309,1.093383,0.190728,0.087309,1.093383,1,True,4
2,ExtraTreesEntr,0.977108,0.778161,accuracy,0.236465,0.087568,0.851342,0.236465,0.087568,0.851342,1,True,7
3,ExtraTreesGini,0.976073,0.767816,accuracy,0.204359,0.087132,0.871869,0.204359,0.087132,0.871869,1,True,6
4,CatBoost,0.873461,0.809195,accuracy,0.010727,0.003176,3.870257,0.010727,0.003176,3.870257,1,True,5
5,LightGBMLarge,0.865754,0.794253,accuracy,0.012893,0.002349,1.71076,0.012893,0.002349,1.71076,1,True,11
6,WeightedEnsemble_L2,0.83734,0.818391,accuracy,0.16416,0.035026,42.946849,0.002559,0.00088,0.087299,2,True,12
7,XGBoost,0.832739,0.805747,accuracy,0.031018,0.004111,0.799391,0.031018,0.004111,0.799391,1,True,9
8,LightGBMXT,0.824917,0.805747,accuracy,0.037153,0.003887,4.917114,0.037153,0.003887,4.917114,1,True,1
9,LightGBM,0.813528,0.798851,accuracy,0.006945,0.001814,0.539879,0.006945,0.001814,0.539879,1,True,2


In [15]:
from autogluon.features.generators import AutoMLPipelineFeatureGenerator
custom_gen = AutoMLPipelineFeatureGenerator(
    enable_numeric_features=True,
    enable_categorical_features=True,
    enable_datetime_features=True,
    enable_text_special_features=True,
    enable_text_ngram_features=True,
    enable_raw_text_features=False,
)
predictor_custom = TabularPredictor(label=label, eval_metric='accuracy', path='ag_models__custom').fit(
    TabularDataset(train_df),
    feature_generator=custom_gen,
    presets='medium_quality',
    time_limit=600,
)
leaderboard_custom = predictor_custom.leaderboard(train_df, silent=True)
leaderboard_custom

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          8
Memory Avail:       48.36 GB / 50.99 GB (94.8%)
Disk Space Avail:   189.17 GB / 235.68 GB (80.3%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "/content/ag_models__custom"
Train Data Rows:    8693
Train Data Columns: 13
Label Column:       Transported
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [np.False_, np.True_]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       binary
Preprocessin

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestGini,0.946049,0.778161,accuracy,0.149901,0.099262,1.355474,0.149901,0.099262,1.355474,1,True,3
1,ExtraTreesEntr,0.945934,0.777011,accuracy,0.218508,0.10136,0.847713,0.218508,0.10136,0.847713,1,True,7
2,RandomForestEntr,0.945588,0.773563,accuracy,0.153128,0.098404,1.362135,0.153128,0.098404,1.362135,1,True,4
3,ExtraTreesGini,0.945473,0.772414,accuracy,0.193034,0.09844,0.863767,0.193034,0.09844,0.863767,1,True,6
4,XGBoost,0.876682,0.803448,accuracy,0.060511,0.010477,1.231745,0.060511,0.010477,1.231745,1,True,9
5,NeuralNetFastAI,0.85402,0.810345,accuracy,0.133308,0.021322,9.676896,0.133308,0.021322,9.676896,1,True,8
6,LightGBMLarge,0.85172,0.801149,accuracy,0.011632,0.004084,1.390018,0.011632,0.004084,1.390018,1,True,11
7,LightGBM,0.843897,0.805747,accuracy,0.030946,0.008094,0.903193,0.030946,0.008094,0.903193,1,True,2
8,WeightedEnsemble_L2,0.825032,0.813793,accuracy,0.111619,0.027066,26.887582,0.001905,0.000812,0.083593,2,True,12
9,LightGBMXT,0.821466,0.810345,accuracy,0.055929,0.008393,0.807874,0.055929,0.008393,0.807874,1,True,1


In [16]:
import pandas as pd
def tidy_lb(lb, tag):
    x = lb.copy()
    x['setup'] = tag
    return x[['model','score_val','fit_time','pred_time_val','setup']]

lb_all = pd.concat([
    tidy_lb(leaderboard_manual, 'manual_fe'),
    tidy_lb(leaderboard_custom, 'custom_generator'),
], ignore_index=True)
display(lb_all.sort_values(['score_val','fit_time'], ascending=[False,True]).head(20))

best_setup = lb_all.sort_values('score_val', ascending=False).iloc[0]['setup']
best_predictor = {'manual_fe': predictor_manual, 'custom_generator': predictor_custom}[best_setup]
print('Best setup:', best_setup)
train_for_fi = train_eng if best_setup=='manual_fe' else train_df
fi = best_predictor.feature_importance(train_for_fi)
fi.head(20)

Unnamed: 0,model,score_val,fit_time,pred_time_val,setup
6,WeightedEnsemble_L2,0.818391,42.946849,0.035026,manual_fe
20,WeightedEnsemble_L2,0.813793,26.887582,0.027066,custom_generator
21,LightGBMXT,0.810345,0.807874,0.008393,custom_generator
17,NeuralNetFastAI,0.810345,9.676896,0.021322,custom_generator
4,CatBoost,0.809195,3.870257,0.003176,manual_fe
7,XGBoost,0.805747,0.799391,0.004111,manual_fe
19,LightGBM,0.805747,0.903193,0.008094,custom_generator
8,LightGBMXT,0.805747,4.917114,0.003887,manual_fe
11,NeuralNetFastAI,0.805747,10.465881,0.01629,manual_fe
22,NeuralNetTorch,0.804598,25.996115,0.017861,custom_generator


These features in provided data are not utilized by the predictor and will be ignored: ['PassengerId']
Computing feature importance via permutation shuffling for 23 features using 5000 rows with 5 shuffle sets...
	13.44s	= Expected runtime (2.69s per shuffle set)


Best setup: manual_fe


	9.44s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Spa,0.07772,0.004642,1.520414e-06,5,0.087279,0.068161
RoomService,0.06792,0.004963,3.397068e-06,5,0.078139,0.057701
FoodCourt,0.06268,0.00261,3.599243e-07,5,0.068054,0.057306
VRDeck,0.05656,0.004325,4.072111e-06,5,0.065466,0.047654
CryoSleep,0.0516,0.004695,8.132563e-06,5,0.061266,0.041934
ShoppingMall,0.03244,0.003395,1.419218e-05,5,0.039431,0.025449
CabinNum,0.0226,0.001965,6.785136e-06,5,0.026645,0.018555
TotalSpend,0.01452,0.004431,0.0009229095,5,0.023643,0.005397
Age,0.01108,0.001968,0.0001145109,5,0.015132,0.007028
GroupSize,0.00884,0.001203,4.020269e-05,5,0.011318,0.006362


In [17]:
test_for_pred = test_eng if best_setup=='manual_fe' else test_df
pred_test = best_predictor.predict(test_for_pred)
if pred_test.dtype != bool:
    pred_test = pred_test.astype(int).map({1: True, 0: False})
sub = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Transported': pred_test})
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv using', best_setup)
sub.head()

Wrote submission.csv using manual_fe


Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,True
4,0023_01,False
