
# AutoGluon: Tabular Feature Engineering (Colab-Ready)

This notebook mirrors the AutoGluon **Feature Engineering** tutorial and adds a few sanity checks so it runs cleanly in Google Colab.

**What you'll do:**
1. Install AutoGluon (tabular-only with all optional models).
2. Load a sample dataset (Adult Income).
3. Fit a baseline AutoGluon `TabularPredictor` (Auto FE happens here).
4. Inspect engineered features via `predictor.transform_features(...)`.
5. Review feature importance and a leaderboard.
6. (Optional) Try a simple custom feature add-on to see how engineered matrices can be extended post-hoc.


In [1]:

# Step 1: Installs (Colab)
!pip -q install -U pip setuptools wheel
!pip -q install -U "autogluon.tabular[all]" --extra-index-url https://download.pytorch.org/whl/cpu

print("✅ Installed. If you're in Colab and imports fail below, do: Runtime → Restart runtime, then re-run from here.")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.8.0+cu126 requires torch==2.8.0, but you have torch 2.7.1+cpu which is incompatible.[0m[31m
[0m✅ Installed. If you're in Colab and imports fail below, do: Runtime → Restart runtime, then re-run from here.


In [3]:
!pip install -U torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cpu


Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch
  Downloading https://download.pytorch.org/whl/cpu/torch-2.8.0%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting torchvision
  Using cached https://download.pytorch.org/whl/cpu/torchvision-0.23.0%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.1 kB)
Downloading https://download.pytorch.org/whl/cpu/torch-2.8.0%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl (183.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 MB[0m [31m45.1 MB/s[0m  [33m0:00:04[0m
[?25hDownloading https://download.pytorch.org/whl/cpu/torchvision-0.23.0%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m75.8 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: torch, torchvision
[2K  Attempting uninstall: torch
[2K    Found existing installation: torch 2.7.1+cpu
[2K    Uninstalling torch-2

In [5]:

# Step 2: Imports and data
from autogluon.tabular import TabularDataset, TabularPredictor
import pandas as pd

# Adult Income dataset (classification), hosted by AutoGluon
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
test_data  = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')

label = 'class'
print("Train shape:", train_data.shape, "| Test shape:", test_data.shape)
display(train_data.head(3))


Train shape: (39073, 15) | Test shape: (9769, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,178478,Bachelors,13,Never-married,Tech-support,Own-child,White,Female,0,0,40,United-States,<=50K
1,23,State-gov,61743,5th-6th,3,Never-married,Transport-moving,Not-in-family,White,Male,0,0,35,United-States,<=50K
2,46,Private,376789,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,0,15,United-States,<=50K


In [6]:

# Step 3: Fit baseline model (Auto FE occurs here)
predictor = TabularPredictor(label=label, path="ag_feature_eng_output").fit(
    train_data,
    presets="medium_quality_faster_train",
    time_limit=600
)
predictor


Preset alias specified: 'medium_quality_faster_train' maps to 'medium_quality'.
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          2
Memory Avail:       11.31 GB / 12.67 GB (89.3%)
Disk Space Avail:   184.59 GB / 225.83 GB (81.7%)
Presets specified: ['medium_quality_faster_train']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "/content/ag_feature_eng_output"
Train Data Rows:    39073
Train Data Columns: 14
Label Column:       class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' <=50K', ' >50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: [

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7ad207b74890>

In [7]:

# Step 4: Inspect engineered features
X_train_fe = predictor.transform_features(train_data)   # model-ready engineered training features
X_test_fe  = predictor.transform_features(test_data)    # model-ready engineered test features

print("Engineered train shape:", X_train_fe.shape, "| Engineered test shape:", X_test_fe.shape)
display(X_train_fe.head(5))


Engineered train shape: (39073, 14) | Engineered test shape: (9769, 14)


Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,native-country
0,25,178478,13,0,0,0,40,4,9,4,13,3,4,38
1,23,61743,3,1,0,0,35,7,4,4,14,1,4,38
2,46,376789,9,1,0,0,15,4,11,4,8,1,4,38
3,55,200235,9,1,0,0,50,0,11,2,0,0,4,38
4,36,224541,4,1,0,0,40,4,5,2,6,0,4,8


In [8]:

# Step 5: Feature importance
imp = predictor.feature_importance(test_data)
display(imp.head(20))


Computing feature importance via permutation shuffling for 14 features using 5000 rows with 5 shuffle sets...
	6.27s	= Expected runtime (1.25s per shuffle set)
	5.54s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
marital-status,0.05164,0.003321,2e-06,5,0.058478,0.044802
capital-gain,0.04696,0.004853,1.3e-05,5,0.056952,0.036968
education-num,0.0318,0.00503,7.3e-05,5,0.042157,0.021443
age,0.01556,0.003477,0.00028,5,0.022719,0.008401
occupation,0.01456,0.002889,0.000177,5,0.020509,0.008611
capital-loss,0.0126,0.001631,3.3e-05,5,0.015958,0.009242
hours-per-week,0.0082,0.002135,0.000505,5,0.012597,0.003803
workclass,0.00304,0.00178,0.009396,5,0.006705,-0.000625
relationship,0.00212,0.001706,0.024961,5,0.005634,-0.001394
fnlwgt,0.00196,0.001499,0.021555,5,0.005047,-0.001127


In [9]:

# Step 6: Evaluate + leaderboard
leaderboard = predictor.leaderboard(test_data, silent=True)
display(leaderboard.head(10))

preds = predictor.predict(test_data.drop(columns=[label]))
print("Sample predictions:\n", preds.head())


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,XGBoost,0.876139,0.8848,accuracy,0.108669,0.031883,4.064157,0.108669,0.031883,4.064157,1,True,9
1,WeightedEnsemble_L2,0.876139,0.8848,accuracy,0.111284,0.033108,4.230777,0.002614,0.001225,0.16662,2,True,12
2,CatBoost,0.875627,0.8836,accuracy,0.036957,0.011469,39.92634,0.036957,0.011469,39.92634,1,True,5
3,LightGBMLarge,0.875422,0.8824,accuracy,0.270722,0.065547,2.458395,0.270722,0.065547,2.458395,1,True,11
4,LightGBM,0.873477,0.8824,accuracy,0.256519,0.088769,2.621393,0.256519,0.088769,2.621393,1,True,2
5,LightGBMXT,0.87143,0.8792,accuracy,0.447932,0.107878,11.25321,0.447932,0.107878,11.25321,1,True,1
6,NeuralNetTorch,0.859658,0.858,accuracy,0.087563,0.032354,106.417527,0.087563,0.032354,106.417527,1,True,10
7,NeuralNetFastAI,0.859556,0.8644,accuracy,0.220608,0.049926,60.077297,0.220608,0.049926,60.077297,1,True,8
8,RandomForestGini,0.859351,0.8612,accuracy,0.986833,0.216309,14.723493,0.986833,0.216309,14.723493,1,True,3
9,RandomForestEntr,0.857611,0.8584,accuracy,0.985484,0.238248,16.08467,0.985484,0.238248,16.08467,1,True,4


Sample predictions:
 0     <=50K
1     <=50K
2      >50K
3     <=50K
4     <=50K
Name: class, dtype: object



### Optional: Add a simple custom feature (post-hoc)

This block shows a lightweight way to add your own feature and see how it propagates through `transform_features`.


In [10]:

# We will add a simple interaction feature to the *raw* training data,
# then call transform_features again to see it handled by AutoGluon's pipeline.
train_raw_plus = train_data.copy()
if 'age' in train_raw_plus.columns and 'fnlwgt' in train_raw_plus.columns:
    train_raw_plus['age_x_fnlwgt'] = train_raw_plus['age'] * train_raw_plus['fnlwgt']
    print("Added custom feature: age_x_fnlwgt")

X_train_fe_plus = predictor.transform_features(train_raw_plus)
print("Engineered w/ custom feature shape:", X_train_fe_plus.shape)
display(X_train_fe_plus.filter(like='age', axis=1).head(5))


Added custom feature: age_x_fnlwgt
Engineered w/ custom feature shape: (39073, 14)


Unnamed: 0,age
0,25
1,23
2,46
3,55
4,36



### Saving with outputs for GitHub

- In Colab: **File → Save a copy in GitHub** and ensure *Include output* is checked.
- Or download via **File → Download → Download .ipynb**, then push to your repo.
- Keep your run artifacts (e.g., the `ag_feature_eng_output/` folder) if you want to show model summaries; it can be large, so you may prefer to re-run during grading.
