# AutoGluon — Part 2

This notebook demonstrates the **AutoGluon Tabular** capabilities required in Part 2 of your assignment:
1) **Tabular classification** (Adult Income) and **regression** (California Housing)
2) **Multimodal tabular** (numeric + categorical + **text** column)
3) **Automatic feature engineering** + **feature importance**

All sections use small public datasets and short time limits so they run quickly on Colab.
---


Assignment Done by :- **Dev Mulchandani**

In [1]:
%%capture
!pip -q install -U pip
!pip -q install -U autogluon
print('Installed packages')

## 1️⃣ Tabular Classification — Adult Income (binary)
We predict whether income >50K using AutoGluon’s quick start dataset.

In [3]:
# Classification quickstart without S3 (uses OpenML instead)
from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from autogluon.tabular import TabularPredictor

# where to save models (edit if you use Drive)
MODEL_DIR = Path('/content/AutoGluonModels')

# 1) Load Adult dataset from OpenML
adult = fetch_openml('adult', version=2, as_frame=True)   # ~48k rows
df = adult.frame.copy()
df['class'] = df['class'].astype(str)                     # ensure string labels

# 2) Train/val split
train_cls, test_cls = train_test_split(
    df, test_size=0.2, random_state=0, stratify=df['class']
)

# 3) Train AutoGluon
label_cls = 'class'
pred_cls = TabularPredictor(label=label_cls, path=str(MODEL_DIR / 'adult_cls')).fit(
    train_cls,
    presets='medium_quality_faster_train',
    time_limit=180
)

# 4) Quick results
print('Leaderboard (classification):')
display(pred_cls.leaderboard(train_cls, silent=True).head())
print('Accuracy on test:')
print((pred_cls.predict(test_cls) == test_cls[label_cls]).mean())


Preset alias specified: 'medium_quality_faster_train' maps to 'medium_quality'.
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          2
Memory Avail:       11.29 GB / 12.67 GB (89.1%)
Disk Space Avail:   62.15 GB / 107.72 GB (57.7%)
Presets specified: ['medium_quality_faster_train']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 180s
AutoGluon will save models to "/content/AutoGluonModels/adult_cls"
Train Data Rows:    39073
Train Data Columns: 14
Label Column:       class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  ['<=50K', '>50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: 

Leaderboard (classification):


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestEntr,0.990479,0.8528,accuracy,3.879886,0.20828,15.436682,3.879886,0.20828,15.436682,1,True,4
1,RandomForestGini,0.990428,0.852,accuracy,3.352239,0.351482,14.402138,3.352239,0.351482,14.402138,1,True,3
2,ExtraTreesGini,0.989788,0.842,accuracy,2.914752,0.234976,9.171849,2.914752,0.234976,9.171849,1,True,6
3,ExtraTreesEntr,0.989686,0.8404,accuracy,3.791875,0.276714,8.114681,3.791875,0.276714,8.114681,1,True,7
4,WeightedEnsemble_L2,0.885368,0.8728,accuracy,0.562812,0.04837,55.397773,0.004806,0.001364,0.13844,2,True,10


Accuracy on test:
0.8724536800081891


## 1️⃣ Tabular Regression — California Housing (sklearn)
We load the classic California housing dataset from scikit‑learn and train a regression model.

In [4]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from autogluon.tabular import TabularPredictor

cal = fetch_california_housing(as_frame=True)
df_reg = cal.frame.copy()
df_reg.rename(columns={'MedHouseVal':'target'}, inplace=True)
label_reg = 'target'

pred_reg = TabularPredictor(label=label_reg, path=str(MODEL_DIR / 'cal_housing_reg'),
                             eval_metric='root_mean_squared_error') \
    .fit(df_reg, presets='medium_quality_faster_train', time_limit=180)

print('Leaderboard (regression):')
display(pred_reg.leaderboard(df_reg, silent=True).head())

Preset alias specified: 'medium_quality_faster_train' maps to 'medium_quality'.
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          2
Memory Avail:       10.06 GB / 12.67 GB (79.4%)
Disk Space Avail:   60.89 GB / 107.72 GB (56.5%)
Presets specified: ['medium_quality_faster_train']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 180s
AutoGluon will save models to "/content/AutoGluonModels/cal_housing_reg"
Train Data Rows:    20640
Train Data Columns: 8
Label Column:       target
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (5.00001, 0.14999, 2.06856, 1.15396)
	If 'regression' is not the correct problem_type, please manually specify the problem_t

[1000]	valid_set's rmse: 0.458064
[2000]	valid_set's rmse: 0.448369
[3000]	valid_set's rmse: 0.444722
[4000]	valid_set's rmse: 0.443817
[5000]	valid_set's rmse: 0.443675
[6000]	valid_set's rmse: 0.444089
[7000]	valid_set's rmse: 0.445133


	-0.4436	 = Validation score   (-root_mean_squared_error)
	12.68s	 = Training   runtime
	1.7s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 164.76s of the 164.76s of remaining time.
	Fitting with cpus=1, gpus=0, mem=0.0/10.1 GB


[1000]	valid_set's rmse: 0.424802
[2000]	valid_set's rmse: 0.42382


	-0.4236	 = Validation score   (-root_mean_squared_error)
	5.06s	 = Training   runtime
	0.46s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 159.00s of the 159.00s of remaining time.
	Fitting with cpus=2, gpus=0, mem=0.0/10.0 GB
	-0.5009	 = Validation score   (-root_mean_squared_error)
	55.22s	 = Training   runtime
	0.27s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 95.66s of the 95.66s of remaining time.
	Fitting with cpus=1, gpus=0
	Ran out of time, early stopping on iteration 7853.
	-0.4093	 = Validation score   (-root_mean_squared_error)
	95.74s	 = Training   runtime
	0.02s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 179.90s of the -0.13s of remaining time.
	Ensemble Weights: {'CatBoost': 0.778, 'LightGBM': 0.222}
	-0.408	 = Validation score   (-root_mean_squared_error)
	0.01s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 180.17

Leaderboard (regression):


Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.223297,-0.408002,root_mean_squared_error,4.219149,0.474523,100.801907,0.005034,0.000513,0.010566,2,True,5
1,CatBoost,-0.22579,-0.40926,root_mean_squared_error,0.162361,0.015482,95.735993,0.162361,0.015482,95.735993,1,True,4
2,LightGBM,-0.230212,-0.423584,root_mean_squared_error,4.051753,0.458528,5.055348,4.051753,0.458528,5.055348,1,True,2
3,RandomForestMSE,-0.236308,-0.50092,root_mean_squared_error,1.675405,0.274807,55.218094,1.675405,0.274807,55.218094,1,True,3
4,LightGBMXT,-0.271515,-0.443633,root_mean_squared_error,18.417991,1.696069,12.683245,18.417991,1.696069,12.683245,1,True,1


## 2️⃣ Multimodal Tabular — add a text column
We create a small dataset containing numeric, categorical, and **text** features to show AutoGluon’s multimodal handling with the same API.

In [5]:
import pandas as pd
from autogluon.tabular import TabularPredictor

mm = pd.DataFrame({
    'age': [25,45,33,52,41,22,61,37,29,48],
    'income': [35000,120000,70000,150000,90000,28000,200000,80000,60000,110000],
    'role': ['entry','exec','engineer','director','pm','intern','c-suite','analyst','engineer','exec'],
    'desc': ['entry-level role','senior executive','mid-level engineer','director position','project manager','intern new grad','c-level executive','experienced analyst','software engineer','executive leader']
})
mm['high_spender'] = [0,1,0,1,0,0,1,0,0,1]

train_mm = mm.sample(frac=0.8, random_state=1)
test_mm  = mm.drop(train_mm.index)

pred_mm = TabularPredictor(label='high_spender', path=str(MODEL_DIR / 'mm')) \
    .fit(train_mm, presets='medium_quality_faster_train', time_limit=60)

print('Predictions on holdout rows:')
display(pred_mm.predict(test_mm))

Preset alias specified: 'medium_quality_faster_train' maps to 'medium_quality'.
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          2
Memory Avail:       10.00 GB / 12.67 GB (78.9%)
Disk Space Avail:   60.41 GB / 107.72 GB (56.1%)
Presets specified: ['medium_quality_faster_train']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 60s
AutoGluon will save models to "/content/AutoGluonModels/mm"
Train Data Rows:    8
Train Data Columns: 4
Label Column:       high_spender
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [np.int64(0), np.int64(1)]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one o

Predictions on holdout rows:


Unnamed: 0,high_spender
5,0
8,0


## 3️⃣ Automatic Feature Engineering & Importances
AutoGluon performs preprocessing and feature generation internally. We visualize feature importances and transformed features using the Adult dataset.

In [8]:
# Feature engineering & importances using the Adult dataset (no S3)

from sklearn.datasets import fetch_openml
import pandas as pd
from autogluon.tabular import TabularPredictor

# 1) Load data
adult = fetch_openml('adult', version=2, as_frame=True)   # public & reliable
train_fe = adult.frame.copy()
train_fe['class'] = train_fe['class'].astype(str)         # ensure string labels
label_fe = 'class'

# 2) Train a quick model
pred_fe = TabularPredictor(label=label_fe, path='AutoGluonModels/feat_eng').fit(
    train_fe, presets='medium_quality_faster_train', time_limit=120
)

# 3) Feature importances
fi = pred_fe.feature_importance(train_fe)
print('Top feature importances:')
display(fi.sort_values('importance', ascending=False).head(10))

# 4) View transformed features produced by AutoGluon’s pipeline
sample_batch = train_fe.head(200)
X_transformed = pred_fe.transform_features(sample_batch)
print('Transformed feature columns (preview):', list(X_transformed.columns)[:12])
display(X_transformed.head())


Preset alias specified: 'medium_quality_faster_train' maps to 'medium_quality'.
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.12.12
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Thu Oct  2 10:42:05 UTC 2025
CPU Count:          2
Memory Avail:       10.04 GB / 12.67 GB (79.3%)
Disk Space Avail:   60.40 GB / 107.72 GB (56.1%)
Presets specified: ['medium_quality_faster_train']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 120s
AutoGluon will save models to "/content/AutoGluonModels/feat_eng"
Train Data Rows:    48842
Train Data Columns: 14
Label Column:       class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  ['<=50K', '>50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: [

Top feature importances:


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
capital-gain,0.04676,0.003448,4e-06,5,0.053859,0.039661
occupation,0.02304,0.003968,0.000102,5,0.031211,0.014869
marital-status,0.02176,0.003129,5e-05,5,0.028202,0.015318
age,0.01968,0.003022,6.5e-05,5,0.025902,0.013458
relationship,0.01812,0.002556,4.6e-05,5,0.023382,0.012858
education,0.0138,0.003353,0.000387,5,0.020703,0.006897
capital-loss,0.01364,0.002355,0.000103,5,0.01849,0.00879
hours-per-week,0.01108,0.002941,0.000544,5,0.017136,0.005024
workclass,0.00792,0.001425,0.000121,5,0.010855,0.004985
fnlwgt,0.00772,0.002081,0.000577,5,0.012006,0.003434


Transformed feature columns (preview): ['age', 'fnlwgt', 'education-num', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass', 'education', 'marital-status', 'occupation', 'relationship']


Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,native-country
0,25,226802,7,0,0,0,40,3.0,1,4,6.0,3,2,37
1,38,89814,9,0,0,0,50,3.0,11,2,4.0,0,4,37
2,28,336951,12,0,0,0,40,1.0,7,2,10.0,0,4,37
3,44,160323,10,0,7688,0,40,3.0,15,2,6.0,0,2,37
4,18,103497,10,1,0,0,30,,15,4,,3,4,37


### ✅ Wrap‑up
We ran: (1) classification and regression with TabularPredictor, (2) a multimodal example with a text column, and (3) AutoGluon’s built‑in feature engineering and importances. Increase `time_limit` or switch `presets` for stronger results.