# Using AutoGluon for Tabular data

In [2]:
# python -m pip install autogluon
from autogluon.tabular import TabularDataset, TabularPredictor

Tutorials taken from: https://auto.gluon.ai/stable/tutorials/tabular/index.html

### Load data:

In [3]:
data_url = 'https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/'
train_data = TabularDataset(f'{data_url}/train.csv')
train_data.head()

Unnamed: 0.1,Unnamed: 0,chern_simons,cusp_volume,hyperbolic_adjoint_torsion_degree,hyperbolic_torsion_degree,injectivity_radius,longitudinal_translation,meridinal_translation_imag,meridinal_translation_real,short_geodesic_imag_part,short_geodesic_real_part,Symmetry_0,Symmetry_D3,Symmetry_D4,Symmetry_D6,Symmetry_D8,Symmetry_Z/2 + Z/2,volume,signature
0,70746,0.09053,12.226322,0,10,0.507756,10.685555,1.144192,-0.519157,-2.760601,1.015512,0.0,0.0,0.0,0.0,0.0,1.0,11.393225,-2
1,240827,0.232453,13.800773,0,14,0.413645,10.453156,1.320249,-0.158522,-3.013258,0.827289,0.0,0.0,0.0,0.0,0.0,1.0,12.742782,0
2,155659,-0.144099,14.76103,0,14,0.436928,13.405199,1.101142,0.768894,2.233106,0.873856,0.0,0.0,0.0,0.0,0.0,0.0,15.236505,2
3,239963,-0.171668,13.738019,0,22,0.249481,27.819496,0.493827,-1.188718,-2.042771,0.498961,0.0,0.0,0.0,0.0,0.0,0.0,17.27989,-8
4,90504,0.235188,15.896359,0,10,0.389329,15.330971,1.036879,0.722828,-3.056138,0.778658,0.0,0.0,0.0,0.0,0.0,0.0,16.749298,4


In [None]:
train_data['signature'].describe()

count    10000.000000
mean        -0.022000
std          3.025166
min        -12.000000
25%         -2.000000
50%          0.000000
75%          2.000000
max         12.000000
Name: signature, dtype: float64

### Training
AutoGluon automatically infers the task and type e.g. discrete multiclass. It also automatically handles things like missing data and rescaling feature values, as well as train-val data splits.

Recommendations from tutorial:

> We recommend users to start with medium_quality to get a sense of the problem and identify any data related issues. If medium_quality is taking too long to train, consider subsampling the training data during this prototyping phase.
>Once you are comfortable, next try best_quality. Make sure to specify at least 16x the time_limit value as used in medium_quality. Once finished, you should have a very powerful solution that is often stronger than medium_quality.
>Make sure to consider holding out test data that AutoGluon never sees during training to ensure that the models are performing as expected in terms of performance.
>Once you evaluate both best_quality and medium_quality, check if either satisfies your needs. If neither do, consider trying high_quality and/or good_quality.
>If none of the presets satisfy requirements, refer to Predicting Columns in a Table - In Depth for more advanced AutoGluon options.

In [None]:
# Train a predictor to predic the 'signature' label from the train data, stop after 60 secs 
metric = 'accuracy'
predictor = TabularPredictor(label='signature', metric=metric).fit(train_data=train_data, time_limit=60, presets='best_quality')
# To save/load:
model_path = predictor.path
loaded_model = TabularPredictor.load(model_path)

No path specified. Models will be saved in: "AutogluonModels/ag-20241227_130125"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.9.6
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 24.1.0: Thu Oct 10 21:02:45 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T8112
CPU Count:          8
Memory Avail:       1.14 GB / 8.00 GB (14.2%)
Disk Space Avail:   37.81 GB / 460.43 GB (8.2%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and ben

### Data inspection
investigate what AutoGluon automatically inferred from the data:

In [14]:
print("AutoGluon infers problem type is: ", predictor.problem_type)
print("AutoGluon identified the following types of features:")
print(predictor.feature_metadata)

AutoGluon infers problem type is:  multiclass
AutoGluon identified the following types of features:
('float', [])     : 9 | ['chern_simons', 'cusp_volume', 'injectivity_radius', 'longitudinal_translation', 'meridinal_translation_imag', ...]
('int', [])       : 3 | ['Unnamed: 0', 'hyperbolic_adjoint_torsion_degree', 'hyperbolic_torsion_degree']
('int', ['bool']) : 5 | ['Symmetry_0', 'Symmetry_D3', 'Symmetry_D4', 'Symmetry_D6', 'Symmetry_Z/2 + Z/2']


View how the data looks after AutoGluon has transformed it into it's internatal representation:

In [15]:
train_data_transformed = predictor.transform_features(train_data)
train_data_transformed.head()

Unnamed: 0.1,Unnamed: 0,chern_simons,cusp_volume,hyperbolic_adjoint_torsion_degree,hyperbolic_torsion_degree,injectivity_radius,longitudinal_translation,meridinal_translation_imag,meridinal_translation_real,short_geodesic_imag_part,short_geodesic_real_part,Symmetry_0,Symmetry_D3,Symmetry_D4,Symmetry_D6,Symmetry_Z/2 + Z/2,volume
0,70746,0.09053,12.226322,0,10,0.507756,10.685555,1.144192,-0.519157,-2.760601,1.015512,0,0,0,0,1,11.393225
1,240827,0.232453,13.800773,0,14,0.413645,10.453156,1.320249,-0.158522,-3.013258,0.827289,0,0,0,0,1,12.742782
2,155659,-0.144099,14.76103,0,14,0.436928,13.405199,1.101142,0.768894,2.233106,0.873856,0,0,0,0,0,15.236505
3,239963,-0.171668,13.738019,0,22,0.249481,27.819496,0.493827,-1.188718,-2.042771,0.498961,0,0,0,0,0,17.27989
4,90504,0.235188,15.896359,0,10,0.389329,15.330971,1.036879,0.722828,-3.056138,0.778658,0,0,0,0,0,16.749298


View feature importance:
The 'importance' column reflects the amount the evaluation metric would drop if that feauture were removed from the data. Negative values indicate the metric would improve if the feature were removed and the model re-fitted.

In [16]:
predictor.feature_importance(train_data)

These features in provided data are not utilized by the predictor and will be ignored: ['Symmetry_D8']
Computing feature importance via permutation shuffling for 17 features using 5000 rows with 5 shuffle sets...
	57.26s	= Expected runtime (11.45s per shuffle set)
	17.3s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
meridinal_translation_real,0.53976,0.004175,4.293771e-10,5,0.548356,0.531164
meridinal_translation_imag,0.33308,0.009539,8.063409e-08,5,0.352721,0.313439
longitudinal_translation,0.30212,0.007694,5.042129e-08,5,0.317961,0.286279
short_geodesic_imag_part,0.07732,0.00389,7.662002e-07,5,0.08533,0.06931
hyperbolic_torsion_degree,0.0452,0.002462,1.051616e-06,5,0.050269,0.040131
volume,0.02408,0.001952,5.141192e-06,5,0.0281,0.02006
injectivity_radius,0.02288,0.002331,1.274359e-05,5,0.027679,0.018081
cusp_volume,0.0076,0.001288,9.54293e-05,5,0.010253,0.004947
chern_simons,0.00616,0.001252,0.0001940857,5,0.008738,0.003582
short_geodesic_real_part,0.00516,0.001571,0.0009149927,5,0.008395,0.001925


### Prediction

In [None]:
eval_data = TabularDataset(f'{data_url}test.csv')
prediction_data = eval_data.drop(columns=['signature'])
y_pred = predictor.predict(prediction_data)
y_pred.head()

Loaded data from: https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/test.csv | Columns = 19 / 19 | Rows = 5000 -> 5000


0   -4
1   -2
2    0
3    4
4    2
Name: signature, dtype: int64

### Eval

In [8]:
predictor.evaluate(eval_data, silent=True)

{'accuracy': 0.9466,
 'balanced_accuracy': 0.7560470577832882,
 'mcc': 0.9345787084902886}

In [None]:
# leadboard functionality - top row is the ensemble model
predictor.leaderboard(eval_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.9466,0.962963,accuracy,0.344813,0.07423,29.408378,0.008163,0.00043,0.074384,2,True,12
1,LightGBM,0.9456,0.955956,accuracy,0.090156,0.016611,11.403526,0.090156,0.016611,11.403526,1,True,5
2,XGBoost,0.9448,0.956957,accuracy,0.218064,0.031942,8.626415,0.218064,0.031942,8.626415,1,True,11
3,CatBoost,0.9432,0.955956,accuracy,0.032023,0.004451,13.291127,0.032023,0.004451,13.291127,1,True,8
4,RandomForestEntr,0.9382,0.946947,accuracy,0.109581,0.037111,1.132512,0.109581,0.037111,1.132512,1,True,7
5,RandomForestGini,0.9354,0.943944,accuracy,0.130638,0.036188,0.902218,0.130638,0.036188,0.902218,1,True,6
6,ExtraTreesEntr,0.935,0.944945,accuracy,0.200927,0.034919,0.560115,0.200927,0.034919,0.560115,1,True,10
7,ExtraTreesGini,0.9328,0.944945,accuracy,0.250937,0.035046,0.586366,0.250937,0.035046,0.586366,1,True,9
8,LightGBMXT,0.932,0.945946,accuracy,0.145144,0.030072,14.274436,0.145144,0.030072,14.274436,1,True,4
9,NeuralNetFastAI,0.9306,0.940941,accuracy,0.059484,0.023422,7.399198,0.059484,0.023422,7.399198,1,True,3


Eval with a specific model:

In [24]:
best_model = predictor.model_best
print(f"The best model is: {best_model}")
predictions_from_XGBoost = predictor.predict(eval_data, model='XGBoost')

# get list of models:
print(f'List of models:')
predictor.model_names() # or predictor.leaderboard()

The best model is: WeightedEnsemble_L2
List of models:


['KNeighborsUnif',
 'KNeighborsDist',
 'NeuralNetFastAI',
 'LightGBMXT',
 'LightGBM',
 'RandomForestGini',
 'RandomForestEntr',
 'CatBoost',
 'ExtraTreesGini',
 'ExtraTreesEntr',
 'XGBoost',
 'WeightedEnsemble_L2']