# AutoGluon Tabular - Quick Start

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/tabular/tabular-quick-start.ipynb)
[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/tabular/tabular-quick-start.ipynb)

In this tutorial, we will see how to use AutoGluon's `TabularPredictor` to predict the values of a target column based on the other columns in a tabular dataset.

Begin by making sure AutoGluon is installed, and then import AutoGluon's `TabularDataset` and `TabularPredictor`. We will use the former to load data and the latter to train models and make predictions. 

In [None]:
!pip install -U pip
!pip install -U setuptools wheel
!pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cpu
!apt-get update; apt-get install -y graphviz graphviz-dev
!pip install autogluon kaggle pygraphviz dask[dataframe]

In [5]:
from autogluon.tabular import TabularDataset, TabularPredictor

## Example Data

For this tutorial we will use a dataset from the cover story of [Nature issue 7887](https://www.nature.com/nature/volumes/600/issues/7887): [AI-guided intuition for math theorems](https://www.nature.com/articles/s41586-021-04086-x.pdf). The goal is to predict a knot's signature based on its properties. We sampled 10K training and 5K test examples from the [original data](https://github.com/deepmind/mathematics_conjectures/blob/main/knot_theory.ipynb). The sampled dataset make this tutorial run quickly, but AutoGluon can handle the full dataset if desired.

We load this dataset directly from a URL. AutoGluon's `TabularDataset` is a subclass of pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), so any `DataFrame` methods can be used on `TabularDataset` as well.

In [6]:
data_url = 'https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/'

In [7]:
!mkdir -p data/knot_theory

In [8]:
!wget -O data/knot_theory/train.csv {data_url}train.csv
!wget -O data/knot_theory/test.csv {data_url}test.csv

--2024-04-28 18:39:50--  https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2068450 (2.0M) [text/plain]
Saving to: ‘data/knot_theory/train.csv’


2024-04-28 18:39:50 (150 MB/s) - ‘data/knot_theory/train.csv’ saved [2068450/2068450]

--2024-04-28 18:39:50--  https://raw.githubusercontent.com/mli/ag-docs/main/knot_theory/test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1034198 (1010K) [text/plain]
Saving to: ‘data/knot_theory/test.csv’


2024-04-28 18:39:5

In [9]:
train_data = TabularDataset(f'data/knot_theory/train.csv')
train_data.head()

Unnamed: 0.1,Unnamed: 0,chern_simons,cusp_volume,hyperbolic_adjoint_torsion_degree,hyperbolic_torsion_degree,injectivity_radius,longitudinal_translation,meridinal_translation_imag,meridinal_translation_real,short_geodesic_imag_part,short_geodesic_real_part,Symmetry_0,Symmetry_D3,Symmetry_D4,Symmetry_D6,Symmetry_D8,Symmetry_Z/2 + Z/2,volume,signature
0,70746,0.09053,12.226322,0,10,0.507756,10.685555,1.144192,-0.519157,-2.760601,1.015512,0.0,0.0,0.0,0.0,0.0,1.0,11.393225,-2
1,240827,0.232453,13.800773,0,14,0.413645,10.453156,1.320249,-0.158522,-3.013258,0.827289,0.0,0.0,0.0,0.0,0.0,1.0,12.742782,0
2,155659,-0.144099,14.76103,0,14,0.436928,13.405199,1.101142,0.768894,2.233106,0.873856,0.0,0.0,0.0,0.0,0.0,0.0,15.236505,2
3,239963,-0.171668,13.738019,0,22,0.249481,27.819496,0.493827,-1.188718,-2.042771,0.498961,0.0,0.0,0.0,0.0,0.0,0.0,17.27989,-8
4,90504,0.235188,15.896359,0,10,0.389329,15.330971,1.036879,0.722828,-3.056138,0.778658,0.0,0.0,0.0,0.0,0.0,0.0,16.749298,4


Our targets are stored in the "signature" column, which has 18 unique integers. Even though pandas didn't correctly recognize this data type as categorical, AutoGluon will fix this issue.


In [10]:
label = 'signature'
train_data[label].describe()

count    10000.000000
mean        -0.022000
std          3.025166
min        -12.000000
25%         -2.000000
50%          0.000000
75%          2.000000
max         12.000000
Name: signature, dtype: float64

In [13]:
vc = train_data[label].value_counts()

In [14]:
vc.sort_index()

signature
-12       1
-10       5
-8       78
-6      412
-4     1076
-2     2124
 0     2685
 2     2059
 4     1084
 6      397
 8       69
 10       9
 12       1
Name: count, dtype: int64

In [15]:
train_data.info()

<class 'autogluon.core.dataset.TabularDataset'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 19 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         10000 non-null  int64  
 1   chern_simons                       10000 non-null  float64
 2   cusp_volume                        10000 non-null  float64
 3   hyperbolic_adjoint_torsion_degree  10000 non-null  int64  
 4   hyperbolic_torsion_degree          10000 non-null  int64  
 5   injectivity_radius                 10000 non-null  float64
 6   longitudinal_translation           10000 non-null  float64
 7   meridinal_translation_imag         10000 non-null  float64
 8   meridinal_translation_real         10000 non-null  float64
 9   short_geodesic_imag_part           10000 non-null  float64
 10  short_geodesic_real_part           10000 non-null  float64
 11  Symmetry_0                         10000 non-

In [19]:
train_data['Symmetry_D8'].value_counts()

Symmetry_D8
0.0    10000
Name: count, dtype: int64

## Training

We now construct a `TabularPredictor` by specifying the label column name and then train on the dataset with `TabularPredictor.fit()`. We don't need to specify any other parameters. AutoGluon will recognize this is a multi-class classification task, perform automatic feature engineering, train multiple models, and then ensemble the models to create the final predictor. 

In [20]:
predictor = TabularPredictor(label=label).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20240428_184714"
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20240428_184714"
AutoGluon Version:  1.1.0
Python Version:     3.10.6
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Sat Mar 23 09:49:55 UTC 2024
CPU Count:          2
Memory Avail:       2.62 GB / 3.78 GB (69.3%)
Di

Model fitting should take a few minutes or less depending on your CPU. You can make training faster by specifying the `time_limit` argument. For example, `fit(..., time_limit=60)` will stop training after 60 seconds. Higher time limits will generally result in better prediction performance, and excessively low time limits will prevent AutoGluon from training and ensembling a reasonable set of models.



## Prediction

Once we have a predictor that is fit on the training dataset, we can load a separate set of data to use for prediction and evaulation.

In [22]:
test_data = TabularDataset(f'data/knot_theory/test.csv')

Loaded data from: data/knot_theory/test.csv | Columns = 19 / 19 | Rows = 5000 -> 5000


In [23]:
y_pred = predictor.predict(test_data.drop(columns=[label]))
y_pred.head()

0   -4
1   -2
2    0
3    4
4    2
Name: signature, dtype: int64

In [26]:
y_pred_sorted = y_pred.value_counts()

In [27]:
y_pred_sorted.sort_index()

signature
-8      42
-6     211
-4     543
-2    1056
 0    1326
 2    1028
 4     555
 6     191
 8      48
Name: count, dtype: int64

## Evaluation

We can evaluate the predictor on the test dataset using the `evaluate()` function, which measures how well our predictor performs on data that was not used for fitting the models.

In [28]:
predictor.evaluate(test_data, silent=True)

{'accuracy': 0.9492,
 'balanced_accuracy': 0.7607185299582315,
 'mcc': 0.9377531923409433}

In [30]:
(y_pred == test_data[label]).mean()

0.9492

AutoGluon's `TabularPredictor` also provides the `leaderboard()` function, which allows us to evaluate the performance of each individual trained model on the test data.

In [31]:
df_leaderboard = predictor.leaderboard(test_data)

In [32]:
df_leaderboard

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.9492,0.965966,accuracy,3.149451,0.773011,130.094284,0.02269,0.002604,0.340377,2,True,14
1,LightGBM,0.9456,0.955956,accuracy,0.77915,0.134874,8.066734,0.77915,0.134874,8.066734,1,True,5
2,XGBoost,0.9448,0.956957,accuracy,2.335654,0.404164,15.392614,2.335654,0.404164,15.392614,1,True,11
3,LightGBMLarge,0.9444,0.94995,accuracy,2.782085,0.647386,18.709358,2.782085,0.647386,18.709358,1,True,13
4,CatBoost,0.9432,0.955956,accuracy,0.060755,0.013117,80.060095,0.060755,0.013117,80.060095,1,True,8
5,RandomForestEntr,0.9384,0.94995,accuracy,0.273951,0.15025,9.067146,0.273951,0.15025,9.067146,1,True,7
6,ExtraTreesGini,0.936,0.946947,accuracy,1.006382,0.122427,2.960621,1.006382,0.122427,2.960621,1,True,9
7,ExtraTreesEntr,0.9358,0.942943,accuracy,1.151344,0.127958,2.653884,1.151344,0.127958,2.653884,1,True,10
8,NeuralNetFastAI,0.9356,0.93994,accuracy,0.100019,0.026136,15.192991,0.100019,0.026136,15.192991,1,True,3
9,RandomForestGini,0.9352,0.944945,accuracy,0.256482,0.122114,6.336494,0.256482,0.122114,6.336494,1,True,6


In [33]:
predictor.plot_ensemble_model()

'AutogluonModels/ag-20240428_184714/ensemble_model.png'

In [35]:
summary_dict = predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val eval_metric  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.965966    accuracy       0.773011  130.094284                0.002604           0.340377            2       True         14
1               XGBoost   0.956957    accuracy       0.404164   15.392614                0.404164          15.392614            1       True         11
2              CatBoost   0.955956    accuracy       0.013117   80.060095                0.013117          80.060095            1       True          8
3              LightGBM   0.955956    accuracy       0.134874    8.066734                0.134874           8.066734            1       True          5
4      RandomForestEntr   0.949950    accuracy       0.150250    9.067146                0.150250           9.067146            1       True          7
5         LightGBMLarge   

In [44]:
summary_dict["model_hyperparams"]['NeuralNetTorch']

{'num_epochs': 500,
 'epochs_wo_improve': 20,
 'activation': 'relu',
 'embedding_size_factor': 1.0,
 'embed_exponent': 0.56,
 'max_embedding_dim': 100,
 'y_range': None,
 'y_range_extend': 0.05,
 'dropout_prob': 0.1,
 'optimizer': 'adam',
 'learning_rate': 0.0003,
 'weight_decay': 1e-06,
 'proc.embed_min_categories': 4,
 'proc.impute_strategy': 'median',
 'proc.max_category_levels': 100,
 'proc.skew_threshold': 0.99,
 'use_ngram_features': False,
 'num_layers': 4,
 'hidden_size': 128,
 'max_batch_size': 512,
 'use_batchnorm': False,
 'loss_function': 'auto'}

## Conclusion

In this quickstart tutorial we saw AutoGluon's basic fit and predict functionality using `TabularDataset` and `TabularPredictor`. AutoGluon simplifies the model training process by not requiring feature engineering or model hyperparameter tuning. Check out the in-depth tutorials to learn more about AutoGluon's other features like customizing the training and prediction steps or extending AutoGluon with custom feature generators, models, or metrics.