表格预测器
首先，导入 AutoGluon 的TabularPredictor和TabularDataset类：

In [1]:
from autogluon.tabular import TabularDataset, TabularPredictor

  from .autonotebook import tqdm as notebook_tqdm


将训练数据从CSV 文件加载到 AutoGluon 数据集对象中。该对象本质上相当于Pandas DataFrame，并且相同的方法可以应用于两者。

In [3]:
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29128,23,Private,190483,Bachelors,13,Never-married,Sales,Own-child,White,Female,0,0,20,United-States,<=50K
23950,72,?,188009,7th-8th,4,Divorced,?,Not-in-family,White,Male,0,0,30,United-States,<=50K
13700,45,Private,117310,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,50,United-States,>50K
35248,21,Local-gov,596776,Some-college,10,Never-married,Adm-clerical,Own-child,White,Male,0,0,40,Guatemala,<=50K


请注意，我们从存储在云中的 CSV 文件加载数据。如果您已经将 CSV 文件下载到您自己的计算机上（例如，使用wget），您也可以指定本地文件路径。表中的每一行train_data对应于一个训练示例。在这个特定的数据集中，每一行对应一个人，列包含人口普查期间报告的各种特征。

我们首先使用这些特征来预测这个人的收入是否超过 50,000 美元，记录在class这个表的列中。

In [4]:
label = 'class'
print(f"Unique classes: {list(train_data[label].unique())}")

Unique classes: [' >50K', ' <=50K']


AutoGluon 使用原始数据，这意味着您在拟合 AutoGluon 之前无需执行任何数据预处理。我们强烈建议您避免执行缺失值插补或单热编码等操作，因为 AutoGluon 具有专用逻辑来自动处理这些情况。您可以在特征工程教程中了解有关 AutoGluon 预处理的更多信息。

训练
现在我们用一行代码初始化并拟合 AutoGluon 的 TabularPredictor：

In [5]:
predictor = TabularPredictor(label=label).fit(train_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20240316_180950"
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20240316_180950"
AutoGluon Version:  1.0.0
Python Version:     3.11.8
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 22.3.0: Mon Jan 30 20:39:35 PST 2023; root:xnu-8792.81.3~2/RELEASE_ARM64_T8103
CPU C

就是这样！我们现在有了一个 TabularPredictor，它能够对新数据进行预测。

预言
接下来，加载单独的测试数据以演示如何在推理时对新示例进行预测：

In [6]:
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
test_data.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,31,Private,169085,11th,7,Married-civ-spouse,Sales,Wife,White,Female,0,0,20,United-States,<=50K
1,17,Self-emp-not-inc,226203,12th,8,Never-married,Sales,Own-child,White,Male,0,0,45,United-States,<=50K
2,47,Private,54260,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,60,United-States,>50K
3,21,Private,176262,Some-college,10,Never-married,Exec-managerial,Own-child,White,Female,0,0,30,United-States,<=50K
4,17,Private,241185,12th,8,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States,<=50K


我们现在可以使用经过训练的模型对新数据进行预测：

In [7]:
y_pred = predictor.predict(test_data)
y_pred.head()  # Predictions

0     <=50K
1     <=50K
2      >50K
3     <=50K
4     <=50K
Name: class, dtype: object

In [8]:
y_pred_proba = predictor.predict_proba(test_data)
y_pred_proba.head()  # Prediction Probabilities

Unnamed: 0,<=50K,>50K
0,0.981126,0.018874
1,0.983599,0.016401
2,0.478133,0.521867
3,0.994751,0.005249
4,0.988539,0.011461


评估
接下来，我们可以根据（标记的）测试数据评估预测器：

In [9]:
predictor.evaluate(test_data)

{'accuracy': 0.8409253761899887,
 'balanced_accuracy': 0.7475663839529563,
 'mcc': 0.5345297121913682,
 'roc_auc': 0.884716037791454,
 'f1': 0.6296472831267874,
 'precision': 0.7034078807241747,
 'recall': 0.5698878343399483}

我们还可以单独评估每个模型：

In [10]:
predictor.leaderboard(test_data)

Unnamed: 0,model,score_test,score_val,eval_metric,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestGini,0.842461,0.84,accuracy,0.069452,0.015696,0.22602,0.069452,0.015696,0.22602,1,True,5
1,XGBoost,0.840925,0.86,accuracy,0.029019,0.003668,0.731228,0.029019,0.003668,0.731228,1,True,10
2,WeightedEnsemble_L2,0.840925,0.86,accuracy,0.030143,0.003964,0.910008,0.001124,0.000296,0.17878,2,True,13
3,RandomForestEntr,0.840925,0.83,accuracy,0.045826,0.015236,0.200196,0.045826,0.015236,0.200196,1,True,6
4,LightGBM,0.839799,0.85,accuracy,0.010307,0.001976,0.590442,0.010307,0.001976,0.590442,1,True,4
5,LightGBMXT,0.836421,0.83,accuracy,0.011586,0.002791,0.507735,0.011586,0.002791,0.507735,1,True,3
6,NeuralNetTorch,0.835705,0.83,accuracy,0.030303,0.005052,0.471932,0.030303,0.005052,0.471932,1,True,11
7,ExtraTreesGini,0.834374,0.82,accuracy,0.097729,0.015282,0.201569,0.097729,0.015282,0.201569,1,True,7
8,ExtraTreesEntr,0.832839,0.81,accuracy,0.057406,0.014712,0.19019,0.057406,0.014712,0.19019,1,True,8
9,LightGBMLarge,0.828949,0.83,accuracy,0.015583,0.002833,2.594767,0.015583,0.002833,2.594767,1,True,12


加载经过训练的预测器
最后，我们可以通过调用TabularPredictor.load()并指定预测器工件在磁盘上的位置，在新会话（或新机器）中加载预测器。

In [11]:
predictor.path  # The path on disk where the predictor is saved

'AutogluonModels/ag-20240316_180950'

In [12]:
# Load the predictor by specifying the path it is saved to on disk.
# You can control where it is saved to by setting the `path` parameter during init
predictor = TabularPredictor.load(predictor.path)

现在您已准备好在自己的表格数据集上尝试 AutoGluon！只要它们以 CSV 等流行格式存储，您应该只需 2 行代码即可实现强大的预测性能：

from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label=<variable-name>).fit(train_data=<file-name>)

注意：这个对TabularPredictor.fit()的简单调用适用于您的第一个原型模型。在后续部分中，我们将演示如何通过另外指定presets参数 tofit()和eval_metric参数 to来最大化预测性能TabularPredictor()。

fit() 的描述
在此我们讨论期间发生的事情fit()。

由于变量只有两个可能的值class，因此这是一个二元分类问题，适当的性能指标是准确性。AutoGluon 会自动推断这一点以及每个特征的类型（即哪些列包含连续数字与离散类别）。AutoGluon 还可以自动处理常见问题，例如丢失数据和重新缩放特征值。

我们没有指定单独的验证数据，因此 AutoGluon 自动选择数据的随机训练/验证分割。用于验证的数据与训练数据分开，用于确定产生最佳结果的模型和超参数值。AutoGluon 不只是单个模型，而是训练多个模型并将它们集成在一起以获得卓越的预测性能。

默认情况下，AutoGluon 尝试拟合各种类型的模型，包括神经网络和树集成。每种类型的模型都有各种超参数，传统上用户必须指定这些超参数。AutoGluon 使这个过程自动化。

AutoGluon 自动迭代地测试超参数值，以在验证数据上产生最佳性能。这涉及在不同的超参数设置下反复训练模型并评估其性能。此过程可能需要大量计算，因此使用Rayfit()跨多个线程并行化此过程。要控制运行时，您可以指定各种参数，如后续深入教程中演示的那样。fit()time_limit

我们可以查看 AutoGluon 自动推断出我们的预测任务的哪些属性：

In [13]:
print("AutoGluon infers problem type is: ", predictor.problem_type)
print("AutoGluon identified the following types of features:")
print(predictor.feature_metadata)

AutoGluon infers problem type is:  binary
AutoGluon identified the following types of features:
('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']


AutoGluon 正确地将我们的预测问题识别为二元分类任务，并决定诸如此类的变量age应表示为整数，而诸如此类的变量workclass应表示为分类对象。该feature_metadata属性允许您在预处理后查看每个预测变量的推断数据类型（这是其原始数据类型；如果通过特征工程生成，某些特征还可能与其他特殊数据类型相关联，例如日期时间/文本列的数字表示） 。

要将数据转换为 AutoGluon 的内部表示，我们可以执行以下操作：

In [14]:
test_data_transform = predictor.transform_features(test_data)
test_data_transform.head()

Unnamed: 0,age,fnlwgt,education-num,sex,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,native-country
0,31,169085,7,0,0,0,20,3,1,1,10,5,4,14
1,17,226203,8,1,0,0,45,5,2,3,10,3,4,14
2,47,54260,11,1,0,1887,60,3,7,1,3,0,4,14
3,21,176262,10,0,0,0,30,3,13,3,3,3,4,14
4,17,241185,8,1,0,0,20,3,2,3,8,3,4,14


请注意预处理后数据是如何纯粹数字的（尽管分类特征仍将被视为分类下游）。

为了更好地理解我们训练的预测器，我们可以通过TabularPredictor.feature_importance()估计每个特征的总体重要性：

In [15]:
predictor.feature_importance(test_data)

Computing feature importance via permutation shuffling for 14 features using 5000 rows with 5 shuffle sets...
	4.04s	= Expected runtime (0.81s per shuffle set)
	1.74s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
marital-status,0.0508,0.003792,3.698489e-06,5,0.058608,0.042992
capital-gain,0.03852,0.002318,1.565361e-06,5,0.043292,0.033748
education-num,0.02968,0.001346,5.063512e-07,5,0.032452,0.026908
age,0.015,0.00285,0.000149044,5,0.020867,0.009133
hours-per-week,0.01172,0.003974,0.00136943,5,0.019902,0.003538
occupation,0.00528,0.001803,0.001406849,5,0.008993,0.001567
relationship,0.00472,0.001154,0.0003967984,5,0.007096,0.002344
native-country,0.00144,0.000654,0.003959537,5,0.002787,9.3e-05
capital-loss,0.00128,0.000415,0.001155921,5,0.002134,0.000426
fnlwgt,0.00108,0.002361,0.1820562,5,0.00594,-0.00378


该importance列是对如果从数据中删除该功能，评估指标分数将下降的量的估计。负值importance意味着如果删除特征后重新拟合可能会改善结果。

当我们调用 时predict()，AutoGluon 会自动使用在验证数据上显示最佳性能的模型（即加权集成）进行预测。

In [16]:
predictor.model_best

'WeightedEnsemble_L2'

我们可以指定使用哪个模型进行预测，如下所示：

In [None]:
predictor.predict(test_data, model='LightGBM')

In [17]:
predictor.model_names()

['KNeighborsUnif',
 'KNeighborsDist',
 'LightGBMXT',
 'LightGBM',
 'RandomForestGini',
 'RandomForestEntr',
 'ExtraTreesGini',
 'ExtraTreesEntr',
 'NeuralNetFastAI',
 'XGBoost',
 'NeuralNetTorch',
 'LightGBMLarge',
 'WeightedEnsemble_L2']

上述预测性能的分数基于默认评估指标（二元分类的准确性）。某些应用程序中的性能可能通过与 AutoGluon 默认优化的指标不同的指标来衡量。如果您知道应用程序中重要的指标，则应通过eval_metric参数指定它，如下一节所示。

预设
AutoGluon 附带了多种预设，可以.fit通过presets参数在调用中指定。medium_quality默认情况下使用它来鼓励初始原型设计，但如果需要认真使用，则应使用其他预设。

我们建议用户首先medium_quality了解问题并识别任何与数据相关的问题。如果medium_quality训练时间太长，请考虑在此原型设计阶段对训练数据进行二次采样。
一旦你感觉舒服了，下一步就可以尝试best_quality。确保指定的time_limit值至少是 中使用的值的16 倍medium_quality。完成后，您应该拥有一个非常强大的解决方案，通常比medium_quality.
请务必考虑保留 AutoGluon 在训练期间从未见过的测试数据，以确保模型在性能方面达到预期的效果。
一旦您评估了best_quality和medium_quality，请检查其中一个是否满足您的需求。如果两者都没有，请考虑尝试high_quality和/或good_quality。
如果没有一个预设满足要求，请参阅预测表中的列 - 深入了解更高级的 AutoGluon 选项。

最大化预测性能
注意：fit()如果您正在对 AutoGluon-Tabular 进行基准测试或希望最大限度地提高其准确性，则不应使用完全默认的参数进行调用！为了获得 AutoGluon 的最佳预测准确性，您通常应该像这样使用它：

In [18]:
time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
metric = 'roc_auc'  # specify your evaluation metric here
predictor = TabularPredictor(label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets='best_quality')

No path specified. Models will be saved in: "AutogluonModels/ag-20240317_030844"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Dynamic stacking is enabled (dynamic_stacking=True). AutoGluon will try to determine whether the input data is affected by stacked overfitting and enable or disable stacking as a consequence.
Detecting stacked overfitting by sub-fitting AutoGluon on the input data. That is, copies of AutoGluon will be sub-fit on subset(s) of the data. Then, the holdout validation data is used to detect stacked overfitting.
Sub-fit(s) time limit is: 60 seconds.
Starting holdout-based sub-fit for dynamic stacking. Context path is: AutogluonModels/ag-20240317_030844/ds_sub_fit/sub_fit_ho.
2024-03-17 11:08:44,978	INFO util.py:159 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
Beginning AutoGluon training ... Time limit = 

In [None]:
predictor.leaderboard(test_data)

该命令实施以下策略以最大限度地提高准确性：

指定参数presets='best_quality'，它允许 AutoGluon 基于stacking/bagging自动构造强大的模型集成，并且如果给予足够的训练时间，将大大改善结果预测。默认值presets是'medium_quality'，它产生的模型精度较低，但有利于更快的原型设计。使用presets，您可以灵活地优先考虑预测准确性与训练/推理速度。例如，如果您不太关心预测性能并希望快速部署基本模型，请考虑使用：。presets=['good_quality', 'optimize_for_deployment']

eval_metric如果TabularPredictor()您知道将使用什么指标来评估应用程序中的预测，请提供参数。您可能使用的其他一些非默认指标包括：（'f1'用于二元分类）、'roc_auc'（用于二元分类）、'log_loss'（用于分类）、'mean_absolute_error'（用于回归）、'median_absolute_error'（用于回归）。您还可以定义自己的自定义指标函数。有关更多信息，请参阅向 AutoGluon 添加自定义指标。

包含您的所有数据并且train_data不提供tuning_data（AutoGluon 将更智能地分割数据以满足其需求）。

不要指定hyperparameter_tune_kwargs参数（与直觉相反，超参数调整并不是花费有限训练时间预算的最佳方式，因为模型集成通常更优越）。hyperparameter_tune_kwargs我们建议您仅在您的目标是部署单个模型而不是整体时使用。

不要指定hyperparameters参数（允许 AutoGluon 自适应地选择要使用的模型/超参数）。

设置time_limit为您愿意等待的最长时间（以秒为单位）。fit()AutoGluon 的预测性能随着运行时间的延长而提高。

