# AutoGluon解决加州房价问题

使用AutoML框架快速实现房价预测，展示AutoGluon的核心优势：简单、高效、自动化。

## 1. 导入库和加载数据

In [1]:
import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

# 加载数据
train_data = pd.read_csv("./california-house-prices/train.csv")
test_data = pd.read_csv("./california-house-prices/test.csv")

print(f"训练集形状: {train_data.shape}")
print(f"测试集形状: {test_data.shape}")
print(f"\n数据预览:")
print(train_data.head(3))

训练集形状: (47439, 41)
测试集形状: (31626, 40)

数据预览:
   Id         Address  Sold Price  \
0   0     540 Pine Ln   3825000.0   
1   1  1727 W 67th St    505000.0   
2   2  28093 Pine Ave    140000.0   

                                             Summary          Type  \
0  540 Pine Ln, Los Altos, CA 94022 is a single f...  SingleFamily   
1  HURRY, HURRY.......Great house 3 bed and 2 bat...  SingleFamily   
2  'THE PERFECT CABIN TO FLIP!  Strawberry deligh...  SingleFamily   

   Year built                                       Heating  \
0      1969.0  Heating - 2+ Zones, Central Forced Air - Gas   
1      1926.0                                   Combination   
2      1958.0                                    Forced air   

                                             Cooling  \
0    Multi-Zone, Central AC, Whole House / Attic Fan   
1  Wall/Window Unit(s), Evaporative Cooling, See ...   
2                                                NaN   

                              Parking     Lot  

## 2. 数据预处理

AutoGluon会自动处理缺失值、编码和特征工程，我们只需删除明显无用的列。

In [2]:
# 删除无用列
columns_to_drop = ['Address', 'Summary', 'Id']

train_df = train_data.drop(columns=columns_to_drop)
test_df = test_data.drop(columns=columns_to_drop)

# 划分训练集和验证集
train_split, valid_split = train_test_split(train_df, test_size=0.3, random_state=42)

print(f"训练集: {len(train_split):,} 样本")
print(f"验证集: {len(valid_split):,} 样本")
print(f"特征数: {train_split.shape[1] - 1}")

训练集: 33,207 样本
验证集: 14,232 样本
特征数: 37


## 3. AutoGluon训练

只需几行代码即可完成模型训练和优化。

In [None]:
# 创建预测器
predictor = TabularPredictor(
    label='Sold Price',
    eval_metric='root_mean_squared_error',
    path='./autogluon_models'
)

# 训练模型（5分钟自动训练）
predictor.fit(
    train_data=train_split,
    time_limit=3600,
    presets='good_quality',
    verbosity=2
)

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.4.0
Python Version:     3.10.19
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 25.0.0: Wed Sep 17 21:41:50 PDT 2025; root:xnu-12377.1.9~141/RELEASE_ARM64_T6030
CPU Count:          11
Memory Avail:       4.54 GB / 18.00 GB (25.2%)
Disk Space Avail:   298.39 GB / 926.35 GB (32.2%)
Presets specified: ['medium_quality']
Using hyperparameters preset: hyperparameters='default'
Beginning AutoGluon training ... Time limit = 300s
AutoGluon will save models to "/Users/haoyiwen/Documents/ai/deeplearning/pytorch_2025/month_11/chapter_3_kaggle/autogluon_models"
Train Data Rows:    33207
Train Data Columns: 37
Label Column:       Sold Price
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (90000000.0, 100500.0, 1297977.23853, 1716926.21183)
	If 'regression' is not the correc

[1000]	valid_set's rmse: 582731


	-581792.8293	 = Validation score   (-root_mean_squared_error)
	10.2s	 = Training   runtime
	0.03s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 268.76s of the 268.76s of remaining time.
	Failed to import torch or check CUDA availability!Please ensure you have the correct version of PyTorch installed by running `pip install -U torch`
	Fitting with cpus=11, gpus=0, mem=1.0/4.8 GB
	-435014.267	 = Validation score   (-root_mean_squared_error)
	2.3s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 266.44s of the 266.44s of remaining time.
	Failed to import torch or check CUDA availability!Please ensure you have the correct version of PyTorch installed by running `pip install -U torch`
	Fitting with cpus=11, gpus=0, mem=0.0/4.7 GB
	-398912.6281	 = Validation score   (-root_mean_squared_error)
	489.6s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training mod

<autogluon.tabular.predictor.predictor.TabularPredictor at 0x10690bbe0>

## 4. 模型评估

In [4]:
# 验证集评估
valid_score = predictor.evaluate(valid_split)
print(f"验证集 RMSE: ${valid_score['root_mean_squared_error']:,.2f}")

# 查看模型排行榜
leaderboard = predictor.leaderboard(valid_split, silent=True)
print(f"\n模型排行榜（Top 5）:")
print(leaderboard.head())

验证集 RMSE: $-626,985.55

模型排行榜（Top 5）:
                 model     score_test      score_val              eval_metric  \
0  WeightedEnsemble_L2 -626985.545999 -381834.952511  root_mean_squared_error   
1      RandomForestMSE -633727.581121 -398912.628052  root_mean_squared_error   
2             LightGBM -674635.220342 -435014.266992  root_mean_squared_error   
3           LightGBMXT -753977.298874 -581792.829334  root_mean_squared_error   

   pred_time_test  pred_time_val    fit_time  pred_time_test_marginal  \
0        0.816994       0.117480  502.106743                 0.030922   
1        0.523310       0.072539  489.604302                 0.523310   
2        0.086051       0.013623    2.304236                 0.086051   
3        0.176711       0.031132   10.195352                 0.176711   

   pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  \
0                0.000185           0.002853            2       True   
1                0.072539         489.604302  

## 5. 测试集预测和提交

In [5]:
# 预测测试集
test_predictions = predictor.predict(test_df)

print(f"预测完成: {len(test_predictions):,} 样本")
print(f"预测均值: ${test_predictions.mean():,.2f}")
print(f"预测中位数: ${test_predictions.median():,.2f}")

预测完成: 31,626 样本
预测均值: $881,822.12
预测中位数: $629,903.38


In [6]:
# 生成提交文件
submission = pd.DataFrame({
    'Id': test_data['Id'],
    'Sold Price': test_predictions
})

submission.to_csv('submission_autogluon.csv', index=False)
print("✅ 提交文件已生成: submission_autogluon.csv")
print(f"\n前10行预览:")
print(submission.head(10))

✅ 提交文件已生成: submission_autogluon.csv

前10行预览:
      Id    Sold Price
0  47439  8.132494e+05
1  47440  5.584474e+05
2  47441  8.219297e+05
3  47442  8.253087e+05
4  47443  1.136194e+06
5  47444  7.605267e+05
6  47445  1.506724e+06
7  47446  4.502934e+05
8  47447  2.055936e+06
9  47448  4.951464e+05


## 总结

**AutoGluon核心优势：**
- ✅ 仅需10行核心代码
- ✅ 5分钟自动训练多个模型
- ✅ 自动特征工程和模型集成
- ✅ 无需手动调参

**适用场景：**
- 快速原型开发
- 表格数据预测
- Kaggle竞赛Baseline
- 业务快速落地