## 1. 问题定义

目标：基于加州房屋数据集，预测该街区的房屋中位数价值 (Median House Value)。这是一个典型的**回归问题**。

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

plt.style.use('seaborn-v0_8-whitegrid')

## 2. 数据获取与探索 (EDA)

In [None]:
# 加载数据
housing = fetch_california_housing(as_frame=True)
df = housing.frame

print(f"数据集大小: {df.shape}")
df.head()

In [None]:
# 目标变量分布
plt.figure(figsize=(10, 6))
sns.histplot(df['MedHouseVal'], kde=True, color='#00D9FF')
plt.title('房屋中位数价值分布 (目标变量)')
plt.xlabel('价值 ($100,000)')
plt.show()

In [None]:
# 相关性分析
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('特征相关性热力图')
plt.show()

## 3. 模型训练

In [None]:
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 比较两个模型
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=50, random_state=42)
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    results[name] = {'RMSE': rmse, 'R2': r2}
    print(f"{name} - RMSE: {rmse:.4f}, R2: {r2:.4f}")

## 4. 结果评估与可视化

In [None]:
# 可视化模型性能对比
res_df = pd.DataFrame(results).T

fig, ax = plt.subplots(1, 2, figsize=(12, 5))
res_df['RMSE'].plot(kind='bar', ax=ax[0], color='#FF6B6B', title='RMSE (越低越好)')
res_df['R2'].plot(kind='bar', ax=ax[1], color='#4ECDC4', title='R2 Score (越高越好)')
plt.tight_layout()
plt.show()

## 5. 结论

随机森林模型在 RMSE 和 R2 指标上均显著优于线性回归，说明房屋价格与特征之间存在非线性关系。

--- 
**下一步**：尝试进行超参数网格搜索 (GridSearchCV) 以进一步优化随机森林模型。