## 安装依赖并保存

In [None]:
%pip install scikit-learn
%pip install matplotlib
%pip install numpy
%pip install pandas
%pip install scipy
%pip freeze > requirements.txt

## 任务1：回归 – 加州房价预测

对回归任务选用了五种具有代表性的监督学习算法进行性能比较。首先使用 线性回归模型 作为基线模型，然后引入 Ridge 和 Lasso 进行正则化约束，缓解过拟合问题，同时比较两者的特征选择能力差异。在此基础上，考虑模型非线性表达能力，引入 随机森林回归（Random Forest） 与 梯度提升回归（Gradient Boosting），用于处理复杂特征关系并提升预测性能。最后，通过均方误差（MSE）与决定系数（R²）评估模型优劣。

In [3]:
# 回归模型：线性回归、Ridge、Lasso、随机森林、梯度提升
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import numpy as np

# 1. 加载数据
data = fetch_california_housing()
X, y = data.data, data.target

# 2. 数据集划分
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. 标准化（对线性模型非常重要）
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 4. 定义模型
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=1.0),
    "Lasso Regression": Lasso(alpha=0.01),
    "Random Forest": RandomForestRegressor(n_estimators=100),
    "Gradient Boosting": GradientBoostingRegressor(),
}

# 5. 训练与评估
for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
print(
    f"{name}: MSE={mean_squared_error(y_test, pred):.4f}, R2={r2_score(y_test, pred):.4f}"
)

Gradient Boosting: MSE=0.2940, R2=0.7756


## 任务2：分类 – 乳腺癌分类

对二分类任务选用了六种具有代表性的监督学习算法进行性能比较。首先使用逻辑回归（Logistic Regression）构建基线模型，对线性可分数据具有良好表现。随后引入支持向量机（SVM）与K 近邻（KNN）算法，分别从间隔最大化与基于距离度量的角度提升模型分类能力。为了考虑模型的非线性表达能力与稳定性，引入了三种基于决策树的模型：决策树（Decision Tree）、随机森林（Random Forest）和自适应提升算法（AdaBoost）。其中随机森林通过集成多棵树降低模型方差，AdaBoost通过迭代提升误分类样本权重实现增强学习效果。实验采用准确率（Accuracy）作为分类性能指标，在标准训练测试划分下分别评估各模型表现，并对比分析其优缺点与适用场景。

In [2]:
# 分类模型：逻辑回归、KNN、决策树、随机森林、SVM、AdaBoost
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC

# 1. 加载数据
data = load_breast_cancer()
X, y = data.data, data.target

# 2. 数据集划分
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. 特征标准化（对SVM/KNN有效）
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 4. 定义模型
models = {
    "Logistic Regression": LogisticRegression(),
    "KNN": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=100),
    "SVM": SVC(),
    "AdaBoost": AdaBoostClassifier(),
}

# 5. 训练与评估
for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print(f"{name}: Accuracy = {accuracy_score(y_test, pred):.4f}")

Logistic Regression: Accuracy = 0.9737
KNN: Accuracy = 0.9474
Decision Tree: Accuracy = 0.9474
Random Forest: Accuracy = 0.9561
SVM: Accuracy = 0.9825
AdaBoost: Accuracy = 0.9649
