# Scikit-learn机器学习完全教程

本笔记本将全面介绍Scikit-learn——Python中最重要的机器学习库，包括：

- **机器学习基础**: 监督学习、无监督学习、模型评估
- **数据预处理**: 特征缩放、编码、特征选择、数据清洗
- **分类算法**: 支持向量机(SVM)、决策树、随机森林、逻辑回归等
- **回归算法**: 线性回归、多项式回归、岭回归、LASSO回归
- **聚类算法**: K-means、层次聚类、DBSCAN
- **降维技术**: PCA、t-SNE、特征选择
- **模型选择**: 交叉验证、网格搜索、模型评估指标
- **实际项目**: MNIST手写数字分类、完整机器学习流程

**重点案例**: 使用支持向量机(SVM)对MNIST数据集进行手写数字分类，这是计算机视觉和模式识别领域的经典问题。

Scikit-learn以其统一的API、丰富的算法库和优秀的文档而闻名，是机器学习入门和实践的首选工具。

In [None]:
# 导入必要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# 设置随机种子保证结果可重现
np.random.seed(42)

# 设置matplotlib中文字体
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = (10, 6)

print("=== Scikit-learn环境配置 ===")
import sklearn
print(f"Scikit-learn版本: {sklearn.__version__}")
print(f"NumPy版本: {np.__version__}")
print(f"Pandas版本: {pd.__version__}")

# 加载MNIST数据集 (sklearn内置的简化版本)
print(f"\n=== 加载MNIST数据集 ===")
from sklearn.datasets import fetch_openml

# 加载MNIST数据集
print("正在下载MNIST数据集...")
mnist = fetch_openml('mnist_784', version=1, cache=True, as_frame=False)
X_mnist, y_mnist = mnist.data, mnist.target.astype(int)

print(f"MNIST数据集信息:")
print(f"- 特征矩阵形状: {X_mnist.shape}")
print(f"- 标签向量形状: {y_mnist.shape}")
print(f"- 特征范围: {X_mnist.min():.1f} - {X_mnist.max():.1f}")
print(f"- 类别数量: {len(np.unique(y_mnist))}")
print(f"- 类别分布:")
unique, counts = np.unique(y_mnist, return_counts=True)
for digit, count in zip(unique, counts):
    print(f"  数字 {digit}: {count:,} 样本")

# 可视化MNIST样本
fig, axes = plt.subplots(2, 5, figsize=(12, 6))
for i in range(10):
    # 找到每个数字的第一个样本
    idx = np.where(y_mnist == i)[0][0]
    row, col = divmod(i, 5)
    
    # 将784维向量重塑为28x28图像
    image = X_mnist[idx].reshape(28, 28)
    axes[row, col].imshow(image, cmap='gray')
    axes[row, col].set_title(f'数字 {i}')
    axes[row, col].axis('off')

plt.suptitle('MNIST数据集样本展示', fontsize=16)
plt.tight_layout()
plt.show()

print("✓ 环境配置和数据加载完成！")

## 1. 机器学习基础概念

### 1.1 机器学习类型

- **监督学习**: 从标记的训练数据中学习，预测新数据的标签
  - 分类 (Classification): 预测离散标签 (如数字识别)
  - 回归 (Regression): 预测连续数值 (如房价预测)

- **无监督学习**: 从无标签数据中发现隐藏模式
  - 聚类 (Clustering): 将数据分组
  - 降维 (Dimensionality Reduction): 简化数据表示

- **强化学习**: 通过与环境交互学习最优策略

### 1.2 Scikit-learn的设计哲学

- **一致的API**: 所有估计器都有fit()、predict()等方法
- **组合性**: 不同组件可以轻松组合
- **合理的默认值**: 开箱即用的参数设置
- **可检查性**: 模型内部状态可访问

In [None]:
# 1.3 Scikit-learn典型工作流程演示
print("=== Scikit-learn典型工作流程 ===")

# 使用一个简单的鸢尾花数据集来演示完整流程
from sklearn.datasets import load_iris

# 步骤1: 加载数据
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

print("步骤1: 数据加载")
print(f"特征名称: {iris.feature_names}")
print(f"类别名称: {iris.target_names}")
print(f"数据形状: {X_iris.shape}")

# 步骤2: 数据分割
X_train, X_test, y_train, y_test = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

print(f"\n步骤2: 数据分割")
print(f"训练集大小: {X_train.shape}")
print(f"测试集大小: {X_test.shape}")

# 步骤3: 特征缩放
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\n步骤3: 特征缩放")
print(f"原始特征范围: {X_train.min():.2f} - {X_train.max():.2f}")
print(f"缩放后特征范围: {X_train_scaled.min():.2f} - {X_train_scaled.max():.2f}")

# 步骤4: 模型训练
model = SVC(kernel='rbf', random_state=42)
model.fit(X_train_scaled, y_train)

print(f"\n步骤4: 模型训练")
print(f"模型类型: {type(model).__name__}")
print(f"模型参数: {model.get_params()}")

# 步骤5: 模型预测
y_pred = model.predict(X_test_scaled)

print(f"\n步骤5: 模型预测")
print(f"预测结果: {y_pred}")
print(f"真实标签: {y_test}")

# 步骤6: 模型评估
accuracy = accuracy_score(y_test, y_pred)
print(f"\n步骤6: 模型评估")
print(f"准确率: {accuracy:.4f}")

# 可视化结果
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# 特征分布可视化
feature_names = iris.feature_names
for i, feature in enumerate([0, 2]):  # 选择两个特征
    for class_idx, class_name in enumerate(iris.target_names):
        mask = y_iris == class_idx
        axes[0].scatter(X_iris[mask, feature], X_iris[mask, 1], 
                       label=class_name, alpha=0.7)
    
axes[0].set_xlabel(feature_names[0])
axes[0].set_ylabel(feature_names[1])
axes[0].set_title('鸢尾花数据集特征分布')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# 混淆矩阵
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
im = axes[1].imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
axes[1].set_title('混淆矩阵')
axes[1].set_xlabel('预测标签')
axes[1].set_ylabel('真实标签')

# 添加数值标签
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        axes[1].text(j, i, format(cm[i, j], 'd'),
                    ha="center", va="center",
                    color="white" if cm[i, j] > cm.max() / 2 else "black")

plt.tight_layout()
plt.show()

print("\n✓ 完整工作流程演示完成！")
print("这就是机器学习项目的标准流程：")
print("数据加载 → 数据分割 → 特征处理 → 模型训练 → 预测 → 评估")

## 2. 数据预处理

数据预处理是机器学习项目中最重要的步骤之一。高质量的数据预处理通常比算法选择更重要。

In [None]:
# 2.1 特征缩放
print("=== 特征缩放 ===")

# 创建示例数据
from sklearn.datasets import make_classification
X_example, y_example = make_classification(n_samples=1000, n_features=4, 
                                          n_informative=3, n_redundant=1, 
                                          random_state=42)

# 模拟不同量级的特征
X_example[:, 0] *= 1000  # 第一个特征放大1000倍
X_example[:, 1] *= 0.01  # 第二个特征缩小100倍

print("原始数据特征统计:")
feature_stats = pd.DataFrame(X_example, columns=[f'特征{i+1}' for i in range(4)])
print(feature_stats.describe())

# 不同的缩放方法
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

scalers = {
    '标准化 (StandardScaler)': StandardScaler(),
    '最小-最大缩放 (MinMaxScaler)': MinMaxScaler(),
    '鲁棒缩放 (RobustScaler)': RobustScaler()
}

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 原始数据分布
axes[0, 0].boxplot(X_example, labels=[f'特征{i+1}' for i in range(4)])
axes[0, 0].set_title('原始数据分布')
axes[0, 0].set_ylabel('数值')

# 不同缩放方法的效果
for idx, (name, scaler) in enumerate(scalers.items()):
    row, col = divmod(idx + 1, 2)
    X_scaled = scaler.fit_transform(X_example)
    axes[row, col].boxplot(X_scaled, labels=[f'特征{i+1}' for i in range(4)])
    axes[row, col].set_title(name)
    axes[row, col].set_ylabel('缩放后数值')

plt.tight_layout()
plt.show()

# 2.2 分类变量编码
print("\n=== 分类变量编码 ===")

# 创建包含分类变量的示例数据
data_cat = pd.DataFrame({
    '颜色': ['红', '蓝', '绿', '红', '蓝', '绿', '黄', '黄'],
    '尺寸': ['小', '中', '大', '小', '大', '中', '小', '大'],
    '品牌': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B'],
    '价格': [100, 200, 150, 120, 180, 160, 90, 210]
})

print("原始分类数据:")
print(data_cat)

# 标签编码 (Label Encoding)
from sklearn.preprocessing import LabelEncoder
le_color = LabelEncoder()
data_cat['颜色_标签编码'] = le_color.fit_transform(data_cat['颜色'])

print(f"\n颜色标签编码映射: {dict(zip(le_color.classes_, range(len(le_color.classes_))))}")

# 独热编码 (One-Hot Encoding)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
color_onehot = ohe.fit_transform(data_cat[['颜色']])
color_onehot_df = pd.DataFrame(color_onehot, columns=[f'颜色_{cat}' for cat in ohe.categories_[0]])

print("\n颜色独热编码结果:")
print(color_onehot_df.head())

# 序数编码 (Ordinal Encoding) - 适用于有顺序关系的分类变量
from sklearn.preprocessing import OrdinalEncoder
# 为尺寸定义顺序
size_mapping = [['小', '中', '大']]
oe = OrdinalEncoder(categories=size_mapping)
data_cat['尺寸_序数编码'] = oe.fit_transform(data_cat[['尺寸']])

print(f"\n尺寸序数编码映射: 小=0, 中=1, 大=2")
print(data_cat[['尺寸', '尺寸_序数编码']])

# 2.3 处理缺失值
print("\n=== 处理缺失值 ===")

# 创建含有缺失值的数据
data_missing = pd.DataFrame({
    '年龄': [25, 30, np.nan, 35, 28, np.nan, 32],
    '收入': [50000, np.nan, 60000, 70000, np.nan, 55000, 65000],
    '教育年限': [16, 18, 12, np.nan, 14, 16, 20]
})

print("含缺失值的数据:")
print(data_missing)
print(f"\n缺失值统计:")
print(data_missing.isnull().sum())

from sklearn.impute import SimpleImputer, KNNImputer

# 简单填充策略
imputers = {
    '均值填充': SimpleImputer(strategy='mean'),
    '中位数填充': SimpleImputer(strategy='median'),
    '众数填充': SimpleImputer(strategy='most_frequent'),
    'KNN填充': KNNImputer(n_neighbors=3)
}

results = {}
for name, imputer in imputers.items():
    filled_data = imputer.fit_transform(data_missing)
    results[name] = pd.DataFrame(filled_data, columns=data_missing.columns)

# 展示不同填充方法的结果
print("\n不同填充方法的结果对比:")
for name, result in results.items():
    print(f"\n{name}:")
    print(result.round(0))

# 2.4 特征选择
print("\n=== 特征选择 ===")

# 使用MNIST数据的子集进行特征选择演示
# 为了计算效率，我们只使用前1000个样本
X_subset = X_mnist[:1000]
y_subset = y_mnist[:1000]

print(f"原始特征数量: {X_subset.shape[1]}")

# 方差阈值特征选择
from sklearn.feature_selection import VarianceThreshold
var_threshold = VarianceThreshold(threshold=100)  # 移除方差小于100的特征
X_var_selected = var_threshold.fit_transform(X_subset)

print(f"方差阈值选择后特征数量: {X_var_selected.shape[1]}")

# 单变量特征选择
from sklearn.feature_selection import SelectKBest, chi2
k_best = SelectKBest(score_func=chi2, k=100)  # 选择前100个最佳特征
X_k_best = k_best.fit_transform(X_subset, y_subset)

print(f"SelectKBest选择后特征数量: {X_k_best.shape[1]}")

# 递归特征消除
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# 使用逻辑回归作为基估计器
estimator = LogisticRegression(random_state=42, max_iter=1000)
rfe = RFE(estimator=estimator, n_features_to_select=50)
X_rfe = rfe.fit_transform(X_subset, y_subset)

print(f"RFE选择后特征数量: {X_rfe.shape[1]}")

# 可视化特征选择的效果
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 显示一些原始图像
for i in range(3):
    axes[i].imshow(X_subset[i].reshape(28, 28), cmap='gray')
    axes[i].set_title(f'原始图像 (标签: {y_subset[i]})')
    axes[i].axis('off')

plt.suptitle('MNIST原始图像示例', fontsize=16)
plt.tight_layout()
plt.show()

print("\n数据预处理要点:")
print("✓ 特征缩放对距离敏感的算法很重要")
print("✓ 分类变量编码要根据变量性质选择合适方法")
print("✓ 缺失值处理要考虑数据的分布和含义")
print("✓ 特征选择可以提高模型性能并减少计算成本")

## 3. 支持向量机 (SVM) 与MNIST分类

支持向量机是一种强大的监督学习算法，特别适用于高维数据分类。我们将详细学习SVM的原理，并在MNIST数据集上进行手写数字分类实战。

In [None]:
# 3.1 SVM基础理论和核函数
print("=== SVM基础理论 ===")

# 首先用简单的二分类数据演示SVM原理
from sklearn.datasets import make_blobs

# 生成二分类数据
X_2d, y_2d = make_blobs(n_samples=100, centers=2, cluster_std=1.2, 
                       center_box=(-2.0, 2.0), random_state=42)

# 不同核函数的SVM
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

for idx, kernel in enumerate(kernels):
    row, col = divmod(idx, 2)
    
    # 训练SVM
    svm = SVC(kernel=kernel, random_state=42)
    svm.fit(X_2d, y_2d)
    
    # 创建网格用于可视化决策边界
    h = 0.02
    x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
    y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # 预测网格点
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    Z = svm.predict(mesh_points)
    Z = Z.reshape(xx.shape)
    
    # 绘制决策边界
    axes[row, col].contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.RdBu)
    
    # 绘制数据点
    scatter = axes[row, col].scatter(X_2d[:, 0], X_2d[:, 1], c=y_2d, cmap=plt.cm.RdBu)
    
    # 绘制支持向量
    axes[row, col].scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
                          s=100, facecolors='none', edgecolors='black', linewidth=2)
    
    axes[row, col].set_title(f'{kernel.upper()} 核函数')
    axes[row, col].set_xlabel('特征 1')
    axes[row, col].set_ylabel('特征 2')

plt.suptitle('不同核函数的SVM决策边界', fontsize=16)
plt.tight_layout()
plt.show()

print("SVM核函数解释:")
print("• Linear: 线性核，适用于线性可分数据")
print("• Polynomial: 多项式核，可以处理非线性关系")
print("• RBF (高斯): 径向基函数核，最常用的非线性核")
print("• Sigmoid: S型核，类似神经网络的激活函数")

# 3.2 MNIST数据集准备和预处理
print("\n=== MNIST数据集准备 ===")

# 为了训练效率，我们使用MNIST的子集
# 在实际项目中可以使用全部数据
n_samples = 5000  # 使用5000个样本进行演示

# 随机采样
indices = np.random.choice(len(X_mnist), n_samples, replace=False)
X_mnist_subset = X_mnist[indices]
y_mnist_subset = y_mnist[indices]

print(f"使用MNIST子集: {X_mnist_subset.shape}")
print(f"类别分布:")
unique, counts = np.unique(y_mnist_subset, return_counts=True)
for digit, count in zip(unique, counts):
    print(f"  数字 {digit}: {count} 样本")

# 数据预处理
# 1. 特征缩放 (像素值从0-255缩放到0-1)
X_mnist_scaled = X_mnist_subset / 255.0

# 2. 数据分割
X_train_mnist, X_test_mnist, y_train_mnist, y_test_mnist = train_test_split(
    X_mnist_scaled, y_mnist_subset, test_size=0.3, random_state=42, 
    stratify=y_mnist_subset
)

print(f"\n数据分割结果:")
print(f"训练集: {X_train_mnist.shape}")
print(f"测试集: {X_test_mnist.shape}")

# 3.3 SVM模型训练和优化
print("\n=== SVM模型训练和参数优化 ===")

# 首先使用默认参数训练SVM
print("1. 使用默认参数的SVM:")
svm_default = SVC(random_state=42)

# 训练模型
import time
start_time = time.time()
svm_default.fit(X_train_mnist, y_train_mnist)
training_time = time.time() - start_time

print(f"训练时间: {training_time:.2f} 秒")

# 在测试集上评估
y_pred_default = svm_default.predict(X_test_mnist)
accuracy_default = accuracy_score(y_test_mnist, y_pred_default)
print(f"默认参数准确率: {accuracy_default:.4f}")

# 使用网格搜索优化参数
print("\n2. 使用网格搜索优化参数:")
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf', 'poly'],
    'gamma': ['scale', 'auto', 0.001, 0.01]
}

# 注意：在实际项目中，可以使用更大的参数网格
# 这里为了演示和计算效率，使用较小的网格
grid_search = GridSearchCV(
    SVC(random_state=42), 
    param_grid, 
    cv=3,  # 3折交叉验证
    scoring='accuracy',
    n_jobs=-1,  # 使用所有CPU核心
    verbose=1
)

print("正在进行网格搜索...")
start_time = time.time()
grid_search.fit(X_train_mnist, y_train_mnist)
search_time = time.time() - start_time

print(f"网格搜索时间: {search_time:.2f} 秒")
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳交叉验证得分: {grid_search.best_score_:.4f}")

# 使用最佳参数的模型在测试集上评估
best_svm = grid_search.best_estimator_
y_pred_best = best_svm.predict(X_test_mnist)
accuracy_best = accuracy_score(y_test_mnist, y_pred_best)
print(f"优化后准确率: {accuracy_best:.4f}")

# 3.4 详细的模型评估
print("\n=== 详细的模型评估 ===")

# 混淆矩阵
cm = confusion_matrix(y_test_mnist, y_pred_best)

# 分类报告
print("分类报告:")
print(classification_report(y_test_mnist, y_pred_best))

# 可视化结果
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. 混淆矩阵热力图
im = axes[0, 0].imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
axes[0, 0].set_title('混淆矩阵')
axes[0, 0].set_xlabel('预测标签')
axes[0, 0].set_ylabel('真实标签')

# 添加数值标签
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        axes[0, 0].text(j, i, format(cm[i, j], 'd'),
                       ha="center", va="center",
                       color="white" if cm[i, j] > cm.max() / 2 else "black")

plt.colorbar(im, ax=axes[0, 0])

# 2. 每个类别的准确率
class_accuracy = cm.diagonal() / cm.sum(axis=1)
axes[0, 1].bar(range(10), class_accuracy)
axes[0, 1].set_title('各数字识别准确率')
axes[0, 1].set_xlabel('数字')
axes[0, 1].set_ylabel('准确率')
axes[0, 1].set_xticks(range(10))

# 3. 正确预测的样本展示
correct_mask = y_test_mnist == y_pred_best
correct_indices = np.where(correct_mask)[0][:6]

for i, idx in enumerate(correct_indices):
    row, col = divmod(i, 3)
    if row < 2 and col < 3:
        axes[row, col + (1 if row == 0 else 0)].imshow(
            X_test_mnist[idx].reshape(28, 28), cmap='gray'
        )
        axes[row, col + (1 if row == 0 else 0)].set_title(
            f'正确: {y_test_mnist[idx]}'
        )
        axes[row, col + (1 if row == 0 else 0)].axis('off')

# 错误预测的样本展示  
incorrect_mask = y_test_mnist != y_pred_best
incorrect_indices = np.where(incorrect_mask)[0][:3]

for i, idx in enumerate(incorrect_indices):
    axes[1, i].imshow(X_test_mnist[idx].reshape(28, 28), cmap='gray')
    axes[1, i].set_title(f'错误: 真实{y_test_mnist[idx]} → 预测{y_pred_best[idx]}')
    axes[1, i].axis('off')

plt.tight_layout()
plt.show()

# 3.5 模型性能分析
print("\n=== 模型性能分析 ===")

# 计算各种评估指标
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test_mnist, y_pred_best, average='weighted')
recall = recall_score(y_test_mnist, y_pred_best, average='weighted')
f1 = f1_score(y_test_mnist, y_pred_best, average='weighted')

performance_metrics = {
    '准确率 (Accuracy)': accuracy_best,
    '精确率 (Precision)': precision,
    '召回率 (Recall)': recall,
    'F1 分数': f1
}

print("整体性能指标:")
for metric, value in performance_metrics.items():
    print(f"{metric}: {value:.4f}")

# 训练时间对比
print(f"\n计算效率:")
print(f"默认参数训练时间: {training_time:.2f} 秒")
print(f"网格搜索时间: {search_time:.2f} 秒")
print(f"性能提升: {((accuracy_best - accuracy_default) / accuracy_default * 100):.2f}%")

print("\nSVM在MNIST上的关键发现:")
print("✓ SVM在高维图像数据上表现优秀")
print("✓ RBF核函数通常在图像分类中效果最好")
print("✓ 参数优化可以显著提升模型性能")
print("✓ SVM对特征缩放敏感，需要进行预处理")
print("✓ 支持向量的数量反映了数据的复杂度")

## 4. 多种分类算法对比

除了SVM，scikit-learn还提供了多种强大的分类算法。我们将在MNIST数据集上比较不同算法的性能，帮助你理解各算法的特点和适用场景。

In [None]:
# 4.1 准备多个分类器
print("=== 多种分类算法对比 ===")

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

# 使用之前的MNIST子集数据
print(f"使用数据集: 训练集 {X_train_mnist.shape}, 测试集 {X_test_mnist.shape}")

# 定义多个分类器
classifiers = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'Naive Bayes': GaussianNB(),
    'Neural Network': MLPClassifier(hidden_layer_sizes=(100,), random_state=42, max_iter=300)
}

# 4.2 训练和评估所有分类器
results = {}
training_times = {}

print("\n正在训练和评估各种分类器...")
for name, classifier in classifiers.items():
    print(f"\n训练 {name}...")
    
    # 训练时间
    start_time = time.time()
    classifier.fit(X_train_mnist, y_train_mnist)
    training_time = time.time() - start_time
    training_times[name] = training_time
    
    # 预测
    y_pred = classifier.predict(X_test_mnist)
    
    # 评估指标
    accuracy = accuracy_score(y_test_mnist, y_pred)
    precision = precision_score(y_test_mnist, y_pred, average='weighted')
    recall = recall_score(y_test_mnist, y_pred, average='weighted')
    f1 = f1_score(y_test_mnist, y_pred, average='weighted')
    
    results[name] = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'training_time': training_time
    }
    
    print(f"  准确率: {accuracy:.4f}")
    print(f"  训练时间: {training_time:.2f}秒")

# 4.3 结果可视化和分析
print("\n=== 算法性能对比分析 ===")

# 创建结果DataFrame
results_df = pd.DataFrame(results).T
results_df = results_df.round(4)

print("所有算法性能对比:")
print(results_df)

# 可视化结果
fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 1. 准确率对比
algorithms = list(results.keys())
accuracies = [results[alg]['accuracy'] for alg in algorithms]

axes[0, 0].bar(range(len(algorithms)), accuracies)
axes[0, 0].set_title('准确率对比')
axes[0, 0].set_ylabel('准确率')
axes[0, 0].set_xticks(range(len(algorithms)))
axes[0, 0].set_xticklabels(algorithms, rotation=45, ha='right')
axes[0, 0].grid(True, alpha=0.3)

# 添加数值标签
for i, v in enumerate(accuracies):
    axes[0, 0].text(i, v + 0.005, f'{v:.3f}', ha='center', va='bottom')

# 2. 训练时间对比
times = [results[alg]['training_time'] for alg in algorithms]

axes[0, 1].bar(range(len(algorithms)), times, color='orange')
axes[0, 1].set_title('训练时间对比')
axes[0, 1].set_ylabel('训练时间 (秒)')
axes[0, 1].set_xticks(range(len(algorithms)))
axes[0, 1].set_xticklabels(algorithms, rotation=45, ha='right')
axes[0, 1].set_yscale('log')  # 使用对数刻度
axes[0, 1].grid(True, alpha=0.3)

# 3. 准确率vs训练时间散点图
axes[0, 2].scatter(times, accuracies, s=100, alpha=0.7)
for i, alg in enumerate(algorithms):
    axes[0, 2].annotate(alg, (times[i], accuracies[i]), 
                       xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[0, 2].set_xlabel('训练时间 (秒)')
axes[0, 2].set_ylabel('准确率')
axes[0, 2].set_title('准确率 vs 训练时间')
axes[0, 2].set_xscale('log')
axes[0, 2].grid(True, alpha=0.3)

# 4. F1分数对比
f1_scores = [results[alg]['f1_score'] for alg in algorithms]

axes[1, 0].bar(range(len(algorithms)), f1_scores, color='green')
axes[1, 0].set_title('F1分数对比')
axes[1, 0].set_ylabel('F1分数')
axes[1, 0].set_xticks(range(len(algorithms)))
axes[1, 0].set_xticklabels(algorithms, rotation=45, ha='right')
axes[1, 0].grid(True, alpha=0.3)

# 5. 综合性能雷达图
from math import pi

# 选择前5个算法进行雷达图展示
top_5_algorithms = sorted(algorithms, key=lambda x: results[x]['accuracy'], reverse=True)[:5]
metrics = ['accuracy', 'precision', 'recall', 'f1_score']

# 标准化指标到0-1范围
normalized_data = {}
for alg in top_5_algorithms:
    normalized_data[alg] = [results[alg][metric] for metric in metrics]

# 雷达图
angles = [n / float(len(metrics)) * 2 * pi for n in range(len(metrics))]
angles += angles[:1]  # 闭合图形

axes[1, 1].set_theta_offset(pi / 2)
axes[1, 1].set_theta_direction(-1)

colors = ['red', 'blue', 'green', 'orange', 'purple']
for i, alg in enumerate(top_5_algorithms):
    values = normalized_data[alg]
    values += values[:1]  # 闭合图形
    
    axes[1, 1].plot(angles, values, 'o-', linewidth=2, label=alg, color=colors[i])
    axes[1, 1].fill(angles, values, alpha=0.25, color=colors[i])

axes[1, 1].set_xticks(angles[:-1])
axes[1, 1].set_xticklabels(metrics)
axes[1, 1].set_ylim(0, 1)
axes[1, 1].set_title('前5名算法性能雷达图')
axes[1, 1].legend(loc='upper right', bbox_to_anchor=(1.2, 1.0))
axes[1, 1].grid(True)

# 6. 混淆矩阵对比 (选择最佳算法)
best_algorithm = max(algorithms, key=lambda x: results[x]['accuracy'])
best_classifier = classifiers[best_algorithm]
y_pred_best = best_classifier.predict(X_test_mnist)
cm_best = confusion_matrix(y_test_mnist, y_pred_best)

im = axes[1, 2].imshow(cm_best, interpolation='nearest', cmap=plt.cm.Blues)
axes[1, 2].set_title(f'最佳算法混淆矩阵\n({best_algorithm})')
axes[1, 2].set_xlabel('预测标签')
axes[1, 2].set_ylabel('真实标签')

# 添加数值标签
for i in range(cm_best.shape[0]):
    for j in range(cm_best.shape[1]):
        axes[1, 2].text(j, i, format(cm_best[i, j], 'd'),
                       ha="center", va="center",
                       color="white" if cm_best[i, j] > cm_best.max() / 2 else "black")

plt.tight_layout()
plt.show()

# 4.4 算法特点总结
print("\n=== 各算法特点总结 ===")

algorithm_characteristics = {
    'Logistic Regression': {
        'advantages': ['快速训练', '概率输出', '线性可解释'],
        'disadvantages': ['假设线性关系', '对特征工程敏感'],
        'best_for': '线性可分问题、需要概率输出'
    },
    'Decision Tree': {
        'advantages': ['高度可解释', '无需特征缩放', '处理非线性'],
        'disadvantages': ['容易过拟合', '对数据变化敏感'],
        'best_for': '需要可解释性的问题'
    },
    'Random Forest': {
        'advantages': ['减少过拟合', '特征重要性', '鲁棒性强'],
        'disadvantages': ['内存消耗大', '可解释性降低'],
        'best_for': '通用分类问题、特征选择'
    },
    'SVM (RBF)': {
        'advantages': ['高维数据优秀', '内存高效', '灵活核函数'],
        'disadvantages': ['训练时间长', '参数敏感', '无概率输出'],
        'best_for': '高维数据、小到中等样本量'
    },
    'K-Nearest Neighbors': {
        'advantages': ['简单直观', '无训练时间', '适应局部模式'],
        'disadvantages': ['预测速度慢', '对维度诅咒敏感', '需要特征缩放'],
        'best_for': '小数据集、局部模式重要'
    },
    'Gradient Boosting': {
        'advantages': ['高预测精度', '处理复杂模式', '特征重要性'],
        'disadvantages': ['训练时间长', '容易过拟合', '参数多'],
        'best_for': '结构化数据竞赛、高精度要求'
    },
    'Naive Bayes': {
        'advantages': ['训练极快', '小数据表现好', '多分类自然'],
        'disadvantages': ['特征独立假设', '数值特征需要假设分布'],
        'best_for': '文本分类、小数据集'
    },
    'Neural Network': {
        'advantages': ['学习复杂模式', '自动特征学习', '灵活架构'],
        'disadvantages': ['需要大量数据', '黑盒模型', '训练时间长'],
        'best_for': '大数据集、复杂非线性模式'
    }
}

for alg_name in algorithms:
    if alg_name in algorithm_characteristics:
        char = algorithm_characteristics[alg_name]
        print(f"\n{alg_name}:")
        print(f"  准确率: {results[alg_name]['accuracy']:.4f}")
        print(f"  优点: {', '.join(char['advantages'])}")
        print(f"  缺点: {', '.join(char['disadvantages'])}")
        print(f"  适用场景: {char['best_for']}")

print(f"\n在MNIST手写数字识别任务中:")
print(f"🏆 最佳准确率: {best_algorithm} ({results[best_algorithm]['accuracy']:.4f})")
print(f"⚡ 最快训练: {min(algorithms, key=lambda x: results[x]['training_time'])} ({min(training_times.values()):.2f}秒)")

fastest_alg = min(algorithms, key=lambda x: results[x]['training_time'])
best_tradeoff = max(algorithms, key=lambda x: results[x]['accuracy'] / results[x]['training_time'])

print(f"⚖️  最佳性价比: {best_tradeoff}")

## 5. 回归算法详解

回归用于预测连续数值，是机器学习的另一大类问题。我们将学习各种回归算法，并用房价预测作为实战案例。

In [None]:
# 5.1 加载和准备回归数据
print("=== 回归算法详解 ===")

from sklearn.datasets import load_boston, make_regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# 由于Boston房价数据集的一些争议，我们创建一个类似的合成数据集
print("生成房价预测数据集...")
np.random.seed(42)

# 生成房价相关特征
n_samples = 1000
n_features = 8

# 创建特征
X_house, y_house = make_regression(
    n_samples=n_samples, 
    n_features=n_features, 
    noise=10, 
    random_state=42
)

# 创建更有意义的特征名
feature_names = [
    '房屋面积', '房间数量', '浴室数量', '楼层', 
    '建造年份', '地理位置评分', '学区评分', '交通便利性'
]

# 转换为DataFrame以便分析
house_data = pd.DataFrame(X_house, columns=feature_names)
house_data['房价'] = y_house

# 为了更真实，调整数据范围
house_data['房屋面积'] = house_data['房屋面积'] * 10 + 100  # 100-200 平米
house_data['房间数量'] = np.abs(house_data['房间数量']) + 2   # 2-5房间
house_data['浴室数量'] = np.abs(house_data['浴室数量']) + 1   # 1-3浴室  
house_data['楼层'] = np.abs(house_data['楼层']) % 20 + 1      # 1-20层
house_data['建造年份'] = 2020 - (np.abs(house_data['建造年份']) % 30)  # 1990-2020
house_data['房价'] = house_data['房价'] * 0.01 + 50  # 调整价格范围

print(f"房价数据集信息:")
print(f"样本数量: {len(house_data)}")
print(f"特征数量: {len(feature_names)}")
print("\n前5行数据:")
print(house_data.head())

print("\n数据统计:")
print(house_data.describe().round(2))

# 相关性分析
correlation_matrix = house_data.corr()

# 可视化数据分布和相关性
fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 1. 房价分布
axes[0, 0].hist(house_data['房价'], bins=30, alpha=0.7, color='skyblue')
axes[0, 0].set_title('房价分布')
axes[0, 0].set_xlabel('房价')
axes[0, 0].set_ylabel('频次')
axes[0, 0].grid(True, alpha=0.3)

# 2. 特征相关性热力图
im = axes[0, 1].imshow(correlation_matrix, cmap='coolwarm', aspect='auto')
axes[0, 1].set_title('特征相关性矩阵')
axes[0, 1].set_xticks(range(len(house_data.columns)))
axes[0, 1].set_yticks(range(len(house_data.columns)))
axes[0, 1].set_xticklabels(house_data.columns, rotation=45, ha='right')
axes[0, 1].set_yticklabels(house_data.columns)

# 添加相关性数值
for i in range(len(house_data.columns)):
    for j in range(len(house_data.columns)):
        text = axes[0, 1].text(j, i, f'{correlation_matrix.iloc[i, j]:.2f}',
                              ha="center", va="center", color="black" if abs(correlation_matrix.iloc[i, j]) < 0.5 else "white")

plt.colorbar(im, ax=axes[0, 1])

# 3. 房屋面积vs房价散点图
axes[0, 2].scatter(house_data['房屋面积'], house_data['房价'], alpha=0.6)
axes[0, 2].set_xlabel('房屋面积')
axes[0, 2].set_ylabel('房价')
axes[0, 2].set_title('房屋面积 vs 房价')
axes[0, 2].grid(True, alpha=0.3)

# 4. 房间数量vs房价箱型图
room_groups = house_data.groupby(house_data['房间数量'].astype(int))
room_prices = [group['房价'].values for name, group in room_groups]
room_labels = [str(int(name)) for name, group in room_groups]

axes[1, 0].boxplot(room_prices, labels=room_labels)
axes[1, 0].set_xlabel('房间数量')
axes[1, 0].set_ylabel('房价')
axes[1, 0].set_title('房间数量 vs 房价分布')
axes[1, 0].grid(True, alpha=0.3)

# 5. 建造年份vs房价
axes[1, 1].scatter(house_data['建造年份'], house_data['房价'], alpha=0.6, color='green')
axes[1, 1].set_xlabel('建造年份')
axes[1, 1].set_ylabel('房价')
axes[1, 1].set_title('建造年份 vs 房价')
axes[1, 1].grid(True, alpha=0.3)

# 6. 特征重要性初步分析（相关系数）
feature_importance = correlation_matrix['房价'].abs().sort_values(ascending=True)[:-1]
axes[1, 2].barh(range(len(feature_importance)), feature_importance.values)
axes[1, 2].set_yticks(range(len(feature_importance)))
axes[1, 2].set_yticklabels(feature_importance.index)
axes[1, 2].set_xlabel('与房价的相关系数(绝对值)')
axes[1, 2].set_title('特征重要性(相关性)')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 准备训练数据
X_house_final = house_data[feature_names]
y_house_final = house_data['房价']

# 数据分割
X_train_house, X_test_house, y_train_house, y_test_house = train_test_split(
    X_house_final, y_house_final, test_size=0.3, random_state=42
)

print(f"\n数据分割结果:")
print(f"训练集: {X_train_house.shape}")
print(f"测试集: {X_test_house.shape}")

# 5.2 多种回归算法实现和对比
print("\n=== 多种回归算法对比 ===")

# 特征标准化
scaler_house = StandardScaler()
X_train_house_scaled = scaler_house.fit_transform(X_train_house)
X_test_house_scaled = scaler_house.transform(X_test_house)

# 定义回归模型
regressors = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0, random_state=42),
    'Lasso Regression': Lasso(alpha=1.0, random_state=42),
    'Elastic Net': ElasticNet(alpha=1.0, l1_ratio=0.5, random_state=42),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'Support Vector Regression': SVR(kernel='rbf')
}

# 多项式回归（使用Pipeline）
poly_regression = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),
    ('linear', LinearRegression())
])
regressors['Polynomial Regression'] = poly_regression

# 训练和评估所有回归模型
regression_results = {}
regression_times = {}

print("\n正在训练和评估各种回归模型...")
for name, regressor in regressors.items():
    print(f"\n训练 {name}...")
    
    # 选择是否使用标准化数据
    if name in ['Ridge Regression', 'Lasso Regression', 'Elastic Net', 'Support Vector Regression']:
        X_train_use = X_train_house_scaled
        X_test_use = X_test_house_scaled
    else:
        X_train_use = X_train_house
        X_test_use = X_test_house
    
    # 训练时间
    start_time = time.time()
    regressor.fit(X_train_use, y_train_house)
    training_time = time.time() - start_time
    regression_times[name] = training_time
    
    # 预测
    y_pred_house = regressor.predict(X_test_use)
    
    # 评估指标
    mae = mean_absolute_error(y_test_house, y_pred_house)
    mse = mean_squared_error(y_test_house, y_pred_house)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test_house, y_pred_house)
    
    regression_results[name] = {
        'MAE': mae,
        'MSE': mse,
        'RMSE': rmse,
        'R²': r2,
        'Training_Time': training_time
    }
    
    print(f"  R² 分数: {r2:.4f}")
    print(f"  RMSE: {rmse:.2f}")
    print(f"  训练时间: {training_time:.4f}秒")

# 5.3 回归结果可视化和分析
print("\n=== 回归结果分析 ===")

# 创建结果DataFrame
regression_df = pd.DataFrame(regression_results).T
print("所有回归算法性能对比:")
print(regression_df.round(4))

# 可视化结果
fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 1. R²分数对比
models = list(regression_results.keys())
r2_scores = [regression_results[model]['R²'] for model in models]

axes[0, 0].bar(range(len(models)), r2_scores, color='lightblue')
axes[0, 0].set_title('R² 分数对比')
axes[0, 0].set_ylabel('R² 分数')
axes[0, 0].set_xticks(range(len(models)))
axes[0, 0].set_xticklabels(models, rotation=45, ha='right')
axes[0, 0].grid(True, alpha=0.3)

# 添加数值标签
for i, v in enumerate(r2_scores):
    axes[0, 0].text(i, v + 0.01, f'{v:.3f}', ha='center', va='bottom')

# 2. RMSE对比
rmse_scores = [regression_results[model]['RMSE'] for model in models]

axes[0, 1].bar(range(len(models)), rmse_scores, color='lightcoral')
axes[0, 1].set_title('RMSE对比 (越小越好)')
axes[0, 1].set_ylabel('RMSE')
axes[0, 1].set_xticks(range(len(models)))
axes[0, 1].set_xticklabels(models, rotation=45, ha='right')
axes[0, 1].grid(True, alpha=0.3)

# 3. 训练时间对比
training_times_reg = [regression_results[model]['Training_Time'] for model in models]

axes[0, 2].bar(range(len(models)), training_times_reg, color='lightgreen')
axes[0, 2].set_title('训练时间对比')
axes[0, 2].set_ylabel('训练时间 (秒)')
axes[0, 2].set_xticks(range(len(models)))
axes[0, 2].set_xticklabels(models, rotation=45, ha='right')
axes[0, 2].set_yscale('log')
axes[0, 2].grid(True, alpha=0.3)

# 4. 最佳模型的预测vs真实值
best_model_name = max(models, key=lambda x: regression_results[x]['R²'])
best_regressor = regressors[best_model_name]

# 重新预测用于绘图
if best_model_name in ['Ridge Regression', 'Lasso Regression', 'Elastic Net', 'Support Vector Regression']:
    y_pred_best = best_regressor.predict(X_test_house_scaled)
else:
    y_pred_best = best_regressor.predict(X_test_house)

axes[1, 0].scatter(y_test_house, y_pred_best, alpha=0.6)
axes[1, 0].plot([y_test_house.min(), y_test_house.max()], 
               [y_test_house.min(), y_test_house.max()], 'r--', lw=2)
axes[1, 0].set_xlabel('真实房价')
axes[1, 0].set_ylabel('预测房价')
axes[1, 0].set_title(f'最佳模型预测效果\n({best_model_name})')
axes[1, 0].grid(True, alpha=0.3)

# 5. 残差分析
residuals = y_test_house - y_pred_best
axes[1, 1].scatter(y_pred_best, residuals, alpha=0.6)
axes[1, 1].axhline(y=0, color='r', linestyle='--')
axes[1, 1].set_xlabel('预测值')
axes[1, 1].set_ylabel('残差')
axes[1, 1].set_title('残差图')
axes[1, 1].grid(True, alpha=0.3)

# 6. 性能vs复杂度分析
complexity_scores = {
    'Linear Regression': 1,
    'Ridge Regression': 2,
    'Lasso Regression': 2,
    'Elastic Net': 3,
    'Polynomial Regression': 4,
    'Decision Tree': 5,
    'Random Forest': 7,
    'Gradient Boosting': 8,
    'Support Vector Regression': 6
}

complexity_vals = [complexity_scores[model] for model in models]
axes[1, 2].scatter(complexity_vals, r2_scores, s=100, alpha=0.7)
for i, model in enumerate(models):
    axes[1, 2].annotate(model, (complexity_vals[i], r2_scores[i]), 
                       xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[1, 2].set_xlabel('模型复杂度')
axes[1, 2].set_ylabel('R² 分数')
axes[1, 2].set_title('模型复杂度 vs 性能')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 5.4 回归算法特点总结
print("\n=== 回归算法特点总结 ===")

regression_characteristics = {
    'Linear Regression': {
        'description': '最简单的线性回归模型',
        'advantages': ['可解释性强', '训练速度快', '无超参数'],
        'disadvantages': ['假设线性关系', '对多重共线性敏感'],
        'when_to_use': '特征数量较少、线性关系明显'
    },
    'Ridge Regression': {
        'description': 'L2正则化的线性回归',
        'advantages': ['防止过拟合', '处理多重共线性', '稳定性好'],
        'disadvantages': ['系数不会为零', '仍假设线性关系'],
        'when_to_use': '特征数量多、存在多重共线性'
    },
    'Lasso Regression': {
        'description': 'L1正则化的线性回归',
        'advantages': ['自动特征选择', '稀疏解', '防止过拟合'],
        'disadvantages': ['可能选择错误特征', '不稳定'],
        'when_to_use': '需要特征选择、稀疏模型'
    },
    'Elastic Net': {
        'description': '结合L1和L2正则化',
        'advantages': ['平衡Ridge和Lasso', '稳定的特征选择'],
        'disadvantages': ['需要调节两个参数', '计算复杂'],
        'when_to_use': '特征数量很多、需要稳定的特征选择'
    },
    'Decision Tree': {
        'description': '基于决策树的回归',
        'advantages': ['非线性建模', '可解释性', '无需特征缩放'],
        'disadvantages': ['容易过拟合', '对数据变化敏感'],
        'when_to_use': '非线性关系、需要可解释性'
    },
    'Random Forest': {
        'description': '多个决策树的集成',
        'advantages': ['减少过拟合', '特征重要性', '鲁棒性'],
        'disadvantages': ['可解释性降低', '内存消耗大'],
        'when_to_use': '通用回归问题、特征重要性分析'
    }
}

for model_name in models[:6]:  # 展示前6个主要算法
    if model_name in regression_characteristics:
        char = regression_characteristics[model_name]
        result = regression_results[model_name]
        
        print(f"\n{model_name}:")
        print(f"  描述: {char['description']}")
        print(f"  R² 分数: {result['R²']:.4f}")
        print(f"  RMSE: {result['RMSE']:.2f}")
        print(f"  优点: {', '.join(char['advantages'])}")
        print(f"  缺点: {', '.join(char['disadvantages'])}")
        print(f"  适用场景: {char['when_to_use']}")

print(f"\n房价预测任务总结:")
print(f"🏆 最佳R²分数: {best_model_name} ({regression_results[best_model_name]['R²']:.4f})")
fastest_reg = min(models, key=lambda x: regression_results[x]['Training_Time'])
print(f"⚡ 最快训练: {fastest_reg} ({regression_results[fastest_reg]['Training_Time']:.4f}秒)")

print(f"\n回归评估指标说明:")
print(f"• MAE (平均绝对误差): 预测值与真实值差异的平均值")
print(f"• RMSE (均方根误差): 对大误差更敏感，单位与目标变量相同")
print(f"• R² (决定系数): 模型解释方差的比例，越接近1越好")

## 6. 聚类算法详解

聚类是无监督学习的重要分支，用于发现数据中的隐藏模式和结构。我们将学习各种聚类算法，并在客户细分场景中进行实战。

In [None]:
# 6.1 生成客户数据用于聚类分析
print("=== 聚类算法详解 ===")

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN, SpectralClustering
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, adjusted_rand_score, calinski_harabasz_score
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

# 生成客户消费数据
print("生成客户消费行为数据...")
np.random.seed(42)

# 创建三个不同类型的客户群体
n_customers = 300

# 高价值客户 (收入高，消费多)
high_value = np.random.multivariate_normal(
    mean=[80, 120], cov=[[100, 50], [50, 200]], size=100
)

# 中等价值客户 (收入中等，消费中等)
medium_value = np.random.multivariate_normal(
    mean=[50, 70], cov=[[80, 30], [30, 120]], size=100
)

# 低价值客户 (收入低，消费少)
low_value = np.random.multivariate_normal(
    mean=[25, 35], cov=[[50, 20], [20, 80]], size=100
)

# 合并数据
X_customers = np.vstack([high_value, medium_value, low_value])
true_labels = np.hstack([np.zeros(100), np.ones(100), np.full(100, 2)])

# 添加更多特征
additional_features = np.random.randn(n_customers, 3)  # 添加3个额外特征
X_customers = np.hstack([X_customers, additional_features])

# 创建DataFrame
feature_names_cluster = ['年收入(千)', '年消费(千)', '购买频率', '客户满意度', '推荐意愿']
customer_data = pd.DataFrame(X_customers, columns=feature_names_cluster)

# 确保数据为正值并调整范围
customer_data['年收入(千)'] = np.abs(customer_data['年收入(千)']) + 20
customer_data['年消费(千)'] = np.abs(customer_data['年消费(千)']) + 10
customer_data['购买频率'] = np.abs(customer_data['购买频率']) * 5 + 1
customer_data['客户满意度'] = (customer_data['客户满意度'] + 3) * 1.5  # 1-10分
customer_data['推荐意愿'] = (customer_data['推荐意愿'] + 3) * 1.5    # 1-10分

print(f"客户数据概览:")
print(f"样本数量: {len(customer_data)}")
print(f"特征数量: {len(feature_names_cluster)}")
print("\n前5行数据:")
print(customer_data.head())

print("\n数据统计:")
print(customer_data.describe().round(2))

# 数据可视化
fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 1. 收入vs消费散点图
scatter = axes[0, 0].scatter(customer_data['年收入(千)'], customer_data['年消费(千)'], 
                           c=true_labels, cmap='viridis', alpha=0.7)
axes[0, 0].set_xlabel('年收入(千)')
axes[0, 0].set_ylabel('年消费(千)')
axes[0, 0].set_title('客户收入vs消费分布 (真实分组)')
axes[0, 0].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[0, 0])

# 2. 各特征分布
for i, feature in enumerate(feature_names_cluster[:4]):
    row, col = divmod(i+1, 3)
    if row < 2:
        axes[row, col].hist(customer_data[feature], bins=20, alpha=0.7)
        axes[row, col].set_title(f'{feature}分布')
        axes[row, col].set_xlabel(feature)
        axes[row, col].set_ylabel('频次')
        axes[row, col].grid(True, alpha=0.3)

# 5. 相关性矩阵
correlation_matrix_cluster = customer_data.corr()
im = axes[1, 2].imshow(correlation_matrix_cluster, cmap='coolwarm', aspect='auto')
axes[1, 2].set_title('客户特征相关性')
axes[1, 2].set_xticks(range(len(feature_names_cluster)))
axes[1, 2].set_yticks(range(len(feature_names_cluster)))
axes[1, 2].set_xticklabels(feature_names_cluster, rotation=45, ha='right')
axes[1, 2].set_yticklabels(feature_names_cluster)

for i in range(len(feature_names_cluster)):
    for j in range(len(feature_names_cluster)):
        text = axes[1, 2].text(j, i, f'{correlation_matrix_cluster.iloc[i, j]:.2f}',
                              ha="center", va="center", 
                              color="black" if abs(correlation_matrix_cluster.iloc[i, j]) < 0.5 else "white")

plt.colorbar(im, ax=axes[1, 2])
plt.tight_layout()
plt.show()

# 数据标准化（聚类算法对特征尺度敏感）
scaler_cluster = StandardScaler()
X_customers_scaled = scaler_cluster.fit_transform(customer_data)

print("✓ 客户数据准备完成！")

# 6.2 K-means聚类详解
print("\n=== K-means聚类详解 ===")

# 确定最优聚类数量 - 肘部法则
print("1. 使用肘部法则确定最优聚类数量...")

inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_customers_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_customers_scaled, kmeans.labels_))

# 可视化肘部法则和轮廓系数
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# 肘部法则
axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_xlabel('聚类数量 (K)')
axes[0].set_ylabel('簇内误差平方和 (Inertia)')
axes[0].set_title('肘部法则')
axes[0].grid(True, alpha=0.3)

# 轮廓系数
axes[1].plot(K_range, silhouette_scores, 'ro-')
axes[1].set_xlabel('聚类数量 (K)')
axes[1].set_ylabel('轮廓系数')
axes[1].set_title('轮廓系数法')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 选择最优K值
optimal_k = K_range[np.argmax(silhouette_scores)]
print(f"基于轮廓系数的最优聚类数量: {optimal_k}")

# 使用最优K值进行K-means聚类
kmeans_optimal = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
kmeans_labels = kmeans_optimal.fit_predict(X_customers_scaled)

print(f"K-means聚类结果:")
print(f"聚类数量: {optimal_k}")
print(f"轮廓系数: {silhouette_score(X_customers_scaled, kmeans_labels):.4f}")
print(f"Calinski-Harabasz指数: {calinski_harabasz_score(X_customers_scaled, kmeans_labels):.2f}")

# 6.3 多种聚类算法对比
print("\n=== 多种聚类算法对比 ===")

# 定义多种聚类算法
clustering_algorithms = {
    'K-Means': KMeans(n_clusters=3, random_state=42, n_init=10),
    'Hierarchical': AgglomerativeClustering(n_clusters=3),
    'DBSCAN': DBSCAN(eps=0.5, min_samples=5),
    'Gaussian Mixture': GaussianMixture(n_components=3, random_state=42),
    'Spectral Clustering': SpectralClustering(n_clusters=3, random_state=42)
}

# 应用所有聚类算法
clustering_results = {}
cluster_labels = {}

for name, algorithm in clustering_algorithms.items():
    print(f"应用 {name}...")
    
    start_time = time.time()
    if name == 'Gaussian Mixture':
        labels = algorithm.fit_predict(X_customers_scaled)
    else:
        labels = algorithm.fit_predict(X_customers_scaled)
    
    clustering_time = time.time() - start_time
    cluster_labels[name] = labels
    
    # 计算评估指标
    if len(np.unique(labels)) > 1:  # 确保有多个聚类
        silhouette = silhouette_score(X_customers_scaled, labels)
        calinski = calinski_harabasz_score(X_customers_scaled, labels)
        
        # 与真实标签比较（如果可用）
        ari = adjusted_rand_score(true_labels, labels)
        
        clustering_results[name] = {
            'silhouette_score': silhouette,
            'calinski_harabasz': calinski,
            'adjusted_rand_index': ari,
            'n_clusters': len(np.unique(labels)),
            'clustering_time': clustering_time
        }
    else:
        clustering_results[name] = {
            'silhouette_score': -1,  # 无效聚类
            'calinski_harabasz': -1,
            'adjusted_rand_index': -1,
            'n_clusters': len(np.unique(labels)),
            'clustering_time': clustering_time
        }

# 显示结果
clustering_df = pd.DataFrame(clustering_results).T
print("\n聚类算法性能对比:")
print(clustering_df.round(4))

# 6.4 聚类结果可视化
print("\n=== 聚类结果可视化 ===")

# 为了可视化，我们使用PCA降维到2D
pca_viz = PCA(n_components=2, random_state=42)
X_customers_2d = pca_viz.fit_transform(X_customers_scaled)

fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 绘制不同算法的聚类结果
plot_configs = [
    ('真实分组', true_labels),
    ('K-Means', cluster_labels['K-Means']),
    ('Hierarchical', cluster_labels['Hierarchical']),
    ('DBSCAN', cluster_labels['DBSCAN']),
    ('Gaussian Mixture', cluster_labels['Gaussian Mixture']),
    ('Spectral Clustering', cluster_labels['Spectral Clustering'])
]

for idx, (title, labels) in enumerate(plot_configs):
    row, col = divmod(idx, 3)
    
    # 处理噪声点（DBSCAN中的-1标签）
    unique_labels = np.unique(labels)
    colors = plt.cm.Set1(np.linspace(0, 1, len(unique_labels)))
    
    for label, color in zip(unique_labels, colors):
        if label == -1:
            # 噪声点用黑色x标记
            mask = labels == label
            axes[row, col].scatter(X_customers_2d[mask, 0], X_customers_2d[mask, 1], 
                                 c='black', marker='x', s=50, alpha=0.7, label='噪声')
        else:
            mask = labels == label
            axes[row, col].scatter(X_customers_2d[mask, 0], X_customers_2d[mask, 1], 
                                 c=color, alpha=0.7, s=50, label=f'簇 {label}')
    
    axes[row, col].set_title(f'{title}')
    axes[row, col].set_xlabel(f'PC1 ({pca_viz.explained_variance_ratio_[0]:.2%} variance)')
    axes[row, col].set_ylabel(f'PC2 ({pca_viz.explained_variance_ratio_[1]:.2%} variance)')
    axes[row, col].grid(True, alpha=0.3)
    axes[row, col].legend()

plt.tight_layout()
plt.show()

# 6.5 层次聚类树状图
print("\n=== 层次聚类分析 ===")

# 计算距离矩阵和链接矩阵
# 为了效率，只使用部分数据绘制树状图
sample_size = 50
sample_indices = np.random.choice(len(X_customers_scaled), sample_size, replace=False)
X_sample = X_customers_scaled[sample_indices]

# 计算链接矩阵
linkage_matrix = linkage(X_sample, method='ward')

# 绘制树状图
plt.figure(figsize=(15, 8))
dendrogram(linkage_matrix, orientation='top', distance_sort='descending')
plt.title('层次聚类树状图 (Ward链接)')
plt.xlabel('样本索引')
plt.ylabel('距离')
plt.grid(True, alpha=0.3)
plt.show()

# 6.6 聚类结果解释和业务应用
print("\n=== 聚类结果解释和业务应用 ===")

# 使用K-means结果进行客户细分分析
customer_data['聚类标签'] = kmeans_labels

# 各聚类的特征分析
cluster_analysis = customer_data.groupby('聚类标签').agg({
    '年收入(千)': ['mean', 'std'],
    '年消费(千)': ['mean', 'std'],
    '购买频率': ['mean', 'std'],
    '客户满意度': ['mean', 'std'],
    '推荐意愿': ['mean', 'std']
}).round(2)

print("各客户群体特征分析:")
print(cluster_analysis)

# 各聚类的大小
cluster_sizes = customer_data['聚类标签'].value_counts().sort_index()
print(f"\n各聚类大小:")
for cluster_id, size in cluster_sizes.items():
    percentage = size / len(customer_data) * 100
    print(f"聚类 {cluster_id}: {size} 客户 ({percentage:.1f}%)")

# 客户群体命名和特征描述
cluster_descriptions = {}
for cluster_id in range(optimal_k):
    cluster_data = customer_data[customer_data['聚类标签'] == cluster_id]
    avg_income = cluster_data['年收入(千)'].mean()
    avg_spending = cluster_data['年消费(千)'].mean()
    avg_frequency = cluster_data['购买频率'].mean()
    
    if avg_income > 60 and avg_spending > 80:
        cluster_descriptions[cluster_id] = {
            'name': '高价值客户',
            'description': '高收入、高消费、高频率购买',
            'strategy': '维护关系、提供VIP服务、推荐高端产品'
        }
    elif avg_income > 40 and avg_spending > 50:
        cluster_descriptions[cluster_id] = {
            'name': '潜力客户',
            'description': '中等收入和消费，有提升空间',
            'strategy': '推荐促销活动、提升购买频率、交叉销售'
        }
    else:
        cluster_descriptions[cluster_id] = {
            'name': '价格敏感客户',
            'description': '收入和消费较低，价格敏感',
            'strategy': '提供优惠活动、基础产品推荐、提升满意度'
        }

print(f"\n客户群体描述和营销策略:")
for cluster_id, desc in cluster_descriptions.items():
    print(f"\n聚类 {cluster_id} - {desc['name']}:")
    print(f"  特征: {desc['description']}")
    print(f"  营销策略: {desc['strategy']}")
    print(f"  客户数量: {cluster_sizes[cluster_id]}")

# 6.7 聚类算法特点总结
print(f"\n=== 聚类算法特点总结 ===")

algorithm_comparison = {
    'K-Means': {
        'advantages': ['简单快速', '适合球形聚类', '可扩展性好'],
        'disadvantages': ['需要预设K值', '对异常值敏感', '假设球形聚类'],
        'best_for': '大数据集、球形分布、已知聚类数'
    },
    'Hierarchical': {
        'advantages': ['不需要预设聚类数', '产生聚类层次', '确定性结果'],
        'disadvantages': ['计算复杂度高', '对噪声敏感', '难以处理大数据'],
        'best_for': '小数据集、需要聚类层次、探索性分析'
    },
    'DBSCAN': {
        'advantages': ['发现任意形状聚类', '自动检测噪声', '不需要预设聚类数'],
        'disadvantages': ['参数敏感', '密度差异大时效果差', '高维数据困难'],
        'best_for': '噪声数据、任意形状聚类、异常检测'
    },
    'Gaussian Mixture': {
        'advantages': ['软聚类(概率)', '处理椭圆形聚类', '模型选择灵活'],
        'disadvantages': ['需要预设组件数', '对初始化敏感', '计算复杂'],
        'best_for': '椭圆形聚类、需要概率输出、混合分布'
    }
}

for alg_name, characteristics in algorithm_comparison.items():
    if alg_name in clustering_results:
        result = clustering_results[alg_name]
        print(f"\n{alg_name}:")
        print(f"  轮廓系数: {result['silhouette_score']:.4f}")
        print(f"  优点: {', '.join(characteristics['advantages'])}")
        print(f"  缺点: {', '.join(characteristics['disadvantages'])}")
        print(f"  适用场景: {characteristics['best_for']}")

# 最佳聚类算法推荐
best_clustering = max(clustering_results.keys(), 
                     key=lambda x: clustering_results[x]['silhouette_score'])
print(f"\n🏆 本案例最佳聚类算法: {best_clustering}")
print(f"   轮廓系数: {clustering_results[best_clustering]['silhouette_score']:.4f}")

print(f"\n聚类分析关键要点:")
print(f"✓ 选择合适的聚类数量很重要（肘部法则、轮廓系数）")
print(f"✓ 数据预处理（标准化）对聚类结果有重大影响")
print(f"✓ 不同算法适用于不同的数据分布和业务场景")
print(f"✓ 聚类结果需要结合业务知识进行解释和应用")
print(f"✓ 评估指标要结合多个维度进行综合判断")

## 7. 降维技术详解

降维是处理高维数据的重要技术，可以减少计算复杂度、避免维度诅咒、便于可视化。我们将学习PCA、t-SNE等经典降维技术。

In [None]:
# 7.1 主成分分析 (PCA) 详解
print("=== 主成分分析 (PCA) 详解 ===")

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.feature_selection import SelectKBest, SelectPercentile, chi2, f_classif, mutual_info_classif

# 使用MNIST数据集进行降维演示
# 为了计算效率，使用子集
n_samples_dim = 2000
indices_dim = np.random.choice(len(X_mnist), n_samples_dim, replace=False)
X_mnist_dim = X_mnist[indices_dim] / 255.0  # 标准化
y_mnist_dim = y_mnist[indices_dim]

print(f"降维数据集信息:")
print(f"原始维度: {X_mnist_dim.shape}")
print(f"类别数量: {len(np.unique(y_mnist_dim))}")

# PCA分析
print("\n1. PCA主成分分析:")

# 计算所有主成分
pca_full = PCA()
pca_full.fit(X_mnist_dim)

# 累积解释方差比
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# 可视化PCA结果
fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 1. 解释方差比
axes[0, 0].plot(range(1, 51), pca_full.explained_variance_ratio_[:50], 'bo-', markersize=4)
axes[0, 0].set_xlabel('主成分')
axes[0, 0].set_ylabel('解释方差比')
axes[0, 0].set_title('前50个主成分的解释方差比')
axes[0, 0].grid(True, alpha=0.3)

# 2. 累积解释方差比
axes[0, 1].plot(range(1, 101), cumulative_variance[:100], 'ro-', markersize=3)
axes[0, 1].axhline(y=0.95, color='green', linestyle='--', label='95%方差')
axes[0, 1].axhline(y=0.99, color='orange', linestyle='--', label='99%方差')
axes[0, 1].set_xlabel('主成分数量')
axes[0, 1].set_ylabel('累积解释方差比')
axes[0, 1].set_title('累积解释方差比')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 找到保留95%和99%方差的主成分数量
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
n_components_99 = np.argmax(cumulative_variance >= 0.99) + 1

print(f"保留95%方差需要的主成分数量: {n_components_95}")
print(f"保留99%方差需要的主成分数量: {n_components_99}")

# 3. 使用不同数量的主成分进行降维
pca_components = [2, 10, 50, 100, 200]
reconstruction_errors = []

for n_comp in pca_components:
    pca = PCA(n_components=n_comp)
    X_transformed = pca.fit_transform(X_mnist_dim)
    X_reconstructed = pca.inverse_transform(X_transformed)
    
    # 计算重构误差
    error = np.mean((X_mnist_dim - X_reconstructed) ** 2)
    reconstruction_errors.append(error)

# 重构误差图
axes[0, 2].plot(pca_components, reconstruction_errors, 'go-')
axes[0, 2].set_xlabel('主成分数量')
axes[0, 2].set_ylabel('重构误差 (MSE)')
axes[0, 2].set_title('主成分数量 vs 重构误差')
axes[0, 2].grid(True, alpha=0.3)

# 4. 2D PCA可视化
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_mnist_dim)

scatter = axes[1, 0].scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y_mnist_dim, cmap='tab10', alpha=0.7)
axes[1, 0].set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%} variance)')
axes[1, 0].set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%} variance)')
axes[1, 0].set_title('PCA 2D可视化')
plt.colorbar(scatter, ax=axes[1, 0])

# 5. 主成分可视化（前几个主成分看起来像什么）
pca_viz = PCA(n_components=9)
pca_viz.fit(X_mnist_dim)

for i in range(9):
    row, col = divmod(i, 3)
    if row == 1 and col < 2:
        component = pca_viz.components_[i].reshape(28, 28)
        axes[1, col + 1].imshow(component, cmap='coolwarm')
        axes[1, col + 1].set_title(f'主成分 {i+1}')
        axes[1, col + 1].axis('off')

plt.tight_layout()
plt.show()

# 重构效果展示
print("\n2. PCA重构效果展示:")

# 选择几个不同的主成分数量进行重构
fig, axes = plt.subplots(3, 6, figsize=(18, 9))

# 原始图像
sample_idx = 0
original_image = X_mnist_dim[sample_idx].reshape(28, 28)
axes[0, 0].imshow(original_image, cmap='gray')
axes[0, 0].set_title('原始图像')
axes[0, 0].axis('off')

# 不同主成分数量的重构
components_to_test = [5, 10, 25, 50, 100]
for i, n_comp in enumerate(components_to_test):
    pca_recon = PCA(n_components=n_comp)
    X_transform = pca_recon.fit_transform(X_mnist_dim)
    X_recon = pca_recon.inverse_transform(X_transform)
    
    reconstructed_image = X_recon[sample_idx].reshape(28, 28)
    axes[0, i+1].imshow(reconstructed_image, cmap='gray')
    axes[0, i+1].set_title(f'{n_comp} 主成分')
    axes[0, i+1].axis('off')

# 展示更多样本
for row in range(1, 3):
    sample_idx = row * 100
    original = X_mnist_dim[sample_idx].reshape(28, 28)
    axes[row, 0].imshow(original, cmap='gray')
    axes[row, 0].set_title(f'原始 (数字{y_mnist_dim[sample_idx]})')
    axes[row, 0].axis('off')
    
    for i, n_comp in enumerate(components_to_test):
        pca_recon = PCA(n_components=n_comp)
        X_transform = pca_recon.fit_transform(X_mnist_dim)
        X_recon = pca_recon.inverse_transform(X_transform)
        
        reconstructed = X_recon[sample_idx].reshape(28, 28)
        axes[row, i+1].imshow(reconstructed, cmap='gray')
        axes[row, i+1].axis('off')

plt.suptitle('PCA重构效果对比', fontsize=16)
plt.tight_layout()
plt.show()

# 7.2 t-SNE非线性降维
print("\n=== t-SNE非线性降维 ===")

# 为了计算效率，使用更小的子集
n_samples_tsne = 1000
indices_tsne = np.random.choice(len(X_mnist_dim), n_samples_tsne, replace=False)
X_tsne_input = X_mnist_dim[indices_tsne]
y_tsne_input = y_mnist_dim[indices_tsne]

print("正在计算t-SNE降维...")
print("注意：t-SNE计算时间较长，请耐心等待...")

# 不同perplexity参数的t-SNE
perplexities = [5, 30, 50]
tsne_results = {}

for perp in perplexities:
    print(f"计算 perplexity={perp} 的t-SNE...")
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42, n_iter=1000)
    X_tsne = tsne.fit_transform(X_tsne_input)
    tsne_results[perp] = X_tsne

# 比较PCA和t-SNE结果
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# PCA 2D结果（从之前计算的结果中取子集）
pca_subset = X_pca_2d[indices_tsne]
scatter = axes[0, 0].scatter(pca_subset[:, 0], pca_subset[:, 1], c=y_tsne_input, cmap='tab10', alpha=0.7)
axes[0, 0].set_title('PCA 2D')
axes[0, 0].set_xlabel('PC1')
axes[0, 0].set_ylabel('PC2')
plt.colorbar(scatter, ax=axes[0, 0])

# 不同perplexity的t-SNE结果
for idx, perp in enumerate(perplexities):
    row, col = divmod(idx + 1, 2)
    X_tsne = tsne_results[perp]
    
    scatter = axes[row, col].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_tsne_input, cmap='tab10', alpha=0.7)
    axes[row, col].set_title(f't-SNE (perplexity={perp})')
    axes[row, col].set_xlabel('t-SNE 1')
    axes[row, col].set_ylabel('t-SNE 2')
    plt.colorbar(scatter, ax=axes[row, col])

plt.tight_layout()
plt.show()

# 7.3 特征选择技术
print("\n=== 特征选择技术 ===")

# 使用完整的MNIST子集进行特征选择
print("应用各种特征选择方法...")

# 1. 方差阈值选择
from sklearn.feature_selection import VarianceThreshold

var_selector = VarianceThreshold(threshold=0.1)
X_var_selected = var_selector.fit_transform(X_mnist_dim)

print(f"方差阈值选择:")
print(f"原始特征数: {X_mnist_dim.shape[1]}")
print(f"选择后特征数: {X_var_selected.shape[1]}")
print(f"保留特征比例: {X_var_selected.shape[1]/X_mnist_dim.shape[1]:.2%}")

# 2. 单变量特征选择
univariate_selectors = {
    'chi2': SelectKBest(score_func=chi2, k=100),
    'f_classif': SelectKBest(score_func=f_classif, k=100),
    'mutual_info': SelectKBest(score_func=mutual_info_classif, k=100)
}

feature_selection_results = {}

for name, selector in univariate_selectors.items():
    print(f"\n{name} 特征选择:")
    X_selected = selector.fit_transform(X_mnist_dim, y_mnist_dim)
    
    # 获取特征得分
    scores = selector.scores_
    selected_features = selector.get_support()
    
    feature_selection_results[name] = {
        'X_selected': X_selected,
        'scores': scores,
        'selected_features': selected_features,
        'n_features': X_selected.shape[1]
    }
    
    print(f"选择的特征数: {X_selected.shape[1]}")
    print(f"平均特征得分: {np.mean(scores):.2f}")

# 3. 递归特征消除 (RFE)
print(f"\n递归特征消除 (RFE):")
from sklearn.feature_selection import RFE

# 使用逻辑回归作为基估计器
estimator = LogisticRegression(random_state=42, max_iter=1000)
rfe_selector = RFE(estimator=estimator, n_features_to_select=100, step=50)

X_rfe_selected = rfe_selector.fit_transform(X_mnist_dim, y_mnist_dim)
print(f"RFE选择的特征数: {X_rfe_selected.shape[1]}")

# 7.4 特征选择效果评估
print("\n=== 特征选择效果评估 ===")

# 比较不同特征选择方法对分类性能的影响
from sklearn.model_selection import cross_val_score

feature_sets = {
    '原始特征': X_mnist_dim,
    '方差阈值': X_var_selected,
    'Chi2选择': feature_selection_results['chi2']['X_selected'],
    'F分类选择': feature_selection_results['f_classif']['X_selected'],
    'RFE选择': X_rfe_selected,
    'PCA (100维)': PCA(n_components=100).fit_transform(X_mnist_dim)
}

# 使用逻辑回归进行交叉验证
classifier = LogisticRegression(random_state=42, max_iter=1000)
cv_results = {}

print("评估不同特征选择方法的分类性能...")
for name, X_features in feature_sets.items():
    print(f"评估 {name}...")
    scores = cross_val_score(classifier, X_features, y_mnist_dim, cv=3, scoring='accuracy')
    cv_results[name] = {
        'mean_accuracy': scores.mean(),
        'std_accuracy': scores.std(),
        'n_features': X_features.shape[1]
    }

# 可视化特征选择效果
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. 准确率对比
methods = list(cv_results.keys())
accuracies = [cv_results[method]['mean_accuracy'] for method in methods]
errors = [cv_results[method]['std_accuracy'] for method in methods]

axes[0, 0].bar(range(len(methods)), accuracies, yerr=errors, capsize=5)
axes[0, 0].set_title('不同特征选择方法的分类准确率')
axes[0, 0].set_ylabel('准确率')
axes[0, 0].set_xticks(range(len(methods)))
axes[0, 0].set_xticklabels(methods, rotation=45, ha='right')
axes[0, 0].grid(True, alpha=0.3)

# 添加数值标签
for i, (acc, err) in enumerate(zip(accuracies, errors)):
    axes[0, 0].text(i, acc + err + 0.005, f'{acc:.3f}±{err:.3f}', 
                    ha='center', va='bottom', fontsize=9)

# 2. 特征数量对比
n_features = [cv_results[method]['n_features'] for method in methods]
axes[0, 1].bar(range(len(methods)), n_features, color='orange')
axes[0, 1].set_title('特征数量对比')
axes[0, 1].set_ylabel('特征数量')
axes[0, 1].set_xticks(range(len(methods)))
axes[0, 1].set_xticklabels(methods, rotation=45, ha='right')
axes[0, 1].set_yscale('log')
axes[0, 1].grid(True, alpha=0.3)

# 3. 准确率vs特征数量
axes[1, 0].scatter(n_features, accuracies, s=100, alpha=0.7)
for i, method in enumerate(methods):
    axes[1, 0].annotate(method, (n_features[i], accuracies[i]), 
                       xytext=(5, 5), textcoords='offset points', fontsize=8)
axes[1, 0].set_xlabel('特征数量')
axes[1, 0].set_ylabel('准确率')
axes[1, 0].set_title('准确率 vs 特征数量')
axes[1, 0].set_xscale('log')
axes[1, 0].grid(True, alpha=0.3)

# 4. 特征重要性可视化（以Chi2为例）
chi2_scores = feature_selection_results['chi2']['scores']
feature_importance_image = chi2_scores.reshape(28, 28)
im = axes[1, 1].imshow(feature_importance_image, cmap='hot', interpolation='nearest')
axes[1, 1].set_title('Chi2特征重要性热力图')
axes[1, 1].axis('off')
plt.colorbar(im, ax=axes[1, 1])

plt.tight_layout()
plt.show()

# 7.5 降维技术总结
print("\n=== 降维技术总结 ===")

dimensionality_reduction_summary = {
    'PCA': {
        'type': '线性降维',
        'advantages': ['快速计算', '可解释性', '保留最大方差', '可逆变换'],
        'disadvantages': ['线性假设', '可能丢失非线性结构'],
        'best_for': '数据预处理、噪声减少、压缩',
        'parameters': '主成分数量'
    },
    't-SNE': {
        'type': '非线性降维',
        'advantages': ['保留局部结构', '可视化效果好', '发现非线性模式'],
        'disadvantages': ['计算复杂', '不可逆', '对参数敏感'],
        'best_for': '数据可视化、聚类分析',
        'parameters': 'perplexity, learning_rate'
    },
    '方差阈值': {
        'type': '特征选择',
        'advantages': ['简单快速', '移除无用特征', '减少维度'],
        'disadvantages': ['可能移除有用特征', '不考虑目标变量'],
        'best_for': '预处理步骤、移除常数特征',
        'parameters': '方差阈值'
    },
    '单变量选择': {
        'type': '特征选择',
        'advantages': ['考虑目标变量', '统计显著性', '可解释'],
        'disadvantages': ['忽略特征交互', '可能选择冗余特征'],
        'best_for': '初步特征筛选、理解特征重要性',
        'parameters': '评分函数、特征数量'
    }
}

print("降维技术对比:")
for technique, info in dimensionality_reduction_summary.items():
    print(f"\n{technique} ({info['type']}):")
    print(f"  优点: {', '.join(info['advantages'])}")
    print(f"  缺点: {', '.join(info['disadvantages'])}")
    print(f"  适用场景: {info['best_for']}")
    print(f"  关键参数: {info['parameters']}")

# 性能总结
print(f"\n本案例性能总结:")
best_method = max(cv_results.keys(), key=lambda x: cv_results[x]['mean_accuracy'])
most_efficient = min(cv_results.keys(), key=lambda x: cv_results[x]['n_features'])

print(f"🏆 最佳准确率: {best_method} ({cv_results[best_method]['mean_accuracy']:.4f})")
print(f"⚡ 最少特征: {most_efficient} ({cv_results[most_efficient]['n_features']} 特征)")

print(f"\n降维技术选择建议:")
print(f"• 数据预处理: 使用PCA或方差阈值")
print(f"• 可视化分析: 使用t-SNE或PCA")
print(f"• 特征选择: 结合单变量选择和RFE")
print(f"• 计算效率: 优先考虑PCA和方差阈值")
print(f"• 可解释性: 选择PCA或单变量特征选择")

## 8. 模型选择与评估

模型选择和评估是机器学习的核心环节。我们将学习交叉验证、网格搜索、性能评估指标等关键技术，确保模型的可靠性和泛化能力。

In [None]:
# 8.1 交叉验证详解
print("=== 交叉验证详解 ===")

from sklearn.model_selection import (
    KFold, StratifiedKFold, LeaveOneOut, ShuffleSplit,
    cross_val_score, cross_validate, validation_curve, learning_curve
)
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve,
    make_scorer
)

# 使用之前的MNIST子集数据
print(f"使用数据集: {X_mnist_dim.shape}")

# 准备一个基础分类器
base_classifier = LogisticRegression(random_state=42, max_iter=1000)

# 8.1.1 不同交叉验证策略
print("\n1. 不同交叉验证策略对比:")

cv_strategies = {
    'KFold(5)': KFold(n_splits=5, shuffle=True, random_state=42),
    'StratifiedKFold(5)': StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    'KFold(10)': KFold(n_splits=10, shuffle=True, random_state=42),
    'ShuffleSplit': ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)
}

cv_results = {}
for name, cv_strategy in cv_strategies.items():
    print(f"  评估 {name}...")
    scores = cross_val_score(base_classifier, X_mnist_dim, y_mnist_dim, 
                           cv=cv_strategy, scoring='accuracy')
    cv_results[name] = {
        'scores': scores,
        'mean': scores.mean(),
        'std': scores.std()
    }

# 可视化交叉验证结果
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# 1. 交叉验证得分对比
strategies = list(cv_results.keys())
means = [cv_results[s]['mean'] for s in strategies]
stds = [cv_results[s]['std'] for s in strategies]

axes[0, 0].bar(range(len(strategies)), means, yerr=stds, capsize=5)
axes[0, 0].set_title('不同交叉验证策略的准确率对比')
axes[0, 0].set_ylabel('准确率')
axes[0, 0].set_xticks(range(len(strategies)))
axes[0, 0].set_xticklabels(strategies, rotation=45, ha='right')
axes[0, 0].grid(True, alpha=0.3)

for i, (mean, std) in enumerate(zip(means, stds)):
    axes[0, 0].text(i, mean + std + 0.005, f'{mean:.3f}±{std:.3f}', 
                    ha='center', va='bottom', fontsize=9)

# 2. 得分分布箱型图
score_data = [cv_results[s]['scores'] for s in strategies]
axes[0, 1].boxplot(score_data, labels=strategies)
axes[0, 1].set_title('交叉验证得分分布')
axes[0, 1].set_ylabel('准确率')
axes[0, 1].tick_params(axis='x', rotation=45)
axes[0, 1].grid(True, alpha=0.3)

# 8.1.2 多指标交叉验证
print("\n2. 多指标交叉验证:")

# 定义多个评估指标
scoring_metrics = {
    'accuracy': 'accuracy',
    'precision_macro': 'precision_macro',
    'recall_macro': 'recall_macro',
    'f1_macro': 'f1_macro'
}

# 使用StratifiedKFold进行多指标评估
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
multi_metric_results = cross_validate(
    base_classifier, X_mnist_dim, y_mnist_dim,
    cv=skf, scoring=scoring_metrics, return_train_score=True
)

# 3. 训练集vs验证集性能对比
train_test_comparison = {}
for metric in scoring_metrics.keys():
    train_scores = multi_metric_results[f'train_{metric}']
    test_scores = multi_metric_results[f'test_{metric}']
    
    train_test_comparison[metric] = {
        'train_mean': train_scores.mean(),
        'train_std': train_scores.std(),
        'test_mean': test_scores.mean(),
        'test_std': test_scores.std(),
        'overfitting': train_scores.mean() - test_scores.mean()
    }

# 可视化训练集vs测试集性能
metrics = list(train_test_comparison.keys())
train_means = [train_test_comparison[m]['train_mean'] for m in metrics]
test_means = [train_test_comparison[m]['test_mean'] for m in metrics]
train_stds = [train_test_comparison[m]['train_std'] for m in metrics]
test_stds = [train_test_comparison[m]['test_std'] for m in metrics]

x = np.arange(len(metrics))
width = 0.35

axes[1, 0].bar(x - width/2, train_means, width, yerr=train_stds, label='训练集', alpha=0.8)
axes[1, 0].bar(x + width/2, test_means, width, yerr=test_stds, label='验证集', alpha=0.8)
axes[1, 0].set_title('训练集 vs 验证集性能对比')
axes[1, 0].set_ylabel('得分')
axes[1, 0].set_xticks(x)
axes[1, 0].set_xticklabels(metrics)
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 4. 过拟合程度分析
overfitting_scores = [train_test_comparison[m]['overfitting'] for m in metrics]
bars = axes[1, 1].bar(range(len(metrics)), overfitting_scores)
axes[1, 1].set_title('过拟合程度分析 (训练-验证)')
axes[1, 1].set_ylabel('性能差异')
axes[1, 1].set_xticks(range(len(metrics)))
axes[1, 1].set_xticklabels(metrics)
axes[1, 1].axhline(y=0, color='red', linestyle='--', alpha=0.7)
axes[1, 1].grid(True, alpha=0.3)

# 标记过拟合程度
for i, score in enumerate(overfitting_scores):
    color = 'red' if score > 0.05 else 'green'
    axes[1, 1].text(i, score + 0.001, f'{score:.3f}', 
                    ha='center', va='bottom', color=color, fontweight='bold')

plt.tight_layout()
plt.show()

print("交叉验证结果分析:")
for strategy, result in cv_results.items():
    print(f"{strategy}: {result['mean']:.4f} ± {result['std']:.4f}")

print("\n多指标评估结果:")
for metric, result in train_test_comparison.items():
    print(f"{metric}:")
    print(f"  训练集: {result['train_mean']:.4f} ± {result['train_std']:.4f}")
    print(f"  验证集: {result['test_mean']:.4f} ± {result['test_std']:.4f}")
    print(f"  过拟合程度: {result['overfitting']:.4f}")

# 8.2 网格搜索和随机搜索
print("\n=== 网格搜索和随机搜索 ===")

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint, uniform

# 为了演示效果，我们使用随机森林分类器
print("1. 网格搜索超参数优化:")

# 定义参数网格
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 创建随机森林分类器
rf_classifier = RandomForestClassifier(random_state=42)

# 网格搜索
print("正在进行网格搜索...")
grid_search = GridSearchCV(
    rf_classifier, param_grid_rf, 
    cv=3, scoring='accuracy', 
    n_jobs=-1, verbose=1
)

# 使用较小的数据集以节省时间
indices_grid = np.random.choice(len(X_mnist_dim), 1000, replace=False)
X_grid = X_mnist_dim[indices_grid]
y_grid = y_mnist_dim[indices_grid]

start_time = time.time()
grid_search.fit(X_grid, y_grid)
grid_time = time.time() - start_time

print(f"网格搜索完成，耗时: {grid_time:.2f}秒")
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳得分: {grid_search.best_score_:.4f}")

# 2. 随机搜索
print("\n2. 随机搜索超参数优化:")

# 定义参数分布
param_dist_rf = {
    'n_estimators': randint(50, 300),
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

# 随机搜索
random_search = RandomizedSearchCV(
    rf_classifier, param_dist_rf,
    n_iter=20, cv=3, scoring='accuracy',
    random_state=42, n_jobs=-1
)

print("正在进行随机搜索...")
start_time = time.time()
random_search.fit(X_grid, y_grid)
random_time = time.time() - start_time

print(f"随机搜索完成，耗时: {random_time:.2f}秒")
print(f"最佳参数: {random_search.best_params_}")
print(f"最佳得分: {random_search.best_score_:.4f}")

# 8.3 学习曲线分析
print("\n=== 学习曲线分析 ===")

# 计算学习曲线
print("计算学习曲线...")
train_sizes = np.linspace(0.1, 1.0, 10)

# 使用最优的随机森林模型
best_rf = random_search.best_estimator_

train_sizes_abs, train_scores_lc, val_scores_lc = learning_curve(
    best_rf, X_grid, y_grid,
    train_sizes=train_sizes, cv=3,
    scoring='accuracy', n_jobs=-1
)

# 计算验证曲线
print("计算验证曲线...")
param_name = 'n_estimators'
param_range = [10, 50, 100, 150, 200, 250, 300]

train_scores_vc, val_scores_vc = validation_curve(
    RandomForestClassifier(random_state=42), X_grid, y_grid,
    param_name=param_name, param_range=param_range,
    cv=3, scoring='accuracy', n_jobs=-1
)

# 可视化学习曲线和验证曲线
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. 学习曲线
train_mean_lc = np.mean(train_scores_lc, axis=1)
train_std_lc = np.std(train_scores_lc, axis=1)
val_mean_lc = np.mean(val_scores_lc, axis=1)
val_std_lc = np.std(val_scores_lc, axis=1)

axes[0, 0].plot(train_sizes_abs, train_mean_lc, 'o-', color='blue', label='训练集')
axes[0, 0].fill_between(train_sizes_abs, train_mean_lc - train_std_lc, 
                       train_mean_lc + train_std_lc, alpha=0.1, color='blue')

axes[0, 0].plot(train_sizes_abs, val_mean_lc, 'o-', color='red', label='验证集')
axes[0, 0].fill_between(train_sizes_abs, val_mean_lc - val_std_lc, 
                       val_mean_lc + val_std_lc, alpha=0.1, color='red')

axes[0, 0].set_xlabel('训练集大小')
axes[0, 0].set_ylabel('准确率')
axes[0, 0].set_title('学习曲线')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. 验证曲线
train_mean_vc = np.mean(train_scores_vc, axis=1)
train_std_vc = np.std(train_scores_vc, axis=1)
val_mean_vc = np.mean(val_scores_vc, axis=1)
val_std_vc = np.std(val_scores_vc, axis=1)

axes[0, 1].plot(param_range, train_mean_vc, 'o-', color='blue', label='训练集')
axes[0, 1].fill_between(param_range, train_mean_vc - train_std_vc, 
                       train_mean_vc + train_std_vc, alpha=0.1, color='blue')

axes[0, 1].plot(param_range, val_mean_vc, 'o-', color='red', label='验证集')
axes[0, 1].fill_between(param_range, val_mean_vc - val_std_vc, 
                       val_mean_vc + val_std_vc, alpha=0.1, color='red')

axes[0, 1].set_xlabel(f'{param_name}')
axes[0, 1].set_ylabel('准确率')
axes[0, 1].set_title('验证曲线')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. 网格搜索vs随机搜索效率对比
search_comparison = {
    '网格搜索': {
        'time': grid_time,
        'best_score': grid_search.best_score_,
        'n_combinations': len(grid_search.cv_results_['params'])
    },
    '随机搜索': {
        'time': random_time,
        'best_score': random_search.best_score_,
        'n_combinations': 20
    }
}

methods = list(search_comparison.keys())
times = [search_comparison[m]['time'] for m in methods]
scores = [search_comparison[m]['best_score'] for m in methods]

# 时间对比
axes[1, 0].bar(methods, times, color=['blue', 'orange'])
axes[1, 0].set_title('搜索时间对比')
axes[1, 0].set_ylabel('时间 (秒)')
for i, time_val in enumerate(times):
    axes[1, 0].text(i, time_val + 0.1, f'{time_val:.1f}s', ha='center', va='bottom')

# 得分对比
axes[1, 1].bar(methods, scores, color=['blue', 'orange'])
axes[1, 1].set_title('最佳得分对比')
axes[1, 1].set_ylabel('准确率')
for i, score in enumerate(scores):
    axes[1, 1].text(i, score + 0.001, f'{score:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# 8.4 详细的模型评估指标
print("\n=== 详细的模型评估指标 ===")

# 使用最佳模型进行详细评估
best_model = random_search.best_estimator_
y_pred_detailed = best_model.predict(X_test_mnist)
y_pred_proba = best_model.predict_proba(X_test_mnist)

# 计算各种评估指标
evaluation_metrics = {
    'accuracy': accuracy_score(y_test_mnist, y_pred_detailed),
    'precision_macro': precision_score(y_test_mnist, y_pred_detailed, average='macro'),
    'precision_micro': precision_score(y_test_mnist, y_pred_detailed, average='micro'),
    'precision_weighted': precision_score(y_test_mnist, y_pred_detailed, average='weighted'),
    'recall_macro': recall_score(y_test_mnist, y_pred_detailed, average='macro'),
    'recall_micro': recall_score(y_test_mnist, y_pred_detailed, average='micro'),
    'recall_weighted': recall_score(y_test_mnist, y_pred_detailed, average='weighted'),
    'f1_macro': f1_score(y_test_mnist, y_pred_detailed, average='macro'),
    'f1_micro': f1_score(y_test_mnist, y_pred_detailed, average='micro'),
    'f1_weighted': f1_score(y_test_mnist, y_pred_detailed, average='weighted')
}

print("详细评估指标:")
for metric, value in evaluation_metrics.items():
    print(f"{metric}: {value:.4f}")

# 分类报告
print(f"\n分类报告:")
print(classification_report(y_test_mnist, y_pred_detailed))

# 8.5 模型选择最佳实践总结
print("\n=== 模型选择最佳实践总结 ===")

best_practices = {
    '数据分割': [
        '使用分层抽样保证类别平衡',
        '留出独立的测试集',
        '训练集用于模型训练，验证集用于模型选择'
    ],
    '交叉验证': [
        '分类问题使用StratifiedKFold',
        '小数据集使用LeaveOneOut',
        '大数据集使用ShuffleSplit提高效率'
    ],
    '超参数优化': [
        '先使用随机搜索快速定位区域',
        '再使用网格搜索精细优化',
        '考虑使用贝叶斯优化（hyperopt等）'
    ],
    '模型评估': [
        '使用多个评估指标综合判断',
        '关注训练集和验证集的性能差异',
        '绘制学习曲线诊断过拟合/欠拟合'
    ],
    '模型选择': [
        '简单模型优于复杂模型（奥卡姆剃刀）',
        '考虑模型的可解释性需求',
        '平衡性能、训练时间和预测时间'
    ]
}

print("模型选择最佳实践:")
for category, practices in best_practices.items():
    print(f"\n{category}:")
    for practice in practices:
        print(f"  • {practice}")

# 常见评估指标解释
print(f"\n评估指标解释:")
metric_explanations = {
    'Accuracy': '正确分类的样本比例，适用于类别平衡的数据',
    'Precision': '预测为正类中真正正类的比例，关注误报',
    'Recall': '真正正类中被正确预测的比例，关注漏报',
    'F1-Score': 'Precision和Recall的调和平均数',
    'Macro平均': '各类别指标的算术平均，给每个类别相同权重',
    'Micro平均': '所有样本的全局指标，给每个样本相同权重',
    'Weighted平均': '按类别样本数加权的平均，适用于不平衡数据'
}

for metric, explanation in metric_explanations.items():
    print(f"• {metric}: {explanation}")

print(f"\n本案例总结:")
print(f"✓ 使用了{len(cv_strategies)}种交叉验证策略")
print(f"✓ 网格搜索评估了{search_comparison['网格搜索']['n_combinations']}种参数组合")
print(f"✓ 随机搜索在{random_time:.1f}秒内找到了接近最优的解")
print(f"✓ 最终模型在测试集上的准确率: {evaluation_metrics['accuracy']:.4f}")
print(f"✓ 学习曲线显示模型具有良好的泛化能力")

## 9. 综合项目案例：完整机器学习流程

将前面学到的所有技术整合，构建一个完整的机器学习项目。从数据探索到模型部署，体现真实项目的完整流程。

In [None]:
# 9.1 项目初始化和数据探索
print("=== 综合项目案例：MNIST手写数字识别完整流程 ===")

import pickle
import joblib
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# 项目配置
PROJECT_CONFIG = {
    'project_name': 'MNIST_Digit_Recognition',
    'version': '1.0',
    'author': 'ML_Tutorial',
    'created_date': datetime.now().strftime('%Y-%m-%d'),
    'target_accuracy': 0.95,
    'max_training_time': 300  # 秒
}

print(f"项目: {PROJECT_CONFIG['project_name']} v{PROJECT_CONFIG['version']}")
print(f"目标准确率: {PROJECT_CONFIG['target_accuracy']}")
print(f"最大训练时间: {PROJECT_CONFIG['max_training_time']}秒")

# 使用完整的MNIST数据集（限制大小以节省计算时间）
n_samples_final = 10000
indices_final = np.random.choice(len(X_mnist), n_samples_final, replace=False)
X_project = X_mnist[indices_final]
y_project = y_mnist[indices_final]

print(f"\n数据集信息:")
print(f"样本数量: {X_project.shape[0]}")
print(f"特征数量: {X_project.shape[1]}")
print(f"类别数量: {len(np.unique(y_project))}")

# 9.1.1 探索性数据分析 (EDA)
print("\n步骤1: 探索性数据分析")

# 类别分布
class_distribution = pd.Series(y_project).value_counts().sort_index()
print("类别分布:")
print(class_distribution)

# 数据质量检查
print(f"\n数据质量检查:")
print(f"缺失值: {np.isnan(X_project).sum()}")
print(f"特征值范围: {X_project.min()} - {X_project.max()}")
print(f"零值比例: {(X_project == 0).mean():.2%}")

# 可视化数据分布
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# 1. 类别分布
axes[0, 0].bar(class_distribution.index, class_distribution.values)
axes[0, 0].set_title('类别分布')
axes[0, 0].set_xlabel('数字')
axes[0, 0].set_ylabel('样本数量')
axes[0, 0].grid(True, alpha=0.3)

# 2. 像素值分布
axes[0, 1].hist(X_project.flatten(), bins=50, alpha=0.7)
axes[0, 1].set_title('像素值分布')
axes[0, 1].set_xlabel('像素值')
axes[0, 1].set_ylabel('频次')
axes[0, 1].grid(True, alpha=0.3)

# 3. 每个样本的非零像素数量
non_zero_pixels = np.sum(X_project > 0, axis=1)
axes[0, 2].hist(non_zero_pixels, bins=30, alpha=0.7, color='green')
axes[0, 2].set_title('每个样本的非零像素数量')
axes[0, 2].set_xlabel('非零像素数')
axes[0, 2].set_ylabel('频次')
axes[0, 2].grid(True, alpha=0.3)

# 4-6. 展示每个数字的典型样本
for i in range(3):
    digit = i * 3
    if digit < 10:
        # 找到该数字的样本
        digit_indices = np.where(y_project == digit)[0]
        if len(digit_indices) > 0:
            sample_idx = digit_indices[0]
            image = X_project[sample_idx].reshape(28, 28)
            axes[1, i].imshow(image, cmap='gray')
            axes[1, i].set_title(f'数字 {digit} 样本')
            axes[1, i].axis('off')

plt.tight_layout()
plt.show()

# 9.2 数据预处理流水线
print("\n步骤2: 构建数据预处理流水线")

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# 2.1 数据分割
X_train_final, X_test_final, y_train_final, y_test_final = train_test_split(
    X_project, y_project, test_size=0.2, random_state=42, stratify=y_project
)

print(f"数据分割结果:")
print(f"训练集: {X_train_final.shape}")
print(f"测试集: {X_test_final.shape}")

# 2.2 创建预处理流水线
preprocessing_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(score_func=chi2, k=400))
])

print("预处理流水线组件:")
print("1. 标准化 (StandardScaler)")
print("2. 特征选择 (SelectKBest, k=400)")

# 应用预处理
print("\n应用预处理流水线...")
X_train_processed = preprocessing_pipeline.fit_transform(X_train_final, y_train_final)
X_test_processed = preprocessing_pipeline.transform(X_test_final)

print(f"预处理后特征数量: {X_train_processed.shape[1]}")

# 9.3 模型候选池和基准测试
print("\n步骤3: 模型候选池和基准测试")

# 定义候选模型
candidate_models = {
    'Logistic_Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random_Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM_RBF': SVC(kernel='rbf', random_state=42, probability=True),
    'Gradient_Boosting': GradientBoostingClassifier(random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5)
}

# 基准测试
baseline_results = {}
print("进行基准测试...")

for name, model in candidate_models.items():
    print(f"测试 {name}...")
    
    # 使用3折交叉验证快速评估
    start_time = time.time()
    cv_scores = cross_val_score(model, X_train_processed, y_train_final, 
                               cv=3, scoring='accuracy')
    training_time = time.time() - start_time
    
    baseline_results[name] = {
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'training_time': training_time
    }
    
    print(f"  准确率: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
    print(f"  训练时间: {training_time:.2f}秒")

# 选择最有前景的模型
best_baseline_model = max(baseline_results.keys(), 
                         key=lambda x: baseline_results[x]['cv_mean'])
print(f"\n最佳基准模型: {best_baseline_model}")
print(f"基准准确率: {baseline_results[best_baseline_model]['cv_mean']:.4f}")

# 9.4 超参数优化
print("\n步骤4: 超参数优化")

# 根据基准测试结果，选择最佳模型进行优化
if best_baseline_model == 'Random_Forest':
    param_grid_final = {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 15, 20, None],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    best_model_class = RandomForestClassifier(random_state=42)
elif best_baseline_model == 'SVM_RBF':
    param_grid_final = {
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]
    }
    best_model_class = SVC(kernel='rbf', random_state=42, probability=True)
else:
    # 默认使用随机森林
    param_grid_final = {
        'n_estimators': [100, 200, 300],
        'max_depth': [10, 15, 20, None],
        'min_samples_split': [2, 5, 10]
    }
    best_model_class = RandomForestClassifier(random_state=42)

print(f"优化模型: {best_baseline_model}")
print(f"参数网格: {param_grid_final}")

# 网格搜索优化
grid_search_final = GridSearchCV(
    best_model_class, param_grid_final,
    cv=5, scoring='accuracy',
    n_jobs=-1, verbose=1
)

print("进行超参数优化...")
start_time = time.time()
grid_search_final.fit(X_train_processed, y_train_final)
optimization_time = time.time() - start_time

print(f"优化完成，耗时: {optimization_time:.2f}秒")
print(f"最佳参数: {grid_search_final.best_params_}")
print(f"最佳交叉验证得分: {grid_search_final.best_score_:.4f}")

# 9.5 最终模型训练和评估
print("\n步骤5: 最终模型训练和评估")

# 获取最佳模型
final_model = grid_search_final.best_estimator_

# 在测试集上评估
y_pred_final = final_model.predict(X_test_processed)
y_pred_proba_final = final_model.predict_proba(X_test_processed)

# 计算最终评估指标
final_metrics = {
    'accuracy': accuracy_score(y_test_final, y_pred_final),
    'precision_macro': precision_score(y_test_final, y_pred_final, average='macro'),
    'recall_macro': recall_score(y_test_final, y_pred_final, average='macro'),
    'f1_macro': f1_score(y_test_final, y_pred_final, average='macro')
}

print("最终模型性能:")
for metric, value in final_metrics.items():
    print(f"{metric}: {value:.4f}")

# 检查是否达到目标准确率
target_met = final_metrics['accuracy'] >= PROJECT_CONFIG['target_accuracy']
print(f"\n目标准确率 ({PROJECT_CONFIG['target_accuracy']}) 达成: {'✓' if target_met else '✗'}")

# 9.6 结果可视化和分析
print("\n步骤6: 结果可视化和分析")

fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 1. 模型性能对比
models = list(baseline_results.keys()) + ['Final_Optimized']
accuracies = [baseline_results[m]['cv_mean'] for m in baseline_results.keys()] + [final_metrics['accuracy']]
colors = ['lightblue'] * len(baseline_results) + ['red']

bars = axes[0, 0].bar(range(len(models)), accuracies, color=colors)
axes[0, 0].set_title('模型性能对比')
axes[0, 0].set_ylabel('准确率')
axes[0, 0].set_xticks(range(len(models)))
axes[0, 0].set_xticklabels(models, rotation=45, ha='right')
axes[0, 0].grid(True, alpha=0.3)

# 添加目标线
axes[0, 0].axhline(y=PROJECT_CONFIG['target_accuracy'], color='green', 
                   linestyle='--', label=f"目标准确率 ({PROJECT_CONFIG['target_accuracy']})")
axes[0, 0].legend()

# 2. 混淆矩阵
cm_final = confusion_matrix(y_test_final, y_pred_final)
im = axes[0, 1].imshow(cm_final, interpolation='nearest', cmap=plt.cm.Blues)
axes[0, 1].set_title('最终模型混淆矩阵')
axes[0, 1].set_xlabel('预测标签')
axes[0, 1].set_ylabel('真实标签')

# 添加数值标签
for i in range(cm_final.shape[0]):
    for j in range(cm_final.shape[1]):
        axes[0, 1].text(j, i, format(cm_final[i, j], 'd'),
                       ha="center", va="center",
                       color="white" if cm_final[i, j] > cm_final.max() / 2 else "black")

plt.colorbar(im, ax=axes[0, 1])

# 3. 特征重要性（如果模型支持）
if hasattr(final_model, 'feature_importances_'):
    # 获取原始特征的重要性
    feature_importance = final_model.feature_importances_
    # 反向映射到原始像素位置
    selected_features = preprocessing_pipeline.named_steps['feature_selection'].get_support()
    full_importance = np.zeros(784)
    full_importance[selected_features] = feature_importance
    
    importance_image = full_importance.reshape(28, 28)
    im2 = axes[0, 2].imshow(importance_image, cmap='hot', interpolation='nearest')
    axes[0, 2].set_title('特征重要性热力图')
    axes[0, 2].axis('off')
    plt.colorbar(im2, ax=axes[0, 2])
else:
    axes[0, 2].text(0.5, 0.5, '该模型不支持\n特征重要性分析', 
                   ha='center', va='center', transform=axes[0, 2].transAxes)
    axes[0, 2].set_title('特征重要性')

# 4. 预测概率分布
if y_pred_proba_final is not None:
    max_proba = np.max(y_pred_proba_final, axis=1)
    axes[1, 0].hist(max_proba, bins=30, alpha=0.7)
    axes[1, 0].set_title('预测概率分布')
    axes[1, 0].set_xlabel('最大预测概率')
    axes[1, 0].set_ylabel('频次')
    axes[1, 0].grid(True, alpha=0.3)

# 5. 错误分析
incorrect_indices = np.where(y_test_final != y_pred_final)[0]
if len(incorrect_indices) > 0:
    # 选择一些错误样本展示
    sample_errors = incorrect_indices[:6]
    for i, idx in enumerate(sample_errors):
        if i < 6:
            row, col = divmod(i + 6, 3)  # 从第二行开始
            if row < 2:
                true_label = y_test_final.iloc[idx] if hasattr(y_test_final, 'iloc') else y_test_final[idx]
                pred_label = y_pred_final[idx]
                image = X_test_final[idx].reshape(28, 28)
                
                axes[row, col].imshow(image, cmap='gray')
                axes[row, col].set_title(f'错误: {true_label}→{pred_label}')
                axes[row, col].axis('off')

plt.tight_layout()
plt.show()

# 9.7 模型保存和部署准备
print("\n步骤7: 模型保存和部署准备")

# 创建模型包
model_package = {
    'model': final_model,
    'preprocessing_pipeline': preprocessing_pipeline,
    'feature_names': [f'pixel_{i}' for i in range(784)],
    'target_names': [str(i) for i in range(10)],
    'model_metadata': {
        'model_type': type(final_model).__name__,
        'best_params': grid_search_final.best_params_,
        'cv_score': grid_search_final.best_score_,
        'test_accuracy': final_metrics['accuracy'],
        'training_samples': len(X_train_final),
        'features_selected': X_train_processed.shape[1],
        'created_date': datetime.now().isoformat()
    }
}

# 保存模型（注释掉实际保存，避免文件系统操作）
# model_filename = f"mnist_model_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pkl"
# joblib.dump(model_package, model_filename)
# print(f"模型已保存到: {model_filename}")

print("模型包组件:")
for key, value in model_package.items():
    if key != 'model' and key != 'preprocessing_pipeline':
        print(f"• {key}: {type(value)}")

# 模型预测函数示例
def predict_digit(image_array, model_package):
    \"\"\"
    预测单个手写数字图像
    
    参数:
    image_array: 28x28的numpy数组或784维向量
    model_package: 包含模型和预处理器的包
    
    返回:
    prediction: 预测的数字
    probability: 预测概率
    \"\"\"
    # 确保输入是正确的形状
    if image_array.shape == (28, 28):
        image_vector = image_array.flatten()
    else:
        image_vector = image_array
    
    # 预处理
    image_processed = model_package['preprocessing_pipeline'].transform([image_vector])
    
    # 预测
    prediction = model_package['model'].predict(image_processed)[0]
    probability = model_package['model'].predict_proba(image_processed)[0]
    
    return prediction, probability

# 示例预测
sample_image = X_test_final[0]
pred, prob = predict_digit(sample_image, model_package)
print(f"\n预测示例:")
print(f"真实标签: {y_test_final.iloc[0] if hasattr(y_test_final, 'iloc') else y_test_final[0]}")
print(f"预测标签: {pred}")
print(f"预测概率: {prob[pred]:.4f}")

# 9.8 项目总结报告
print("\n步骤8: 项目总结报告")

project_summary = f\"\"\"
=== {PROJECT_CONFIG['project_name']} 项目总结报告 ===

项目信息:
• 项目版本: {PROJECT_CONFIG['version']}
• 创建日期: {PROJECT_CONFIG['created_date']}
• 目标准确率: {PROJECT_CONFIG['target_accuracy']}

数据集信息:
• 总样本数: {len(X_project):,}
• 训练样本: {len(X_train_final):,}
• 测试样本: {len(X_test_final):,}
• 原始特征数: {X_project.shape[1]}
• 选择特征数: {X_train_processed.shape[1]}

模型开发过程:
1. ✓ 探索性数据分析
2. ✓ 数据预处理流水线构建
3. ✓ 多模型基准测试 ({len(candidate_models)} 个模型)
4. ✓ 超参数优化 (网格搜索)
5. ✓ 最终模型评估
6. ✓ 模型保存和部署准备

最终模型性能:
• 模型类型: {type(final_model).__name__}
• 测试准确率: {final_metrics['accuracy']:.4f}
• 精确率 (宏平均): {final_metrics['precision_macro']:.4f}
• 召回率 (宏平均): {final_metrics['recall_macro']:.4f}
• F1分数 (宏平均): {final_metrics['f1_macro']:.4f}

目标达成情况:
• 准确率目标: {'达成 ✓' if target_met else '未达成 ✗'}

计算效率:
• 基准测试时间: {sum(baseline_results[m]['training_time'] for m in baseline_results):.1f} 秒
• 超参数优化时间: {optimization_time:.1f} 秒
• 总计算时间: {sum(baseline_results[m]['training_time'] for m in baseline_results) + optimization_time:.1f} 秒

建议和后续工作:
1. 考虑使用深度学习模型进一步提升性能
2. 实施模型监控和自动重训练机制
3. 优化预测延迟和内存使用
4. 增加数据增强技术
5. 部署到生产环境并收集反馈

项目状态: {'成功完成' if target_met else '需要进一步优化'}
\"\"\"

print(project_summary)

# 保存项目报告（注释掉实际保存）
# report_filename = f"project_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
# with open(report_filename, 'w', encoding='utf-8') as f:
#     f.write(project_summary)
# print(f"\n项目报告已保存到: {report_filename}")

print(f"\n🎉 {PROJECT_CONFIG['project_name']} 项目完成！")
print(f"这个综合案例展示了完整的机器学习项目流程:")
print(f"从数据探索 → 预处理 → 模型选择 → 优化 → 评估 → 部署准备")
print(f"在实际项目中，你可以按照这个流程进行开发和部署。")

## 10. 学习总结与进阶指南

通过这个完整的Scikit-learn教程，我们掌握了机器学习的核心技能。让我们总结关键知识点，并规划进阶学习路径。

In [None]:
# 10.1 知识体系回顾
print("=== Scikit-learn 学习总结与进阶指南 ===")

# 创建知识体系图
knowledge_map = {
    '机器学习基础': {
        '监督学习': ['分类', '回归'],
        '无监督学习': ['聚类', '降维'],
        '模型评估': ['交叉验证', '评估指标']
    },
    '数据预处理': {
        '特征缩放': ['StandardScaler', 'MinMaxScaler', 'RobustScaler'],
        '分类编码': ['LabelEncoder', 'OneHotEncoder', 'OrdinalEncoder'],
        '缺失值处理': ['SimpleImputer', 'KNNImputer'],
        '特征选择': ['VarianceThreshold', 'SelectKBest', 'RFE']
    },
    '分类算法': {
        '线性模型': ['LogisticRegression'],
        '树模型': ['DecisionTree', 'RandomForest', 'GradientBoosting'],
        '实例学习': ['KNeighbors'],
        '支持向量机': ['SVC'],
        '朴素贝叶斯': ['GaussianNB'],
        '神经网络': ['MLPClassifier']
    },
    '回归算法': {
        '线性回归': ['LinearRegression', 'Ridge', 'Lasso', 'ElasticNet'],
        '树回归': ['DecisionTreeRegressor', 'RandomForestRegressor'],
        '支持向量回归': ['SVR']
    },
    '聚类算法': {
        '划分聚类': ['KMeans'],
        '层次聚类': ['AgglomerativeClustering'],
        '密度聚类': ['DBSCAN'],
        '混合模型': ['GaussianMixture']
    },
    '降维技术': {
        '线性降维': ['PCA', 'TruncatedSVD'],
        '非线性降维': ['t-SNE'],
        '特征选择': ['SelectKBest', 'RFE']
    },
    '模型选择': {
        '交叉验证': ['KFold', 'StratifiedKFold'],
        '超参数优化': ['GridSearchCV', 'RandomizedSearchCV'],
        '模型评估': ['accuracy', 'precision', 'recall', 'f1_score']
    }
}

print("📚 本教程涵盖的知识体系:")
for main_topic, subtopics in knowledge_map.items():
    print(f"\n{main_topic}:")
    for subtopic, methods in subtopics.items():
        print(f"  • {subtopic}: {', '.join(methods)}")

# 10.2 最佳实践总结
print(f"\n=== 机器学习最佳实践总结 ===")

best_practices = {
    '数据准备阶段': [
        '充分理解数据和业务问题',
        '进行探索性数据分析(EDA)',
        '检查数据质量和完整性',
        '合理处理缺失值和异常值',
        '确保数据分割的随机性和代表性'
    ],
    '特征工程阶段': [
        '根据算法特性选择合适的特征缩放方法',
        '正确编码分类变量',
        '创建有意义的特征组合',
        '使用特征选择减少维度诅咒',
        '避免数据泄露(Data Leakage)'
    ],
    '模型选择阶段': [
        '从简单模型开始，逐步增加复杂度',
        '使用多个算法进行基准测试',
        '重视模型的可解释性需求',
        '考虑计算效率和部署约束',
        '平衡偏差和方差'
    ],
    '模型验证阶段': [
        '使用分层交叉验证确保结果可靠',
        '关注多个评估指标而非单一指标',
        '绘制学习曲线诊断过拟合/欠拟合',
        '在独立测试集上最终验证',
        '进行错误分析理解模型缺陷'
    ],
    '超参数优化阶段': [
        '先使用随机搜索快速探索参数空间',
        '再使用网格搜索精细优化',
        '设置合理的搜索范围',
        '使用嵌套交叉验证避免过拟合',
        '考虑计算资源和时间约束'
    ],
    '模型部署阶段': [
        '保存完整的预处理流水线',
        '建立模型监控和预警机制',
        '准备模型回滚方案',
        '定期重训练和更新模型',
        '建立A/B测试框架'
    ]
}

for stage, practices in best_practices.items():
    print(f"\n{stage}:")
    for i, practice in enumerate(practices, 1):
        print(f"  {i}. {practice}")

# 10.3 常见问题和解决方案
print(f"\n=== 常见问题和解决方案 ===")

common_issues = {
    '数据相关问题': {
        '类别不平衡': [
            '使用分层抽样',
            '调整类别权重(class_weight="balanced")',
            '使用SMOTE等重采样技术',
            '选择合适的评估指标(F1, AUC等)'
        ],
        '特征维度过高': [
            '使用特征选择技术',
            '应用PCA等降维方法',
            '使用L1正则化(Lasso)',
            '增加更多训练数据'
        ],
        '数据泄露': [
            '确保时间序列数据的正确分割',
            '避免使用未来信息',
            '仔细检查特征来源',
            '使用管道确保预处理顺序'
        ]
    },
    '模型性能问题': {
        '过拟合': [
            '增加训练数据',
            '使用正则化技术',
            '减少模型复杂度',
            '使用集成方法',
            '应用早停策略'
        ],
        '欠拟合': [
            '增加模型复杂度',
            '创建更多特征',
            '减少正则化强度',
            '选择更强大的算法',
            '增加训练时间'
        ],
        '泛化能力差': [
            '使用交叉验证评估',
            '增加验证数据的多样性',
            '减少模型复杂度',
            '改进特征工程',
            '使用集成学习'
        ]
    },
    '计算效率问题': {
        '训练时间过长': [
            '使用更快的算法',
            '减少特征数量',
            '使用数据子集',
            '并行计算(n_jobs=-1)',
            '使用增量学习算法'
        ],
        '内存不足': [
            '分批处理数据',
            '使用稀疏矩阵',
            '减少特征维度',
            '使用内存映射',
            '选择内存友好的算法'
        ]
    }
}

for category, issues in common_issues.items():
    print(f"\n{category}:")
    for issue, solutions in issues.items():
        print(f"  问题: {issue}")
        for solution in solutions:
            print(f"    • {solution}")

# 10.4 算法选择指南
print(f"\n=== 算法选择指南 ===")

algorithm_guide = {
    '根据问题类型选择': {
        '二分类问题': ['LogisticRegression', 'SVM', 'RandomForest'],
        '多分类问题': ['LogisticRegression', 'RandomForest', 'GradientBoosting'],
        '回归问题': ['LinearRegression', 'RandomForestRegressor', 'SVR'],
        '聚类问题': ['KMeans', 'DBSCAN', 'AgglomerativeClustering'],
        '降维问题': ['PCA', 't-SNE', 'SelectKBest']
    },
    '根据数据特点选择': {
        '小数据集(<1000样本)': ['KNN', 'NaiveBayes', 'LinearRegression'],
        '中等数据集(1000-100k)': ['SVM', 'RandomForest', 'GradientBoosting'],
        '大数据集(>100k样本)': ['LogisticRegression', 'SGD', 'LinearSVM'],
        '高维数据': ['LinearModels', 'SVM', 'NaiveBayes'],
        '非线性关系': ['RandomForest', 'SVM(RBF)', 'NeuralNetwork']
    },
    '根据性能要求选择': {
        '需要概率输出': ['LogisticRegression', 'RandomForest', 'GaussianNB'],
        '需要可解释性': ['LinearRegression', 'DecisionTree', 'LogisticRegression'],
        '追求最高精度': ['GradientBoosting', 'RandomForest', 'SVM'],
        '需要快速预测': ['LinearModels', 'NaiveBayes', 'KNN'],
        '需要快速训练': ['LinearModels', 'NaiveBayes']
    }
}

for criterion, recommendations in algorithm_guide.items():
    print(f"\n{criterion}:")
    for scenario, algorithms in recommendations.items():
        print(f"  {scenario}: {', '.join(algorithms)}")

# 10.5 进阶学习路径
print(f"\n=== 进阶学习路径 ===")

learning_path = {
    '深化Scikit-learn': [
        '学习Pipeline和ColumnTransformer的高级用法',
        '掌握自定义转换器和估计器',
        '研究集成学习的高级技巧',
        '学习半监督和主动学习',
        '掌握时间序列分析方法'
    ],
    '扩展到深度学习': [
        '学习TensorFlow/Keras基础',
        '掌握PyTorch深度学习框架',
        '理解卷积神经网络(CNN)',
        '学习循环神经网络(RNN/LSTM)',
        '探索Transformer架构'
    ],
    '特化领域应用': [
        '自然语言处理(NLP): NLTK, spaCy, transformers',
        '计算机视觉: OpenCV, PIL, torchvision',
        '推荐系统: surprise, lightfm',
        '时间序列: statsmodels, prophet',
        '强化学习: OpenAI Gym, stable-baselines3'
    ],
    '工程化和部署': [
        '学习MLOps工具链: MLflow, DVC, Kubeflow',
        '掌握模型部署: Flask, FastAPI, Docker',
        '学习云平台服务: AWS SageMaker, GCP AI Platform',
        '掌握模型监控和A/B测试',
        '学习分布式机器学习: Dask, Ray'
    ],
    '理论基础加强': [
        '深入理解统计学习理论',
        '学习优化算法原理',
        '掌握概率图模型',
        '研究因果推理方法',
        '学习贝叶斯机器学习'
    ]
}

for area, topics in learning_path.items():
    print(f"\n{area}:")
    for i, topic in enumerate(topics, 1):
        print(f"  {i}. {topic}")

# 10.6 实践项目建议
print(f"\n=== 实践项目建议 ===")

project_suggestions = {
    '初级项目(巩固基础)': [
        '房价预测: 回归分析和特征工程',
        '客户细分: 聚类分析和业务解释',
        '垃圾邮件分类: 文本分类和特征提取',
        '股票价格预测: 时间序列分析',
        '图像分类: 传统ML方法vs深度学习对比'
    ],
    '中级项目(综合应用)': [
        '推荐系统: 协同过滤和内容推荐',
        '欺诈检测: 不平衡数据处理',
        '情感分析: NLP和机器学习结合',
        '销售预测: 多元时间序列分析',
        '用户流失预测: 生存分析和分类模型'
    ],
    '高级项目(工程化)': [
        '端到端ML管道: 从数据到部署',
        '实时推荐系统: 在线学习和冷启动',
        '多模态学习: 文本+图像+表格数据',
        'AutoML系统: 自动特征工程和模型选择',
        'A/B测试平台: 实验设计和因果推理'
    ]
}

for level, projects in project_suggestions.items():
    print(f"\n{level}:")
    for i, project in enumerate(projects, 1):
        print(f"  {i}. {project}")

# 10.7 资源推荐
print(f"\n=== 学习资源推荐 ===")

resources = {
    '官方文档和教程': [
        'Scikit-learn官方文档: https://scikit-learn.org/',
        'Scikit-learn教程: https://scikit-learn.org/stable/tutorial/',
        'Python数据科学手册: Jake VanderPlas',
        'Hands-On Machine Learning: Aurélien Géron'
    ],
    '在线课程': [
        'Andrew Ng机器学习课程 (Coursera)',
        'Fast.ai实用机器学习课程',
        'edX MIT机器学习课程',
        'Kaggle Learn免费课程'
    ],
    '实践平台': [
        'Kaggle竞赛和数据集',
        'Google Colab免费GPU环境',
        'GitHub开源项目学习',
        'Papers With Code论文代码实现'
    ],
    '社区和论坛': [
        'Stack Overflow问答',
        'Reddit r/MachineLearning',
        '知乎机器学习话题',
        'Towards Data Science (Medium)'
    ]
}

for category, resource_list in resources.items():
    print(f"\n{category}:")
    for resource in resource_list:
        print(f"  • {resource}")

# 10.8 结语
print(f"\n=== 结语 ===")

conclusion = \"\"\"
🎓 恭喜你完成了这个全面的Scikit-learn教程！

通过这个教程，你已经掌握了：
✅ 机器学习的核心概念和工作流程
✅ 数据预处理的各种技术和最佳实践
✅ 主要机器学习算法的原理和应用
✅ 模型选择、评估和优化的方法
✅ 完整项目的开发流程

记住，机器学习是一个实践性很强的学科，理论学习只是开始。
真正的成长来自于：
• 动手实践各种项目
• 参与开源社区
• 持续关注新技术发展
• 将学到的知识应用到实际问题中

机器学习领域发展迅速，保持好奇心和学习热情，
你将在这个激动人心的领域中不断成长！

🚀 现在就开始你的机器学习之旅吧！
\"\"\"

print(conclusion)

# 创建学习检查清单
print(f"\n📋 学习检查清单:")
checklist = [
    "理解监督学习、无监督学习的区别",
    "能够进行完整的数据预处理",
    "掌握至少3种分类算法的使用",
    "掌握至少2种回归算法的使用", 
    "能够使用聚类算法进行客户细分",
    "理解PCA等降维技术的原理和应用",
    "能够使用交叉验证评估模型性能",
    "掌握网格搜索进行超参数优化",
    "能够解释模型结果和业务价值",
    "完成至少一个端到端的ML项目"
]

for i, item in enumerate(checklist, 1):
    print(f"  {i:2d}. □ {item}")

print(f"\n勾选完成的项目，规划下一步学习计划！")

## 📚 机器学习与Scikit-learn学习资源

### 🎯 Scikit-learn官方资源
- [**Scikit-learn官方文档**](https://scikit-learn.org/stable/) - 最权威的API文档和用户指南
- [**Scikit-learn Examples**](https://scikit-learn.org/stable/auto_examples/) - 官方示例代码库
- [**Scikit-learn User Guide**](https://scikit-learn.org/stable/user_guide.html) - 详细的算法介绍和使用指南
- [**Scikit-learn Tutorials**](https://scikit-learn.org/stable/tutorial/) - 官方教程系列

### 📖 机器学习经典教程
- [**Machine Learning Course by Andrew Ng**](https://www.coursera.org/learn/machine-learning) - Coursera最经典的机器学习课程
- [**CS229 Stanford ML**](http://cs229.stanford.edu/) - 斯坦福大学机器学习课程资料
- [**李宏毅机器学习课程**](https://speech.ee.ntu.edu.tw/~hylee/ml/2021-spring.html) - 台大中文机器学习课程
- [**MIT 6.034 Artificial Intelligence**](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/) - MIT人工智能课程

### 📚 推荐教材和书籍
- [**Hands-On Machine Learning**](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) - Aurélien Géron著，实战性很强
- [**The Elements of Statistical Learning**](https://hastie.su.stanford.edu/ElemStatLearn/) - 统计学习经典教材(免费)
- [**Pattern Recognition and Machine Learning**](https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/) - Christopher Bishop的经典教材
- [**Introduction to Statistical Learning**](https://www.statlearning.com/) - 统计学习导论(免费在线版)

### 🛠️ 实践和项目
- [**Kaggle Learn**](https://www.kaggle.com/learn) - 免费的机器学习微课程
- [**Machine Learning Mastery**](https://machinelearningmastery.com/) - Jason Brownlee的实战教程
- [**Towards Data Science**](https://towardsdatascience.com/) - Medium上的数据科学文章
- [**Papers with Code**](https://paperswithcode.com/) - 论文+代码实现

### 🏆 竞赛和数据集
- [**Kaggle Competitions**](https://www.kaggle.com/competitions) - 数据科学竞赛平台
- [**UCI ML Repository**](https://archive.ics.uci.edu/ml/) - 经典机器学习数据集
- [**Google Dataset Search**](https://datasetsearch.research.google.com/) - Google数据集搜索
- [**Awesome Public Datasets**](https://github.com/awesomedata/awesome-public-datasets) - 优质公开数据集

### 🔬 算法深入学习
- [**Algorithm Visualizations**](https://www.cs.usfca.edu/~galles/visualization/Algorithms.html) - 算法可视化
- [**Machine Learning Yearning**](https://www.deeplearning.ai/machine-learning-yearning/) - Andrew Ng的实践指南
- [**Interpretable ML Book**](https://christophm.github.io/interpretable-ml-book/) - 可解释机器学习
- [**Feature Engineering Book**](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/) - 特征工程指南

### 🐍 Python数据科学生态
- [**Pandas Documentation**](https://pandas.pydata.org/docs/) - 数据处理利器
- [**NumPy User Guide**](https://numpy.org/doc/stable/user/) - 数值计算基础
- [**Matplotlib Gallery**](https://matplotlib.org/stable/gallery/) - 可视化示例
- [**Seaborn Tutorial**](https://seaborn.pydata.org/tutorial.html) - 统计可视化

### 📊 可视化和解释
- [**Plotly Python**](https://plotly.com/python/) - 交互式可视化
- [**SHAP (SHapley Additive exPlanations)**](https://shap.readthedocs.io/) - 模型解释工具
- [**LIME**](https://github.com/marcotcr/lime) - 局部可解释机器学习
- [**Yellowbrick**](https://www.scikit-yb.org/) - 机器学习可视化

### 🇨🇳 中文学习资源
- [**机器学习实战**](https://github.com/apachecn/MachineLearning) - ApacheCN中文教程
- [**机器学习西瓜书**](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/MLbook2016.htm) - 周志华教授著作
- [**统计学习方法**](https://book.douban.com/subject/10590856/) - 李航教授著作
- [**Python机器学习基础教程**](https://github.com/amueller/introduction_to_ml_with_python) - 中文版配套资源

### 🎓 在线课程平台
- [**edX MIT Introduction to ML**](https://www.edx.org/course/introduction-to-machine-learning) - MIT机器学习导论
- [**Udacity ML Nanodegree**](https://www.udacity.com/course/machine-learning-engineer-nanodegree--nd009t) - 机器学习纳米学位
- [**DataCamp**](https://www.datacamp.com/) - 数据科学在线学习
- [**Coursera ML Specialization**](https://www.coursera.org/specializations/machine-learning-introduction) - 新版机器学习专项课程

### 🧪 实验和工具
- [**Google Colab**](https://colab.research.google.com/) - 免费的云端Jupyter环境
- [**Jupyter Notebooks Gallery**](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks) - 优质Jupyter笔记本
- [**MLflow**](https://mlflow.org/) - 机器学习生命周期管理
- [**Weights & Biases**](https://wandb.ai/) - 实验跟踪和可视化

### 📱 移动学习
- [**Machine Learning Mastery Blog**](https://machinelearningmastery.com/blog/) - 定期更新的技术博客
- [**KDnuggets**](https://www.kdnuggets.com/) - 数据科学新闻和教程
- [**Analytics Vidhya**](https://www.analyticsvidhya.com/) - 数据科学社区
- [**DataScienceCentral**](https://www.datasciencecentral.com/) - 数据科学资讯平台

记住：**机器学习是理论与实践的结合，多动手、多实验才是王道！** 🚀