## 1.2 代码实现 - 回归
使用 diabetes 数据集，演示如何使用 KNN 进行回归任务。

#### 1. 导入数据

In [2]:
from sklearn.datasets import load_diabetes
import pandas as pd

diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.DataFrame(diabetes.target, columns=["target"])
print("X:\n", X)
print("y:\n", y)

X:
           age       sex       bmi        bp        s1        s2        s3  \
0    0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1   -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2    0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3   -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4    0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   
..        ...       ...       ...       ...       ...       ...       ...   
437  0.041708  0.050680  0.019662  0.059744 -0.005697 -0.002566 -0.028674   
438 -0.005515  0.050680 -0.015906 -0.067642  0.049341  0.079165 -0.028674   
439  0.041708  0.050680 -0.015906  0.017293 -0.037344 -0.013840 -0.024993   
440 -0.045472 -0.044642  0.039062  0.001215  0.016318  0.015283 -0.028674   
441 -0.045472 -0.044642 -0.073030 -0.081413  0.083740  0.027809  0.173816   

           s4        s5        s6  
0   -0.002592  0.019907 -0.017646  

#### 2. 划分训练集和测试集

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### 3. 标准化数据

In [4]:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler()
X_train_scaled = scalar.fit_transform(X_train) # 只在训练集上 fit
X_test_scaled = scalar.transform(X_test)

#### 4. 创建 KNN 回归器，并且训练模型

In [5]:
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(
    n_neighbors=5, # 使用5个邻居
    weights="uniform", # 使用均匀权重
    metric="minkowski", # 采用通用的 Minkowski 距离
    p=2  # 欧氏距离
)
model.fit(X_train_scaled, y_train)

0,1,2
,"n_neighbors  n_neighbors: int, default=5 Number of neighbors to use by default for :meth:`kneighbors` queries.",5
,"weights  weights: {'uniform', 'distance'}, callable or None, default='uniform' Weight function used in prediction. Possible values: - 'uniform' : uniform weights. All points in each neighborhood  are weighted equally. - 'distance' : weight points by the inverse of their distance.  in this case, closer neighbors of a query point will have a  greater influence than neighbors which are further away. - [callable] : a user-defined function which accepts an  array of distances, and returns an array of the same shape  containing the weights. Uniform weights are used by default. See the following example for a demonstration of the impact of different weighting schemes on predictions: :ref:`sphx_glr_auto_examples_neighbors_plot_regression.py`.",'uniform'
,"algorithm  algorithm: {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto' Algorithm used to compute the nearest neighbors: - 'ball_tree' will use :class:`BallTree` - 'kd_tree' will use :class:`KDTree` - 'brute' will use a brute-force search. - 'auto' will attempt to decide the most appropriate algorithm  based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force.",'auto'
,"leaf_size  leaf_size: int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.",30
,"p  p: float, default=2 Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.",2
,"metric  metric: str, DistanceMetric object or callable, default='minkowski' Metric to use for distance computation. Default is ""minkowski"", which results in the standard Euclidean distance when p = 2. See the documentation of `scipy.spatial.distance `_ and the metrics listed in :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric values. If metric is ""precomputed"", X is assumed to be a distance matrix and must be square during fit. X may be a :term:`sparse graph`, in which case only ""nonzero"" elements may be considered neighbors. If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. This works for Scipy's metrics, but is less efficient than passing the metric name as a string. If metric is a DistanceMetric object, it will be passed directly to the underlying computation routines.",'minkowski'
,"metric_params  metric_params: dict, default=None Additional keyword arguments for the metric function.",
,"n_jobs  n_jobs: int, default=None The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. Doesn't affect :meth:`fit` method.",


#### 5. 进行预测

In [9]:
y_pred = model.predict(X_test_scaled)
print("True labels:", y_test.values.reshape(-1))
print("Predictions:", y_pred.reshape(-1))

True labels: [219.  70. 202. 230. 111.  84. 242. 272.  94.  96.  94. 252.  99. 297.
 135.  67. 295. 264. 170. 275. 310.  64. 128. 232. 129. 118. 263.  77.
  48. 107. 140. 113.  90. 164. 180. 233.  42.  84. 172.  63.  48. 108.
 156. 168.  90.  52. 200.  87.  90. 258. 136. 158.  69.  72. 171.  95.
  72. 151. 168.  60. 122.  52. 187. 102. 214. 248. 181. 110. 140. 202.
 101. 222. 281.  61.  89.  91. 186. 220. 237. 233.  68. 190.  96.  72.
 153.  98.  37.  63. 184.]
Predictions: [125.6 160.2 153.  238.  153.4 150.4 246.2 170.   94.4 104.6 104.  151.6
 104.  166.6  61.4 105.4 263.8 252.  173.6 215.4 161.   86.6 106.6 184.6
 158.2 168.  196.6 145.6  67.6 117.2 158.  166.6  86.4 158.4 166.4 243.6
  73.8 136.8 148.6 106.6  90.  100.6 134.2 147.6 193.4  81.6  83.6  90.8
  84.4 119.2 117.8  88.2 150.6 112.  210.2 130.8  79.4 175.8 113.4  73.
 155.  117.6  58.   94.6 169.4 196.8 214.4 146.4 145.2 111.4 119.2 195.
 204.4  98.6 107.8 192.4 151.  163.6 210.4 218.2 150.8 139.   67.4  82.4
 105.4 113.2

#### 6. 评估模型性能
1. Mean Absolute Error (MAE): 衡量预测值与真实值之间的平均绝对差异。
2. Mean Squared Error (MSE): 衡量预测值与真实值之间的平均平方差异。
3. R² Score: 衡量模型解释数据变异的能力，取值范围为0到1，值越大表示模型性能越好。

In [11]:
# 评估指标
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
MAE = mean_absolute_error(y_test, y_pred)
MSE = mean_squared_error(y_test, y_pred)
R2 = r2_score(y_test, y_pred)
print("Mean Absolute Error (MAE):", MAE)
print("Mean Squared Error (MSE):", MSE)
print("R² Score:", R2)

Mean Absolute Error (MAE): 42.777528089887646
Mean Squared Error (MSE): 3047.449887640449
R² Score: 0.42480887066066253


#### 7. 可选：使用 Pipeline 来优化流程

In [13]:
from sklearn.pipeline import Pipeline
pipline = Pipeline([
    ("scalar", StandardScaler()),
    ("knn_regressor", KNeighborsRegressor(n_neighbors=7, weights="uniform", metric="minkowski", p=2))
])
# 训练 Pipeline, 此时直接传递 X_train 和 y_train即可，因为 Pipeline 会自动处理标准化步骤
pipline.fit(X_train, y_train)

# 进行预测
y_pred_2 = pipline.predict(X_test_scaled)
print("True labels:", y_test.values.reshape(-1))
print("Predictions from Pipeline:", y_pred_2.reshape(-1))

# 评估 Pipeline 模型性能
MAE_2 = mean_absolute_error(y_test, y_pred_2)
MSE_2 = mean_squared_error(y_test, y_pred_2)
R2_2 = r2_score(y_test, y_pred_2)
print("Pipeline Mean Absolute Error (MAE):", MAE_2)
print("Pipeline Mean Squared Error (MSE):", MSE_2)
print("Pipeline R² Score:", R2_2)

True labels: [219.  70. 202. 230. 111.  84. 242. 272.  94.  96.  94. 252.  99. 297.
 135.  67. 295. 264. 170. 275. 310.  64. 128. 232. 129. 118. 263.  77.
  48. 107. 140. 113.  90. 164. 180. 233.  42.  84. 172.  63.  48. 108.
 156. 168.  90.  52. 200.  87.  90. 258. 136. 158.  69.  72. 171.  95.
  72. 151. 168.  60. 122.  52. 187. 102. 214. 248. 181. 110. 140. 202.
 101. 222. 281.  61.  89.  91. 186. 220. 237. 233.  68. 190.  96.  72.
 153.  98.  37.  63. 184.]
Predictions from Pipeline: [164.14285714 156.         160.28571429 227.14285714 177.14285714
 153.28571429 235.14285714 205.42857143  95.57142857 152.14285714
  67.28571429 123.14285714 112.42857143 262.85714286  83.57142857
  78.         239.42857143 230.57142857 112.14285714 260.28571429
 200.42857143  73.71428571  80.14285714 256.         121.14285714
 257.14285714 303.57142857 204.         106.42857143 142.
 223.85714286  96.71428571 125.71428571 271.71428571 164.42857143
 201.85714286 117.85714286 146.71428571 191.         

