# Grading
**Written test(individual) - 60%**
<br>
This written test of 40 multiple choice questions with 4 possible answers is about all the theory presented in the lectures. There will be no Python-specific questions. The test will contain multiple choice questions. The questions are designed to apply the knowledge obtained from the lectures to test your understanding.
<br>

**Project report(group) - 40%**
<br>
1. The actual content of the final report is 20 pages maximum. The appendix may be 20 pages maximum. The content counts from the start of your introduction to the last sentence of your last section (conclusion). 
2. Follow the guidelines in the Writing Guidelines regarding appendices, bibliography and formatting.
3. Only use screenshots of your graphs. Make proper tables for the other components of data understanding (e.g. descriptive statistics, formulas). That is to say – no screenshots of code output. 
4. If you want to show some code to provide evidence towards fulfilling the LOs, put your code in an appendix. You may either make separate entries within that appendix with the relevant code snippets, or put your code in its entirety in one appendix and refer to the relevant lines within the text. Put the code in monospace font to discern it from regular text. Use the Word plugin for code to do this. This means also not putting in screenshots of your code.
5. You take responsibility for paraphrasing correctly (or quoting if necessary) and the ethical use of sources, including accurate source referencing following APA guidelines. Use the standard Word functionality under References for this.

---

## 1. Introduction to AI
### What do we want to achieve with AI?
- **Now**: Narrow AI 狭义人工智能
  - Automate processes 自动化流程
  - Improve decision-making 改进决策过程
  - Build productivity tools 构建提升效率的工具
- **Later**: AGI & Superintelligence 通用人工智能和超人工智能
  - Perform any task a human can 能完成任何人类能做的任务
  - Solve novel problems 能解决从未遇到的新问题
  - Show creativity and common sense 展现创造力与常识推理能力

### Narrow AI (Machine Learning)
- Works with **available data** 使用**已有数据**进行工作
- Focused on **specific tasks** 专注于**特定任务**
- Core functions: 
  - Pattern recognition模式识别：Recommendation systems推荐系统, facial recognition人脸识别
  - Prediction预测：maintenance维护, demand forecasting需求预测
  - Content generation内容生成：ChatGPT, Copilot, DALL·E

### What makes Narrow AI work?
- **Main ingredients**:
  - Data 数据
  - Model of the data distribution 对数据分布的建模
  - Probability & inferential statistics 概率与推理统计方法
  - Model tuning 模型调优

### AGI (Artificial General Intelligence)
| Narrow AI | AGI |
|-----------|-----|
| Task-specific特定任务 | Cross-domain跨领域任务 |
| Based on past data基于历史数据 | Can handle novelty能处理新情况 |
| Pattern recognition模式识别 | Pattern interpretation & reasoning模式解释与推理 |
| Increases efficiency提高效率 | Matches/exceeds human flexibility匹敌或超越人类的灵活性 |

<br>

> Adding complexity or data ≠ more intelligence 增加复杂性或数据 ≠ 增强智能
> Intelligence requires reasoning beyond seen data 智能需要超越已知数据的推理能力

---

## 2. Business Understanding (CRISP-DM Phase)
### Key Questions
- **Business Objective**: What is the organization trying to achieve? 组织试图实现什么？
- **Business Success Criteria**: When is that considered successful? 什么情况下认为目标达成？（明确KPI）
- **Data Mining Goal**: How can data science help? 数据科学如何提供帮助？
- **Data Mining Success Criteria**: When is the data science effort successful? 如何判断数据科学工作是否成功？（明确准确率或RMSE等）

---

## 3. Evaluation & Baseline Models
### Why use a baseline model?
- Understand your data performance 了解数据的基本表现
- Identify data or modeling issues 发现数据或建模中存在的问题
- Faster iterations with simpler models 使用简单模型可以更快速地迭代
- Interpretable results for stakeholders 提供可解释的结果给相关方参考
- Provide a fair benchmark for advanced models 为高级模型提供一个公平的对比基准

### How to define Data Mining Success Criteria?
Use one or more of the following methods:
- **Relative improvement over baseline 相对提升**: "15% improvement over mean prediction" 比均值预测提升 15%
- **Business impact as a metric 业务影响指标**: "10% waste reduction = €500 savings/month" 减少 10% 浪费 = 每月节省 500 欧元
- **Industry/regulatory benchmarks 行业或法规标准**: "False negatives < 5% when predicting disease" 疾病预测中漏报率 < 5%
- **Statistical significance 统计显著性标准**: "Improve math scores by +6 points to be beyond ±3 point fluctuation" 数学成绩提高 6 分以超出 ±3 的自然波动范围


---

# 4. Data Understanding  
## 4.1 Preliminary Data Inspection 初步数据检查
- Data types and structure (e.g., numerical, categorical) 数据类型与结构（数值型、分类型等）  
- Variable distributions (check for normality) 变量分布（是否符合正态分布）  
- Correlations between variables 变量之间的相关性  
- Missing values 缺失值  
- Is the dataset suitable for modeling? What are the potential issues? 是否适合建模？有哪些潜在问题？  

## 4.2 Distributions  
**Purpose**: Understand the shape, skewness, and normality of variables.  
目的：了解变量的形状、偏态、是否符合正态分布  

**Visualization methods**:  
- Histogram 直方图: shows frequency of numerical variables 数值型变量频率   
- Bar plot 柱状图: compares grouped categorical values 分组（分类变量）与其数值比较   
- Boxplot 箱线图: shows central tendency, distribution, and outliers  集中趋势、分布、离群值 

Statistical tests like **t-test** and **ANOVA** can be used to test for normality. 可以使用 t 检验、ANOVA 来查看变量是否服从正态分布。 <br> 
If the distribution is not normal, apply **log transformation**: suitable for right-skewed data, reduces tail length and outlier influence. 如果不是，采用对数变换（log transform）：适用于右偏变量，拉近尾部，减少极端值影响。<br>   
Alternatively, consider using **non-linear models**. 或者使用非线性模型。  

## 4.3 Outlier Detection  
**Purpose**: Identify and handle extreme values that may bias the model. 识别和处理异常点，避免其对模型产生不当影响  

**Detection methods**:  
- **Boxplot method**:  
  - Lower bound = Q1 - 1.5 × IQR  
  - Upper bound = Q3 + 1.5 × IQR  
- **Z-score method**:  
  - Z = (x - mean) / standard deviation  
  - Z < -3 or Z > 3 is considered an outlier  

**Handling strategies**:  
- Investigate cause: data entry error or device malfunction?  调查原因：是否为录入错误或设备异常  
- Delete: only if clearly erroneous  删除：仅限于明确为错误数据  
- Retain: if it's a meaningful extreme value  保留：如果是有意义的极端值  
- Important: document all changes to avoid introducing bias  重要：记录所有修改操作，避免引入偏差  

Outlier ≠ Error — always investigate the reason.  离群值 ≠ 错误，一定要调查原因  

## 4.4 Feature Scaling  
**Purpose**: Bring variables to a similar scale for better convergence and fair influence in models. 让变量在相似尺度上，有利于建模收敛、避免某变量主导模型  

**Common scaling methods**:

| Method           | Description 描述                                 | Suitable scenarios 适用场景                           |
|------------------|--------------------------------------------------|------------------------------------------------------|
| StandardScaler   | Mean = 0, Std = 1 (Z-score standardization)      | Linear models, neural networks, gradient descent     |
|                  | 均值为 0，标准差为 1（Z-score 标准化）             | 线性模型、神经网络、梯度下降类算法                    |
| MinMaxScaler     | Rescales values to [0, 1] range                  | Image processing, interpretable scales               |
|                  | 所有值压缩到 [0,1] 区间                           | 图像处理、需要恢复业务单位的模型                      |
| RobustScaler     | Based on IQR, resistant to outliers              | Data with many outliers                              |
|                  | 基于 IQR，抗离群点                                | 数据有大量离群值时                                   |

**Note**:  
- Always scale based on **training set only** to avoid data leakage.  缩放应仅基于训练集，以防数据泄露  
- Do **not** scale binary variables or categorical variables.  不要对二元变量或分类变量缩放  
- Scaling is **not needed** for tree-based models.  不用于树模型  

## 4.5 Correlation & Multicollinearity  
**Purpose**: Understand relationships between variables and identify redundancy. 理解变量之间的关系，判断模型输入是否冗余  

**Correlation metrics**:

| Metric           | Description                                  | Notes                                  |
|------------------|----------------------------------------------|----------------------------------------|
| Pearson's r      | Measures linear correlation [-1, 1] 线性相关性，值在 [-1, 1] 之间 | Most common; for continuous variables 最常用，适用于数值型变量 |
| Spearman's ρ     | Measures monotonic rank correlation 单调关系的秩相关，适用于排序数据 | Captures non-linear monotonic patterns 可捕捉非线性单调关系 |
| Distance corr.   | Captures all types of dependencies 衡量所有类型的相关性（非线性等）| Powerful but harder to interpret 更强但不易解释 |

**Tip**: Always use scatter plots to visually confirm correlation.  总是可视化（如散点图）来验证相关性是否合理。  

### 4.5.1 Multicollinearity  
**Problems**:  
- Unstable coefficients, difficult to interpret  系数不稳定，解释困难  
- Poor generalization  模型泛化能力下降  

**Detection**:  
- Correlation heatmap  相关系数热图  
- Variance Inflation Factor (VIF) 方差膨胀因子: VIF > 5 or 10 indicates high multicollinearity

**Solutions**:  
- Drop one variable (choose less relevant or meaningful)  删除其中一个变量（选择更弱相关、含义更弱者）  
- Combine variables (e.g., average or use PCA)  合并变量（例如平均值、主成分分析等）  
- Apply regularization methods (e.g., Ridge or Lasso regression)  使用正则化方法（如 Ridge 或 Lasso 回归）  

# 5. Data Understanding & Preparation
机器学习类型流程图（Machine Learning Flow）：<br>
**Question：Is labeled data available or can a target value be generated? 是否有标注数据或可以生成目标值？**  <br><br>

Yes → **监督学习（Supervised Learning）**
- **回归（Regression）**: Used to predict continuous values (e.g., housing price, temperature)用于预测连续数值（如房价、温度）
  - Types：Linear Regression 线性回归 / Non-Linear Regression 非线性回归  
- **分类（Classification）**: Used to predict categories (e.g., spam detection) 用于预测类别（如垃圾邮件识别）  
  - Types：Linear Classification 线性分类 / Non-Linear Classification 非线性分类  

<br>

No → **无监督学习（Unsupervised Learning）**
- **聚类（Clustering）**：Grouping data without target values (e.g., customer segmentation) 将数据分组，无需目标值（如客户分群）  

## 5.1 建模假设与特征选择（Modeling Assumptions & Feature Selection）
建模假设（Modeling Assumptions）
- **代表性（Representative）**：Samples should represent the overall population 样本应能代表总体  
- **独立同分布（IID: Independent and Identically Distributed）**：Each row should be independent and from the same distribution 每行数据应独立，且来自相同分布

<br>

行层面（Row-Level）
- **测量水平（Measurement Level）**：Types of variables (e.g., numerical, categorical) 变量类型（数值型、类别型等）  
- **分布（Distribution）**：Variable distribution (e.g., skewness, outliers) 变量的分布形态（是否偏态、是否有异常值）  

<br>

列层面（Column-Level）
- **变量关系（Relationships）**：Correlations between features and with target 特征之间、特征与目标之间的相关性  


## 5.2 数据类型识别（Data Type Recognition）
- **独立数据（Independent Data）**：Observations are unrelated (e.g., random sampling) 观测值之间无关联（如随机抽样）  
- **自相关（Autocorrelation）**：Nearby data in time or space are more similar (e.g., temperature) 时间或空间上相近的数据更相似（如温度）  
- **类内相关（Intraclass Correlation）**：Correlation within groups (e.g., experimental groups) 组内数据相关（如实验组）  

<br>
判断方法：Combine dataset purpose + visualization (e.g., scatterplot, autocorrelation plot) 结合数据集目的 + 可视化（如散点图、自相关图）


## 5.3 分箱（Binning）
- **定义（Definition）**：Convert continuous variables into categories 将连续变量离散化为类别   
- **示例（Example）**：Wind speed is divided into 风速分为：  
  - "low" < 10 m/s  
  - "medium" 10–15 m/s  
  - "high" > 15 m/s
- **用途（Purpose）**：Simplify model input for classification models 简化模型输入，适应分类模型  

## 5.4 滞后特征工程（Lag Feature Engineering）
- **定义（Definition）**：Use past observations as new features 将过去的观测值作为新特征  
  - Lag 1 = 昨天的值 → **Yesterday's value**  
  - Lag 7 = 上周同一天的值 → **Same day last week**
-  **适用场景（When to Use）**：Time series modeling (e.g., predicting today’s temperature) 时间序列建模（如预测今日温度）  

## 5.5 自相关与趋势可视化（Autocorrelation & Trend Visualization）

### 5.5.1 自相关（Autocorrelation）
- Measures similarity between time series and its lagged version 衡量时间序列与其滞后版本的相似性  
- Bars outside confidence interval → Significant autocorrelation exists 条形图落在置信区间外 → 存在显著自相关  
  
### 5.5.2 趋势可视化（Trend Visualization）
- **Histogram 直方图**: Shows frequency of numerical variables 展示数值型变量的频率分布
- **Bar plot 柱状图**: Compares grouped categorical values 比较分类变量组的数值
- **Box plot 箱线图**: Visualizes distribution and outliers 可视化分布形态及异常值
- **Scatterplot 散点图**: Shows relationship between two variables 展示两个变量之间的关系
- **Line plot 折线图**: Suitable for time series 适用于时间序列分析
- **Heatmap 热力图**: Displays correlation or matrix values 展示相关性或矩阵数值

## 5.6 分类变量编码（Categorical Encoding）
### 5.6.1 编码方法（Encoding Methods）
| 方法（Method）       | 说明（Explanation）     | 适用模型（Suitable Models） |
|----------------------|--------------------------|------------------------------|
| One-hot encoding     | 每个类别一列             | 非线性模型（Non-linear）    |
| Dummy encoding       | n 类别 → n-1 列          | 线性模型，避免共线性        |

**Python Tools**：
```python
pd.get_dummies(drop_first=True)  # Dummy 编码
sklearn.preprocessing.OneHotEncoder()  # One-hot 编码
``` 

## 5.7 数据合并（Merging）

| 合并方式（Join Type） | 中文解释（Explanation）       | 特点（Feature）                     |
|------------------------|-------------------------------|--------------------------------------|
| Inner Join             | 内连接                        | 仅保留两表共有键（only common keys） |
| Left Join              | 左连接                        | 保留左表全部 + 匹配右表              |
| Right Join             | 右连接                        | 保留右表全部 + 匹配左表              |
| Full Join              | 全连接                        | 保留所有行，缺失填 NaN               |

## 5.8 缺失值处理（Missing Values）

### 5.8.1 缺失类型（Types of Missingness）
| 类型（Type） | 中文解释（Explanation）          | 示例（Example）                    |
|--------------|----------------------------------|------------------------------------|
| MCAR         | 完全随机缺失（Missing Completely at Random） | 硬件故障导致数据丢失              |
| MAR          | 随机缺失，与其他变量相关（Missing At Random）         | 年轻人不愿报告屏幕时间            |
| MNAR         | 非随机缺失，与缺失值本身相关（Missing Not At Random）   | 重度吸毒者隐瞒使用频率            |

### 5.8.2 为什么重要（Why It Matters）
- 会引入偏差 → May introduce bias  
- 降低统计效能 → Reduces statistical power  
- 影响模型训练 → Affects model training

### 5.8.3 缺失值填补方法（Imputation Methods）
| 方法（Method）        | 适用情况（When to Use） | 优点（Advantages） | 缺点（Disadvantages）           |
|------------------------|--------------------------|--------------------|----------------------------------|
| 删除（dropna）         | MCAR                     | 简单               | 丢失信息（Information loss）    |
| 均值/中位数填充       | MCAR/MAR                 | 快速               | 扭曲分布（Distorts distribution）|
| 前向/后向填充         | 时间序列                 | 保持趋势           | 不适用于突变                     |
| 热/冷卡片填充         | 有相似值可借用           | 保持分布           | 可能引入偏差                    |
| 插值（interpolation） | 有趋势可推断             | 多种方法           | 不适用于跳跃数据                 |

# 6. 回归模型 Regression Models

CRISP-DM 是一个数据科学项目的框架，包含六个阶段：业务理解、数据理解、数据准备、建模、评估和部署。 (CRISP-DM is a framework for data science projects, including six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.)
- 数据理解：探索和总结数据（例如检查模式、异常值或相关性），为建模做准备。 (Data Understanding: Explore and summarize data (e.g., check patterns, outliers, or correlations) to prepare for modeling.) 数据理解通过揭示模式（例如线性或非线性关系）帮助选择合适的模型。 (Data understanding helps select appropriate models by revealing patterns, such as linear or non-linear relationships.)
- 建模：在这个阶段，你需要： (Modeling: In this phase, you need:)
  - 拆分数据：将数据分为训练集和测试集，建立测试设计。 (Split data: Divide data into training and test sets to establish a test design.) 拆分数据确保模型在未见过的数据上测试，模拟真实场景。 (Splitting data ensures the model is tested on unseen data, simulating real-world scenarios.)
  - 构建和训练模型：用训练集训练模型，并在测试集上进行预测。 (Build and train models: Train the model with the training set and make predictions on the test set.)

监督学习基于输入（特征）和输出（目标）数据开发预测模型，分为分类和回归两大任务。 (Supervised learning develops predictive models based on input (features) and output (target) data, divided into two main tasks: classification and regression.)
- 监督学习：模型从已知正确答案（目标）的数据中学习。 (Supervised learning: The model learns from data with known correct answers (target).) 例如，有房屋面积（输入）和价格（输出）的数据，模型学习如何根据面积预测价格。 (For example, with data on house size (input) and price (output), the model learns to predict price based on size.)
- 分类：预测类别（例如“会下雨吗？”→ 是/否）。 (Classification: Predict categories (e.g., “Will it rain?” → Yes/No).)
- 回归：预测数值（例如“房价是多少？”→ 30万）。 (Regression: Predict numerical values (e.g., “What is the house price?” → 300,000).) 回归用于预测数值结果（如销售额或温度）。 (Regression is used to predict numerical outcomes, such as sales or temperature.)

模型训练的核心流程： (Core process of model training:)
- X（特征/自变量）：模型输入，是我们已知的。 (X (features/independent variables): Model inputs, which are known to us.) 例如数据框中的列，如房屋面积、年龄。 (For example, columns in a dataset, such as house size or age.)
- y（目标变量/因变量）：模型输出，是我们要预测的，如房价。 (y (target variable/dependent variable): Model output, what we want to predict, such as house price.)

训练过程： (Training process:)
- 初始化模型（如随机权重）。 (Initialize the model (e.g., with random weights).)
- 预测y值。 (Predict y values.)
- 计算误差（预测值与真实值的差）。 (Calculate error (difference between predicted and actual values).)
- 更新模型参数（例如调整权重）。 (Update model parameters (e.g., adjust weights).)
- 重复直到满足停止条件（如误差足够小）。 (Repeat until a stopping condition is met (e.g., error is small enough).)

建模前的假设：<br> (Assumptions before modeling:)<br>
在进行建模前，我们需要检查数据是否满足一些条件（假设），例如，线性回归假设特征和目标呈直线关系。 (Before modeling, we need to check if the data meets certain conditions (assumptions), for example, linear regression assumes a linear relationship between features and target.) 这些条件的存在是为了确保我们得到的结果是可信的、能被正确解释的。 (These conditions ensure the results are reliable and interpretable.)<br>
如果这些假设被违背，我们就可能需要：改用其他模型，或者在报告中说明结果可能会失真。 (If these assumptions are violated, we may need to switch to other models or explain in the report that results may be distorted.)

## 6.1 线性回归 Linear Regression
线性回归建模的目的是找到与所有数据点最接近的直线，尽量减少误差（预测值与实际值的差距）。 (The purpose of linear regression is to find the line closest to all data points, minimizing errors (difference between predicted and actual values).) 线性回归简单且易于解释，是数值预测的良好起点。 (Linear regression is simple and interpretable, a good starting point for numerical predictions.)<br>
标准的一元线性回归公式是：y = β0 + β1 * X + ϵ (The standard simple linear regression formula is: y = β0 + β1 * X + ϵ)
- 𝑦：预测值。 (Predicted value.)
- β0：截距，即当X=0时的y值（基线）。 (Intercept, the value of y when X=0 (baseline).)
- β1：斜率，表示X每增加1，y会增加多少。 (Slope, indicating how much y increases per unit of X.)
- X：自变量。 (Independent variable.)
- ϵ：误差项（真实值与预测值之间的差异）。 (Error term (difference between actual and predicted values).)

多元线性回归与系数解释：面对多个特征时的情况，如果有多个自变量（特征），我们会得到一个回归平面而不是回归线。 (Multiple linear regression and coefficient interpretation: In cases with multiple features, if there are multiple independent variables (features), we get a regression plane instead of a line.) 系数表示该特征对目标变量的影响程度。 (Coefficients indicate the impact of each feature on the target variable.) 如果我们将每个特征的系数乘以它的标准差，可以衡量它对y的“实际影响力”。 (If we multiply each feature’s coefficient by its standard deviation, we can measure its “actual impact” on y.)

线性模型的类型（Lasso 与 Ridge）： (Types of linear models (Lasso and Ridge):)
1. Lasso 回归（L1正则化）：会将不重要的特征系数压缩为0 → 实现特征选择，适用于少量特征有用的情况。 (Lasso regression (L1 regularization): Compresses coefficients of unimportant features to 0 → achieves feature selection, suitable when only a few features are useful.)
2. Ridge 回归（L2正则化）：不会将系数压为0，但会缩小它们 → 降低过拟合风险，适用于所有特征可能都有用的情况。 (Ridge regression (L2 regularization): Does not compress coefficients to 0 but shrinks them → reduces overfitting risk, suitable when all features may be useful.)

### 6.1.1 线性回归假设 Linear Regression Assumptions
线性回归依赖四个关键假设： (Linear regression relies on four key assumptions:)

| 假设 Assumption | 含义 Meaning | 检查方式 Check Method | 修复方式 Fix Method | 举例 Example |
|----------------|-------------|---------------|---------------|---------------|
| **线性关系 Linear relationship** | 每个特征与目标的关系必须是直线。 (Each feature’s relationship with the target must be linear.) | 散点图 / Pairplot (Scatter plot / Pairplot) | 使用非线性模型 (Use non-linear models) | 如果销售额随广告支出稳定增加，是线性关系；如果达到某点后趋平，则是非线性。 (If sales increase steadily with ad spend, it’s linear; if it plateaus after a point, it’s non-linear.) |
| **观察值独立 Independence of observations** | 每个数据点（行）不应影响其他点。 (Each data point (row) should not affect others.) 时间序列数据（例如每日销售额）常违反此假设，因为今天的数据可能依赖昨天。 (Time series data (e.g., daily sales) often violates this because today’s data may depend on yesterday’s.) | Durbin-Watson检验 (Durbin-Watson test) | 加入滞后变量 / 改用时序模型 (Add lag variables / Use time series models) |  |
| **无多重共线性 No multicollinearity** | 特征之间不强相关（例如房屋面积和房间数高度相关，提供类似信息）。 (Features should not be strongly correlated (e.g., house size and number of rooms are highly correlated, providing similar information).) | 热力图、VIF > 5 (Heatmap, VIF > 5) | 删除或合并变量 (Remove or combine variables) | 预测汽车价格时，如果引擎大小和马力高度相关，那就要移除一个或合并。 (When predicting car prices, if engine size and horsepower are highly correlated, remove one or combine them.) |
| **残差正态分布/同方差性 Normality of residuals/Homoscedasticity** | 误差应平均分布且方差一致。 (Errors should be evenly distributed with constant variance.) 残差指预测值与实际值的差；正态分布指残差应呈钟形（正态）分布；同方差性指残差的方差在所有预测值中应恒定（残差图无漏斗形）。 (Residuals are the difference between predicted and actual values; normality means residuals should follow a bell-shaped (normal) distribution; homoscedasticity means residuals’ variance should be constant across all predicted values (no funnel shape in residual plots).) | 残差图、Q-Q图 (Residual plots, Q-Q plots) | 变量转换（例如对y取对数）或换模型 (Variable transformation (e.g., log-transform y) or change models) | 比如，如果残差图呈漏斗形（预测值高时误差更大），说明方差不一致，需换模型。 (For example, if residual plots show a funnel shape (larger errors at higher predictions), it indicates non-constant variance, requiring a different model.) |

## 6.2 非线性模型 Non-linear Models
如果数据呈非线性关系（当X和y不呈直线关系，例如销售额最初快速增加之后又减缓），使用以下模型： (If the data shows a non-linear relationship (when X and y are not linearly related, e.g., sales increase rapidly at first then slow down), use the following models:)
- 决策树回归：简单但容易过拟合。 (Decision tree regression: Simple but prone to overfitting.)
- 随机森林回归：用多个树平均预测，更稳健。 (Random forest regression: Averages predictions from multiple trees, more robust.) 平均预测，减少过拟合。 (Average predictions, reducing overfitting.)
- 梯度提升回归：逐棵树构建，纠正前一棵树的错误，通常更准确。 (Gradient boosting regression: Builds trees sequentially, correcting errors of previous trees, usually more accurate.)

## 6.3 时间序列模型 Time Series Models
线性回归等标准模型不适合时间序列，因为数据点有时间依赖性（自相关）。 (Standard models like linear regression are unsuitable for time series due to temporal dependencies (autocorrelation).) 时间序列（数据按时间顺序排列）数据（例如每日股价、天气、销售额）需要特殊模型： (Time series (data ordered by time) data (e.g., daily stock prices, weather, sales) requires special models:)
- 移动平均：取前几期值的平均值预测下一期（例如用前7天销售额平均值预测明天），可以平滑噪声。 (Moving average: Takes the average of previous periods to predict the next (e.g., use the average of the last 7 days’ sales to predict tomorrow), smoothing noise.)
- (S)ARIMA(X)：高级模型，预测趋势和季节性。 (Advanced model for forecasting trends and seasonality.)

## 6.4 回归指标 Regression Metrics

| 指标 Metric | 说明 Description | 适用场景 Applicable Scenarios |
|--------------------------|---------------------|---------------------------|
| **MAE 平均绝对误差 Mean Absolute Error** | 预测值与实际值的平均绝对差，单位与目标相同。 (The average absolute difference between predicted and actual values, in the same units as the target.) 所有错误平等对待。 (All errors are treated equally.) 如果房价预测的MAE=1万，说明平均误差为1万元。 (If MAE=10,000 for house prices, the average error is 10,000.) | 日常使用 (Everyday use) |
| **MSE / RMSE 均方误差 / 根均方误差 Mean Squared Error / Root Mean Squared Error** | 先平方误差再平均然后开根，放大大误差的影响，适合大误差代价高的情况（例如药物剂量预测）。 (Squares errors, averages them, then takes the square root, emphasizing larger errors, suitable for cases where large errors are costly (e.g., drug dosage prediction).) | 高风险领域（如医疗） (High-risk areas (e.g., healthcare)) |
| **R² 决定系数 Coefficient of Determination** | 模型解释的方差比例，仅适用于**线性模型**。 (The proportion of variance explained by the model, only reliable for **linear models**.) | 越接近1越好（注意：非线性模型中R²可能无意义或为负） (Closer to 1 is better (note: R² may be meaningless or negative in non-linear models)) |

## 6.5 数据切分与验证 Data Splitting and Validation（Train-Test Split & Cross-Validation）

**训练/测试集划分 Train-Test Split**：
- 训练集：用于拟合模型（例如80%）。 (Training set: Used to fit the model (e.g., 80%).)
- 测试集：用于评估模型在未见过数据上的表现（例如20%）。 (Test set: Used to evaluate the model on unseen data (e.g., 20%).)

比如预测考试成绩，用80%学生数据训练，20%测试，检查模型是否能预测新学生的成绩。 (For example, predict exam scores using 80% of student data for training and 20% for testing to check if the model predicts new students’ scores.)

**交叉验证（K-Fold） Cross-Validation (K-Fold)**：
K折交叉验证将数据分成K份，训练用K-1份，测试用1份，重复K次。 (K-fold cross-validation divides data into K parts, trains on K-1 parts, tests on 1 part, and repeats K times.) 如果K=5，将数据分成5份，每次用4份训练，1份测试，重复5次，每次用不同份测试，平均结果以获得可靠性能估计。 (If K=5, divide data into 5 parts, each time train on 4 parts, test on 1, repeat 5 times with different test parts, and average results for a reliable performance estimate.) 特别适合小数据集，最大化数据使用，同时避免过拟合。 (Especially suitable for small datasets, maximizing data use while avoiding overfitting.) 比如100个房价数据，5折交叉验证分成5组，每组20个。 (For example, 100 house prices, 5-fold CV divides into 5 groups of 20.) 每次用80个训练，20个测试，重复5次，平均误差以评估模型。 (Each time train on 80, test on 20, repeat 5 times, and average errors to evaluate the model.)

**时间序列交叉验证 Time Series Cross-Validation**：
对时间序列或非独立同分布（non-i.i.d.）数据，使用TimeSeriesSplit按时间顺序拆分数据，避免随机拆分。 (For time series or non-independent and identically distributed (non-i.i.d.) data, use TimeSeriesSplit to split data chronologically, avoiding random splitting.) 时间序列数据（例如股价）有时间依赖性，随机拆分会导致数据泄露（用未来数据预测过去）。 (Time series data (e.g., stock prices) has temporal dependencies; random splitting causes data leakage (using future data to predict the past).) TimeSeriesSplit按固定时间间隔拆分（例如用第1–80天训练，第81–100天测试，然后用第1–100天训练，第101–120天测试），确保时间依赖数据的真实预测，防止误导性好结果。 (TimeSeriesSplit splits by fixed time intervals (e.g., train on days 1–80, test on days 81–100, then train on days 1–100, test on days 101–120), ensuring realistic predictions for time-dependent data, preventing misleadingly good results.)

## 6.6 常见回归模型对比（总结） Comparison of Common Regression Models (Summary)

| 模型类型 Model Type | 优势 Advantages | 适用场景 Applicable Scenarios |
|----------------|-----------|------------|
| **线性回归 Linear Regression** | 简单易解释 (Simple and interpretable) | 线性关系明显，数据质量高 (Clear linear relationships, high-quality data) |
| **Ridge / Lasso** | 处理特征相关性，Ridge缩小相关特征影响，Lasso将不重要特征置零。 (Handle feature correlation, Ridge shrinks correlated feature impacts, Lasso sets unimportant features to zero.) 正则化，防止过拟合 (Regularization prevents overfitting) | 多特征、可能相关 (Multiple features, possible correlations) |
| **随机森林 Random Forest** | 适合非线性数据，结合多个决策树。 (Suitable for non-linear data, combines multiple decision trees.) 抗过拟合 (Robust against overfitting) | 特征之间复杂关系 (Complex relationships between features) |
| **SVM回归 SVM Regression** | 适合高维、非线性数据 (Suitable for high-dimensional, non-linear data) | 特征维度远大于样本数 (Feature dimensions far exceed sample size) |
| **Logistic回归 Logistic Regression** | 常用于分类，但也可用于回归，预测事件概率 (Commonly used for classification but can be used for regression, predicting event probabilities) | 二分类/概率输出问题 (Binary classification/probability output problems) |