# Grading
**Written test(individual) - 60%**
<br>
This written test of 40 multiple choice questions with 4 possible answers is about all the theory presented in the lectures. There will be no Python-specific questions. The test will contain multiple choice questions. The questions are designed to apply the knowledge obtained from the lectures to test your understanding.
<br>

**Project report(group) - 40%**
<br>
1. The actual content of the final report is 20 pages maximum. The appendix may be 20 pages maximum. The content counts from the start of your introduction to the last sentence of your last section (conclusion). 
2. Follow the guidelines in the Writing Guidelines regarding appendices, bibliography and formatting.
3. Only use screenshots of your graphs. Make proper tables for the other components of data understanding (e.g. descriptive statistics, formulas). That is to say – no screenshots of code output. 
4. If you want to show some code to provide evidence towards fulfilling the LOs, put your code in an appendix. You may either make separate entries within that appendix with the relevant code snippets, or put your code in its entirety in one appendix and refer to the relevant lines within the text. Put the code in monospace font to discern it from regular text. Use the Word plugin for code to do this. This means also not putting in screenshots of your code.
5. You take responsibility for paraphrasing correctly (or quoting if necessary) and the ethical use of sources, including accurate source referencing following APA guidelines. Use the standard Word functionality under References for this.

---

## 1. Introduction to AI
### What do we want to achieve with AI?
- **Now**: Narrow AI 狭义人工智能
  - Automate processes 自动化流程
  - Improve decision-making 改进决策过程
  - Build productivity tools 构建提升效率的工具
- **Later**: AGI & Superintelligence 通用人工智能和超人工智能
  - Perform any task a human can 能完成任何人类能做的任务
  - Solve novel problems 能解决从未遇到的新问题
  - Show creativity and common sense 展现创造力与常识推理能力

### Narrow AI (Machine Learning)
- Works with **available data** 使用**已有数据**进行工作
- Focused on **specific tasks** 专注于**特定任务**
- Core functions: 
  - Pattern recognition模式识别：Recommendation systems推荐系统, facial recognition人脸识别
  - Prediction预测：maintenance维护, demand forecasting需求预测
  - Content generation内容生成：ChatGPT, Copilot, DALL·E

### What makes Narrow AI work?
- **Main ingredients**:
  - Data 数据
  - Model of the data distribution 对数据分布的建模
  - Probability & inferential statistics 概率与推理统计方法
  - Model tuning 模型调优

### AGI (Artificial General Intelligence)
| Narrow AI | AGI |
|-----------|-----|
| Task-specific特定任务 | Cross-domain跨领域任务 |
| Based on past data基于历史数据 | Can handle novelty能处理新情况 |
| Pattern recognition模式识别 | Pattern interpretation & reasoning模式解释与推理 |
| Increases efficiency提高效率 | Matches/exceeds human flexibility匹敌或超越人类的灵活性 |

<br>

> Adding complexity or data ≠ more intelligence 增加复杂性或数据 ≠ 增强智能
> Intelligence requires reasoning beyond seen data 智能需要超越已知数据的推理能力

---

## 2. Business Understanding (CRISP-DM Phase)
### Key Questions
- **Business Objective**: What is the organization trying to achieve? 组织试图实现什么？
- **Business Success Criteria**: When is that considered successful? 什么情况下认为目标达成？（明确KPI）
- **Data Mining Goal**: How can data science help? 数据科学如何提供帮助？
- **Data Mining Success Criteria**: When is the data science effort successful? 如何判断数据科学工作是否成功？（明确准确率或RMSE等）

---

## 3. Evaluation & Baseline Models
### Why use a baseline model?
- Understand your data performance 了解数据的基本表现
- Identify data or modeling issues 发现数据或建模中存在的问题
- Faster iterations with simpler models 使用简单模型可以更快速地迭代
- Interpretable results for stakeholders 提供可解释的结果给相关方参考
- Provide a fair benchmark for advanced models 为高级模型提供一个公平的对比基准

### How to define Data Mining Success Criteria?
Use one or more of the following methods:
- **Relative improvement over baseline 相对提升**: "15% improvement over mean prediction" 比均值预测提升 15%
- **Business impact as a metric 业务影响指标**: "10% waste reduction = €500 savings/month" 减少 10% 浪费 = 每月节省 500 欧元
- **Industry/regulatory benchmarks 行业或法规标准**: "False negatives < 5% when predicting disease" 疾病预测中漏报率 < 5%
- **Statistical significance 统计显著性标准**: "Improve math scores by +6 points to be beyond ±3 point fluctuation" 数学成绩提高 6 分以超出 ±3 的自然波动范围


---

# 4. Data Understanding  
## 4.1 Preliminary Data Inspection 初步数据检查
- Data types and structure (e.g., numerical, categorical) 数据类型与结构（数值型、分类型等）  
- Variable distributions (check for normality) 变量分布（是否符合正态分布）  
- Correlations between variables 变量之间的相关性  
- Missing values 缺失值  
- Is the dataset suitable for modeling? What are the potential issues? 是否适合建模？有哪些潜在问题？  

## 4.2 Distributions  
**Purpose**: Understand the shape, skewness, and normality of variables.  
目的：了解变量的形状、偏态、是否符合正态分布  

**Visualization methods**:  
- Histogram 直方图: shows frequency of numerical variables 数值型变量频率   
- Bar plot 柱状图: compares grouped categorical values 分组（分类变量）与其数值比较   
- Boxplot 箱线图: shows central tendency, distribution, and outliers  集中趋势、分布、离群值 

Statistical tests like **t-test** and **ANOVA** can be used to test for normality. 可以使用 t 检验、ANOVA 来查看变量是否服从正态分布。 <br> 
If the distribution is not normal, apply **log transformation**: suitable for right-skewed data, reduces tail length and outlier influence. 如果不是，采用对数变换（log transform）：适用于右偏变量，拉近尾部，减少极端值影响。<br>   
Alternatively, consider using **non-linear models**. 或者使用非线性模型。  

## 4.3 Outlier Detection  
**Purpose**: Identify and handle extreme values that may bias the model. 识别和处理异常点，避免其对模型产生不当影响  

**Detection methods**:  
- **Boxplot method**:  
  - Lower bound = Q1 - 1.5 × IQR  
  - Upper bound = Q3 + 1.5 × IQR  
- **Z-score method**:  
  - Z = (x - mean) / standard deviation  
  - Z < -3 or Z > 3 is considered an outlier  

**Handling strategies**:  
- Investigate cause: data entry error or device malfunction?  调查原因：是否为录入错误或设备异常  
- Delete: only if clearly erroneous  删除：仅限于明确为错误数据  
- Retain: if it's a meaningful extreme value  保留：如果是有意义的极端值  
- Important: document all changes to avoid introducing bias  重要：记录所有修改操作，避免引入偏差  

Outlier ≠ Error — always investigate the reason.  离群值 ≠ 错误，一定要调查原因  

## 4.4 Feature Scaling  
**Purpose**: Bring variables to a similar scale for better convergence and fair influence in models. 让变量在相似尺度上，有利于建模收敛、避免某变量主导模型  

**Common scaling methods**:

| Method           | Description 描述                                 | Suitable scenarios 适用场景                           |
|------------------|--------------------------------------------------|------------------------------------------------------|
| StandardScaler   | Mean = 0, Std = 1 (Z-score standardization)      | Linear models, neural networks, gradient descent     |
|                  | 均值为 0，标准差为 1（Z-score 标准化）             | 线性模型、神经网络、梯度下降类算法                    |
| MinMaxScaler     | Rescales values to [0, 1] range                  | Image processing, interpretable scales               |
|                  | 所有值压缩到 [0,1] 区间                           | 图像处理、需要恢复业务单位的模型                      |
| RobustScaler     | Based on IQR, resistant to outliers              | Data with many outliers                              |
|                  | 基于 IQR，抗离群点                                | 数据有大量离群值时                                   |

**Note**:  
- Always scale based on **training set only** to avoid data leakage.  缩放应仅基于训练集，以防数据泄露  
- Do **not** scale binary variables or categorical variables.  不要对二元变量或分类变量缩放  
- Scaling is **not needed** for tree-based models.  不用于树模型  

## 4.5 Correlation & Multicollinearity  
**Purpose**: Understand relationships between variables and identify redundancy. 理解变量之间的关系，判断模型输入是否冗余  

**Correlation metrics**:

| Metric           | Description                                  | Notes                                  |
|------------------|----------------------------------------------|----------------------------------------|
| Pearson's r      | Measures linear correlation [-1, 1] 线性相关性，值在 [-1, 1] 之间 | Most common; for continuous variables 最常用，适用于数值型变量 |
| Spearman's ρ     | Measures monotonic rank correlation 单调关系的秩相关，适用于排序数据 | Captures non-linear monotonic patterns 可捕捉非线性单调关系 |
| Distance corr.   | Captures all types of dependencies 衡量所有类型的相关性（非线性等）| Powerful but harder to interpret 更强但不易解释 |

**Tip**: Always use scatter plots to visually confirm correlation.  总是可视化（如散点图）来验证相关性是否合理。  

### 4.5.1 Multicollinearity  
**Problems**:  
- Unstable coefficients, difficult to interpret  系数不稳定，解释困难  
- Poor generalization  模型泛化能力下降  

**Detection**:  
- Correlation heatmap  相关系数热图  
- Variance Inflation Factor (VIF) 方差膨胀因子: VIF > 5 or 10 indicates high multicollinearity

**Solutions**:  
- Drop one variable (choose less relevant or meaningful)  删除其中一个变量（选择更弱相关、含义更弱者）  
- Combine variables (e.g., average or use PCA)  合并变量（例如平均值、主成分分析等）  
- Apply regularization methods (e.g., Ridge or Lasso regression)  使用正则化方法（如 Ridge 或 Lasso 回归）  

# 5. Data Understanding & Preparation
机器学习类型流程图（Machine Learning Flow）：<br>
**Question：Is labeled data available or can a target value be generated? 是否有标注数据或可以生成目标值？**  <br><br>

Yes → **监督学习（Supervised Learning）**
- **回归（Regression）**: Used to predict continuous values (e.g., housing price, temperature)用于预测连续数值（如房价、温度）
  - Types：Linear Regression 线性回归 / Non-Linear Regression 非线性回归  
- **分类（Classification）**: Used to predict categories (e.g., spam detection) 用于预测类别（如垃圾邮件识别）  
  - Types：Linear Classification 线性分类 / Non-Linear Classification 非线性分类  

<br>

No → **无监督学习（Unsupervised Learning）**
- **聚类（Clustering）**：Grouping data without target values (e.g., customer segmentation) 将数据分组，无需目标值（如客户分群）  

## 5.1 建模假设与特征选择（Modeling Assumptions & Feature Selection）
建模假设（Modeling Assumptions）
- **代表性（Representative）**：Samples should represent the overall population 样本应能代表总体  
- **独立同分布（IID: Independent and Identically Distributed）**：Each row should be independent and from the same distribution 每行数据应独立，且来自相同分布

<br>

行层面（Row-Level）
- **测量水平（Measurement Level）**：Types of variables (e.g., numerical, categorical) 变量类型（数值型、类别型等）  
- **分布（Distribution）**：Variable distribution (e.g., skewness, outliers) 变量的分布形态（是否偏态、是否有异常值）  

<br>

列层面（Column-Level）
- **变量关系（Relationships）**：Correlations between features and with target 特征之间、特征与目标之间的相关性  


## 5.2 数据类型识别（Data Type Recognition）
- **独立数据（Independent Data）**：Observations are unrelated (e.g., random sampling) 观测值之间无关联（如随机抽样）  
- **自相关（Autocorrelation）**：Nearby data in time or space are more similar (e.g., temperature) 时间或空间上相近的数据更相似（如温度）  
- **类内相关（Intraclass Correlation）**：Correlation within groups (e.g., experimental groups) 组内数据相关（如实验组）  

<br>
判断方法：Combine dataset purpose + visualization (e.g., scatterplot, autocorrelation plot) 结合数据集目的 + 可视化（如散点图、自相关图）


## 5.3 分箱（Binning）
- **定义（Definition）**：Convert continuous variables into categories 将连续变量离散化为类别   
- **示例（Example）**：Wind speed is divided into 风速分为：  
  - "low" < 10 m/s  
  - "medium" 10–15 m/s  
  - "high" > 15 m/s
- **用途（Purpose）**：Simplify model input for classification models 简化模型输入，适应分类模型  

## 5.4 滞后特征工程（Lag Feature Engineering）
- **定义（Definition）**：Use past observations as new features 将过去的观测值作为新特征  
  - Lag 1 = 昨天的值 → **Yesterday's value**  
  - Lag 7 = 上周同一天的值 → **Same day last week**
-  **适用场景（When to Use）**：Time series modeling (e.g., predicting today’s temperature) 时间序列建模（如预测今日温度）  

## 5.5 自相关与趋势可视化（Autocorrelation & Trend Visualization）

### 5.5.1 自相关（Autocorrelation）
- Measures similarity between time series and its lagged version 衡量时间序列与其滞后版本的相似性  
- Bars outside confidence interval → Significant autocorrelation exists 条形图落在置信区间外 → 存在显著自相关  
  
### 5.5.2 趋势可视化（Trend Visualization）
- **Histogram 直方图**: Shows frequency of numerical variables 展示数值型变量的频率分布
- **Bar plot 柱状图**: Compares grouped categorical values 比较分类变量组的数值
- **Box plot 箱线图**: Visualizes distribution and outliers 可视化分布形态及异常值
- **Scatterplot 散点图**: Shows relationship between two variables 展示两个变量之间的关系
- **Line plot 折线图**: Suitable for time series 适用于时间序列分析
- **Heatmap 热力图**: Displays correlation or matrix values 展示相关性或矩阵数值

## 5.6 分类变量编码（Categorical Encoding）
### 5.6.1 编码方法（Encoding Methods）
| 方法（Method）       | 说明（Explanation）     | 适用模型（Suitable Models） |
|----------------------|--------------------------|------------------------------|
| One-hot encoding     | 每个类别一列             | 非线性模型（Non-linear）    |
| Dummy encoding       | n 类别 → n-1 列          | 线性模型，避免共线性        |

**Python Tools**：
```python
pd.get_dummies(drop_first=True)  # Dummy 编码
sklearn.preprocessing.OneHotEncoder()  # One-hot 编码
``` 

## 5.7 数据合并（Merging）

| 合并方式（Join Type） | 中文解释（Explanation）       | 特点（Feature）                     |
|------------------------|-------------------------------|--------------------------------------|
| Inner Join             | 内连接                        | 仅保留两表共有键（only common keys） |
| Left Join              | 左连接                        | 保留左表全部 + 匹配右表              |
| Right Join             | 右连接                        | 保留右表全部 + 匹配左表              |
| Full Join              | 全连接                        | 保留所有行，缺失填 NaN               |

## 5.8 缺失值处理（Missing Values）

### 5.8.1 缺失类型（Types of Missingness）
| 类型（Type） | 中文解释（Explanation）          | 示例（Example）                    |
|--------------|----------------------------------|------------------------------------|
| MCAR         | 完全随机缺失（Missing Completely at Random） | 硬件故障导致数据丢失              |
| MAR          | 随机缺失，与其他变量相关（Missing At Random）         | 年轻人不愿报告屏幕时间            |
| MNAR         | 非随机缺失，与缺失值本身相关（Missing Not At Random）   | 重度吸毒者隐瞒使用频率            |

### 5.8.2 为什么重要（Why It Matters）
- 会引入偏差 → May introduce bias  
- 降低统计效能 → Reduces statistical power  
- 影响模型训练 → Affects model training

### 5.8.3 缺失值填补方法（Imputation Methods）
| 方法（Method）        | 适用情况（When to Use） | 优点（Advantages） | 缺点（Disadvantages）           |
|------------------------|--------------------------|--------------------|----------------------------------|
| 删除（dropna）         | MCAR                     | 简单               | 丢失信息（Information loss）    |
| 均值/中位数填充       | MCAR/MAR                 | 快速               | 扭曲分布（Distorts distribution）|
| 前向/后向填充         | 时间序列                 | 保持趋势           | 不适用于突变                     |
| 热/冷卡片填充         | 有相似值可借用           | 保持分布           | 可能引入偏差                    |
| 插值（interpolation） | 有趋势可推断             | 多种方法           | 不适用于跳跃数据                 |