# Data Science Notes 数据科学笔记

## 1. Introduction 介绍

### 1.1 Data and Dataset 数据与数据集

#### 1.1.1 What is Data? 什么是数据？
Data refers to facts, statistics, or values that convey information. It can be structured (tables, databases) or unstructured (slides, documentation).  
数据是事实、统计数据或数值，能够传达信息，可以是**结构化**（表格、数据库）或**非结构化**（幻灯片、文档）。

#### 1.1.2 What is a Dataset? 什么是数据集？
A dataset consists of rows and columns:  
数据集由行和列组成：
- **Rows (observations, 观察值)**: Represent individual cases or instances (e.g., each customer in a dataset).  
- **Columns (variables, 变量)**: Represent features describing the observations.

#### 1.1.3 Types of Data 数据类型
- **Categorical (分类数据)**: Data divided into distinct groups (e.g., colors, cities).  
- **Numerical (数值数据)**: Data expressed as numbers (e.g., age, temperature).

#### 1.1.4 Levels of Measurement 测量水平
- **Nominal (名义变量)**: Categories without a meaningful order (e.g., gender, country).  
- **Ordinal (序数变量)**: Ordered categories without meaningful differences (e.g., satisfaction levels).  
- **Interval (区间变量)**: Ordered categories with meaningful differences but no true zero (e.g., temperature in Celsius).  
- **Ratio (比率变量)**: Like interval data but with a true zero (e.g., height, weight).

### 1.2 CRISP-DM (Cross Industry Standard Process for Data Mining) 数据挖掘标准流程
A structured framework for conducting data science projects.  
用于执行数据科学项目的标准框架：
1. **Business Understanding (业务理解)** - Define the project goal and understand business needs.
2. **Data Understanding (数据理解)** - Explore, visualize, and analyze data.
3. **Data Preparation (数据准备)** - Clean and transform data.
4. **Modeling (建模)** - Develop predictive models.
5. **Evaluation (评估)** - Assess model performance.
6. **Deployment (部署)** - Implement the model in a real-world scenario.

## 2. Business Understanding 业务理解
- **Business Objective (商业目标)**: Understanding what the organization aims to achieve.
- **Data Mining Goal (数据挖掘目标)**: Define what data scientists can do to support business objectives.
- **Success Criteria (成功标准)**: Establish measurable criteria to assess model effectiveness.

---

## 3. Data Understanding 数据理解

### 3.1 Descriptive Statistics 描述性统计
#### Central Tendency 中心趋势
- **Mean (均值)**: Uses all data points but **sensitive to outliers**.
- **Median (中位数)**: Robust to outliers.
- **Mode (众数)**: Useful for categorical data.

#### Variability 变异性
- **Variance (方差, σ²)**: Measures spread of data.
- **Standard Deviation (标准差, σ)**: Square root of variance.

### 3.2 Outliers 异常值
- **Detection Methods:** Z-score, IQR, Boxplot.
- **Handling:** Remove, adjust, or analyze business impact.

### 3.3 Correlation 相关性分析
**Measures the strength and direction of relationships between two variables**
- Values range from -1 to +1: **Positive correlation (+), Negative correlation (-), No correlation (≈ 0)**
- **Correlation ≠ Causation**

#### 3.3.1 Pearson Correlation (皮尔逊相关系数)
- Measures **linear** relationships.
- **Assumptions:** Linear relationship, continuous variables, no significant outliers, normal distribution (for hypothesis testing).
- **Strength Interpretation Guidelines:**
  - 0.00-0.19: Very weak
  - 0.20-0.39: Weak
  - 0.40-0.59: Moderate
  - 0.60-0.79: Strong
  - 0.80-1.00: Very strong
- **Statistical Significance:**
  - p-value < 0.05: Significant (5% chance of false positive)
  - p-value < 0.01: Highly significant (1% chance of false positive)
  - p-value < 0.001: Very highly significant (0.1% chance of false positive)

#### 3.3.2 Spearman’s Rank (斯皮尔曼秩相关系数)
- Measures **monotonic** relationships (both linear and non-linear).
- **Use cases:**
  - Data may have non-linear relationships
  - Working with ordinal data
  - Dealing with outliers
  - When normality assumptions are violated

#### 3.3.3 Correlation Matrices (相关矩阵) & Heatmaps (热图)
- **Visualize relationships between variables.**
- **Feature Redundancy (特征冗余)**: Identify highly correlated features to reduce model complexity.
- **Multicollinearity (多重共线性)**: When features are too correlated, it can affect model performance.

---

## 4. Data Preparation 数据准备
### 4.1 Data Cleaning 数据清理
- **Typos (拼写错误):** Use `.unique()` and `.replace()`.
- **Handling Missing Values (处理缺失值):**
  - **Forward Fill (正向填充, ffill)**: Fills missing values with the last observed value.
  - **Backward Fill (反向填充, bfill)**: Fills missing values with the next observed value.
  - **Mean/Median Imputation (均值/中位数填充)**: Suitable for numerical data.
- **Impact of Missing Data (缺失数据的影响)**: Can introduce bias and reduce model accuracy.

### 4.2 Independence & Representativity 独立性 & 代表性
- **Independence (独立性)**: Each observation should not influence another.
- **Representativity (代表性)**: Ensure dataset reflects the broader population.

---

## 5. Modeling 建模

### 5.1 Benchmark Models 基准模型
- **Baseline comparison (基准比较)**: Helps measure improvement from complex models.

### 5.2 Training & Testing 训练与测试
- **Train-Test Split (训练-测试拆分)**: Common splits (70%-30%, 80%-20%).
- **Cross-validation (交叉验证)**: K-fold validation to assess model generalizability.

### 5.3 Regression & Classification 回归与分类
#### Regression (回归)
- **Linear Regression (线性回归)**: Models linear relationships.
- **Ordinary Least Squares (普通最小二乘法, OLS)**: Minimizes the sum of squared residuals.
- **Evaluation Metrics:**
  - **R² (决定系数)**: Measures how well the model explains variance in the target.
  - **MAE (平均绝对误差)**: Measures average error.
  - **MSE (均方误差)**: Penalizes large errors.
  - **RMSE (均方根误差)**: Square root of MSE.
- **Overfitting (过拟合)**: When a model learns training data too well and fails to generalize.

#### Classification (分类)
- **Decision Trees (决策树)**: Uses feature-based splits.
- **Evaluation Metrics:**
  - **Confusion Matrix (混淆矩阵)**:
    - **True Positive (TP, 真阳性)**: Correctly predicted positive cases.
    - **True Negative (TN, 真阴性)**: Correctly predicted negative cases.
    - **False Positive (FP, 假阳性)**: Incorrectly predicted positive cases.
    - **False Negative (FN, 假阴性)**: Incorrectly predicted negative cases.
  - **Accuracy (准确率)**: (TP + TN) / (TP + TN + FP + FN)
  - **Precision (精确率)**: TP / (TP + FP)
  - **Recall (召回率)**: TP / (TP + FN)
  - **F1 Score**: Harmonic mean of precision & recall.

---

## 6. Model Evaluation 模型评估

### 6.1 Regression Metrics 回归指标
- **MAE (平均绝对误差)**: Measures average error.
- **MSE (均方误差)**: Penalizes large errors.
- **RMSE (均方根误差)**: Square root of MSE.

### 6.2 Classification Metrics 分类指标
- **Confusion Matrix (混淆矩阵)**: 
  - **True Positive (TP, 真阳性)**: Correctly predicted positive cases.
  - **True Negative (TN, 真阴性)**: Correctly predicted negative cases.
  - **False Positive (FP, 假阳性)**: Incorrectly predicted positive cases.
  - **False Negative (FN, 假阴性)**: Incorrectly predicted negative cases.
- **Precision (精确率)**: TP / (TP + FP).
- **Recall (召回率)**: TP / (TP + FN).
- **F1 Score**: Harmonic mean of precision & recall.

---

## 7. Data Visualization 数据可视化
- **Histogram (直方图)**: Displays numerical distributions.
- **Bar Plot (条形图)**: Compares categorical values.
- **Scatter Plot (散点图)**: Shows relationships between numerical variables.
- **Line Graph (折线图)**: Tracks changes over time.