### Chapter 2 : Preliminary Knowledge
- 数据操作
  - 广播机制（两个数据分别复制扩充到同样的尺寸）
  - 节省内存（使用X[:] = \<expression\>或X+=\<expression\>来避免重新分配）
- 数据预处理
- 线性代数 
  - 转置.T 范数norm
  - 非降维求和 (keepdims=True)，累积和cumsum
  - torch.dot只支持向量，矩阵和向量间用mv，矩阵之间用mm
- 微积分
  - 设T是梯度算符，T(Ax) = A.T, T(x.T·A) = A, T(x.T A x) = (A + A.T)x
- 自动微分
  - 在默认情况下，PyTorch会累积梯度，我们需要清除之前的值
  - 自动微分必须是标量，非标量的话要么转成标量，要么指定输出形状
  - 分离操作
- 概率论
- 查阅文档、API的指导
  - dir查看可以调用的函数和类

### Chapter 3 : Linear Neural Network
- Minibatch stochastic gradient descent (小批量随机梯度下降)
- 一般的训练过程
  - model.forward() 与 y_hat 做差，然后反向传播，优化器根据导数去更新参数
- Machine Learning Concept
  - lasso regression: l1 norm; ridge regression: l2 norm;

## Chapter 4 : Classification
- softmax:
  $y_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}$, often minus max(oj) to get numerical stable
- Information theory
  - cross-entropy loss：$l(y, \hat y) = - \sum y_i * \log(\hat y_i)$
  - amount of information $\log{\frac{1}{P(j)}} = - \log{P(j)}$ 
  - entorpy $H[P] = \sum -P(j) \log{P(j)}$
  - cross-entorpy $H(P, Q) = \sum -P(j) \log{Q(j)}, ~ P=Q \rightarrow H(P, Q) = H(P, P) = H(P)$
- Image Classification Rules:
  - image stored in (channel, height, weight) manner.
- Distrubution shift:
  - Covariate Shift (feature shift): $p(x) \neq q(x), p(y|x) = q(y|x)$
    - For example: p(x) and q(x) are features of oral and urban house, y is the price, we assume the feature and label relation is the same
    - Method: weighted by $\beta(x) = p(x) / q(x) \rightarrow \int\int l(f(x), y)p(y|x)p(x)dxdy = \int\int l(f(x), y)q(y|x)q(x) \frac{p(x)}{q(x)}dxdy \rightarrow \sum_i \beta_i l(f(x_i), y_i)$, $\beta$ can be obtained with logistic regression.
  - Label Shift, $p(y) \neq q(y), p(x|y) = q(x|y)$, the same method $\beta(y) = p(y) / q(y)$, but now $q(y)$ is hard to get, we need compute a confusion matrix on the val data then use the model to pridcit the distrubution of the $q(y)$
  - Concept Shift (the concept of the label)

## Chapter 5 : Multilayer Perceptrons
- Activation Function: relu, sigmoid, tanh ($\frac{1 - \exp(-2x)}{1 + \exp(-2x)}$)
- Numerical stability: vanish and explode are common
  - Symmetry: linear layer and conv (with no share weight) layer are symmetric so we can tell apart from different weight and try to explain it, so we need to **Bread the Symmetry** (like using a dropout)
  - Xavier initilization: get from distrubution of zero mean and variance $\sigma = \sqrt{2 / (n_{in} + n_{out})}$
  - Dropout, shared param...
- (Rolnick et al., 2017) has revealed that in the setting of label noise, neural networks tend to fit cleanly labeled data **first** and only subsequently to interpolate the mislabeled data.
  - so we can use early stop once error on val is minimal or the patience hit. usually combined with regularization.
- Dropout:
  - $h^{'} = \left \{ 
  \begin{array}{lll}
  & 0, p \\
  & \frac{h}{1-p}, 1-p
  \end{array} 
  \right .$, now $E[h^{'}] = E[h]$
  - We do not use dropout in test, except we want to know the uncertainty of the model output (by comparing different dropout)
  - Use lower p in lower layer (to get lower feature), higher p in higher layer

## Chapter 6: Beginner Guide
- Tied layer: gradient will add up along different chain
- Custom initialization: `apply` method
- I/O
  - save tensor: `torch.save(x:Uinon[List[tensor], Dict], name:str)` and load
  - save model: the same, just input dict of the net (`net.state_dict()`) then `net.load_state_dict(torch.load(name))`
- GPU
  - operation between tensors must in the same GPU
  - print or transform to numpy will copy to memory, and even worse wait the python **GIL** (`Global Interpreter Lock`, make sure at the same time only one thread can execute the python bytecode)

## Chapter 7 : CNN
1. **Invariance**: translation equivariance, locality -> The earliest layers should respond similarly to the same patch and focus on local regions.
2. **Convolution**: math is $(f * g)(i, j) = \sum_a \sum_b f(a, b)  g(i - a, j - b)$, remind that **cross-correlation** is $(f * g)(i, j) = \sum_a \sum_b f(a, b)  g(i + a, j + b)$
   - The difference is not important as we will learn the kernel, `k_conv_learned = k_corr_learned.T`, or `conv(X, k_conv_learned) = corr(X, k_corr_learned)`
3. **Receptive Field**： for any element (tensors on the conv layer) x, all the elements that may effect x in the previous layers in the forward population.
4. **Padding, Stride**: $\lfloor (n_h - k_h + p_h + s_h) / s_h \rfloor \times \lfloor (n_w - k_w + p_w + s_w) / s_w \rfloor$, often `p_h = k_h - 1`, the same for `p_w`. `p_h = p_h_upper + p_h_lower`
5. **Channel**:
   - multi in $c_i$ -> kernel must also have the same channels ($c_i \times k_h \times k_w$), then add them up.
   - multi out $c_o$ -> kernel with $c_o \times c_i \times k_h \times k_w$, get $c_o$ output channels.
6. use `torch.stack` to stack tensors
7. **Pooling**: mitigating the sensitivity of convolutional layers to location and of spatially downsampling representations.

## Chapter 8 : Modern CNN
1. **AlexNet**: first deep conv successful, using dropout, Relu, polling
2. **VGG**: multiple 3 * 3 conv layers (two 3 * 3 conv touch 5 * 5 input as a 5 * 5 conv, but 2 * 3 * 3  = 18 < 25 = 5 * 5)
3. **NiN**: to handle 2 problem (1. much ram for the MLP at the end; 2. can not add MLP between the conv to increase the degree of nonlinearity as it will destroy the spatial information)
   - use 1 * 1 conv layer to add local nonlinearities across the channel activations
   - use global average pooling to integrate across all locations in the last representation layer. (must combine with added nonlinearities)
4. **GoogleNet**: Inception layer, parallel conv multi scales, and then concate them
5. **Batch Normalization**:
   - $BN(\mathbf x) = \mathbf{\gamma} \bigodot \frac{\mathbf x - \mathbf{\mu_B}}{\sigma^2_B} + \mathbf \beta$, $\mathbf{\mu_B} = \frac{1}{|B|}\sum_{x \in B} \mathbf x$,
     $\sigma^2_B = \frac{1}{|B|} \sum_{x \in B} (x - \mathbf{\mu_B})^2 + \epsilon$
   - On linear layer [N, D] it will get across D (different features in D will not do calculations), on conv layer [N, C, H, W] it will across C (save the difference between channels)
     - For example, [N, C, H, W] shape input x, for x[N, 0, H, W], get it's mean mu and std and do (x[N, 0, H, W] - mu) / std, here mu and std are scalar.
   - At the testing stage, we will use the global (whole) data mean and varience, instead of minibatch mean and varience. Just like dropout.
   - So BN also serves as a noise introducer! (minibatch information != true mean and var) Teye et al. (2018) and Luo et al. (2018).
   - So it best works for batch size of 50 ~ 100, higher the noise is small, lower it is too high.
   - Moving global mean and var: when testing, no minibatch, so we use a global one that is stored during training.
     - It is a kind of exp weighted mean, closest batch has higer weight
     - $\mu_m = \mu_m * (1 - \tau) + \mu * \tau, \Sigma_m = \Sigma_m * (1 - \tau) + \Sigma * \tau$, $\tau$ is called momentum term.
6. **Layer Normalization**: often used in NLP
   - For features like [N, A, B] it will save difference between N, A and B are typically seq_len, hidden_size.
7. **ResNet**: residual block, pass x as one of the branch before a activation function (for the original paper, and later it is changed to BN -> AC -> Conv)
   - To get the passed x has the correct shape to add up, we can use 1 * 1 conv if it is needed
   - **Idea**: nested-function class, shallower net (like ResNet-20) is subclass of depper net (like ResNet-50). Because in ResNet-50 if the layers after 20th layer are f(x) = x, then it is the same as RestNet-20! So we can make sure f' (the best we can get in ResNet-50 for certain data) will be better than f (ResNet-20 on the same data) or at least the same.
   - <p align="center">
       <img alt="Residul Block" src="https://d2l.ai/_images/resnet-block.svg" style="background-color: white; display: inline-block;">
       Rusidul Block
   </p>
   - **ResNeXt**: use g groups of 3 * 3 conv layers between two 1 * 1 conv of channel $b$ and $c_o$, so $\mathcal O(c_i c_o) \rightarrow \mathcal O(g ~ c_i / g ~ c_o / g) = \mathcal O(c_ic_o/g)$
     - This is a **Bottleneck** arch if $b < c_i$
       </br>
   - <img alt="ResNeXt Block" src="https://d2l.ai/_images/resnext-block.svg" style="background-color: white; display: inline-block;">
       ResNeXt Block
8. **DenseNet**: instead of plus x, we concatenate x repeatedly.
   - For example (\<channel\> indicates the channel): x\<c_1\> -> f_1(x)\<c_2\> end up with [x, f_1(x)]\<c_1 + c_2\> -> f_2([x, f_1(x)])\<c_3\> end up with [x, f_1(x), f_2([x, f_1(x)])]\<c_1 + c_2 + c_3\>
   - Too many of this layer will cause the dimeansion too big, so we need some layer to reduce it. **Translation** layer use 1 * 1 conv to reduce channel and avgpool to half the H and W.
9. **RegNet**:
   - AnyNet: network with **stem** -> **body** -> **head**.
   - Distrubution of net: $F(e,Z)=∑_{i=1}^{n}1(e_i<e)$, use this empirical CDF to approximate $F(e, p)$, $p$ is the net arch distrubution. $Z$ is a sample of net sample from $p$, if $F(e, Z_1) < F(e, Z_2)$ then we say $Z_1$ is better, it's parameters are better.
   - So for RegNet, they find that we should use same k (k = 1, no bottlenet, is best, says in paper) and g for the ResNeXt blocks with no harm, and increase the network depth d and weight c along the stage. And keep the c change linearly with $c_j = c_o + c_aj$ with slope $c_a$
   - neural architecture search (NAS) : with certain search space, use RL (NASNet), evolution alg (AmoebaNet), gradient based (DARTS) or shared weight (ENAS) to get the model. But it takes to much computation resource.
   - <img src="https://d2l.ai/_images/anynet.svg" style="background-color: white; display: inline-block;"> AnyNet Structure
   </br>

## Chapter 9 : RNN
- Two form of sequence to sequence task:
  - **aligned**: input at certain time step aligns with corrsponding output, like tagging (fight -> verb)
  - **unaligned**: no step-to-step correspondence, like maching translation
- **Autoregressive** model: regress value based on previous value
  - latent autoregressive models (since $h_t$ is never observed): estimate $P(x_t | x_{t-1} \dots x_1)$ with $\hat x_t = P(x_t | h_t)$ and $h_t = g(h_{t-1}, x_{t-1})$
- **Sequence Model**: to get joint probablity of a sequence $p(x_1, \dots, x_T)$, we change it to a form like autoregressive one: $p(x_1) \prod_{t=2}^T p(x_t|x_{t-1}, \dots, x_1)$
  - **Markov Condition**: if we can make the condition above into $x_{t-1}, \dots, x_{t-\tau}$ without any loss, aka the future is conditionally independent of the past, given the recent history, then the sequence satisfies a Markov condition. And it is $\tau^{th}$-order Markov model.
- Zipf’s law: the frequency of words will decrease exponentially, n-grams too (with smaller slope).
  - So use word frequency to construct the probility is not good, for example. $\hat p(learning|deep) = n(deep, learning) / n(deep)$, $n(deep, learning)$ will be very small compared to denominator. We can use so called **Laplace Smooth** but that will not help too much.
- **Perplexity**: (how confusion it is), given a true test data, the cross-entropy is $J = \frac{1}{n} \sum_{t=1}^n -\log P(x_t | x_{t-1}, \dots, x_1)$, and the perplexity is $\exp(J)$.
- Partioning the sequence: for a $T$ token indices sequence, we add some randomness, discard first $d \in U(0, n]$ tokens and partion the rest into $m = \lfloor (T-d) / n \rfloor$ group. For a sequence $x_t$ the target sequence is shifted by one token $x_{t+1}$.
---
- **RNN**: for a vocab with size $|\mathcal V|$, the model parameters should go up to $|\mathcal V|^n$, $n$ is the sequence length.So we $P(x_t | x_{t-1} \dots x_1) \approx P(x_t | h_{t-1})$，$h$ is a **hidden state**, it varies at different time step and contains information of previous time steps. Hidden layer, on the other hand, is a structure, it dose not change in forward calculation.
  - recurrent: $H_t = \phi (X_tW_{th} + H_{t-1}W_{hh} + b_h)$, output is $O_t = H_tW_{tq} + b_q$.
  - <img alt="ResNeXt Block" src="https://d2l.ai/_images/rnn.svg" style="background-color: white; display: inline-block;">
       RNN Block
  - clip the gradient: $g = \min(1, \frac{\theta}{|| g ||}) g$, it is a hack but useful.
  - **Warm-up**: When predicting, we can first feed a prefix (now called prompt I think), just iter the prefix into the network without generating output until we need to predict.
- For RNN: the input shape is (sequence_length, batch_size, feature_size), first is time_step, third is one-hot dim or word2vec dim.
- **Backpropagation through time**
  - <img alt="ResNeXt Block" src="https://d2l.ai/_images/rnn-bptt.svg" style="background-color: white; display: inline-block;"> Computation graph of RNN
  - How to reduce gradient explosion or vanishing: truncate the gradient propagete at certain time step.
  - In the img above: $\frac{\partial L}{\partial h_T} = W_{qh}^T \frac{\partial L}{\partial o_T}$, $\frac{\partial L}{\partial h_t} = \sum_{i=t}^T (W_{hh}^T)^{T-i} W_{qh}^T \frac{\partial L}{\partial o_{T+t-i}}$, $\frac{\partial L}{\partial W_{hx}} = \sum_{t=1}^T \frac{\partial L}{\partial h_t} x_t^T$, $\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^T \frac{\partial L}{\partial h_t} h_{t-1}^T$.

在**深度学习（DL）工程**和**硬件优化**方面，需要掌握一系列工具、技术和最佳实践，以确保模型能够高效训练、优化和部署。  

---

# **1. 深度学习工程**
**目标**：不仅要训练模型，还要能够在实际应用中高效地**数据处理、训练、调优、部署和维护**。

## **1.1 数据工程**
深度学习的性能很大程度上依赖于数据质量和预处理效率。

### **(1) 数据收集与存储**
- **结构化数据**（SQL, Pandas, BigQuery）
- **图像数据**（OpenCV, PIL, TensorFlow Datasets）
- **文本数据**（NLTK, Hugging Face Datasets）
- **流数据**（Kafka, Apache Spark）

### **(2) 数据预处理**
- **标准化 / 归一化**（Min-Max Scaling, Z-score）
- **数据增强**（图像：旋转、裁剪；文本：同义词替换）
- **降维**（PCA, t-SNE, UMAP）
- **缺失值处理**（均值填充、插值）

### **(3) 数据加载优化**
- **批量加载（Batch Loading）**
- **多线程 / 多进程数据预处理（Dataloader, TensorFlow tf.data）**
- **TFRecord / HDF5**（二进制格式加速数据读取）

---

## **1.2 训练与超参数调优**
深度学习模型训练是一个计算密集型过程，需要高效的**优化策略**和**超参数调整**。

### **(1) 训练优化**
- **优化器选择**
  - SGD（标准梯度下降）
  - Adam / RMSprop（自适应优化）
  - LARS / LAMB（用于大规模分布式训练）
  
- **正则化**
  - Dropout（随机丢弃神经元）
  - Batch Normalization（批量归一化）
  - Weight Decay（L2 正则化）

- **梯度裁剪（Gradient Clipping）**
  - 解决梯度爆炸问题

### **(2) 超参数优化**
自动搜索最优超参数（例如学习率、batch size、权重初始化）。
- **Grid Search（网格搜索）**
- **Random Search（随机搜索）**
- **Bayesian Optimization（贝叶斯优化）**
- **Hyperband（高效采样）**
- **Optuna / Ray Tune（自动化超参数调优工具）**

---

## **1.3 训练加速**
大规模训练时需要高效的训练加速技术：

### **(1) GPU 加速**
- 训练时尽可能利用 **CUDA** / **cuDNN**
- **混合精度训练（Mixed Precision）**：使用 FP16（Half Precision）加速计算
- **数据并行（DataParallel）** vs. **模型并行（ModelParallel）**
  
### **(2) 分布式训练**
- **单机多卡（Multi-GPU Training）**
  - PyTorch `DataParallel`
  - PyTorch `DistributedDataParallel (DDP)`
  
- **多机多卡（Multi-Node Training）**
  - TensorFlow `MirroredStrategy`
  - Horovod（Uber 提出的高效分布式训练框架）

---

## **1.4 部署与推理优化**
深度学习不仅要训练，还要在**边缘设备**或**服务器端**高效推理。

### **(1) 模型压缩**
- **剪枝（Pruning）**：去掉不重要的权重
- **量化（Quantization）**：
  - **8-bit INT 量化**（TensorRT, TFLite）
  - **混合精度推理（FP16, INT8）**

- **知识蒸馏（Knowledge Distillation）**：
  - 用大模型训练小模型，提高推理效率

### **(2) 推理框架**
- **ONNX（Open Neural Network Exchange）**：模型通用格式，可用于 PyTorch / TensorFlow 互转
- **TensorRT（NVIDIA）**：高效的 GPU 加速推理
- **TVM（Apache）**：自动优化模型推理

### **(3) 部署方式**
- **服务器部署**
  - Flask / FastAPI（REST API 部署）
  - TensorFlow Serving / TorchServe（高效模型服务）

- **移动端 / 边缘部署**
  - TensorFlow Lite（TFLite）
  - CoreML（iOS 设备）
  - NVIDIA Jetson（嵌入式 AI）

---

# **2. 硬件优化**
深度学习的计算量极大，硬件的优化能**显著提高训练和推理速度**。

## **2.1 GPU 计算**
GPU 是深度学习的核心计算设备，NVIDIA CUDA 生态至关重要。

### **(1) GPU 编程基础**
- CUDA 编程（掌握 Kernel 编写）
- cuDNN（深度学习优化库）
- Tensor Core（用于混合精度计算）
  
### **(2) GPU 训练优化**
- **减少 CPU-GPU 传输**（优化 `pin_memory=True`）
- **梯度累积（Gradient Accumulation）**，减少显存占用
- **使用 FP16 训练**（提高吞吐量）

---

## **2.2 分布式计算**
适用于**超大规模数据训练**（如 GPT、Llama 等模型）。

### **(1) 并行策略**
- **数据并行（Data Parallelism）**
  - 复制模型到多个 GPU，每个 GPU 训练不同数据
  - PyTorch `DistributedDataParallel (DDP)`

- **模型并行（Model Parallelism）**
  - 适用于超大模型（如 GPT-4）
  - DeepSpeed / Megatron-LM 优化

- **流水线并行（Pipeline Parallelism）**
  - 将不同层分配到不同 GPU，提高计算效率
  - **适用于 Transformer 训练**

### **(2) 高效通信**
- **NCCL（NVIDIA Collective Communication Library）**：优化 GPU 之间的通信
- **RDMA（远程直接内存访问）**：用于 GPU 服务器间高速通信

---

## **2.3 专用 AI 硬件**
除了 GPU，AI 训练还可以用专用芯片加速：
- **TPU（Google）**：专门优化深度学习计算
- **Graphcore IPU**（稀疏计算优化）
- **Cerebras Wafer-Scale Engine**（超大规模 AI 计算）

---

# **3. 总结**
| **类别** | **关键内容** |
|----------|--------------|
| **数据工程** | 数据清洗、数据增强、数据加载优化 |
| **训练优化** | 超参数调优、正则化、优化器选择 |
| **训练加速** | GPU 加速、混合精度、分布式训练 |
| **部署优化** | 模型量化、剪枝、TensorRT 加速 |
| **硬件优化** | CUDA、NCCL、TPU/FPGA |

你已经有**矩阵分解和 Rust 经验**，如果想深入工程优化，可以：
1. **研究 PyTorch DDP / DeepSpeed**（分布式训练优化）
2. **学习 CUDA / cuDNN 编程**（低级 GPU 加速）
3. **尝试 TensorRT / ONNX**（推理加速）

这将让你在 **深度学习工程 & 硬件优化** 方面具备更强的竞争力 🚀