# Some Notes About AI, ML and DL

写这个的最初目的是用于大二期末考试的复习。从课堂、书、网络教程整理下来。

## Loss Functions

The smaller, the better.

### Negative Log Likelihood (NLL)

$$
- log(p(y))
$$

### Focal Loss

The loss will be larger when the classifier is not confident, that is, when $p(y)$ is small.

$$
- (1 - p(y))^{\gamma} \cdot log(p(y))
$$

### Cross Entropy

(Usually for classification.)

$$
- \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{K} \mathbb{1}(y_{i} = j) \cdot log\left(p(\hat{y_{i}}=j)\right)
$$

where $n$ is the number of instances; $K$ is the number of classes; $\mathbb{1}(\cdot)$ is the indicator function such that $\mathbb{1}(true) = 1$ and $\mathbb{1}(false) = 0$. When $y_i \in \{0, 1\}$, it is the same as the ovjective function of logistic regression.

Note: KL-divergence is very similar to cross entropy but not the same. Suppose we have two distributions, $P$ and $Q$. Cross entropy measures average number of **total** bits to represent an event from $Q$ instead of $P$. KL-divergence measures average number of **extra** bits to represent an event from $Q$ instead of $P$. Thus, if two distributions are the same, the KL-divergence is zero but the cross entropy is not. ([Reference link](https://machinelearningmastery.com/cross-entropy-for-machine-learning/).)

### Mean Absolute Error (MAE, L1)

(Usually for regression.)

$$
\frac{1}{n} \sum_{i=1}^{n} \lvert y_{i} - \hat{y_{i}} \rvert
$$

### Mean Square Error (MSE, L2)

(Usually for regression.)

$$
\frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y_{i}})^2
$$

### Hinge Loss

$$
\xi_{i} = max \big( 0, 1-y\hat{y} \big)
$$


## 宋恒杰课堂记录

AI 与 ML 最大的差别在于后者用到了 probabilistic information 或 statistics information，而前者没有。因此进化学习、传统神经网络都不属于 ML，他们属于 AI。

神经网络不能保证收敛。

### 决策树

优点:

1. 清晰的可解释性
2. 非线性关系隐含在简单树形结构中
3. 少量的缺失数据对模型效果影响不大

训练决策树的关键问题是判断 feature 的顺序(重要性)。每轮 iteration 都要重新计算余下的所有 feature 的信息增益。不能一次计算完然后排序然后结束。

C4.5 优点: 能并行处理数据。
    
### 模糊神经网络

1. 给定输入，若模糊子集个数确定，则规则子集的个数也随之确定
2. 第一、二层非全连接，因此训练工作量减少
3. 不用训练连接的权重值，只需训练神经元的相关参数


### 深度学习

用来进行 feature learning

通过学习一组可见即可得的特征，得到一组可能不具备特定物理含义但却对 inference 的准确度有提高的特征。


## Gradient Descent

更新参数应该是同步的，即

correct:

$$
temp_1 = \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}
$$

$$
temp_2 = \theta_2 - \alpha \frac{\partial J}{\partial \theta_2}
$$

$$
\theta_1 = temp_1
$$

$$
\theta_2 = temp_2
$$

wrong:

$$
\theta_1 = \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}
$$

$$
\theta_2 = \theta_2 - \alpha \frac{\partial J}{\partial \theta_2}
$$

In the wrong example, the update of  $\theta_1$ will affect the update of $\theta_2$.

## Linear Regression

MSE (L2) loss function.

\begin{aligned}
L(w) & = \frac{1}{2} \Vert y - \hat{y} \Vert _ {2} ^ {2}\\
     & = \frac{1}{2} \Vert y - (wX + b) \Vert _ {2} ^ {2}
\end{aligned}

### Gradient Descent

\begin{aligned}
L'(w) & = (y - (wX + b))(-x) \\
      & = (y - \hat{y})(-x)
\end{aligned}

### Closed-form Solotion

$$w^* = argmin_w L(w) = (X^{T} X)^{-1} X^{T} y$$


## Logistic Regression

### Sigmoid Function

\begin{aligned}
\sigma(z) & = \frac{1}{1 + e ^ {-z}} \\
     & = (1 + e ^ {-z}) ^ {-1}
\end{aligned}

\begin{aligned}
\sigma'(z) & = (-1) (1 + e ^ {-z}) ^ {-2} e ^ {-z} (-1) \\
      & = \frac{e ^ {-z}}{(1 + e ^ {-z}) ^ {2}} \\
      & = \frac{1}{1 + e ^ {-z}} \frac{e ^ {-z}}{1 + e ^ {-z}} \\
      & = \sigma(z) \cdot \big( 1-\sigma(z) \big)
\end{aligned}

### Objective Function

Minimize binary cross-entropy.

$$
h_{w}(x) = \sigma(w^{T} x) = P(y = +1 | x)
$$

$$
L(w) = - \frac{1}{n} \sum_{i=1}^{n} \bigg[ y_{i} \cdot log \big( h_w(x_{i}) \big) + (1-y_{i}) \cdot log \big( 1-h_{w}(x_{i}) \big) \bigg]
$$

assume $y \in \{0,1\}$

### Gradient

For one instance ([reference link](https://math.stackexchange.com/questions/477207/derivative-of-cost-function-for-logistic-regression)):

$$
\frac{\partial L(w)}{\partial w} = \big( h_{w}(x)-y \big) x
$$

For all instances:

$$
\frac{\partial L(w)}{\partial w} = \frac{1}{n} \sum_{i=1}^{n} \big( h_{w}(x_{i})-y_{i} \big) x_{i}
$$


## Softmax Regression

Softmax Regression is the general form of logistic regression, with k possible outcomes instead of 2. Minimize cross-entropy.

$$
P(y_{i} = j | x) = \frac{exp(w_{j}^{T} x_{i})}{\sum_{l=1}^{K} exp(w_{l}^{T} x_{i})}
$$

$$
L(w) = - \frac{1}{n} \bigg[ \sum_{i=1}^{n} \sum_{j=1}^{K} \mathbb{1}(y_{i} = j) log(P(y_{i} = j | x)) \bigg]
$$

The derivative is computed as:

$$
\frac{\partial L(w)}{\partial w_{j}} = \frac{1}{n} \sum_{i=1}^{n} \bigg[ \big( P(y_{i} = j | x_{i} ; w) - \mathbb{1}(y_{i} = j) \big) x_{i} \bigg] + \lambda w_{j}
$$

where $\mathbb{1}()$ is the indicator function such that $\mathbb{1}(true)=1$ and $\mathbb{1}(false)=0$

## Support Vecotor Machine (SVM)

### Some Keywords

Maximize the margin, which is to minimize the L2-norm of coefficients $w$.

Lagrange, primal and dual problem, Krause-Kuhn-Tucker (KKT) conditions, support vectors

feature mapping, kernels, Gaussian kernel, kernel matrix

regularization, allow some instances have margin of $1 - \xi_{i}$, the cost is $C \xi_{i}$

### Objective

Distance of point to line $Ax + By + C = 0$:

$$
\frac{|Ax+By+C|}{\sqrt{A^{2}+B^{2}}} = \frac{|\overrightarrow{(A, B)} \cdot \overrightarrow{(x, y)}+C|}{\sqrt{A^{2}+B^{2}}} = \frac{|wx+b|}{||w||_{2}}
$$

where $w = (A, B)$ is a vector being orthogonal to the line controlling the direction of the line; $b = C$

To maximize the margin (the distance of the support vectors to the line)

$$
max_{w,b} \frac{|wx+b|}{||w||_{2}}
$$

is equivalent to

$$
min_{w, b} \frac{||w||_{2}^{2}}{2}
$$

s.t.

$$
y^{(i)} (w^T x^{(i)} + b) \geq 1
$$

which is

$$
g_{i}(w) \triangleq 1 - y^{(i)} (w^{T} x^{(i)} + b) \leq 0
$$

(assume $y \in \{-1, +1\}$)

### Lagrangian for Optimization Problem

$$
L(w, b, \alpha)=\frac{||w||_2^2}{2} - \sum_{i=1}^{m}\alpha_{i} g_{i}(w) \qquad (\textrm{s.t.} \: \alpha_{i} \geq 0)
$$

### Dual form of the optimization problem

$$
MaxMin \leq MinMax
$$

### Kernel Trick

Question: $K(x, z)$ is defined using $\phi(x)$ and $\phi(z)$. How can we represent or calculate $K$ without knowing $\phi$?

Answer: We don't care about what $\phi(x)$ and $\phi(z)$ are. We just want to classify the instances, no matter what space they are projected to.

### Soft Margin

$$
min_{w, b} \frac{||w||^2}{2} + \frac{C}{n} \ \sum_{i=1}^{n} \xi_{i}
$$

s.t.

$$
y^{(i)} (w^T x^{(i)} + b) \geq 1- \xi_{i}
$$

#### Hinge Loss

$$
\xi_{i} = max \big( 0, 1-y^{(i)} (w^{T} x^{(i)} + b) \big)
$$

### SMO (sequential minimal optimization)

Coordinate ascent/descent. Select a pair of features, instead of one feature, at each iteration. The reason for selecting a pair is the constraint that the sum $\sum _ {i=1} ^ {m} \alpha _ {i} y ^ {(i)}$ is zero.

### Disadvantages

1. 不会自己学习feature，要手动选择feature作为input
2. 训练完成后，保存model时，要存training data (support vector)
3. 所有数据要存在单台设备上才能做SMO凸优化

#### Deal with Disadvantages

1. feature engineering，特征工程
2. artificial support vector
3. (I forget)

## Decision Tree

- ID3
- C4.5

### ID3

#### Entropy and information gain

The purity of the node (the lower entropy means the higher purity):

$$
Entropy(D) = - \sum_{i \in C} p_{i} log_{2} p_{i}
$$

where $C$ is the set of all classes.

Information gain on attribute $A$:

$$
Gain(D, A) = Entropy(D) - \sum_{v \in Values(A)} \frac{|D_{v}|}{|D|} Entropy(D_{v})
$$

where $D_{v}$ is the subset of the samples that take the value $v$ on the attribute $A$.

The information gain of attributes that have a large number of distinct values might be too large (e.g., sample ID), causing over-fitting on the training set.
For instance, "sample ID" is an attribute of the samples in the dataset.
For each value $v$ in the "sample ID", there are only one instance in the subset $D_{v}$.
The subset $D_{v}$ will be very "pure" because there is only one class in the subset.
As a result, the algorithm will regard the attribute "sample ID" as a good attribute for splitting the dataset.

### C4.5

Adopt information gain ratio to overcome the disadvantage of the naive information gain.

Gain raito:

$$
GainRatio(D, A) = \frac{Gain(D, A)}{IV(A)}
$$

where $IV(A)$ is the intrinsic value of the attribute $A$ ([Wikipedia link](https://en.wikipedia.org/wiki/Information_gain_ratio))

$$
IV(A) = - \sum_{v \in Values(A)} \frac{|D_{v}|}{|D|} log_{2} \frac{|D_{v}|}{|D|}
$$

## Ensemble Learning

## Random forest

A forest of uncorrelated trees.

Feature bagging, a random subset of the features.

## Boosting

Ensemble learning, Bootstrapped Aggregation (Bagging), XGBoost, Gradient boosting machine (GBM), Gradient boosting decision tree (GBDT)

### Bootstrapped Aggregation (Bagging)

Random sample with replacement.

### AdaBoost

Base (weak) classifier $h_{m}$, error rate $\epsilon_{m}$

$$\epsilon_{m} = \frac{h_{m}(x_{i}) \neq y_{i}}{\text{total number of samples}}$$

importance for base classifier $\alpha_{m}$

$$\alpha_{m} = \frac{1}{2} log \frac{1-\epsilon_{m}}{\epsilon_{m}}$$

sample weights $w_{m}(i)$ ([How does class_weight work in Decision Tree](https://datascience.stackexchange.com/questions/56250/how-does-class-weight-work-in-decision-tree))

$$w_{m+1}(i) = \frac{w_{m}(i)}{z_{m}} exp\big(-\alpha_{m} y_{i} h_{m}(x_{i})\big)$$

normalization factor $z_{m}$

$$z_{m} = \sum_{i=1}^{n} exp\big(-\alpha_{m} y_{i} h_{m}(x_{i})\big)$$

final classifier

$$H(x) = \sum_{m=1}^{M} \alpha_m h_m(x)$$

### Gradient Boosting Decision Tree (GBDT)

Train multiple decision trees to predict the residual. The final prediction is the sum of the learning rate times the prediction of each decision tree.

## Naive Bayes

Text classification, independence assumption, Laplace smoothing

Bayes' theorem:

\begin{aligned}
P(Y | X) & = \frac{P(X, Y)}{P{(X)}} \\
         & = \frac{P(X | Y) P(Y)}{P{(X)}}
\end{aligned}

Independence assumption:

\begin{aligned}
P(X | Y) & = P(x_{1}, x_{2}, ..., x_{n} | Y) \\
         & = P(x_{1} | Y) P(x_{2} | Y) ... P(x_{n} | Y) \\
         & = \prod _ {i = 1} ^ {n} P(x_{i} | Y)
\end{aligned}

Naive Bayes:

\begin{aligned}
P(Y | X) & = \frac{P(X | Y) P(Y)}{P{(X)}} \\
         & = \frac{P(Y) \prod _ {i = 1} ^ {n} P(x_{i} | Y)}{P{(X)}} \\
         & \propto P(Y) \prod _ {i = 1} ^ {n} P(x_{i} | Y) \quad \text{(given the dataset, P(X) is a constant)}
\end{aligned}

Inference:

\begin{aligned}
\hat{Y} &=  argmax_{Y} P(Y | X) \\
        & = argmax_{Y} \log P(Y | X) \\
        & = argmax_{Y} \bigg(\log P(Y) + \sum _ {i = 1} ^ {n} \log P(x_{i} | Y)\bigg)
\end{aligned}

## Bayesian network

Chow-Liu tree, exact inference, Junction tree, variable elimination, belief propagation, approximate inference, sampling, structural learning, scoring function,

## K-Nearest neighbours (KNN)

## Expectation-Maximization (EM)

Two steps

- E-step: Estimate the missing variables in the dataset.
- M-step: Maximize the parameters of the model in the presence of the data.

## Clustering

### K-means

Belongs to EM. $K$ is the hyperparameter selected by hand.

#### Objective

Suppose we use one-hot encoding for label $r = (r_1, r_2, ..., r_K)$. That is, if an instance $x$ belongs to cluster $i$, then $r_i = 1$ and $r_j = 0$ for all $1 \leq j \leq K, j \neq i$.

If we have a dataset of $n$ instances, then $r, x \in R^{n \times K}$ are two $n$-by-$K$ matrices.

$$
L(\mu; x, r) = \sum_{i=1}^{n} \sum_{k=1}^{K} r_{ik} ||x_{i} - \mu_{k}||
$$

Expectation-Maximization (EM)

- E-step: $\mu_{k} = \frac{1}{n_{k}} \sum_{i=1}^{n} r_{ik}x_{i}$
- M-step: $r_{i} = argmin_{k} ||x_{i} - \mu_{k}||$

### DBSCAN

**D**ensity-**B**ased **S**patial **C**lustering of **A**pplications with **N**oise.

First, find all core objects.

Similar to finding the all connected components by depth first search (DFS) using a queue. The connection is defined by “density-reachable”. The starting points are core objects.

## Principle Component Analysis (PCA)

重要概念: 方差、协方差、特征向量、特征值

- [主成分分析PCA算法：为什么去均值以后的高维矩阵乘以其协方差矩阵的特征向量矩阵就是“投影”？ - YE Y的回答 - 知乎](https://www.zhihu.com/question/30094611/answer/120499954)
- [主成分分析法到底怎么用的？过程模模糊糊的 - 石溪的回答 - 知乎](https://www.zhihu.com/question/30044663/answer/1696535206)

找到一组新的基向量，使得给定数据 $X$ 在这组基向量上的投影，在每个方向上的方差都尽可能大。方差尽可能大是为了保证在该方向上有区分度。若所有数据在该方向上取值相同，则方差为零，且从该方向上无法区分任何一个数据点，每个数据都是相同的。

相关概念: Singular Value Decomposition (SVD)

## Recommender System

### Model-based Collaborative Filtering

Number of users $m$, number of items $n$, number of features $k$.

User-item matrix $R \in \mathbb{R}^{m \times n}$, user matrix $P \in \mathbb{R}^{m \times k}$, item matrix $Q \in \mathbb{R}^{k \times n}$. $R = PQ$

Suppose $p$ and $q$ are $k$-by-1 matries, which are $k$-dimensional **column** vectors.

$$P = [p_{1}, p_{2}, ..., p_{m}]^{T}$$

$$Q = [q_{1}, q_{2}, ..., q_{m}]$$

Prediction of one element:

$$\hat{R_{ui}} = P_{u\cdot} Q_{\cdot i} = (p_{u})^{T} q_{i}$$

Loss for one element (squared error loss):

$$L(R_{ui},\hat{R_{ui}}) = (R_{ui}-\hat{R_{ui}})^{2} = (R_{ui}-(p_{u})^{T} q_{i})^{2}$$

Loss for the whole P and Q is the sum of loss of each element, plus regularization part:

$$L = \sum_{u,i}(R_{ui}-(p_{u})^{T} q_{i})^{2} + \lambda ( \sum_{u} n_{p_{u}} ||p_{u}||_{2}^{2} + \sum_{i} n_{q_{i}} ||q_{i}||_{2}^{2} )$$

### ALS

Fixing Q, optimize P, update each $p_{u}$ as:

$$p_{u} \gets \sum_{i} (q_{i} q_{i}^{T} + \lambda n_{p_{u}}I)^{-1} Q^{T} R_{u \cdot}^{T}$$

Fixing P, optimize Q, update each $q_{i}$ as:

$$q_{i} \gets \sum_{u} (p_{u} p_{u}^{T} + \lambda n_{q_{i}}I)^{-1} P^{T} R_{\cdot i}$$

ALS is **NOT** scalable to large-scale datasets, but SGD is.

### SGD

SGD choose the loss function as:

$$L = \sum_{u,i}(R_{ui}-(p_{u})^{T} q_{i})^{2} + ( \lambda_p ||p_{u}||_{2}^{2} + \lambda_q ||q_{i}||_{2}^{2} )$$

$$E_{ui} = R_{ui} - (p_{u})^{T} q_{i}$$

$$\frac{\partial L}{\partial p_{u}} = E_{ui}(-q_{i}) + \lambda_{p} p_{u}$$

$$\frac{\partial L}{\partial q_{i}} = E_{ui}(-p_{u}) + \lambda_{q} q_{i}$$

## Independent components analysis (ICA)

## Reinforcement learning

## Neural network and related concepts

CNN,  
RNN, LSTM,  
BP, ReLU  
GNN  
Auto-encoder  

## Recurrent Neural Network (RNN).

### LSTM - Long Short Term Memory network

Three gates:

- Forget gate ($f$)
- Input gate ($i$)
- Output gate ($o$)

The equations of three gates ($f, i, o$) are all in the similar form of $g_{t} = sigmoid(W_{g}[h_{t-1}, x_{t}] + b_{g})$ where $g \in \{f, i, o\}$.

Equations and variables:

- Forget gate
  - how many old memory (last cell state) to keep (while the others are forgotten)
  - input: last hidden state $h_{t-1}$, current input $x_{t}$
  - output: the proportion $f_{t}$ of old memory to keep
- Input gate
  - the proportion $i_{t}$ of candidate new cell state $\tilde{C}_{t}$ that should be added to the final new cell state $C_{t}$
  - input: last hidden state $h_{t-1}$, current input $x_{t}$
  - output: the proportion $i_{t}$
- Output gate
  - the proportion $o_{t}$ of the final new cell state $C_{t}$ that should be output
  - input: last hidden state $h_{t-1}$, current input $x_{t}$
  - output: the proportion $o_{t}$
- Candidate new cell state
  - input: last hidden state $h_{t-1}$, current input $x_{t}$
  - output: candidate new cell state $\tilde{C}_{t}$
- Final new cell state
  - part of the old memory to forget, and part of the candidate new memory to add
  - $C_{t} = f_{t} * C_{t-1} + i_{t} * \tilde{C}_{t}$
- Step output
  - part of the final new cell state
  - $h_{t} = o_{t} * \tanh(C_{t})$

References:

- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://d2l.ai/chapter_recurrent-modern/lstm.html

### GRU - Gated Recurrent Unit

Two gates:

- Reset gate (r)
  - How many old memory (last hidden state) to be used to compute the candidate new hidden state ($\tilde{h}_{t}$)
- Update gate (z)
  - The proportion ($z$) of the old hidden state and the proportion ($1-z$) of the candidate new hidden state that form the final new hidden state
  - $h_{t} = z_{t} * h_{t-1} + (1 - z_{t}) * \tilde{h}_{t}$

The equations of two gates (r, z) are all in the similar form of $g_{t} = sigmoid(W_{g}[h_{t-1}, x_{t}] + b_{g})$ where $g \in \{r, z\}$.

References:

- https://d2l.ai/chapter_recurrent-modern/gru.html

### Connection Between LSTM and GRU

The new memories of both two model consist of part of the old memory and part of the new candidate memory.

- LSTM
  - $C_{t} = f_{t} * C_{t-1} + i_{t} * \tilde{C}_{t}$
- GRU
  - $h_{t} = z_{t} * h_{t-1} + (1 - z_{t}) * \tilde{h}_{t}$

But LSTM uses two separate gates, forget gate ($f_{t}$) and input gate ($i_{t}$), while GRU uses a single update gate ($z_{t}$) to control the proportions of the old and the new memory.

### Sequence to Sequence

Encoder and decoder. Hidden states. Context vector (the encoder's last hidden state).

Input/output elements are characters or words (index). The indexes need to be changed to embeddings.

Simple seq2seq does not need a maximum sentence length constraint, but seq2seq with **attention mechanism** needs one.

[Pytorch tutorial for seq2seq](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html)

"***Teacher forcing***" is the concept of using the **real target** outputs as each next input, instead of using the **decoder's guess** as the next input. Using teacher forcing causes it to converge faster but when the trained network is exploited, it may exhibit instability.

**Beam search** is often applied to seq2seq model on test stage. It considers **multiple**, instead of the best one, solutions simultaneously at each step. The number of solutions considered simultaneously is controled by the parameter called **beam width**. If the beam width is set to 1, then the beam search is downgraded to greedy search. When implementing the algorithm, we usually use addition on logarithmic space instead of using multiplication on normal space, to avoid underflow. One of the disadvantages of beam search is that it tends to prefer shorter sentence, because the longer sentence tends to get lower probability. Here is [an article about beam search](https://medium.com/@dhartidhami/beam-search-in-seq2seq-model-7606d55b21a5). 

**Orthogonal initialization** is an approach to deal with the problem of gradient vanishing/expolding. Here is [an article about orthogonal initialization](https://medium.com/@dhartidhami/beam-search-in-seq2seq-model-7606d55b21a5).

## Dropout

Dropout 可以理解为让模型不要过分依赖某一个特征，因为这个特征随时可能被清除（dropout）。

Dropout 的缺点是它的存在使得 loss function 不再是严格定义的了。严格定义的 loss function 在每次 iteration 肯定会往 loss 下降的方向走，引入 dropout 就是引入了随机性，就没法在每次 iteration 都保证 loss 下降。写程序的一个建议就是在调试的时候把 dropout 去掉或者概率设置为 0，然后看看是不是每个 iteration 的 loss 都在下降，是的话就说明模型至少是能正常、正确地工作，然后再设置自己想要的 dropout 概率进行训练。

## Optimizers

Common concepts: SGD, momentum, Nesterov accelerated gradient (NAG), Adagrad, Adadelta (RMSProp), Adam, Nadam.

(The following code block is written in [Mermaid](https://mermaid-js.github.io/mermaid). Jupyter Notebook does not supprot rendering it for now.)
```mermaid
graph LR
A((SGD)) --> B[使用一阶动量,<br/>利用历史方向信息] --> C((SGD with<br/>Momentum)) --> J[Look Ahead<br/>提前估计下一时刻<br/>大致位置的梯度] --> K((NAG))
A --> D[累加二阶动量,利用<br/>历史更新幅度调节学习率] --> E((AdaGrad)) --> F[从全局累加改为<br/>窗口累加,解决学习率<br/>无限缩小问题] --> G((AdaDelta<br/>- - -<br/>RMSProp))
C & G --> H[同时利用一阶<br/>与二阶动量] --> I((Adam))
I & K --> L[组合] --> M((NAdam))
```

- [An overview of gradient descent optimization algorithms (Sebastian Ruder, 2017)](https://arxiv.org/pdf/1609.04747.pdf)
- [优化器(Optimizer) - 杰奏 - 知乎专栏](https://zhuanlan.zhihu.com/p/261695487)