# <center> Project1 </center>

In this problem we will investigate handwritten digit classification. 

The inputs are 16 by 16 grayscale images
of handwritten digits (0 through 9), and the goal is to predict the number of the given the image. If you run
example_neuralNetwork it will load this dataset and train a neural network with stochastic gradient descent,
printing the validation set error as it goes. To handle the 10 classes, there are 10 output units (using a {−1, 1}
encoding of each of the ten labels) and the squared error is used during training. Your task in this question is to
modify this training procedure architecture to optimize performance.

Report the best test error you are able to achieve on this dataset, and report the modifications you made to
achieve this. Please refer to previous instruction of writing the report.

# <center> task0: matlab 跑出来的初始数据 </center>

<div style="text-align:center;">
    <img src="./images/matlab_figure.png" alt="Figure 1" style="width:250px;" />
</div>


可以观察到：

总共跑了100000次，最后test_data上的错误率是0.457


通过观察源代码，可以看出：
1. 数据初始化有Normalization
2. 一开始就固定了b，后期不更新b
3. 最后损失函数是L2_loss
4. 只有一个hidden layers 并且大小是（d，10）
5. 更新梯度的时候只使用一个X_train的数据（minibatch）

接下去在此基础上做改进

# <center> task0.1: matlab 代码转为python 并做一些些改进 </center>

<div style="display:flex; justify-content:center;">
    <img src="./images/original_figure.png" alt="Image 1" style="height:180px; margin: 10px;" />
    <img src="./images/original_result.png" alt="Image 2" style="height:180px; margin: 10px;" />
</div>

<div style="display:flex; justify-content:center;">
    <img src="./images/original_figure2.png" alt="Image 3" style="height:200px; margin: 10px;" />
    <img src="./images/original_result2.png" alt="Image 4" style="height:200px; margin: 10px;" />
</div>

第一组图是iteration为50000次，test_data错误率为0.171；第二组图是iteration为100000次，test_data错误率为0.13

代码保留了：
1. 数据初始化有Normalization
2. 一开始就固定了b，后期不更新b
3. 最后损失函数是L2_loss
4. 只有一个hidden layers 并且大小是（d，10）

代码改进了：
1. 更新梯度的时候使用全部X_train的数据

可以看到错误率有明显降低

由于更新梯度的时候使用全部X_train的数据，所以整体iteration不需要跑那么多，以后我们都选择跑35000个iteration

# <center> task1: Change the network structure </center>

做法：在matlab_original.py中直接将结构更改即可（输入数字代表顺序维度）

设计想法来源：https://zhuanlan.zhihu.com/p/100419971

选择的设计：

[128], [128, 64], [128, 64, 16]

```python
    for nHidden_layer in [[128], [128, 64], [128,64,16]]:
    print('Running with ', nHidden_layer)
    start_time = time.time()
    model = MLP(
        nHidden_layer,
        input_dim=d,
        weight_scale=weight_scale,
        dtype=np.float64
    )
    solver = Backprop(
        model,
        data,
        print_every=3000,
        max_iteration = 35000,
        optim_config={"learning_rate": learning_rate},
    )
    solvers["{}".format(nHidden_layer)] = solver
    solver.train()
```

<div style="text-align:center;">
    <img src="./images/task1_Figure_1.png" alt="Figure 1" style="height:250px;weigh:200px" />
</div>


<div style="display:flex; justify-content:center;">
    <img src="./images/task1_result1.png" alt="Image 3" style="height:170px; margin: 10px;" />
    <img src="./images/task1_result2.png" alt="Image 4" style="height:170px; margin: 10px;" />
    <img src="./images/task1_result3.png" alt="Image 4" style="height:170px; margin: 10px;" />
</div>

可以看出3个nHidden_layers对应的error为 0.107， 0.11， 0.133

我们知道网络越复杂，神经元越多，可以拟合越复杂的特征，但需要的训练时间可能增长，所以上述error结果呈现递增

对于训练总iteration的问题，可能涉及早停等防止过拟合同时增强网络能力的问题，将会在后面的task讨论到

但总的来说综合效果性价比比original（0.13，训练100000个iteration）要好

# <center> task2: Change the training procedure  </center>

## <center> （1）modifying the sequence of step-sizes </center>

我们采用：
1. 刚开始训练时，学习率以0.01 ~ 0.001为宜。
2. 一定轮数过后，逐渐减缓。
3. 接近训练结束，学习速率的衰减应该在100倍以上。

故在网络中加入lr_decay = 0.95, 并将learning rate 修改为1e-2, 1e-3测试一下（原1e-3）

```python
    """ in Backprop.train() """
    ...
            # every 300 iteration decay
            if num_iterations % 300 == 0:
                for k in self.optim_configs:
                    self.optim_configs[k]["learning_rate"] *= self.lr_decay
    ...

    """ when train """
    solver = Backprop(
        model,
        data,
        print_every=3000,
        max_iteration = 35000,
        optim_config={"learning_rate": learning_rate},
        lr_decay = 0.95)
    
```

<div style="text-align:center;">
    <img src="./images/task2_Figure_1.png" alt="Figure 1" style="height:270px;" />
</div>

<div style="display:flex; justify-content:center;">
    <img src="./images/task2_result1.png" alt="Image 3" style="height:220px; margin: 10px;" />
    <img src="./images/task2_result2.png" alt="Image 4" style="height:220px; margin: 10px;" />
</div>

两个结果对应的error是0.046 ; 0.119

可以很明显看出迭代的数量不够（因为1e-3 + lr_decay的error比上一个task中对应没有加的要大）

而增大learning_rate可以使得较快收敛 所以后续不改变iteration 而采用learning_rate为1e-2

## <center> （2）using diﬀerent step-sizes for diﬀerent variables </center>

我们增加了SGD_momentum, RMSProp, Adam

详情optim代码见original.optim

```python
    # Backprop
    # Perform a parameter update
        for p, w in self.model.params.items():
            dw = grads[p]
            config = self.optim_configs[p]
            next_w, next_config = self.update_rule(w, dw, config)
            self.model.params[p] = next_w
            self.optim_configs[p] = next_config
    ...
    # solver.train()
        weight_scale = 1e-1  
        learning_rate = 1e-2
        solver = Backprop(
        model,
        data,
        update_rule=update_rule,
        print_every=3000,
        max_iteration = 35000,
        optim_config={"learning_rate": learning_rate},
        lr_decay = 0.95)
```

<div style="text-align:center;">
    <img src="./images/task2_Figure_2.png" alt="Figure 1" style="height:230px;" />
</div>

<div style="display:flex; justify-content:center;">
    <img src="./images/task2_result21.png" alt="Image 5" style="height:90px; weigh: 90px; margin: 10px;" />
    <img src="./images/task2_result22.png" alt="Image 6" style="height:90px; weigh: 90px;margin: 10px;" />
    <img src="./images/task2_result23.png" alt="Image 7" style="height:90px; weigh: 90px;margin: 10px;" />
    <img src="./images/task2_result24.png" alt="Image 8" style="height:90px; weigh: 90px;margin: 10px;" />
</div>

可以看到四个update_rule对应的error为 0.048；0.047；0.103；0.064

可以看到加了momentum的收敛速度更快，意味着我们可以训练较少次数即得到更准确的网络，其中收敛速度最快的是adam

但是也可以发现一些问题：

例如对adam来说，在3000步之前training_acc达到了1，说明模型已经过拟合，需要采取一些方式：比如早停，或者数据augmentation，加入正则项，dropout等来避免这个事情的发生（这个在之后的task进行试验）

总的来说加了lr_decay 和 momentum 后， 模型的error有所改善

# <center> task3: Vectorize evaluating the loss function  </center>

我发现我把代码改成python的时候好像自动做了这件事😊

所以每次跑一次更新所有梯度

例如：(from layer_utils.py etc.)
```python
    def affine_backward(dout, cache):
        x, w = cache
        N = x.shape[0]
        dx = dout.dot(w.T).reshape(x.shape) # dx = (N, d)
        x_flat = x.reshape(N, -1)
        dw = x_flat.T.dot(dout) # dW = (D, M)
        return dx, dw
    
    def L2_loss(x, y):
        N = x.shape[0]
        C = x.shape[1]
        loss = np.sum(np.sum((1/2)*(x - np.eye(C)[y])**2, axis = 1)) / N
        dx =  (x - np.eye(C)[y]) / N
        return loss, dx
    ...
    def sgd(w, dw, config=None):
        if config is None:
            config = {}
        config.setdefault("learning_rate", 1e-2)

        w -= config["learning_rate"] * dw
        return w, config
    ...
```


跑出来的结果详情请见task0.1

# <center> task4: Regularization  </center>

## <center> (1) weight decay </center>

我选择加入 $l_2$ regularization
and 发现用了别的梯度更新方法收敛较快，所以根据task2，以后只跑15000 iteration

```python
    # MLP-loss func
        ...
        loss, dout = L2_loss(scores,y)
        for i in range(self.num_layers):
            W = self.params['W' + str(i+1)]
            loss += 0.5 * self.reg * np.sum(W*W) # add
        ...
        dout,dw = affine_backward(dout,caches[self.num_layers - 1])
        # caches[self.num_layers - 1] is the cache of the last layer
        dw += self.reg * self.params['W'+str(self.num_layers)] # add
        ...
        for i in range(self.num_layers-2,-1,-1):
            dout, dw = affine_tanh_backward(dout,caches[i])
            dw += self.reg * self.params['W'+str(i+1)] # add
            grads['W'+str(i+1)] = dw
        ...
    # train
    for reg in [1, 0.5, 2]:
        print('Running with ', reg)
        start_time = time.time()
        model = MLP(
            nHidden_layer,
            input_dim=d,
            reg = reg,
            weight_scale=weight_scale,
            dtype=np.float64)
```


<div style="text-align:center;">
    <img src="./images/task4_Figure_1.png" alt="Figure 1" style="height:250px;" />
</div>

<div style="display:flex; justify-content:center;">
    <img src="./images/task4_result1.png" alt="Image 3" style="height:100px;weigh: 90px; margin: 10px;" />
    <img src="./images/task4_result2.png" alt="Image 4" style="height:100px; weigh: 90px; margin: 10px;" />
    <img src="./images/task4_result3.png" alt="Image 4" style="height:100px; weigh: 90px; margin: 10px;" />

</div>

感觉效果没有之前好了 error 为0.655； 0.333； 0.306

（可能是因为本来用sgd_momentum就没有过拟合到1），加了reg后影响更大

所以我又试验了一下已经过拟合的adam：

<div style="text-align:center;">
    <img src="./images/task4_result4.png" alt="Figure 4" style="height:230px;" />
</div>

可以看出效果比之前好了不少：error分别为 0.043, 0.027，amazing！！ 而且也只跑了4000次iteration

说明reg对adam还是很有效果的

## <center> (2) early stopping </center>

只需要在Backprop里加上：
```python
        # early stop
        if val_acc <= self.val_acc_history[-1]:
            break
        else:
            self.train_acc_history.append(train_acc)
            self.val_acc_history.append(val_acc)
            
            print("(training iteration %d / %d) train acc: %f; val_acc: %f"
                    % (t, num_iterations, train_acc, val_acc))
```



<div style="text-align:center;">
    <img src="./images/task4_result21.png" alt="Figure 4" style="height:70px;" />
</div>

error为0.025 

说明对于adam来说，（加reg = 1e-3后）只需要跑1500个iteration就可以达到很好效果


# <center> task5: Softmax Loss  </center>

直接将MLP里的Loss改为softmax_loss
```python
def softmax_loss(x, y):

    loss, dx = None, None

    N = x.shape[0]

    probs = np.exp(x - np.max(x, axis=1, keepdims=True))
    probs /= np.sum(probs, axis=1, keepdims=True)
    
    loss = -np.log(probs[range(N), y]) 
    loss = np.sum(loss) / N
    
    dx = probs.copy()
    dx[np.arange(N), y] -= 1
    dx /= N
    
    return loss, dx
```

<div style="text-align:center;">
    <img src="./images/task5_result1.png" alt="Figure 4" style="height:250px;" />
</div>

用adam跑了三次，error分别为 0.033; 0.051; 0.048

reg 分别为 0.001； 0.01； 0.01

但是第一个过拟合了😢

# <center> task6: Each layer has a bias  </center>

重新写了一个MLP，在里面每一层都初始化一个b，同时一起计算db
（代码太多了，只做一小个示例）：
```python
    # MLP
            W = np.random.normal(0,weight_scale,(layer_dims[i],layer_dims[i+1]))
            b = np.zeros(layer_dims[i+1])
            self.params['W' + str(i+1)] = W
            self.params['b' + str(i+1)] = b
        ...
            for i in range(self.num_layers-2,-1,-1):

                dout,dw,db = affine_tanh_backward(dout,caches[i])
                dw += self.reg * self.params['W'+str(i+1)]

                grads['W'+str(i+1)] = dw
                grads['b'+str(i+1)] = db
    # layer_utils
        def affine_tanh_forward(x, w, b):

            a, fc_cache = affine_forward(x, w, b)
            out, tanh_cache = tanh_forward(a)
            cache = (fc_cache, tanh_cache)
            return out, cache
        ...
```

<div style="text-align:center;">
    <img src="./images/task6_result1.png" alt="Figure 4" style="height:300px;" />
</div>

跑了五次实验，error分别是 0.03; 0.046, 0.034; 0.034

reg: 0.001, 0.01, 0.001, 0.001

nHidden: [128, 64], [128, 64], [128, 64, 16], [128, 64, 64]


# <center> task7: Dropout  </center>

在changed中的MLP中补上dropout
```python
    # MLP
        # dropout
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {"mode": "train", "p": dropout_keep_ratio}
```

<div style="display:flex; justify-content:center;">
    <img src="./images/task7_result1.png" alt="Image 5" style="height:210px; margin: 10px;" />
    <img src="./images/task7_result2.png" alt="Image 6" style="height:210px; margin: 10px;" />
</div>

由上对应：

| 训练 | lr_rate | reg | nHidden | drop_out | test_error |
|--------|--------|--------|---------|----------|--------|
| 1  | 1e-3 | 1e-3  | [128 ] | 1 | 0.028 | 0.028 |
| 2  | 1e-3  | 1e-3  | [128 ] | 0.75 | 0.039 |
| 3  | 1e-3  | 1e-3  | [128 ] | 0.5 | 0.059 |
| 4 | 1e-2 | 1e-3 | [128, 64] |  1 | 0.03|
| 5 | 1e-2 | 1e-3 | [128, 64] | 0.75 | 0.038 |
| 6 | 1e-2 | 1e-3 | [128, 64] | 0.5 | 0.054 | 

带有dropout可能会适当地降低网络的表示能力（因为每次使用的神经元少），但是能在一定程度上提高模型的泛化能力。带有dropout的模型相当于多个模型的集成。但是如果P（保留的神经元）太小，则会让模型很难拟合数据集。

加reg，dropout 都增加了泛化性，而数据集的feature比较简单，可能加太多适得其反

# <center> task8: Fine-tuning </center>

我们采用之前效果最好的：
<center> L2_loss, weight_scale = 1e-1 , learning_rate = 1e-2, reg = 1e-3n, </center>
<center> Hidden_layer = [128, 64]， dropout_keep_ratio = 1 来微调, 然后将数据保存 </center>

<center> (training iteration 0 / 10000) train acc: 0.352400; val_acc: 0.334400 </center>

<center> (training iteration 1500 / 10000) train acc: 0.999600; val_acc: 0.970400 </center>

<center> Test set error:  0.025 </center>

<center> 程序运行时间： 117.42383909225464 秒 </center>



```python
    dist = {}
    for p, w in best_model.params.items():
        dist.update({p: w})

    with open(filename, "wb") as f:
        pickle.dump(dist, f)
```

固定前两层：

```python
    
    with open(filename, "rb") as file:  
        loaded_data = pickle.load(file)

    # 得出input
    x1 = np.tanh(X.dot(loaded_data["W1"]) + loaded_data["b1"]) # (5000,128)
    x2 = np.tanh(x1.dot(loaded_data["W2"]) + loaded_data["b2"]) # (5000,64)
```

最后的损失是L2_loss, 所以采用最小二乘法来求W3

```python
    # 使用最小二乘法求解得到W3
    W3 = np.linalg.lstsq(x2.T @ x2, x2.T @ y3, rcond=None)[0]
```

最后求解出ytest：

```python
    xtest1 = np.tanh(Xtest.dot(loaded_data["W1"]) + loaded_data["b1"])
    xtest2 = np.tanh(xtest1.dot(loaded_data["W2"]) + loaded_data["b2"])
    xtest_final = xtest2.dot(W3)

    y_test_pred = np.argmax(xtest_final, axis=1)
```

代码可见task8.py


最后得到
<center> Test set error:  0.025 </center>

注：如果需要测试别的网络或者参数，在changed_train.py里把task8代码去掉注释，然后直接运行task8.py代码（改变nHidden需要手动跟改一下task8.py的结构，大体一致）

# <center> task9: Data augmentation </center>

## <center> (1) Resize </center>

将图片缩小，原数据的辨识度已经很低了，缩小了好像好一点点？(尺寸为16*16)

<div style="display:flex; justify-content:center;">
    <img src="./images/combined.jpg" alt="Image 5" style="height:200px; margin: 10px;" />
    <img src="./images/combined_image.jpg" alt="Image 6" style="height:200px; margin: 10px;" />
</div>

选择生成2000个图片：示例
```python
        i = 0
    for idx in random_numbers1:
        canvas = Image.new('L', (16, 16), color='black')

        offset = ((16 - 8) // 2, (16 - 8) // 2)

        img = Image.fromarray(X[idx].reshape((16,16)).astype(np.uint8))
        smaller_image = img.resize((8,8))

        canvas.paste(smaller_image, offset)
        canvas_array = np.asarray(canvas).reshape(1, -1)
        new_data1[i] = canvas_array
        i += 1
```
更多请看data_augmentation.py

## <center> (2) Add noise </center>

将图片随机加入噪声，例如

<div style="display:flex; justify-content:center;">
    <img src="./images/noisy_image.jpg" alt="Image 5" style="height:180px; margin: 10px;" />
    <img src="./images/noisy_image1.jpg" alt="Image 6" style="height:180px; margin: 10px;" />
</div>

随机生成2000个：

```python
        x = 0
    for idx in random_numbers2:
        noise = np.random.normal(loc=0, scale=50, size=(1, 256))

        noisy_image_array = np.clip(X[idx] + noise, 0, 255).astype(np.uint8) 
        new_data1[x] = noisy_image_array
        x += 1

    new_y2 = np.empty((2000,1))
```
更多请看data_augmentation.py

## <center> (3) Rotation </center>

考虑到9翻过来就是6，还有一些数字的特殊性，所以没选用这个方法

我们还是用之前最少error版本试一下
weight_scale = 1e-1  
learning_rate = 1e-2
reg = 1e-3
nHidden_layer = [128, 64]


<div style="display:flex; justify-content:center;">
    <img src="./images/task9_result.png" alt="Image 5" style="height:200px; margin: 10px;" />
    <img src="./images/task9_result2.png" alt="Image 6" style="height:200px; margin: 10px;" />
</div>

error 分别为：0.05； 0.057； 0.052；0.061； 0.142； 0.123

跑的数据分别是：加了resize， 加了noise， 都加了

好像图片类型变多了 训练的acc都下降了 

所以后边那张图我又加深了nHidden = [128, 128, 64]


# <center> task10: 2D convolutional layer </center>

更改的所有代码在conv_main里
将第一层改为conv即可：
```python
    #MLP_conv
    for i in range(self.num_layers):
            ## conv
            if i == 0:
                self.params['W' + str(i+1)] = np.random.normal(0.0,weight_scale,(num_filters, C, filter_size, filter_size))
                self.params['b' + str(i+1)] = np.zeros(num_filters)
                ...
    for i in range(self.num_layers -1):
            if i == 0:
                W = self.params['W'+str(i+1)]
                b = self.params['b'+str(i+1)]
                filter_size = W.shape[2]
                conv_param = {"stride": 3}
                out, cache = conv_tanh_forward(x, W, b, conv_param)
                ...
    # layers_utils
    ...
    def conv_forward(x, w, b, conv_param):


        stride = conv_param['stride']
        N, C, H, W = x.shape
        F, C, HH, WW = w.shape

        out = np.zeros((N,F, 1+(H - HH) // stride, 1 + (W -WW) // stride))

        for n in range(N):
            for f in range(F):
                for i in range(0,H - HH + 1,stride):
                    for j in range(0, W - WW + 1,stride):
                        out[n,f,i//stride,j//stride]=np.sum(x[n,:,i:i+HH,j:j+WW] * w[f])+b[f]

        cache = (x, w, b, conv_param)
        return out, cache
    ...
    def conv_backward(dout, cache):

    x,w,b,conv_param = cache
    stride = conv_param['stride']
    
    F,C,HH,WW = w.shape
    N,C,H,W = x.shape

    dw = np.zeros_like(w)
    db = np.sum(dout,axis=(0,2,3))
    dx = np.zeros_like(x)

    for n, f in itertools.product(range(N), range(F)):
        for i, j in itertools.product(range(0,H - HH + 1,stride), range(0,W - WW + 1,stride)):
            dx[n,:,i:i+HH,j:j+WW] += dout[n,f,i//stride,j//stride]*w[f]
            dw[f] += dout[n,f,i//stride,j//stride] * x[n,:,i:i+HH,j:j+WW]

    return dx, dw, db
```



<div style="display:flex; justify-content:center;">
    <img src="./images/task10_result1.png" alt="Image 5" style="height:80px; margin: 10px;" />
    <img src="./images/task10_result2.png" alt="Image 6" style="height:80px; margin: 10px;" />
    <img src="./images/task10_result3.png" alt="Image 6" style="height:80px; margin: 10px;" />

</div>

<div style="display:flex; justify-content:center;">
    <img src="./images/task10_result4.png" alt="Image 5" style="height:120px; margin: 10px;" />
</div>

效果看起来还可以， 训练后边的MLP都是【128， 64】

| 训练 | 核大小 | stride | num_filters |error |
| ---|---|---|---|---|
| 1   | 5*5 | 3 | 4 | 0.032 |
| 2   | 3*3 | 3 | 4 | 0.024 |
| 3   | 5*5 | 3 | 8 | 0.027 |
| 4   | 3*3 | 3 | 4 | 0.024 |

可以看到第二次训练的效果最好，比肩上一次的最强模型

后来观察发现error较小的经过conv输出的dimension大概在100~140之间，符合我们之前挑线性层神经元大小的标准

# <center>总结 </center>

通过俺的不断尝试，两个最好的模型是：
1. weight_scale = 1e-1， learning_rate = 1e-2， reg = 1e-3， nHidden_layer = [210]， update_rate = 'adam', lr_decay = 0.95, dropout_keep_ratio = 1, 加early_stop （最好0.018）
2. 将1的第一层换为4个1* 3 *3的filter，stride = 3, padding = 0 的卷积层，后边再接着上面的 （最强0.024）
   
error都在0.02左右

如果把验证集一起训练的话
1. 第一个error可以降到0.016

<div style="display:flex; justify-content:center;">
    <img src="./images/task11_result.png" alt="Image 5" style="height:130px; margin: 10px;" />
</div>

可以可视化第一个看看哪些没有被分为正确的label


<div style="display:flex; justify-content:center;">
    <img src="./images/task11_result2.png" alt="Image 5" style="height:130px; margin: 10px;" />
</div>

<div style="display:flex; justify-content:center;">
    <img src="./images/task11.png" alt="Image 5" style="height:210px; margin: 10px;" />
</div>

可以看到有些确实摸棱两可，我自己也分不太清楚

总的来说，可能是MINIST数据集比较简单，特征比较少，不用conv的效果也能达到很好