---
title: "2023-01-03"
# subtitle: Last 
# author: <a href="https://github.com/ChoCho66">ChoCho</a>
author: ChoCho
date: last-modified
date-format: iso
institute: Last update date 
bibliography: ../references.bib
slide-number: c/t
# knitr: true
# jupyter: python3

format:
  revealjs:
    # theme: beige
    # theme: ../custom.scss
    theme: [serif,custom.scss]    # 像 LaTeX
    width: 1800
    height: 1050
    # transition: fade
    # preview-links: auto
    # slide-number: true
    # slide-tone: true
    # show-slide-number: print
    chalkboard:
      theme: whiteboard
      boardmarker-width: 2
      src: "Chalkboard.json"  
    scrollable: true
    echo: true
    # footer: "NCU math"
    # logo: cover.jpg
---

In [2]:
%matplotlib inline
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l
from torch.nn import functional as F

- d2l en ch6 Builders’ Guide

- ![](https://d2l.ai/_images/blocks.svg)


## The Sequential Module (6.1.2 en)

In [4]:
#| output-location: column
class MySequential(nn.Module):
    def __init__(self, *args):
        super().__init__()
        for idx, module in enumerate(args):
            self.add_module(str(idx), module)

    def forward(self, X):
        for module in self.children():
            X = module(X)
        return X

net = MySequential(
        nn.LazyLinear(256), 
        nn.ReLU(), 
        nn.LazyLinear(10)
    )
X = torch.FloatTensor([1,2,3])
net.state_dict, net(X)



(<bound method Module.state_dict of MySequential(
   (0): Linear(in_features=3, out_features=256, bias=True)
   (1): ReLU()
   (2): Linear(in_features=256, out_features=10, bias=True)
 )>,
 tensor([ 0.9646, -0.2961, -0.0681, -0.0784, -0.7608,  0.7837, -0.5562,  0.3798,
         -0.2103, -0.6713], grad_fn=<AddBackward0>))

- In `pytorch`, use `nn.Sequential`.

In [7]:
#| output-location: column
net2 = nn.Sequential(
  nn.LazyLinear(256), 
  nn.ReLU(), 
  nn.LazyLinear(10)
  )
X = torch.FloatTensor([1,2,3])
net2.state_dict, net2(X)

(<bound method Module.state_dict of Sequential(
   (0): Linear(in_features=3, out_features=256, bias=True)
   (1): ReLU()
   (2): Linear(in_features=256, out_features=10, bias=True)
 )>,
 tensor([ 0.0067, -0.1085, -0.6059, -0.8441, -0.1724,  0.4852, -0.4140,  1.0386,
         -0.5748, -0.0646], grad_fn=<AddBackward0>))

## Random weights which are not model parameters and thus are never updated by backpropagation (en 6.1.3)

In [12]:
#| output-location: column
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((4, 5))
        self.linear = nn.LazyLinear(4)

    def forward(self, X):
        print(X)
        print(self.linear)
        X = self.linear(X)
        print(X)
        print(self.linear)
        X = F.relu(X @ self.rand_weight + 1)
        print(X)
        print(self.linear)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        print(X)
        print(self.linear)
        return X.sum()

net = FixedHiddenMLP()
X = torch.arange(3).to(torch.float32)
net.state_dict, net(X)

tensor([0., 1., 2.])
LazyLinear(in_features=0, out_features=4, bias=True)
tensor([ 0.6522, -0.0703, -0.6604,  0.5343], grad_fn=<AddBackward0>)
Linear(in_features=3, out_features=4, bias=True)
tensor([1.4962, 1.0377, 1.0205, 1.6205, 1.5255], grad_fn=<ReluBackward0>)
Linear(in_features=3, out_features=4, bias=True)




RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x5 and 3x4)

In [16]:
#| output-location: column
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((4, 5))
        self.linear = nn.LazyLinear(4)

    def forward(self, X):
        print(X)
        print(self.linear)
        print()
        X = self.linear(X)
        print(X)
        print(self.linear)
        print()
        X = F.relu(X @ self.rand_weight + 1)
        print(X)
        print(self.linear)
        print()
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        print(X)
        print(self.linear)
        print()
        return X.sum()

net = FixedHiddenMLP()
X = torch.arange(5).to(torch.float32)
net.state_dict, net(X)

tensor([0., 1., 2., 3., 4.])
LazyLinear(in_features=0, out_features=4, bias=True)

tensor([1.5987, 0.2714, 0.8514, 0.4229], grad_fn=<AddBackward0>)
Linear(in_features=5, out_features=4, bias=True)

tensor([3.4346, 3.5500, 2.4825, 2.7338, 2.0602], grad_fn=<ReluBackward0>)
Linear(in_features=5, out_features=4, bias=True)

tensor([ 1.2755, -1.1270,  0.5474,  2.1472], grad_fn=<AddBackward0>)
Linear(in_features=5, out_features=4, bias=True)





(<bound method Module.state_dict of FixedHiddenMLP(
   (linear): Linear(in_features=5, out_features=4, bias=True)
 )>,
 tensor(2.8432, grad_fn=<SumBackward0>))

In [17]:
#| output-location: column
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
          nn.LazyLinear(3), nn.ReLU(),
          nn.LazyLinear(4), nn.ReLU())
        self.linear = nn.LazyLinear(2)

    def forward(self, X):
        return self.linear(self.net(X))

chimera = nn.Sequential(
  NestMLP(), 
  nn.LazyLinear(5), 
  FixedHiddenMLP()
  )

X = torch.arange(50).to(torch.float32)
chimera.state_dict(), chimera(X)

tensor([ 0.2760, -2.0931, -3.4741, -5.0602, -0.9771], grad_fn=<AddBackward0>)
LazyLinear(in_features=0, out_features=4, bias=True)

tensor([ 0.7497,  1.5311, -0.5474, -1.3877], grad_fn=<AddBackward0>)
Linear(in_features=5, out_features=4, bias=True)

tensor([0.6991, 1.7620, 1.2413, 0.0840, 1.9946], grad_fn=<ReluBackward0>)
Linear(in_features=5, out_features=4, bias=True)

tensor([ 0.4324,  0.0697, -0.1702,  0.1354], grad_fn=<AddBackward0>)
Linear(in_features=5, out_features=4, bias=True)



(OrderedDict([('0.net.0.weight',
               Parameter containing:
               tensor([[-0.0036,  0.0389, -0.0717, -0.0846, -0.0710, -0.0936,  0.0365, -0.1154,
                        -0.0454,  0.0805,  0.0005, -0.0640, -0.1398, -0.0549, -0.0640, -0.0317,
                         0.1289,  0.0568,  0.1242,  0.0101, -0.0691,  0.1232, -0.0208,  0.0289,
                        -0.1040,  0.0653,  0.0312,  0.0572, -0.0080,  0.1268, -0.0524,  0.0817,
                         0.0638,  0.1405,  0.1176, -0.0147, -0.1191,  0.0176, -0.0174, -0.0558,
                        -0.1079,  0.0407,  0.0873,  0.0093, -0.0576, -0.0788, -0.0341,  0.0834,
                        -0.0486,  0.1231],
                       [ 0.0355, -0.0237, -0.0801, -0.0782, -0.0246,  0.1300, -0.1150, -0.0709,
                        -0.0187, -0.0645, -0.0414, -0.0435, -0.0042,  0.0096, -0.0184,  0.1130,
                         0.0514, -0.0007,  0.0482, -0.1079,  0.0327, -0.0856, -0.0564, -0.0356,
                       

## `Net` 共用參數 (6.2.2 en)

In [28]:
# We need to give the shared layer a name so that we can refer to its
# parameters

m = 5  # 下面兩個 m 值一定要一樣
shared = nn.LazyLinear(m)
net = nn.Sequential(nn.LazyLinear(m), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))
X = torch.FloatTensor([1,2,3])
net(X)
# Check whether the parameters are the same
print(net[2].weight.data[0])
print(net[4].weight.data[0])
net[2].weight.data[0, 0] = 0.1
# Make sure that they are actually the same object rather than just having the
# same value
print()
print(net[2].weight.data[0])
print(net[4].weight.data[0])

tensor([ 0.3106,  0.1066,  0.2790, -0.2405, -0.1878])
tensor([ 0.3106,  0.1066,  0.2790, -0.2405, -0.1878])

tensor([ 0.1000,  0.1066,  0.2790, -0.2405, -0.1878])
tensor([ 0.1000,  0.1066,  0.2790, -0.2405, -0.1878])




## Parameter Initialization (6.3 en)

In [34]:
net = nn.Sequential(
  nn.Linear(3,4)
)
X = torch.FloatTensor([1,2,3])
net[0].weight, net[0].bias, net(X)

(Parameter containing:
 tensor([[ 0.2723, -0.2298, -0.3999],
         [-0.4484, -0.5352, -0.2473],
         [ 0.1999,  0.5098, -0.2889],
         [ 0.3707, -0.3990, -0.2375]], requires_grad=True),
 Parameter containing:
 tensor([0.2529, 0.2601, 0.3132, 0.0876], requires_grad=True),
 tensor([-1.1342, -2.0006,  0.6661, -1.0522], grad_fn=<AddBackward0>))

In [45]:
net = nn.Sequential(
  nn.Linear(3,4)
)
X = torch.FloatTensor([1,2,3])
nn.init.normal_( net[0].weight , 0, 1 )
nn.init.zeros_( net[0].bias )
net[0].weight, net[0].bias, net(X)

(Parameter containing:
 tensor([[ 5.7912e-01,  7.2905e-01,  2.7349e-01],
         [ 8.2351e-01,  9.2833e-01, -6.2403e-04],
         [-4.6876e-02, -1.6262e-01, -1.1401e+00],
         [-1.1399e+00,  3.3388e-01, -8.0548e-01]], requires_grad=True),
 Parameter containing:
 tensor([0., 0., 0., 0.], requires_grad=True),
 tensor([ 2.8577,  2.6783, -3.7925, -2.8886], grad_fn=<AddBackward0>))

In [46]:
net = nn.Sequential(
  nn.Linear(3,4)
)
X = torch.FloatTensor([1,2,3])
nn.init.constant_( net[0].weight , 6 )
net[0].weight, net[0].bias, net(X)

(Parameter containing:
 tensor([[6., 6., 6.],
         [6., 6., 6.],
         [6., 6., 6.],
         [6., 6., 6.]], requires_grad=True),
 Parameter containing:
 tensor([ 0.2872, -0.4617, -0.3226,  0.0261], requires_grad=True),
 tensor([36.2872, 35.5383, 35.6774, 36.0261], grad_fn=<AddBackward0>))

In [47]:
net = nn.Sequential(
  nn.Linear(3,4)
)
X = torch.FloatTensor([1,2,3])
nn.init.xavier_uniform_( net[0].weight )
net[0].weight, net[0].bias, net(X)

(Parameter containing:
 tensor([[-0.3272, -0.6721, -0.7902],
         [ 0.8782,  0.3942, -0.8563],
         [-0.0585,  0.3681, -0.2382],
         [ 0.4020,  0.6003, -0.2567]], requires_grad=True),
 Parameter containing:
 tensor([ 0.4035,  0.0404, -0.2751,  0.5587], requires_grad=True),
 tensor([-3.6384, -0.8621, -0.3121,  1.3913], grad_fn=<AddBackward0>))

## Custom Initialization (6.3.1.1 en)

In [54]:
net = nn.Sequential(
  nn.Linear(3,4), nn.ReLU(),
  nn.Linear(4,5), nn.ReLU(),
  nn.LazyLinear(2), 
)
X = torch.FloatTensor([1,2,3])

def my_init(module):
    if type(module) == nn.Linear:
        print("Init", *[(name, param.shape)
                        for name, param in module.named_parameters()][0])
        nn.init.uniform_(module.weight, -10, 10)
        # module.weight.data *= module.weight.data.abs() >= 5
net.apply(my_init)
net[0].weight, net[0].bias, net(X)

Init weight torch.Size([4, 3])
Init weight torch.Size([5, 4])


(Parameter containing:
 tensor([[ 1.2087, -7.8388, -4.4931],
         [ 0.3281,  2.8136, -9.8746],
         [-3.1472, -6.7930,  3.1936],
         [ 6.7071, -9.9570, -3.0954]], requires_grad=True),
 Parameter containing:
 tensor([-0.2979, -0.4321,  0.3465, -0.2627], requires_grad=True),
 tensor([ 0.1822, -0.1544], grad_fn=<AddBackward0>))

## 觀察 Net 的變化

In [19]:
%matplotlib inline
import numpy as np
import torch
from torch import nn

from ipywidgets import interactive
import matplotlib.pyplot as plt

m = 16
W1 = nn.Linear(1,m)
W1.weight.requires_grad = False

W2 = nn.Linear(m,m)
W2.weight.requires_grad = False

Wn = nn.Linear(m,m)
Wn.weight.requires_grad = False

sigR = nn.ReLU()
sigS = nn.Sigmoid()

Net = nn.Sequential(
  W1, sigR,
  nn.Linear(m,m), sigR,
  W2, sigR,
  nn.Linear(m,m), sigR,
  Wn, sigR,
  nn.Linear(m,1),
)

def f(w11=2.0, w21=-1.0, wn1=2.0, b=0.0):  
  W1.weight[0,0] = torch.tensor([w11])
  W2.weight[0,0] = torch.tensor([w21])
  Wn.weight[0,0] = torch.tensor([wn1])
  
  plt.figure(2)
  x = np.linspace(-16, 16, num=600)
  y = torch.zeros_like(torch.zeros(len(x)))
  x_torch = torch.from_numpy(x).to(torch.float32)
  # np.linspace(-10, 10, num=100000)
  for j in range(len(x_torch)):
    y[j] = Net(x_torch[j:j+1])
  y = y.detach().numpy() + b
  # print(x[:16])
  print(y[:16])
  plt.plot(x, y)
  plt.ylim(-1, 1)
  plt.show()

interactive_plot = interactive(f, w11 = (-10.0, 10.0),
                                  w21 = (-10.0, 10.0),
                                  wn1 = (-10.0, 10.0),
                                  b = (-20.0, 20.0)
                                  )
output = interactive_plot.children[-1]
output.layout.height = '500px'
interactive_plot

interactive(children=(FloatSlider(value=2.0, description='w11', max=10.0, min=-10.0), FloatSlider(value=-1.0, …

## Questions for MLP

- Deep vs Shallow
  - https://youtu.be/FN8jclCrqY0?t=1674

- 第一層較為重要
  - https://youtu.be/FN8jclCrqY0?t=2032
  - https://vigneshgig.medium.com/why-first-hidden-layer-is-very-important-in-building-a-neural-network-model-and-relation-between-6f2943acc847

- 如何確定神經網絡的層數和隱藏層神經元數量.
  - https://zhuanlan.zhihu.com/p/100419971


## CNN RNN

- Find the features.


### 7.1.1. 学习表征 zh


## https://vigneshgig.medium.com/why-first-hidden-layer-is-very-important-in-building-a-neural-network-model-and-relation-between-6f2943acc847

- https://www.youtube.com/watch?v=FN8jclCrqY0
- https://www.youtube.com/watch?v=qpuLxXrHQB4

大家好，
我將解釋“為什麼第一個隱藏層在構建神經網絡模型中非常重要”，
我還將解釋激活函數如何解決梯度消失問題。
我將使用 google playground.tensorflow 來解釋這個概念，
這是一個非常棒的工具，
可以可視化神經網絡模型的內部工作部分。
我建議您嘗試一下，
這樣您將獲得神經網絡背後的更多直覺。

在我之前的博客中，
我介紹了為什麼在神經網絡中使用激活函數以及第一個隱藏層，
所以我不打算在這個博客中解釋。
我建議您閱讀我的為什麼在神經網絡中使用激活函數，
這將提供有關激活函數和第一個隱藏層的更多信息。

Case 1

為了向您展示為什麼第一個隱藏層更重要，
我將限制第一個隱藏層只有兩個神經元，
然後我們可以在第一個隱藏層之後添加許多不受神經元限制的隱藏層，
如果您使用的是 sigmoid 函數，
則再添加一個不超過兩個隱藏層，
我們最終可能會以消失的梯度結束。
如果你想檢查兩個以上的隱藏層，
請使用 relu 激活函數，
這將避免梯度消失問題，
不管怎樣，
我打算同時使用 sigmoid 和 ReLu 激活函數。

正如我們所見，
即使我們使用了三個隱藏層，
該模型也無法完全分離或分類數據集。
所以我們在第一個隱藏層中至少需要三個 3 神經元。
在數學術語中，
我們至少需要三個線性方程來分離這個非線性數據集，
或者在拓撲術語中，我們需要多一個維度來將低維轉換為高維，
以便我們可以在高維中線性分離非線性數據集，
在這種情況下，
我們有 2d 數據集，
所以我們需要多一維，
這樣我們就可以將 2d 非線性數據集轉換為 3d 線性數據集。

Case 2

瞧！
它僅用 3 個隱藏參數就完全分離或分類了數據集。
如果你想知道為什麼請閱讀我的博客為什麼在神經網絡中使用激活函數。
所以如果你對神經網絡建模，
你應該更加重視第一個隱藏層，
因為所有其他隱藏層都依賴於第一個隱藏層。

由於梯度消失，
一層神經元的學習率逐層降低，
因此神經元的權重不變。
例如，
當我們有四個隱藏層時。
假設第 4 個隱藏層學習率為 0.9，
那麼第 3 個隱藏層為 0.5，
第 2 個隱藏層為 0.1，
第 1 個隱藏層學習率為 0.001，
由於第一個隱藏層沒有學習，
錯誤率沒有降低。

## 4.9.3 分布偏移

A perturbation at a layer grows exponentially in the remaining depth after that layer.