### **Building Models in PyTorch**
#### **`torch.nn.Module` and `torch.nn.Parameter`**
In this video, we'll be discussing some of the tools PyTorch makes available for building deep learning networks.

Except for `Parameter`, the classes we discuss in this video are all subclasses of `torch.nn.Module`. This is the PyTorch base class meant to encapsulate behaviors specific to PyTorch Models and their components.

One important behavior of `torch.nn.Module` is registering parameters. If a particular `Module` subclass has learning weights, these weights are expressed as instances of `torch.nn.Parameter`. The `Parameter` class is a subclass of `torch.Tensor`, with the special behavior that when they are assigned as attributes of a `Module`, they are added to the list of that modules parameters. These parameters may be accessed through the `parameters()` method on the `Module` class.

* torch.nn.Module: PyTorch 中所有模型和層的基礎類，具有自動註冊參數的能力。
* torch.nn.Parameter: torch.Tensor 的子類，專門用來作為模型的可學習參數，當賦值為 Module 的屬性時會自動註冊為模型參數。

As a simple example, here's a very simple model with two linear layers and an activation function. We'll create an instance of it and ask it to report on its parameters:

In [1]:
import torch

# 設置 PyTorch 的輸出選項
torch.set_printoptions(threshold=10000, precision=6, linewidth=200)

class TinyModel(torch.nn.Module):
    # __init__ 方法中，定義了模型的結構
    def __init__(self):
        super(TinyModel, self).__init__()

        # linear1 是一個輸入大小為 100，輸出大小為 200 的全連接層。
        self.linear1 = torch.nn.Linear(100, 200)
        # ReLU 是一種常用的非線性激活函數。
        self.activation = torch.nn.ReLU()
        # linear2 是一個輸入大小為 200，輸出大小為 10 的全連接層。
        self.linear2 = torch.nn.Linear(200, 10)
        # Softmax 是一種常用於多分類任務的函數，用來將輸出轉換為概率分佈。
        self.softmax = torch.nn.Softmax()
    
    # forward 方法定義了模型的前向傳播過程；輸入數據如何依次通過各層，最終產生輸出。
    def forward(self, x):
        # 數據首先經過 linear1 層，再經過 ReLU 激活，接著經過 linear2 層，最後經過 Softmax 層。
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x

tinymodel = TinyModel() # 創建一個 TinyModel 實例

print('The model;')
print(tinymodel)

print('\n\nJust one layer:')
print(tinymodel.linear2)

# 循環會遍歷並印出模型中的所有參數（即 linear1 和 linear2 的權重和偏置）。
print('\n\nModel params:')
for param in tinymodel.parameters():
    print(param)

# 會遍歷並印出 linear2 層中的所有參數。
print('\n\nLayer params:')
for param in tinymodel.linear2.parameters():
    print(param)

The model;
TinyModel(
  (linear1): Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=200, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)


Just one layer:
Linear(in_features=200, out_features=10, bias=True)


Model params:
Parameter containing:
tensor([[-3.205977e-02,  7.872678e-03,  5.418869e-02,  ...,  2.005448e-02, -1.961485e-03, -4.266337e-03],
        [-7.094987e-02,  6.961019e-02,  9.257927e-02,  ..., -2.706629e-02,  8.440390e-03,  3.588381e-02],
        [-4.440714e-02,  7.478767e-02,  8.611829e-02,  ..., -4.279068e-02, -3.451542e-02, -9.934898e-02],
        ...,
        [-4.254567e-02,  9.226108e-02,  7.768998e-02,  ...,  8.669520e-02,  2.285053e-02, -1.314601e-02],
        [ 3.112680e-02, -1.330163e-02,  5.061924e-05,  ..., -1.745600e-02,  5.860174e-02,  1.716904e-02],
        [-4.400945e-02,  3.008768e-03,  3.419369e-04,  ...,  4.247188e-02, -7.746973e-02,  5.233642e-02]], requires_grad=True)
Parameter con

#### **Common Layer Types**

**線性層;全連接層 (Linear Layer)**

The most basic type of neural network layer is a linear or fully connected layer. This is a layer where every input influences every output of the layer to a degree specified by the layer's weights.If a model has m inputs and n outputs, the weights will be an `m*n `matrix.  
線性層的輸出可以表示為：`y = xW + b`，其中 `x` 是輸入，`W` 是權重矩陣，`b` 是偏置。For example:

In [2]:
lin = torch.nn.Linear(3, 2) # 創建一個全連接層，輸入大小為 3，輸出大小為 2。
x = torch.rand(1, 3) # 創建一個大小為 1x3 的隨機數據。
print('Input:')
print(x)

print('\n\nWeight and Bias parameters:')
for param in lin.parameters():
    print(param)

# 印出 lin 層的權重和偏置參數。
y = lin(x) # 將輸入 x 傳遞給線性層 lin，計算並返回輸出 y。
print('\n\nOutput:')
print(y)


Input:
tensor([[0.772721, 0.548795, 0.456000]])


Weight and Bias parameters:
Parameter containing:
tensor([[-0.502086,  0.118776, -0.534763],
        [-0.236756,  0.516878, -0.209722]], requires_grad=True)
Parameter containing:
tensor([-0.316743,  0.322792], requires_grad=True)


Output:
tensor([[-0.883384,  0.327871]], grad_fn=<AddmmBackward0>)


這段程式展示了如何使用 PyTorch 定義和使用一個線性層。線性層是深度學習中非常重要的組件，它將輸入轉換為輸出，這個轉換是通過學習一組權重和偏置來實現的。在訓練過程中，這些權重和偏置會根據數據和損失函數進行調整，從而使模型能夠學習到數據中的模式。

If you do the matrix multiplication of x by the linear layer's weights, and add the biases, you'll find that you get the output vector `y`.

One other important feature to note: When we checked the weights of our layer with `lin.weight`, it reported itself as a `Parameter` (which is a subclass of `Tensor` ), and let us know that it's tracking gradients with autograd. This is a default behavior for `Parameter` that differs from `Tensor`.

Linear layers are used widely in deep learning models. One of the most common places you'll see them is in classifier moels, which will usually have one or more linear layers at the end, where the last layer will have n outputs, where n is the number of classes the classifier addresses.

**Convolutional Layers**

Convolutional layers are built to handle data with a high degree of spatial correlation. They are very commonly used in computer vision, where they detect close groupings of features which the compose into higher-level features. They pop up in other contexts too - for example, in NLP applications, where the a word's immediate context (that is, the other words nearby in the sequence) can affect the meaning of a sentence.

We saw convolutional layers in action in LeNet in an earlier video:

* 卷積層：卷積層主要用於處理具有高度空間相關性的數據。它們在計算機視覺中非常常見，因為它們可以檢測圖像中相近特徵的組合，然後將這些組合構成更高層次的特徵。
* 計算機視覺中的應用：在圖像處理中，卷積層用於識別邊緣、角落等低層次特徵，並逐步組合成複雜的形狀或物體。
* 自然語言處理中的應用：在 NLP 中，卷積層可以用來檢測詞語的上下文（即序列中鄰近的詞語），從而影響句子的意思。


In [3]:
import torch.functional as F

class LeNet(torch.nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        # 1 input image channel (black & white), 6 output channels, 3x3 square convolution
        # kernel
        # 第一個卷積層，接收 1 個輸入通道（灰度圖像），輸出 6 個通道，使用 5x5 的卷積核。
        self.conv1 = torch.nn.Conv2d(1, 6, 5)
        # 第二個卷積層，接收 6 個輸入通道，輸出 16 個通道，同樣使用 5x5 的卷積核。
        self.conv2 = torch.nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        # 全連接層，用於將卷積層的輸出映射到最終的分類結果上。
        self.fc1 = torch.nn.Linear(16 * 6 * 6, 120) # 輸入大小為 16*6*6，輸出大小為 120。
        self.fc2 = torch.nn.Linear(120, 84) # 輸入大小為 120，輸出大小為 84。
        self.fc3 = torch.nn.Linear(84, 10) # 輸入大小為 84，輸出大小為 10，對應於 10 個分類。

    def forward(self, x):
        # Max pooling over a (2, 2) window ; 進行 2x2 的最大池化，減少空間維度。
        x = F.max_pool2d(F.relu(self.conv1(x)),(2, 2)) # 對輸入進行卷積，然後應用 ReLU 激活函數來引入非線性。
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x)) # 將卷積層的輸出展平為一維，以便輸入到全連接層。
        # 全連接層傳遞:
        x = F.relu(self.fc1(x)) # 將展平的張量依次傳遞到全連接層中，並應用 ReLU 激活。
        x = F.relu(self.fc2(x)) # 將展平的張量依次傳遞到全連接層中，並應用 ReLU 激活。
        x = self.fc3(x) # 最後一層沒有激活函數，輸出最終的分類結果。
        return x

    # 計算展平的特徵數
    def num_flat_features(self, x):
        size = x.size()[1:] # all dimensions except the batch dimension，取出所有維度大小，除了批次維度。
        num_features = 1
        for s in size:
            num_features *= s
        return num_features
    

Let's break down what's happening in the convolutional layers of this model. Starting with `conv1`:

**遞迴層（Recurrent Layers）**
Recurrent neural networks (or RNNs) are used for sequential data - anything from time-series measurements from a scientific instrument to natural language sentences to DNA nucleotides. An RNN does this by maintaining a hidden state that acts as a sort of memory for what it has seen in the sequence so far.

The internal structure of an RNN layer - or its variants, the LSTM (long short-term memory;長短期記憶) and GRU (gated recurrent unit;門控遞迴單元) - is moderately complex and beyond the scope of this video, but we'll show you what one looks like in action with an LSTM-based part-of-speech tagger (a type of classifier that tells you if a word is a noun, verb, etc.):

RNN 通過維持一個隱藏狀態來記憶序列中的歷史信息，因此可以處理長序列中的依賴關係。

LSTM(長短期記憶和) GRU（門控遞迴單元）是 RNN 的變體，專門設計來處理長期依賴問題。它們內部結構複雜，但可以有效地避免傳統 RNN 的梯度消失問題。

* `embedding_dim`：詞嵌入的維度，即每個單詞的表示向量的大小。
* `hidden_dim`：LSTM 隱藏層的維度，即隱藏狀態的大小。
* `vocab_size`：詞彙表的大小，表示可處理的不同單詞的總數。
* `tagset_size`：標籤集的大小，表示可能的詞性標籤的數量。

In [None]:
class LSTMTagger(torch.nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        # 詞嵌入層將單詞的索引轉換為對應的嵌入向量，這些向量將作為 LSTM 的輸入。
        self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        # 這些隱藏狀態包含了序列中已處理部分的上下文信息。
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        # 輸出每個詞對應的標籤得分。
        self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence) # 將輸入的句子（詞索引列表）轉換為對應的詞嵌入向量。

        # 將詞嵌入向量傳遞到 LSTM 層中。
        # view(len(sentence), 1, -1) 用來調整張量的形狀，以便 LSTM 能夠處理。
        # LSTM 返回輸出隱藏狀態 lstm_out 和（可選的）隱藏狀態元組 _。
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))

        # 將 LSTM 的輸出展平並傳遞給全連接層，生成對應於每個單詞的標籤得分。
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))

        # 對標籤得分應用 Log Softmax 函數，得到每個標籤的對數概率分佈。
        # 這些概率表示每個單詞屬於不同詞性標籤的可能性。
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores


The constructor has four arguments:

* `vocab_size` is the number of words in the input vocabulary.  
    Each word is a one-hot vector (or unit vector) in a `vocab_size` -dimensional space.
* `tagset_size` is the number of tags in the output set.
* `embedding_dim` is the size of the embedding space for the vocabulary.  
    An embedding maps a vocabulary onto a low-dimensional space, where words with similar meanings are close together in the space.
* `hidden_dim` is the size of the LSTM's memory.

The input will be a sentence with the words represented as indices of of one-hot vectors. The embedding layer will then map these down to an `embedding_dim` -dimensional space. The LSTM takes this sequence of embeddings and iterates over it, fielding an output vector of length `hidden_dim`.
The final linear layer acts as a classifier; applying log_softmax () to the output of the final layer converts the output into a normalized set of estimated probabilities that a given word maps to a given tag.

If you'd like to see this network in action, check out the Sequence Models and LSTM Networks tutorial on pytorch.org.

**Transformers**

Transformers are multi-purpose networks that have taken over the state of the art in NLP with models like BERT. A discussion of transformer architecture is beyond the scope of this video, but PyTorch has a `Transformer` class that allows you to define the overall parameters of a transformer model - the number of attention heads, the number of encoder & decoder layers, dropout and activation functions, etc. (You can even build the BERT model from this single class, with the right parameters!) The `torch.nn.Transformer` class also has classes to encapsulate the individual components ( `TransformerEncoder`, `TransformerDecoder` ) and subcomponents (`TransformerEncoderLayer`, `TransformerDecoderLayer` ). For details, check out the documentation on transformer classes, and the relevant tutorial on pytorch.org.


Transformers 是一種多功能神經網絡架構，PyTorch 提供了 Transformer 類，允許用戶定義 Transformer 模型的整體參數。

#### **Other Layers and Functions**

**Data Manipulation Layers**  
There are other layer types that perform important functions in models, but don't participate in the learning process themselves.

**Max Pooling（最大池化）** (and its twin, min pooling) reduce a tensor by combining cells, and assigning the maximum value of the input cells to the output cell. (We saw this) For example:

In [4]:
# MaxPool2d(3) 將 6x6 的輸入張量分割成 3x3 的區域，並對每個區域取最大值。
my_tensor = torch.rand(1, 6, 6)
print(my_tensor)

maxpool_layer = torch.nn.MaxPool2d(3)
print(maxpool_layer(my_tensor))

tensor([[[0.321914, 0.236021, 0.430655, 0.152097, 0.727925, 0.819373],
         [0.189953, 0.555486, 0.326744, 0.420363, 0.899552, 0.278097],
         [0.635880, 0.267926, 0.274840, 0.070747, 0.850823, 0.829280],
         [0.861568, 0.032475, 0.625679, 0.645713, 0.297683, 0.643696],
         [0.359585, 0.745571, 0.288882, 0.101304, 0.324211, 0.133364],
         [0.377945, 0.392677, 0.404468, 0.876196, 0.180975, 0.120142]]])
tensor([[[0.635880, 0.899552],
         [0.861568, 0.876196]]])


If you look closely at the values above, you'll see that each of the values in the maxpooled output is the maximum value of each quadrant of the 6x6 input.

**Normalization layers** re-center and normalize the output of one layer before feeding it to another. Centering the and scaling the intermediate tensors has a number of beneficial effects, such as letting you use higher learning rates without exploding/vanishing gradients.

重新中心化並歸一化一個層的輸出，再將其作為另一層的輸入。這可以防止梯度爆炸或消失的問題，並讓你在訓練過程中使用更高的學習率。


In [5]:
my_tensor = torch.rand(1, 4, 4) * 20 + 5 # 對輸入張量進行了大幅度的縮放和偏移。
print(my_tensor)

print(my_tensor.mean())

# 經過 Batch Normalization 層處理後，輸出的張量值被縮小並且集中在零附近，這有助於更好的學習效果。
norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print(normed_tensor)

print(normed_tensor.mean())

tensor([[[ 7.667809,  5.411770, 17.922087, 11.226503],
         [14.156848, 14.394440, 15.735331,  7.223950],
         [ 6.784162, 17.811817,  7.300866, 21.269615],
         [15.609527, 17.357277, 22.560999,  9.459081]]])
tensor(13.243255)
tensor([[[-0.610756, -1.087662,  1.556901,  0.141518],
         [ 0.385399,  0.456980,  0.860964, -1.703342],
         [-1.021549,  0.709588, -0.940436,  1.252398],
         [-0.136175,  0.237336,  1.349421, -1.450583]]], grad_fn=<NativeBatchNormBackward0>)
tensor(1.490116e-08, grad_fn=<MeanBackward0>)


Running the cell above, we've added a large scaling factor and offset to an input tensor; you should see the input tensor's `mean()` somewhere in the neighborhood of 15. After running it through the normalization layer, you can see that the values are smaller, and grouped around zero - in fact, the mean should be very small (> 1e-8).

This is beneficial because many activation functions (discussed below) have their strongest gradients near 0, but sometimes suffer from vanishing or exploding gradients for inputs that drive them far away from zero. Keeping the data centered around the area of steepest gradient will tend to mean faster, better learning and higher feasible learning rates.

**Dropout layers** are a tool for encouraging sparse representations in your model - that is, pushing it to do inference with less data.

Dropout layers work by randomly setting parts of the input tensor during training - dropout layers are always turned off for inference. This forces the model to learn against this masked or reduced dataset. For example:

In [6]:
my_tensor = torch.rand(1, 4, 4) # 隨機的 3D 張量 my_tensor，其尺寸為 1×4×4。
print(my_tensor)

dropout = torch.nn.Dropout(p=0.4) # p = 隨機丟棄的概率

#  Dropout 是隨機的，所以每次應用時輸出的結果會不同。
print(dropout(my_tensor))
print(dropout(my_tensor))

tensor([[[0.630955, 0.035210, 0.435472, 0.337215],
         [0.239945, 0.074656, 0.800279, 0.892582],
         [0.150856, 0.591368, 0.257231, 0.838336],
         [0.893400, 0.000992, 0.684879, 0.894716]]])
tensor([[[0.000000, 0.058684, 0.725787, 0.562025],
         [0.000000, 0.124427, 1.333799, 0.000000],
         [0.251426, 0.000000, 0.428719, 0.000000],
         [0.000000, 0.000000, 1.141465, 0.000000]]])
tensor([[[0.000000, 0.058684, 0.725787, 0.000000],
         [0.399908, 0.124427, 1.333799, 0.000000],
         [0.000000, 0.985613, 0.428719, 1.397227],
         [0.000000, 0.000000, 1.141465, 0.000000]]])


Above, you can see the effect of dropout on a sample tensor. You can use the optional p argument to set the probability of an individual weight dropping out; if you don't it defaults to 0.5.

**Activation Functions**

Activation functions make deep learning possible. A neural network is really a program - with many parameters - that simulates a mathematical function. If all we did was multiple tensors by layer weights repeatedly, we could only simulate linear functions; further, there would be no point to having many layers, as the whole network would reduce could be reduced to a single matrix multiplication. Inserting non-linear activation functions between layers is what allows a deep learning model to simulate any function, rather than just linear ones.

`torch.nn.Module` has objects encapsulating all of the major activation functions including ReLU and its many variants, Tanh, Hardtanh, sigmoid, and more. It also includes other functions, such as Softmax, that are most useful at the output stage of a model.

**Loss Functions**

Loss functions tell us how far a model's prediction is from the correct answer. PyTorch contains a variety of loss functions, including common MSE (mean
squared error = L2 norm), Cross Entropy Loss and Negative Likelihood Loss (useful for classifiers), and others.