# NN-自然語言處理 with Transformer
## 教學目標
- 本教學著重於自然語言處理，主要涵蓋 `Transformer`。
- 這份教學的目標是介紹如何以 Python 和 PyTorch 實作神經網路。

## 使用 NN 來進行中文的分類任務

- 我們將在這個教學裡讓大家實作中文情緒分析（Sentiment Analysis）
- 本資料集爲外賣平臺用戶評價分析，[下載連結](https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/waimai_10k/waimai_10k.csv)。
- 資料集欄位爲標籤（label）和評價（review），
- 標籤 1 爲正向，0 爲負向。
- 正向 4000 條，負向約 8000 條。

In [None]:
# 0. 下載資料與安裝 jieba

!mkdir -p data
!wget https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/waimai_10k/waimai_10k.csv -O data/waimai_10k.csv
!pip install jieba

--2025-11-03 08:29:20--  https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/waimai_10k/waimai_10k.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 919380 (898K) [text/plain]
Saving to: ‘data/waimai_10k.csv’


2025-11-03 08:29:20 (26.6 MB/s) - ‘data/waimai_10k.csv’ saved [919380/919380]



In [None]:
# 1. 導入所需套件
import math

# 第3方套件
import jieba
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [None]:
# 2. 以 pandas 讀取資料
# 請先下載資料集

df = pd.read_csv("./data/waimai_10k.csv")

In [None]:
# 3. 觀察資料

df.head()

Unnamed: 0,label,review
0,1,很快，好吃，味道足，量大
1,1,没有送水没有送水没有送水
2,1,非常快，态度好。
3,1,方便，快捷，味道可口，快递给力
4,1,菜味道很棒！送餐很及时！


## 建立字典
- 電腦無法僅透過字符來區分不同字之間的意涵
- 電腦視覺領域依賴的是影像資料本身的像素值
- 我們讓電腦理解文字的方法是透過向量
- 文字的意義藉由向量來進行表達的形式稱為 word embeddings
- 舉例:
$\textrm{apple}=[0.123, 0.456,0.789,\dots,0.111]$

- 如何建立每個文字所屬的向量？
    - 傳統方法: 計數法則
    - 近代方法 (2013-至今): 使用(淺層)神經網路訓練 word2vec ([參考](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/))，稱為 word embeddings
    - 現代方法 (2018-至今): 使用(深層)神經網路訓練 Transformers，也就是BERT ([參考](https://youtu.be/gh0hewYkjgo))，又稱為 contexualized embeddings
- 在那之前，要先建立分散式字詞的字典
    - 可粗分兩種斷詞方式 (tokenization):
        1. 每個字都斷 (character-level)
        2. 斷成字詞 (word-level)

## Word embeddings
- 著名的方法有:
    1. word2vec: Skip-gram, CBOW (continuous bag-of-words)
    2. GloVe
    3. fastText
- 本教學使用 PyTorch 內建的 Embedding 層來實作 word embeddings

In [None]:
word_to_idx = {"<pad>": 0, "<unk>": 1, "好吃": 2, "棒": 3, "给力": 4}
embeds = torch.nn.Embedding(5, 5)  # 5 words in vocab, 5 dimensional embeddings

In [None]:
def get_word_id(word, vocab, unk_idx: int = 1):
    return vocab.get(word, unk_idx)

In [None]:
lookup_tensor = torch.tensor(
    [
        get_word_id("好吃", word_to_idx),
        get_word_id("不棒", word_to_idx),
        get_word_id("<unk>", word_to_idx),
    ],
)
word_embed = embeds(lookup_tensor)
print(word_embed)

tensor([[-0.2812, -0.3170, -0.1047,  0.2052, -0.7845],
        [ 0.3511,  0.4559, -1.9599,  1.9738,  0.4159],
        [ 0.3511,  0.4559, -1.9599,  1.9738,  0.4159]],
       grad_fn=<EmbeddingBackward0>)


### 複習 torch.nn.Linear 用法

In [None]:
# torch.nn.Linear 用法範例
# m 是一個線性轉換層，之前我們都在 forward 裡面使用它

m = torch.nn.Linear(20, 30)
input = torch.randn(128, 20) # 假設有 128 筆資料，每筆資料有 20 維
output = m(input)

print(output.size())

torch.Size([128, 30])


In [None]:
# 4. 建立字典
use_jieba=True

vocab = {'<pad>':0, '<unk>':1}

if use_jieba:
    words = []
    for sent in df['review']:
        tokens = jieba.lcut(sent, cut_all=False)
        words.extend(tokens)

else:
    # 以 character-level 斷詞
    words = df['review'].str.cat()

# 使字詞不重複
words = sorted(set(words))
for idx, word in enumerate(words):
    # 一開始已經放兩個進去 dictionary 了
    idx = idx + 2
    # 將 word to id 放到 dictionary
    vocab[word] = idx

# 查看字典大小
print("The vocab size is {}.".format(len(vocab)))

Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.627 seconds.
DEBUG:jieba:Loading model cost 0.627 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.


The vocab size is 11010.


## 使用 PyTorch 建立 Dataset
![Imgur](https://i.imgur.com/wGnfCmH.png)

In [None]:
# 5. 將資料分成 train/ validation/ test

train_data, test_data = train_test_split(
    df,
    test_size=0.2,
)
train_data, validation_data = train_test_split(
    train_data,
    test_size=0.1,
)

## 定義超參數



In [None]:
# 6. 定義超參數

parameters = {
    "padding_idx": 0,
    "vocab_size": len(vocab),
    # Hyperparameters
    "embed_dim": 300,
    "hidden_dim": 256,
    "module_name": 'rnn', # 選項: rnn, lstm, gru, transformer
    "num_layers": 2,
    "learning_rate": 5e-4, # 使用 Transformer 時建議改成 5e-5
    "epochs": 10,
    "max_seq_len": 50,
    "batch_size": 64,
    "bidirectional": True,
    # Transformers
    "num_heads": 4,
    "dropout": 0.2,
}

In [None]:
# 7. 建立 PyTorch Dataset (定義 class)

class WaimaiDataset(torch.utils.data.Dataset):
    # 繼承 torch.utils.data.Dataset
    def __init__(self, vocab, data, max_seq_len, use_jieba):
        self.df = data
        self.max_seq_len = max_seq_len
        # 可以選擇要不要使用結巴進行斷詞
        self.use_jieba = use_jieba
        self.vocab = vocab
        self.unk_idx = self.vocab.get('<unk>')

    # 改寫繼承的 __getitem__ function
    def __getitem__(self, idx):
        # dataframe 的第一個 column 是 label
        # dataframe 的第一個 column 是 評論的句子
        label, sent = self.df.iloc[idx, 0:2]
        # 先將 label 轉為 float32 以方便後面進行 loss function 的計算
        label_tensor = torch.tensor(label, dtype=torch.float32)
        if self.use_jieba:
            # 使用 lcut 可以 return list
            tokens = jieba.lcut(sent, cut_all=False)
        else:
            # 每個字都斷詞
            tokens = list(sent)

        # 控制最大的序列長度
        tokens = tokens[:self.max_seq_len]

        # 根據 vocab 轉換 word id
        # 如果找不到該字詞，就用 <unk> 的 index 來表示
        tokens_id = [self.vocab.get(word, self.unk_idx) for word in tokens]
        tokens_tensor = torch.LongTensor(tokens_id)

        # 所以 第 0 個index是句子，第 1 個index是 label
        return tokens_tensor, label_tensor

    # 改寫繼承的 __len__ function
    def __len__(self):
        return len(self.df)

In [None]:
# 8. 建立 PyTorch Dataset (執行 class)
use_jieba=use_jieba

trainset = WaimaiDataset(
    vocab,
    train_data,
    parameters["max_seq_len"],
    use_jieba=use_jieba
)
validset = WaimaiDataset(
    vocab,
    validation_data,
    parameters["max_seq_len"],
    use_jieba=use_jieba
)
testset = WaimaiDataset(
    vocab,
    test_data,
    parameters["max_seq_len"],
    use_jieba=use_jieba
)

In [None]:
# 9. 整理 batch 的資料 (定義 function)

def collate_batch(batch):
    # 抽每一個 batch 的第 0 個(注意順序)
    text = [i[0] for i in batch]
    # 進行 padding
    text = pad_sequence(text, batch_first=True)

    # 抽每一個 batch 的第 1 個(注意順序)
    label = [i[1] for i in batch]
    # 把每一個 batch 的答案疊成一個 tensor
    label = torch.stack(label)

    return text, label

In [None]:
# 10. 建立資料分批 (mini-batches)

# 因為會針對 trainloader 進行 shuffle
# 對 trainloader 進行 shuffle 有助於降低 overfitting

trainloader = DataLoader(
    trainset,
    batch_size=parameters["batch_size"],
    collate_fn=collate_batch,
    shuffle=True,
)
validloader = DataLoader(
    validset,
    batch_size=parameters["batch_size"],
    collate_fn=collate_batch,
    shuffle=False,
)
testloader = DataLoader(
    testset,
    batch_size=parameters["batch_size"],
    collate_fn=collate_batch,
    shuffle=False,
)

## 建立Transformer模型
- 如上課所述，[Transformer](https://arxiv.org/abs/1706.03762) 有 encoder 部分和 decoder 部分，本教學實作encoder部分
- PyTorch 已經幫我們實作好 [`torch.nn.TransformerEncoderLayer`](https://docs.pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html) 和 [`torch.nn.TransformerEncoder`](https://docs.pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html)
    - `torch.nn.TransformerEncoderLayer` 代表 encoder 中每一層的設定，例如 `d_model`, `nhead` 等等
    - 處理完 encoder 層的設定之後，再用 `torch.nn.TransformerEncoder`來定義層數，以建構一個完整的 Transformer encoder，例如：
    ```python
    encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
    transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
    src = torch.rand(10, 32, 512)
    out = transformer_encoder(src)
    ```
- 如果你只想要實作 Transformer decoder 的話，使用方法與 `TransformerEncoder` 相似，採用 [`torch.nn.TransformerDecoderLayer`](https://docs.pytorch.org/docs/stable/generated/torch.nn.TransformerDecoderLayer.html) 與 [`torch.nn.TransformerDecoder`](https://docs.pytorch.org/docs/stable/generated/torch.nn.TransformerDecoder.html)
- 如果你想要直接建立一個 Transformer，可以使用 [`torch.nn.Transformer`](https://docs.pytorch.org/docs/stable/generated/torch.nn.Transformer.html#torch.nn.Transformer)

In [None]:
# 11-1. 建立Transformer模型的位置編碼

class PositionalEncoding(torch.nn.Module):
    r"""Inject some information about the relative or absolute position of the tokens
        in the sequence. The positional encodings have the same dimension as
        the embeddings, so that the two can be summed. Here, we use sine and cosine
        functions of different frequencies.

    Args:
        d_model: the embed dim (required).
        dropout: the dropout value (default=0.1).
        max_len: the max. length of the incoming sequence (default=5000).
    Examples:
        >>> pos_encoder = PositionalEncoding(d_model)
    """

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super().__init__()
        self.dropout = torch.nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        r"""Inputs of forward function
        Args:
            x: the sequence fed to the positional encoder model (required).
        Shape:
            x: [sequence length, batch size, embed dim]
            output: [sequence length, batch size, embed dim]
        Examples:
            >>> output = pos_encoder(x)
        """

        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

In [None]:
# 12-1. 建立Transformer encoder模型

class CustomTransformerEncoder(torch.nn.Module):
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        hidden_dim: int,
        padding_idx: int,
        num_layers: int,
        num_heads: int,
        dropout: float,
    ):
        """定義能夠處理句子分類任務的 Transformer encoder 模型架構
        Arguments:
            - args (dict): 所需要的模型參數 (parameters)
        Returns:
            - None
        """
        super().__init__()

        # 定義 Embedding 層
        self.embedding_layer = torch.nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=embed_dim,
            padding_idx=padding_idx
        )
        # 定義 dropout 層
        self.embedding_dropout = torch.nn.Dropout(dropout)

        # 定義 Positional Encoding (位置編碼)
        self.pos_encoder = PositionalEncoding(
            d_model=embed_dim,
            dropout=dropout
        )
        encoder_layer = torch.nn.TransformerEncoderLayer(
            d_model=embed_dim,
            nhead=num_heads,
            dim_feedforward=hidden_dim,
            dropout=dropout
        )
        self.transformer_encoder = torch.nn.TransformerEncoder(
            encoder_layer=encoder_layer,
            num_layers=num_layers
        )
        self.linear_layer = torch.nn.Linear(
            in_features=embed_dim,
            out_features=embed_dim
        )
        self.output_layer = torch.nn.Linear(
            in_features=embed_dim,
            out_features=1
        )

    def forward(self, X):
        """定義神經網路的前向傳遞的進行流程
        Arguments:
            - X: 輸入值，維度為 [B, S]，其中 B 為 batch size，S 為 sequence length
        Returns:
            - logits: 模型的輸出值，維度為(B, 1)，其中 B 為 batch size
            - Y: 模型的輸出值但經過非線性轉換 (這邊是用 sigmoid)，維度為(B, 1)，其中 B 為 batch size
        """
        # B: batch size; S: sequence length; E: embedding dimension
        E = self.embedding_layer(X) # 輸出維度: [B, S, E]
        E = self.embedding_dropout(E)

        # 加入位置編碼
        E = self.pos_encoder(E) # 輸出維度為 (B, S, E)

        # 使用 Transformer
        # PyTorch 官方 Transformer 預設是 seq-first (seq_len, batch, d_model)
        # 所以我們要先交換一下維度資訊
        E = E.transpose(0, 1) # 輸出維度為 (S, B, E)

        E = self.transformer_encoder(E) # 輸出維度為 (S, B, E)

        # 等等要經過分類層，所以要再轉回 (B, S, E)
        E = E.transpose(0, 1) # 輸出維度為 (B, S, E)

        H_out = self.linear_layer(E) # 輸出維度為 (B, S, E)

        # 取第一個 hidden state
        logits = self.output_layer(H_out[:, 0, :]) # 輸出維度為 (B, 1, E)
        Y = torch.sigmoid(logits)

        return logits, Y

In [None]:
# 12. 執行訓練所需要的準備工作

# model = RNNModel(
#     vocab_size=parameters["vocab_size"],
#     embed_dim=parameters["embed_dim"],
#     hidden_dim=parameters["hidden_dim"],
#     padding_idx=parameters["padding_idx"],
#     bi=parameters["bidirectional"],
# )

model = CustomTransformerEncoder(
    vocab_size=parameters["vocab_size"],
    embed_dim=parameters["embed_dim"],
    hidden_dim=parameters["hidden_dim"],
    padding_idx=parameters["padding_idx"],
    num_layers=parameters["num_layers"],
    num_heads=parameters["num_heads"],
    dropout=parameters["dropout"],
)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=parameters["learning_rate"])
loss_func = torch.nn.BCEWithLogitsLoss() # 含有 sigmoid 的版本



## 設定訓練流程

In [None]:
# 13. 設定訓練流程 (定義 function)

def train(trainloader, model, optimizer, loss_func):
    """定義訓練時的進行流程
    Arguments:
        - trainloader: 具備 mini-batches 的 dataset，由 PyTorch DataLoader 所建立
        - model: 要進行訓練的模型
        - optimizer: 最佳化目標函數的演算法
    Returns:
        - train_loss: 模型在一個 epoch 的 training loss
    """
    # 設定模型的訓練模式
    model.train()

    # 記錄一個 epoch中 training 過程的 loss
    train_loss = 0
    # 從 trainloader 一次一次抽
    for x, y in tqdm(trainloader, desc="Training"):
        # 將變數丟到指定的裝置位置
        x = x.to(device)
        y = y.to(device)

        # 重新設定模型的梯度
        optimizer.zero_grad()

        # 1. 前向傳遞 (Forward Pass)
        logits, pred = model(x)

        # 2. 計算 loss (loss function 為二元交叉熵)
        loss = loss_func(logits.squeeze(-1), y)

        # 3. 計算反向傳播的梯度
        loss.backward()
        # 4. "更新"模型的權重
        optimizer.step()

        # 一個 epoch 會抽很多次 batch，所以每個 batch 計算完都要加起來
        # .item() 在 PyTorch 中可以獲得該 tensor 的數值
        train_loss += loss.item()

    return train_loss / len(trainloader)

## 設定驗證流程

In [None]:
# 14. 設定驗證流程 (定義 function)

def evaluate(dataloader, model, loss_func):
    """定義驗證時的進行流程
    Arguments:
        - dataloader: 具備 mini-batches 的 dataset，由 PyTorch DataLoader 所建立
        - model: 要進行驗證的模型
    Returns:
        - loss: 模型在驗證/測試集的 loss
        - acc: 模型在驗證/測試集的正確率
    """
    # 設定模型的驗證模式
    # 此時 dropout 會自動關閉
    model.eval()
    total_loss = 0 # 紀錄 loss 數值
    label_list = []
    prediction_list = []

    # 設定現在不計算梯度
    with torch.no_grad():
        # 從 dataloader 一次一次抽
        for x, y in tqdm(dataloader, desc="Evaluating"):
            x, y = x.to(device), y.to(device)
            logits, pred = model(x)

            # 計算 loss (loss function 為二元交叉熵)
            # 模型輸出的維度是 (B, 1)，使用.squeeze(-1)可以讓維度變 (B,)
            loss = loss_func(logits.squeeze(-1), y)
            total_loss += loss.item()

            # 預測的數值大於 0.5 則視為類別1，反之為類別0
            pred = (pred > 0.5) * 1 # pred.shape: (B, 1)
            prediction_list.extend(pred.cpu().squeeze(-1).tolist())
            label_list.extend(y.cpu().tolist())

    avg_loss = total_loss / len(dataloader)
    # 計算正確率
    acc = accuracy_score(label_list, prediction_list)

    return avg_loss, acc

## 開始訓練

In [None]:
# 15. 整個訓練及驗證過程的 script

train_loss_history = []
valid_loss_history = []

for epoch in range(parameters["epochs"]):
    train_loss = train(
        trainloader,
        model,
        optimizer=optimizer,
        loss_func=loss_func,
    )

    print("Training loss at epoch {} is {}.".format(epoch+1, train_loss))
    train_loss_history.append(train_loss)

    if epoch % 2 == 1:
        print("=====Start validation=====")
        valid_loss, valid_acc = evaluate(
            dataloader=validloader,
            model=model,
            loss_func=loss_func,
        )
        valid_loss_history.append(valid_loss)
        print("Validation accuracy at epoch {} is {}, and validation loss is {}."\
              .format(epoch+1, valid_acc, valid_loss))

    torch.save(model.state_dict(), "model_epoch_{}.pkl".format(epoch))

Training: 100%|██████████| 135/135 [00:04<00:00, 28.71it/s]


Training loss at epoch 1 is 0.5304086109002432.


Training: 100%|██████████| 135/135 [00:04<00:00, 30.08it/s]


Training loss at epoch 2 is 0.39762211199159975.
=====Start validation=====


Evaluating: 100%|██████████| 15/15 [00:00<00:00, 33.66it/s]


Validation accuracy at epoch 2 is 0.8633993743482794, and validation loss is 0.3493048369884491.


Training: 100%|██████████| 135/135 [00:04<00:00, 27.14it/s]


Training loss at epoch 3 is 0.3617438483017462.


Training: 100%|██████████| 135/135 [00:04<00:00, 29.82it/s]


Training loss at epoch 4 is 0.3428898274898529.
=====Start validation=====


Evaluating: 100%|██████████| 15/15 [00:00<00:00, 44.98it/s]


Validation accuracy at epoch 4 is 0.8748696558915537, and validation loss is 0.32384509444236753.


Training: 100%|██████████| 135/135 [00:05<00:00, 26.83it/s]


Training loss at epoch 5 is 0.3200523402955797.


Training: 100%|██████████| 135/135 [00:04<00:00, 29.92it/s]


Training loss at epoch 6 is 0.31670454144477844.
=====Start validation=====


Evaluating: 100%|██████████| 15/15 [00:00<00:00, 42.83it/s]


Validation accuracy at epoch 6 is 0.872784150156413, and validation loss is 0.314593979716301.


Training: 100%|██████████| 135/135 [00:05<00:00, 24.88it/s]


Training loss at epoch 7 is 0.29816480963318437.


Training: 100%|██████████| 135/135 [00:04<00:00, 29.29it/s]


Training loss at epoch 8 is 0.29653940465715195.
=====Start validation=====


Evaluating: 100%|██████████| 15/15 [00:00<00:00, 41.68it/s]


Validation accuracy at epoch 8 is 0.8738269030239834, and validation loss is 0.33724465370178225.


Training: 100%|██████████| 135/135 [00:04<00:00, 29.47it/s]


Training loss at epoch 9 is 0.2797621112178873.


Training: 100%|██████████| 135/135 [00:04<00:00, 27.16it/s]


Training loss at epoch 10 is 0.2762188545531697.
=====Start validation=====


Evaluating: 100%|██████████| 15/15 [00:00<00:00, 42.64it/s]

Validation accuracy at epoch 10 is 0.8894681960375391, and validation loss is 0.31517145534356433.





In [None]:
# 16. 預測測試集

best_epoch = np.argmin(valid_loss_history)
model.load_state_dict(
    torch.load("model_epoch_{}.pkl".format(best_epoch))
)

print("=====Start testing=====")
test_loss, test_acc = evaluate(testloader, model, loss_func)
print("Testing accuracy is {}.".format(test_acc))

=====Start testing=====


Evaluating: 100%|██████████| 38/38 [00:00<00:00, 39.76it/s]

Testing accuracy is 0.8669724770642202.



