# 实验任务一: 预训练模型

------
### **1. 使用GPU训练模型**
    
在PyTorch中，可以使用以下代码来检测当前环境是否有可用的GPU：

In [None]:
import torch

# 检查是否有可用的GPU
if torch.cuda.is_available():
    print(f"CUDA is available. Number of GPUs: {torch.cuda.device_count()}")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(torch.cuda.current_device())}")
else:
    print("CUDA is not available. Using CPU.")

CUDA is available. Number of GPUs: 1
Current device: 0
Device name: Tesla T4


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
class SimpleModel(nn.Module):
    def __init__(self, dropout_rate=0.0):
        super(SimpleModel, self).__init__()
        self.dropout_rate = dropout_rate
        self.training = True
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(1600, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, 2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        if self.training:
            x = dropout_layer(x, self.dropout_rate)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

NameError: name 'nn' is not defined

如果显示'CUDA is not available. Using CPU.'请确认启动的环境是否正确或者尝试重新安装pytorch或者与助教联系。

把模型放到GPU上的代码示例。定义模型后，通过model = model.to(device)把模型放到GPU上。

In [None]:
# 检查是否有可用的GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 创建模型
model = SimpleModel()

# 将模型放到GPU（如果可用）
model = model.to(device)

NameError: name 'SimpleModel' is not defined

把数据放到GPU上的代码示例。由于模型在GPU上，所以数据也必须在GPU上才能送入模型。通过inputs = inputs.to(device)把input放到GPU上。

值得说明的是由于模型的输出也在GPU上，所以标签也需要放到GPU上以便于计算损失，通过labels = labels.to(device)。

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

# 训练示例
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        # 将数据放到GPU（如果可用）
        inputs, labels = inputs.to(device), labels.to(device)

        # 前向传播
        outputs = model(inputs)

NameError: name 'train_loader' is not defined

通过上述过程，我们可以把数据和模型都放到GPU上从而加速训练。

你可以使用以下命令查看是否使用了GPU并且观察的GPU利用率：

watch -n 5 nvidia-smi

这个命令会每5秒（-n 5）更新一次NVIDIA GPU的状态信息。

### **2. 了解预训练语言模型**
    
下面我们以BERT为例，用的bert-base-uncased版本进行实验。我们首先用AutoModel和AutoTokenizer加载模型和分词器。分词器是把文本的每个词元映射到对应的索引，以便于BERT的embedding层完成索引到嵌入的映射。


完整代码如下：

In [None]:
import torch
from transformers import AutoModel, AutoTokenizer

# 指定模型名称
model_name = '/content/drive/MyDrive/bert-base-uncased/bert-base-uncased'

# 读取模型对应的tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 载入模型
model = AutoModel.from_pretrained(model_name)

# 输入文本
input_text = "Here is some text to encode"

# 通过tokenizer把文本变成 token_id
input_ids = tokenizer.encode(input_text, add_special_tokens=True)
print(input_ids)

# 转换为Tensor
input_ids = torch.tensor([input_ids])

# 获得BERT的输出
with torch.no_grad():
    output = model(input_ids)

# 获得BERT模型最后一个隐层结果
output_hidden_states = output.last_hidden_state
output_hidden_states.shape

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


torch.Size([1, 9, 768])

分词（tokenizer）的过程会在文本的头尾添加特殊token，即会在文本的开头加入词元[CLS]并且在文本的结尾加入词元[SEP]。你可以调整input_text和设置add_special_tokens=False，观察到这两个词元分别被编码为101和102。

除此之外，由于批处理过程需要一个批次中文本长度相同，因此额外引入了padding。所以，我们需要使用了attention_mask屏蔽这些padding token，不让其参与自注意力的计算。

最终的输出是文本中所有词元的隐藏状态（hidden states）。

我们可以用model.named_parameters(): 观察模型的所有参数及其形状，完整代码如下：

In [None]:
import torch
from transformers import AutoModel, AutoTokenizer

# 指定模型名称
model_name = '/content/drive/MyDrive/bert-base-uncased/bert-base-uncased'

# 读取模型对应的tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 载入模型
model = AutoModel.from_pretrained(model_name)

# 打印模型所有参数的名称和形状
for name, param in model.named_parameters():
    print(f"Parameter Name: {name}, Shape: {param.shape}")

Parameter Name: embeddings.word_embeddings.weight, Shape: torch.Size([30522, 768])
Parameter Name: embeddings.position_embeddings.weight, Shape: torch.Size([512, 768])
Parameter Name: embeddings.token_type_embeddings.weight, Shape: torch.Size([2, 768])
Parameter Name: embeddings.LayerNorm.weight, Shape: torch.Size([768])
Parameter Name: embeddings.LayerNorm.bias, Shape: torch.Size([768])
Parameter Name: encoder.layer.0.attention.self.query.weight, Shape: torch.Size([768, 768])
Parameter Name: encoder.layer.0.attention.self.query.bias, Shape: torch.Size([768])
Parameter Name: encoder.layer.0.attention.self.key.weight, Shape: torch.Size([768, 768])
Parameter Name: encoder.layer.0.attention.self.key.bias, Shape: torch.Size([768])
Parameter Name: encoder.layer.0.attention.self.value.weight, Shape: torch.Size([768, 768])
Parameter Name: encoder.layer.0.attention.self.value.bias, Shape: torch.Size([768])
Parameter Name: encoder.layer.0.attention.output.dense.weight, Shape: torch.Size([768, 7

In [None]:
### **3. 使用预训练模型进行文本分类**
可能需要安装transformers包

pip install transformers

在本章节中，你将基于上面的BERT代码和AG NEWS数据集进行基于预训练模型BERT的文本分类。你将完善下述代码同时探索多种句子聚合方式对结果的影响，其中句子聚合方式指的是从词嵌入中得到句子嵌入的过程。需要探索的句子聚合方式包括：

1. 直接使用[CLS]的嵌入表示当做句子嵌入。
2. 使用mean-pooling平均一个句子中的所有词元得到嵌入
3. 使用注意力机制给每个词元分配一个权重，通过加权求和的方式得到嵌入。你可以使用任意注意力机制计算。

SyntaxError: invalid character '，' (U+FF0C) (2054479218.py, line 6)

In [3]:
pip install transformers



In [4]:
pip install scikit-learn



代码部分：

In [11]:
pip install torch --upgrade

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [7]:
import torch
from transformers import AutoModel

print(torch.__version__)

2.6.0+cu124


In [None]:
import torch
import pandas as pd
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tqdm import tqdm


# **1. 加载 AG NEWS 数据集**
df = pd.read_csv("./drive/MyDrive/ag/train.csv")  # 请替换成你的文件路径
df.columns = ["label", "title", "description"]  # CSV 有3列: 标签, 标题, 描述
df["text"] = df["title"] + " " + df["description"]  # 合并标题和描述作为输入文本
df["label"] = df["label"] - 1  # AG NEWS 的标签是 1-4，我们转换成 0-3
train_texts, train_labels = df["text"].tolist(), df["label"].tolist()
number = int(0.3 * len(train_texts))
train_texts, train_labels = train_texts[: number], train_labels[: number]

df = pd.read_csv("./drive/MyDrive/ag/test.csv")  # 请替换成你的文件路径
df.columns = ["label", "title", "description"]  # CSV 有3列: 标签, 标题, 描述
df["text"] = df["title"] + " " + df["description"]  # 合并标题和描述作为输入文本
df["label"] = df["label"] - 1  # AG NEWS 的标签是 1-4，我们转换成 0-3
test_texts, test_labels = df["text"].tolist(), df["label"].tolist()

# **2. 加载 BERT Tokenizer**
model_name = "./drive/MyDrive/bert-base-uncased/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# **3. 处理数据**
class AGNewsDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=50):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text, truncation=True, padding="max_length", max_length=self.max_length, return_tensors="pt"
        )
        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long),
        } # 此处会自动生成BERT输入所需要的attention_mask


train_dataset = AGNewsDataset(train_texts, train_labels, tokenizer)
test_dataset = AGNewsDataset(test_texts, test_labels, tokenizer)

train_dataloader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=128, shuffle=False)

# **4. 定义和加载BERT分类模型**
#TODO:定义模型并且放到GPU上
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModel.from_pretrained(model_name)
model = model.to(device)

class BERTClassifier(nn.Module):
    def __init__(self, model_name, num_labels):
        super(BERTClassifier, self).__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(0.1)
        self.classifier = torch.nn.Linear(self.bert.config.hidden_size,num_labels)


    def forward(self, input_ids, attention_mask):
        logits = self.bert(input_ids = input_ids, attention_mask = attention_mask)
        logits = logits.last_hidden_state[:,0,:]
        logits = self.dropout(logits)
        logits = self.classifier(logits)
        return logits

model = BERTClassifier(model_name, num_labels=4).to(device)


# **5. 设置优化器和损失函数**
#TODO: 定义优化器和损失函数
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
# **6. 训练 BERT**
EPOCHS = 3

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    loop = tqdm(train_dataloader, desc=f"Epoch {epoch+1}")

    for batch in loop:
        #TODO: 基于后面需要打印的损失，定义训练过程
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)
        output = model(input_ids, attention_mask)
        loss = criterion(output, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss+=loss.item()

    print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_dataloader):.4f}")

    # **7. 评估模型**
    model.eval()
    preds, true_labels = [], []
    with torch.no_grad():
      for batch in test_dataloader:
        inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
        labels = batch['labels'].to(device)
        outputs = model(**inputs)
        preds.extend(torch.argmax(outputs, dim=1).cpu().numpy())
        true_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(true_labels, preds)
    print(f"Test Accuracy: {acc:.4f}")

Epoch 1: 100%|██████████| 282/282 [04:48<00:00,  1.02s/it]


Epoch 1, Loss: 0.3169
Test Accuracy: 0.9217


Epoch 2: 100%|██████████| 282/282 [04:47<00:00,  1.02s/it]


Epoch 2, Loss: 0.1687
Test Accuracy: 0.9246


Epoch 3: 100%|██████████| 282/282 [04:48<00:00,  1.02s/it]


Epoch 3, Loss: 0.1171
Test Accuracy: 0.9221


In [None]:
训练笔记：你如果觉得训练速度慢，可以尝试增大batch size，不过注意不要炸显存。

根据实验数据，三种句子聚合方式的最终测试准确率如下：

思考题1：你觉得以上三种得到句子嵌入的方案，哪种效果会最好，哪种效果会最差？为什么？

思考题2：如果一个文档包括多个句子，我们需要获得其中每个句子的嵌入表示。那么，我们应该怎么利用BERT得到每个句子的嵌入？

In [2]:
import torch
import pandas as pd
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModel, AutoTokenizer, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tqdm import tqdm

df = pd.read_csv("./drive/MyDrive/ag/train.csv")
df.columns = ["label", "title", "description"]
df["text"] = df["title"] + " " + df["description"]
df["label"] = df["label"] - 1
train_texts, train_labels = df["text"].tolist(), df["label"].tolist()
number = int(0.3 * len(train_texts))
train_texts, train_labels = train_texts[:number], train_labels[:number]

df = pd.read_csv("./drive/MyDrive/ag/test.csv")
df.columns = ["label", "title", "description"]
df["text"] = df["title"] + " " + df["description"]
df["label"] = df["label"] - 1
test_texts, test_labels = df["text"].tolist(), df["label"].tolist()

model_name = "./drive/MyDrive/bert-base-uncased/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

class AGNewsDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=50):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text, truncation=True, padding="max_length",
            max_length=self.max_length, return_tensors="pt"
        )
        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "labels": torch.tensor(label, dtype=torch.long),
        }

train_dataset = AGNewsDataset(train_texts, train_labels, tokenizer)
test_dataset = AGNewsDataset(test_texts, test_labels, tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=128, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=128, shuffle=False)

class BERTClassifier(nn.Module):
    def __init__(self, model_name, num_labels, pool_type='cls'):
        super(BERTClassifier, self).__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.pool_type = pool_type
        self.num_labels = num_labels

        if pool_type == 'attention':
            self.attention = nn.Sequential(
                nn.Linear(self.bert.config.hidden_size, 128),
                nn.Tanh(),
                nn.Linear(128, 1)
            )

        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        self.dropout = nn.Dropout(0.1)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids,
                           attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state  # [batch, seq_len, hidden_size]

        # CLS pooling (直接使用[CLS] token)
        if self.pool_type == 'cls':
            pooled = last_hidden[:, 0, :]

        # Mean pooling (平均所有token)
        elif self.pool_type == 'mean':
            mask = attention_mask.unsqueeze(-1).expand(last_hidden.size()).float()
            sum_hidden = torch.sum(last_hidden * mask, 1)
            sum_mask = torch.clamp(mask.sum(1), min=1e-9)
            pooled = sum_hidden / sum_mask

        # Attention pooling (加权平均)
        elif self.pool_type == 'attention':
            attn_weights = self.attention(last_hidden).squeeze(-1)  # [batch, seq_len]
            attn_weights = attn_weights.masked_fill(attention_mask == 0, float('-inf'))
            attn_weights = torch.softmax(attn_weights, dim=1)
            pooled = torch.sum(last_hidden * attn_weights.unsqueeze(-1), dim=1)

        pooled = self.dropout(pooled)
        return self.classifier(pooled)

def train_and_evaluate(pool_type):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = BERTClassifier(model_name, num_labels=4, pool_type=pool_type).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.AdamW(model.parameters(), lr=2e-5)

    EPOCHS = 3
    for epoch in range(EPOCHS):
        model.train()
        total_loss = 0
        loop = tqdm(train_dataloader, desc=f"Epoch {epoch+1} [{pool_type}]")

        for batch in loop:
            optimizer.zero_grad()
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)

            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            loop.set_postfix(loss=loss.item())

        print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_dataloader):.4f}")

        model.eval()
        preds, true_labels = [], []
        with torch.no_grad():
            for batch in test_dataloader:
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["labels"].cpu().numpy()

                outputs = model(input_ids, attention_mask)
                preds.extend(torch.argmax(outputs, dim=1).cpu().numpy())
                true_labels.extend(labels)

        acc = accuracy_score(true_labels, preds)
        print(f"Test Accuracy ({pool_type}): {acc:.4f}\n")

    return acc

results = {}
for pool_type in ['cls', 'mean', 'attention']:
    print(f"\n=== Evaluating {pool_type} pooling ===")
    results[pool_type] = train_and_evaluate(pool_type)

print("\n=== Final Results ===")
for k, v in results.items():
    print(f"{k} pooling accuracy: {v:.4f}")


=== Evaluating cls pooling ===


Epoch 1 [cls]: 100%|██████████| 282/282 [04:38<00:00,  1.01it/s, loss=0.389]


Epoch 1, Loss: 0.3219
Test Accuracy (cls): 0.9149



Epoch 2 [cls]: 100%|██████████| 282/282 [04:38<00:00,  1.01it/s, loss=0.313]


Epoch 2, Loss: 0.1707
Test Accuracy (cls): 0.9221



Epoch 3 [cls]: 100%|██████████| 282/282 [04:38<00:00,  1.01it/s, loss=0.219]


Epoch 3, Loss: 0.1193
Test Accuracy (cls): 0.9226


=== Evaluating mean pooling ===


Epoch 1 [mean]: 100%|██████████| 282/282 [04:40<00:00,  1.01it/s, loss=0.356]


Epoch 1, Loss: 0.3290
Test Accuracy (mean): 0.9180



Epoch 2 [mean]: 100%|██████████| 282/282 [04:41<00:00,  1.00it/s, loss=0.408]


Epoch 2, Loss: 0.1773
Test Accuracy (mean): 0.9203



Epoch 3 [mean]: 100%|██████████| 282/282 [04:40<00:00,  1.01it/s, loss=0.236]


Epoch 3, Loss: 0.1239
Test Accuracy (mean): 0.9236


=== Evaluating attention pooling ===


Epoch 1 [attention]: 100%|██████████| 282/282 [04:41<00:00,  1.00it/s, loss=0.0723]


Epoch 1, Loss: 0.3276
Test Accuracy (attention): 0.9186



Epoch 2 [attention]: 100%|██████████| 282/282 [04:41<00:00,  1.00it/s, loss=0.0607]


Epoch 2, Loss: 0.1694
Test Accuracy (attention): 0.9243



Epoch 3 [attention]: 100%|██████████| 282/282 [04:41<00:00,  1.00it/s, loss=0.132]


Epoch 3, Loss: 0.1181
Test Accuracy (attention): 0.9137


=== Final Results ===
cls pooling accuracy: 0.9226
mean pooling accuracy: 0.9236
attention pooling accuracy: 0.9137
