# 任务二：SNLI蕴含识别

本Notebook为自然语言推理（NLI）任务，在斯坦福自然语言推理（SNLI）数据集上，实现并对比三种不同的深度学习模型。

三个模型分别是：
1.  **模型一：** GloVe嵌入 + BiLSTM
2.  **模型二：** BERT-base嵌入 + BiLSTM (作为特征提取器)
3.  **模型三：** 微调BERT-base模型

**评价指标：** 准确率（Accuracy）和宏平均F1值（Macro-F1）.

## 1. 环境设置与依赖安装

首先，我们安装必要的库。我们将需要 `torch`、用于BERT模型的 `transformers`、用于加载数据的 `pandas` 和 `pyarrow`，以及用于评估的 `scikit-learn`。

## 2. 数据加载与预处理

我们将从 `.parquet` 文件加载SNLI数据集。数据集包含句子对（前提和假设）和一个标签。

**标签说明：**
- `0`: 蕴含 (Entailment)
- `1`: 中立 (Neutral)
- `2`: 矛盾 (Contradiction)

数据集中包含一些标签为-1的样本，表示标注者无法达成一致意见。我们将过滤掉这些样本。

In [2]:
import pandas as pd

# 定义文件路径
train_file =r"D:\期末大作业\期末大作业\数据集\SNLI\test-00000-of-00001.parquet"
val_file = r"D:\期末大作业\期末大作业\数据集\SNLI\validation-00000-of-00001.parquet"
test_file = r"D:\期末大作业\期末大作业\数据集\SNLI\test-00000-of-00001.parquet"

# 加载数据集
df_train = pd.read_parquet(train_file)
df_val = pd.read_parquet(val_file)
df_test = pd.read_parquet(test_file)

# --- 数据清洗 ---
# SNLI数据集使用-1作为标注者无法达成一致的标签，我们将移除这些样本。
df_train = df_train[df_train['label'] != -1].reset_index(drop=True)
df_val = df_val[df_val['label'] != -1].reset_index(drop=True)
df_test = df_test[df_test['label'] != -1].reset_index(drop=True)

print("训练集大小:", df_train.shape)
print("验证集大小:", df_val.shape)
print("测试集大小:", df_test.shape)

print("\n数据样本示例:")
df_train.head()

训练集大小: (9824, 3)
验证集大小: (9842, 3)
测试集大小: (9824, 3)

数据样本示例:


Unnamed: 0,premise,hypothesis,label
0,This church choir sings to the masses as they ...,The church has cracks in the ceiling.,1
1,This church choir sings to the masses as they ...,The church is filled with song.,0
2,This church choir sings to the masses as they ...,A choir singing at a baseball game.,2
3,"A woman with a green headscarf, blue shirt and...",The woman is young.,1
4,"A woman with a green headscarf, blue shirt and...",The woman is very happy.,0


## 3. 模型一：GloVe + BiLSTM

该模型使用预训练的GloVe词向量作为BiLSTM网络的输入。

### 3.1. GloVe设置
使用300维的向量 (`glove.800B.300d.txt`)。


In [7]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
import re
import os
# --- 参数配置 ---
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
NUM_CLASSES = 3
BATCH_SIZE = 128
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
N_LAYERS = 2
DROPOUT = 0.5
EPOCHS = 5
GLOVE_PATH = r"D:\glove_vectors\glove.840B.300d\glove.840B.300d.txt"
# 检查GloVe文件是否存在
if not os.path.exists(GLOVE_PATH):
    raise FileNotFoundError(f"GloVe文件不存在于: {GLOVE_PATH}")
else:
    print(f"找到GloVe文件: {GLOVE_PATH}")

# --- 文本预处理与词汇表构建 ---
def tokenizer(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9' ]+", "", text)
    return text.split()

print("正在构建词汇表...")
word_counts = Counter()
for text in pd.concat([df_train['premise'], df_train['hypothesis']]):
    word_counts.update(tokenizer(text))

vocab = sorted(word_counts, key=word_counts.get, reverse=True)
word_to_idx = {word: i+2 for i, word in enumerate(vocab)} # i+2 是为了给 <pad> 和 <unk> 留出位置
word_to_idx['<pad>'] = 0
word_to_idx['<unk>'] = 1
idx_to_word = {i: word for word, i in word_to_idx.items()}
VOCAB_SIZE = len(word_to_idx)

# --- GloVe 词向量矩阵 ---
print("正在加载GloVe词向量...")
glove_embeddings = np.zeros((VOCAB_SIZE, EMBEDDING_DIM))
word_found = 0
total_lines = 0
error_count = 0

with open(GLOVE_PATH, 'r', encoding='utf-8', errors='ignore') as f:
    for line_num, line in enumerate(f, 1):
        total_lines += 1
        if line_num % 100000 == 0:
            print(f"已处理 {line_num} 行...")
        
        try:
            parts = line.strip().split(' ')
            word = parts[0]
            
            if word in word_to_idx:
                # 确保有足够的值来创建向量
                if len(parts) >= EMBEDDING_DIM + 1:  # 单词 + 300维向量
                    vector = np.array(parts[1:EMBEDDING_DIM+1], dtype=np.float32)
                    if len(vector) == EMBEDDING_DIM:
                        glove_embeddings[word_to_idx[word]] = vector
                        word_found += 1
        except Exception as e:
            error_count += 1
            if error_count < 10:  # 只打印前几个错误
                print(f"第 {line_num} 行解析错误: {e}")
                print(f"问题行内容: {line[:50]}..." if len(line) > 50 else line)
            continue

print(f"GloVe文件共 {total_lines} 行，成功加载 {word_found} 个词向量 ({word_found/VOCAB_SIZE*100:.2f}% 的词汇表)")
print(f"解析过程中遇到 {error_count} 个错误行")

glove_embeddings = torch.tensor(glove_embeddings, dtype=torch.float32)

# --- PyTorch 数据集 ---
class SNLIDataset(Dataset):
    def __init__(self, dataframe, word_to_idx, max_len=50):
        self.df = dataframe
        self.word_to_idx = word_to_idx
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        premise = self.df.loc[idx, 'premise']
        hypothesis = self.df.loc[idx, 'hypothesis']
        label = self.df.loc[idx, 'label']

        premise_tokens = [self.word_to_idx.get(word, self.word_to_idx['<unk>']) for word in tokenizer(premise)]
        hypothesis_tokens = [self.word_to_idx.get(word, self.word_to_idx['<unk>']) for word in tokenizer(hypothesis)]

        # 使用简单拼接（可以改进，例如使用[SEP]标记）
        tokens = premise_tokens + hypothesis_tokens
        
        # 填充/截断
        if len(tokens) < self.max_len:
            tokens.extend([self.word_to_idx['<pad>']] * (self.max_len - len(tokens)))
        else:
            tokens = tokens[:self.max_len]
            
        return torch.tensor(tokens), torch.tensor(label)

# --- 创建数据加载器 ---
train_dataset = SNLIDataset(df_train, word_to_idx)
val_dataset = SNLIDataset(df_val, word_to_idx)
test_dataset = SNLIDataset(df_test, word_to_idx)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

print("\n数据准备完成。")

找到GloVe文件: D:\glove_vectors\glove.840B.300d\glove.840B.300d.txt
正在构建词汇表...
正在加载GloVe词向量...
已处理 100000 行...
已处理 100000 行...
已处理 200000 行...
已处理 200000 行...
已处理 300000 行...
已处理 300000 行...
已处理 400000 行...
已处理 400000 行...
已处理 500000 行...
已处理 500000 行...
已处理 600000 行...
已处理 600000 行...
已处理 700000 行...
已处理 700000 行...
已处理 800000 行...
已处理 800000 行...
已处理 900000 行...
已处理 900000 行...
已处理 1000000 行...
已处理 1000000 行...
已处理 1100000 行...
已处理 1100000 行...
已处理 1200000 行...
已处理 1200000 行...
已处理 1300000 行...
已处理 1300000 行...
已处理 1400000 行...
已处理 1400000 行...
已处理 1500000 行...
已处理 1500000 行...
已处理 1600000 行...
已处理 1600000 行...
已处理 1700000 行...
已处理 1700000 行...
已处理 1800000 行...
已处理 1800000 行...
已处理 1900000 行...
已处理 1900000 行...
已处理 2000000 行...
已处理 2000000 行...
已处理 2100000 行...
已处理 2100000 行...
GloVe文件共 2196017 行，成功加载 6371 个词向量 (96.69% 的词汇表)
解析过程中遇到 0 个错误行

数据准备完成。
GloVe文件共 2196017 行，成功加载 6371 个词向量 (96.69% 的词汇表)
解析过程中遇到 0 个错误行

数据准备完成。


### 3.2. BiLSTM 模型结构

In [8]:
class BiLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout, pretrained_embeddings):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding.weight.data.copy_(pretrained_embeddings)
        self.embedding.weight.requires_grad = False # 冻结词向量，不参与训练

        self.lstm = nn.LSTM(embedding_dim, 
                              hidden_dim, 
                              num_layers=n_layers, 
                              bidirectional=True, 
                              dropout=dropout if n_layers > 1 else 0,
                              batch_first=True)
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim) # *2因为是双向LSTM
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        embedded = self.dropout(self.embedding(text))
        
        # _ 是所有时间步的输出, (hidden, cell) 是最后一个时间步的隐藏状态和细胞状态
        _, (hidden, cell) = self.lstm(embedded)
        
        # 拼接前向和后向的最终隐藏状态
        # hidden 的形状是 [num_layers * num_directions, batch_size, hidden_dim]
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
            
        return self.fc(hidden)

# 实例化模型
model1 = BiLSTMClassifier(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM, NUM_CLASSES, N_LAYERS, DROPOUT, glove_embeddings).to(DEVICE)
print(model1)

BiLSTMClassifier(
  (embedding): Embedding(6589, 300)
  (lstm): LSTM(300, 256, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=3, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)


### 3.3. 训练与评估循环

In [9]:
def train_model(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    
    for batch in iterator:
        text, labels = batch
        text, labels = text.to(DEVICE), labels.to(DEVICE)
        
        optimizer.zero_grad()
        predictions = model(text)
        loss = criterion(predictions, labels)
        
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        
        # 计算准确率
        acc = accuracy_score(labels.cpu(), predictions.argmax(1).cpu())
        epoch_acc += acc
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate_model(model, iterator, criterion):
    epoch_loss = 0
    all_preds = []
    all_labels = []
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            text, labels = batch
            text, labels = text.to(DEVICE), labels.to(DEVICE)
            
            predictions = model(text)
            loss = criterion(predictions, labels)
            
            epoch_loss += loss.item()
            all_preds.extend(predictions.argmax(1).cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            
    acc = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='macro')
    return epoch_loss / len(iterator), acc, f1

# --- 训练模型一 ---
print("开始训练模型一 (GloVe + BiLSTM)...")
optimizer = optim.Adam(model1.parameters())
criterion = nn.CrossEntropyLoss().to(DEVICE)

for epoch in range(EPOCHS):
    train_loss, train_acc = train_model(model1, train_loader, optimizer, criterion)
    valid_loss, valid_acc, valid_f1 = evaluate_model(model1, val_loader, criterion)
    
    print(f'轮次: {epoch+1:02} | 训练损失: {train_loss:.3f} | 训练准确率: {train_acc*100:.2f}% | 验证损失: {valid_loss:.3f} | 验证准确率: {valid_acc*100:.2f}% | 验证F1: {valid_f1:.3f}')

开始训练模型一 (GloVe + BiLSTM)...


  from .autonotebook import tqdm as notebook_tqdm


轮次: 01 | 训练损失: 1.102 | 训练准确率: 32.68% | 验证损失: 1.101 | 验证准确率: 33.58% | 验证F1: 0.269
轮次: 02 | 训练损失: 1.100 | 训练准确率: 32.81% | 验证损失: 1.099 | 验证准确率: 33.10% | 验证F1: 0.261
轮次: 02 | 训练损失: 1.100 | 训练准确率: 32.81% | 验证损失: 1.099 | 验证准确率: 33.10% | 验证F1: 0.261
轮次: 03 | 训练损失: 1.100 | 训练准确率: 33.75% | 验证损失: 1.099 | 验证准确率: 33.82% | 验证F1: 0.169
轮次: 03 | 训练损失: 1.100 | 训练准确率: 33.75% | 验证损失: 1.099 | 验证准确率: 33.82% | 验证F1: 0.169
轮次: 04 | 训练损失: 1.099 | 训练准确率: 33.47% | 验证损失: 1.099 | 验证准确率: 33.15% | 验证F1: 0.314
轮次: 04 | 训练损失: 1.099 | 训练准确率: 33.47% | 验证损失: 1.099 | 验证准确率: 33.15% | 验证F1: 0.314
轮次: 05 | 训练损失: 1.099 | 训练准确率: 33.38% | 验证损失: 1.099 | 验证准确率: 33.11% | 验证F1: 0.247
轮次: 05 | 训练损失: 1.099 | 训练准确率: 33.38% | 验证损失: 1.099 | 验证准确率: 33.11% | 验证F1: 0.247


### 3.4. 在测试集上对模型一进行最终评估

In [10]:
test_loss, test_acc, test_f1 = evaluate_model(model1, test_loader, criterion)
print(f'模型一 测试集结果 -> 准确率: {test_acc*100:.2f}% | Macro-F1: {test_f1:.3f}')
results = {}
results['模型一 (GloVe + BiLSTM)'] = {'Accuracy': test_acc, 'Macro-F1': test_f1}

模型一 测试集结果 -> 准确率: 33.83% | Macro-F1: 0.255


## 4. 模型二：BERT嵌入 + BiLSTM

在这个模型中，我们使用预训练的BERT模型 (`bert-base-uncased`) 作为特征提取器。它的权重被冻结，输出的嵌入向量被送入一个BiLSTM网络，类似于模型一。

In [11]:
from transformers import BertTokenizer, BertModel

# --- 参数配置 ---
BERT_MODEL_NAME = 'bert-base-uncased'
MAX_LEN_BERT = 128 # BERT最大长度是512

# --- BERT 分词器 ---
tokenizer_bert = BertTokenizer.from_pretrained(BERT_MODEL_NAME)

# --- 用于BERT的PyTorch数据集 ---
class SNLIDatasetBERT(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.df = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        premise = self.df.loc[idx, 'premise']
        hypothesis = self.df.loc[idx, 'hypothesis']
        label = self.df.loc[idx, 'label']

        encoding = self.tokenizer.encode_plus(
            premise,
            hypothesis,
            add_special_tokens=True, # 添加 '[CLS]' 和 '[SEP]' 特殊符号
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt', # 返回PyTorch张量
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# --- 创建数据加载器 ---
train_dataset_bert = SNLIDatasetBERT(df_train, tokenizer_bert, MAX_LEN_BERT)
val_dataset_bert = SNLIDatasetBERT(df_val, tokenizer_bert, MAX_LEN_BERT)
test_dataset_bert = SNLIDatasetBERT(df_test, tokenizer_bert, MAX_LEN_BERT)

train_loader_bert = DataLoader(train_dataset_bert, batch_size=32, shuffle=True) # BERT模型需要更小的批量大小
val_loader_bert = DataLoader(val_dataset_bert, batch_size=32)
test_loader_bert = DataLoader(test_dataset_bert, batch_size=32)
print("用于BERT的数据准备完成。")

用于BERT的数据准备完成。


### 4.1. BERT+BiLSTM 模型结构

In [12]:
class BertBiLSTMClassifier(nn.Module):
    def __init__(self, bert, hidden_dim, output_dim, n_layers, dropout):
        super().__init__()
        self.bert = bert
        embedding_dim = bert.config.to_dict()['hidden_size'] # BERT的嵌入维度

        self.lstm = nn.LSTM(embedding_dim,
                              hidden_dim,
                              num_layers=n_layers,
                              bidirectional=True,
                              dropout=dropout if n_layers > 1 else 0,
                              batch_first=True)

        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_ids, attention_mask):
        # 不计算梯度，以冻结BERT
        with torch.no_grad():
            embedded = self.bert(input_ids=input_ids, attention_mask=attention_mask)[0]
        
        _, (hidden, cell) = self.lstm(embedded)
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        
        return self.fc(hidden)

# 加载预训练的BERT模型
bert_model = BertModel.from_pretrained(BERT_MODEL_NAME)

# 冻结BERT的参数
for param in bert_model.parameters():
    param.requires_grad = False

# 实例化模型
model2 = BertBiLSTMClassifier(bert_model, HIDDEN_DIM, NUM_CLASSES, N_LAYERS, DROPOUT).to(DEVICE)
print(model2)

BertBiLSTMClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elem

### 4.2. 训练与评估（模型二）

In [13]:
def train_bert_bilstm(model, iterator, optimizer, criterion):
    model.train()
    epoch_loss = 0
    for batch in iterator:
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        optimizer.zero_grad()
        predictions = model(input_ids, attention_mask)
        loss = criterion(predictions, labels)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(iterator)

def evaluate_bert_bilstm(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch in iterator:
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            labels = batch['labels'].to(DEVICE)
            
            predictions = model(input_ids, attention_mask)
            loss = criterion(predictions, labels)
            epoch_loss += loss.item()
            all_preds.extend(predictions.argmax(1).cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='macro')
    return epoch_loss / len(iterator), acc, f1

# --- 训练模型二 ---
# 注意：我们只将未冻结层的参数传递给优化器
print("\n开始训练模型二 (BERT + BiLSTM)...")
optimizer = optim.Adam(model2.parameters())
criterion = nn.CrossEntropyLoss().to(DEVICE)

# 由于模型较大，减少训练轮数
for epoch in range(3):
    train_loss = train_bert_bilstm(model2, train_loader_bert, optimizer, criterion)
    valid_loss, valid_acc, valid_f1 = evaluate_bert_bilstm(model2, val_loader_bert, criterion)
    print(f'轮次: {epoch+1:02} | 训练损失: {train_loss:.3f} | 验证损失: {valid_loss:.3f} | 验证准确率: {valid_acc*100:.2f}% | 验证F1: {valid_f1:.3f}')


开始训练模型二 (BERT + BiLSTM)...


  attn_output = torch.nn.functional.scaled_dot_product_attention(


轮次: 01 | 训练损失: 1.041 | 验证损失: 0.979 | 验证准确率: 51.28% | 验证F1: 0.506
轮次: 02 | 训练损失: 0.963 | 验证损失: 0.896 | 验证准确率: 58.97% | 验证F1: 0.586
轮次: 02 | 训练损失: 0.963 | 验证损失: 0.896 | 验证准确率: 58.97% | 验证F1: 0.586
轮次: 03 | 训练损失: 0.922 | 验证损失: 0.854 | 验证准确率: 61.66% | 验证F1: 0.611
轮次: 03 | 训练损失: 0.922 | 验证损失: 0.854 | 验证准确率: 61.66% | 验证F1: 0.611


### 4.3. 在测试集上对模型二进行最终评估

In [14]:
test_loss, test_acc, test_f1 = evaluate_bert_bilstm(model2, test_loader_bert, criterion)
print(f'模型二 测试集结果 -> 准确率: {test_acc*100:.2f}% | Macro-F1: {test_f1:.3f}')
results['模型二 (BERT嵌入 + BiLSTM)'] = {'Accuracy': test_acc, 'Macro-F1': test_f1}

模型二 测试集结果 -> 准确率: 63.20% | Macro-F1: 0.626


## 5. 模型三：微调BERT

这是最常用且最强大的方法。我们采用一个带分类头的预训练BERT模型 (`BertForSequenceClassification`)，并在我们的特定任务上对整个模型进行微调。

In [None]:
from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup
from torch.optim import AdamW  # 从PyTorch导入AdamW而不是transformers

# --- 加载模型 ---
model3 = BertForSequenceClassification.from_pretrained(
    BERT_MODEL_NAME,
    num_labels=NUM_CLASSES,
    output_attentions=False,
    output_hidden_states=False,
).to(DEVICE)

# --- 优化器与学习率调度器 ---
optimizer = AdamW(model3.parameters(), lr=2e-5, eps=1e-8)
EPOCHS_BERT_FINETUNE = 3
total_steps = len(train_loader_bert) * EPOCHS_BERT_FINETUNE
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0, 
                                            num_training_steps=total_steps)

criterion = nn.CrossEntropyLoss().to(DEVICE)
print("BERT微调模型已加载。")

ImportError: cannot import name 'AdamW' from 'transformers' (e:\anaconda3\envs\pytorch2.3.0\Lib\site-packages\transformers\__init__.py)

### 5.1. 训练与评估（模型三）

In [None]:
def train_bert_finetune(model, iterator, optimizer, scheduler, criterion):
    model.train()
    epoch_loss = 0
    for batch in iterator:
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        optimizer.zero_grad()
        # 直接将labels传入，模型会自动计算损失
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        epoch_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # 防止梯度爆炸
        optimizer.step()
        scheduler.step()
        
    return epoch_loss / len(iterator)

def evaluate_bert_finetune(model, iterator, criterion):
    model.eval()
    epoch_loss = 0
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch in iterator:
            input_ids = batch['input_ids'].to(DEVICE)
            attention_mask = batch['attention_mask'].to(DEVICE)
            labels = batch['labels'].to(DEVICE)
            
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits
            
            epoch_loss += loss.item()
            all_preds.extend(logits.argmax(1).cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='macro')
    return epoch_loss / len(iterator), acc, f1

# --- 训练模型三 ---
print("\n开始训练模型三 (微调BERT)...")
for epoch in range(EPOCHS_BERT_FINETUNE): # 微调通常需要较少的训练轮数
    train_loss = train_bert_finetune(model3, train_loader_bert, optimizer, scheduler, criterion)
    valid_loss, valid_acc, valid_f1 = evaluate_bert_finetune(model3, val_loader_bert, criterion)
    print(f'轮次: {epoch+1:02} | 训练损失: {train_loss:.3f} | 验证损失: {valid_loss:.3f} | 验证准确率: {valid_acc*100:.2f}% | 验证F1: {valid_f1:.3f}')


### 5.2. 在测试集上对模型三进行最终评估

In [None]:
test_loss, test_acc, test_f1 = evaluate_bert_finetune(model3, test_loader_bert, criterion)
print(f'模型三 测试集结果 -> 准确率: {test_acc*100:.2f}% | Macro-F1: {test_f1:.3f}')
results['模型三 (微调BERT)'] = {'Accuracy': test_acc, 'Macro-F1': test_f1}

## 6. 总结与性能对比

In [None]:
df_results = pd.DataFrame(results).T
df_results['Accuracy'] = df_results['Accuracy'].apply(lambda x: f"{x*100:.2f}%")
df_results['Macro-F1'] = df_results['Macro-F1'].apply(lambda x: f"{x:.4f}")

print("--- SNLI测试集最终性能对比 ---")
print(df_results)

### 实验分析

1.  **模型一 (GloVe + BiLSTM):** 该模型是一个强大的基线模型。GloVe词向量能够捕捉单词间的语义关系，而BiLSTM则从序列中学习上下文信息。然而，它的理解能力是有限的，因为GloVe词向量是静态的——它们不会根据句子的具体上下文而改变。

2.  **模型二 (BERT嵌入 + BiLSTM):** 该模型在模型一的基础上进行了改进，使用了来自BERT的上下文相关词向量。对于同一个词，BERT会根据其周围的词生成不同的向量。将BERT用作固定的特征提取器比微调整个模型的计算成本要低。顶部的BiLSTM有助于聚合这些上下文特征以用于最终的分类决策。我们通常会看到比GloVe有性能上的提升。

3.  **模型三 (微调BERT):** 这通常是性能最好的模型。通过微调整个BERT模型，我们使其内部的权重（特别是注意力机制）能够专门适应SNLI任务的细微差别。模型不仅学习如何表示词语，而且学习如何直接执行推理任务。`[CLS]`标记的最终隐藏状态被设计为整个输入序列的表示，使其非常适合分类任务。这种端到端的训练方式几乎总能在像NLI这样的句子对分类任务上产生最佳结果。