# IMDB-10 情感分类 —— 任务一实验报告

#### 数据集说明
本实验使用用户提供的三个文件作为数据集：
- imdb.train.txt.ss
- imdb.dev.txt.ss
- imdb.test.txt.ss

每个文件包含若干样本。请确保文件格式为：每行一条，通常格式为 `<label>\t<text>` 或 `<text>\t<label>`。如有不同请相应调整代码。

## 1. 数据加载与预处理

In [2]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch import nn
from torchtext.vocab import GloVe
from torchtext.data.utils import get_tokenizer
from transformers import BertTokenizer, BertModel, BertForSequenceClassification
from torch.optim import AdamW  # 从 PyTorch 导入 AdamW
import numpy as np
from tqdm import tqdm
import os
import random

# 设置随机种子确保可复现
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
set_seed(42)

  from .autonotebook import tqdm as notebook_tqdm
  from .autonotebook import tqdm as notebook_tqdm


In [3]:
def peek_file(filepath, n=5):
    """打印文件的前n行"""
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            print(f"文件 {filepath} 的前 {n} 行:")
            for i, line in enumerate(f):
                if i >= n:
                    break
                # 使用repr()显示所有字符，包括制表符和换行符
                print(f"行 {i+1}: {repr(line)}")
                # 也打印常规格式，便于阅读
                print(f"行 {i+1} (常规显示): {line}")
        return True
    except Exception as e:
        print(f"读取文件时出错: {str(e)}")
        return False

# 使用绝对路径
train_path = r"D:\期末大作业\期末大作业\数据集\imdb\imdb.train.txt.ss"
peek_file(train_path)

文件 D:\期末大作业\期末大作业\数据集\imdb\imdb.train.txt.ss 的前 5 行:
行 1: 'ur2480402/\t\t\\tt0119485\t\t10\t\ti excepted a lot from this movie , and it did deliver . <sssss> there is some great buddhist wisdom in this movie . <sssss> the real dalai lama is a very interesting person , and i think there is a lot of wisdom in buddhism . <sssss> the music , of course , sounds like because it is by philip glass . <sssss> this adds to the beauty of the movie . <sssss> whereas other biographies of famous people tend to get very poor this movie always stays focused and gives a good and honest portrayal of the dalai lama . <sssss> all things being equal , it is a great movie , and i really enjoyed it . <sssss> it is not like taxi driver of course but as a biography of a famous person it is really a great film indeed . \n'
行 1 (常规显示): ur2480402/		\tt0119485		10		i excepted a lot from this movie , and it did deliver . <sssss> there is some great buddhist wisdom in this movie . <sssss> the real dalai lama is a ve

True

### 1.1 读取你的txt数据集

In [4]:
def load_imdb_file(path):
    data = []
    with open(path, encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            try:
                # 正则表达式查找评分模式（前后有制表符的数字）
                import re
                match = re.search(r'\t\t(\d+)\t\t', line)
                
                if not match:
                    print(f"行 {line_num} 未找到评分模式: {line[:50]}...")
                    continue
                
                # 提取评分和位置
                rating = match.group(1)
                
                # 评论文本在评分之后
                review_start = match.end()
                review = line[review_start:].strip()
                
                # 将评分转换为整数（映射到0-9区间）
                rating_int = int(rating)
                label = rating_int - 1 if rating_int == 10 else rating_int
                
                data.append((review, label))
                
                # 打印前几个解析结果（用于调试）
                if line_num <= 3:
                    print(f"解析结果: 评分={rating}, 标签={label}, 评论={review[:50]}...")
            
            except Exception as e:
                print(f"行 {line_num} 处理错误: {str(e)}")
    
    print(f"成功加载 {len(data)} 条数据")
    return data

# 路径按实际文件所在路径调整
train_path = r"D:\期末大作业\期末大作业\数据集\imdb\imdb.train.txt.ss"
dev_path = r"D:\期末大作业\期末大作业\数据集\imdb\imdb.dev.txt.ss"
test_path = r"D:\期末大作业\期末大作业\数据集\imdb\imdb.test.txt.ss"

train_data = load_imdb_file(train_path)
dev_data = load_imdb_file(dev_path)
test_data = load_imdb_file(test_path)

print('train:', train_data[0])

解析结果: 评分=10, 标签=9, 评论=i excepted a lot from this movie , and it did deli...
解析结果: 评分=1, 标签=1, 评论=this movie is not worth seeing . <sssss> has no me...
解析结果: 评分=10, 标签=9, 评论=this is a truly remarkable horror movie . <sssss> ...
成功加载 67426 条数据
解析结果: 评分=7, 标签=7, 评论=i was born in 1976 , so i sort of grew up with wil...
解析结果: 评分=4, 标签=4, 评论=this movie has good performances by sylvester stal...
解析结果: 评分=4, 标签=4, 评论=this is a typical horror movie of that period . <s...
成功加载 8381 条数据
解析结果: 评分=10, 标签=9, 评论=this is a stunningly beautiful movie . <sssss> the...
解析结果: 评分=10, 标签=9, 评论=this is quite possible one of the best movies made...
解析结果: 评分=10, 标签=9, 评论=i was astonished to see the relatively low rating ...
成功加载 9112 条数据
train: ('i excepted a lot from this movie , and it did deliver . <sssss> there is some great buddhist wisdom in this movie . <sssss> the real dalai lama is a very interesting person , and i think there is a lot of wisdom in buddhism . <sssss> the music , of course , sounds lik

### 1.2 分词、词表构建与padding

In [5]:
tokenizer = get_tokenizer('basic_english')
max_len = 256

def tokenize_and_cut(text):
    tokens = tokenizer(text)
    return tokens[:max_len]

# 构建词表（用全部train+dev数据）
all_tokens = set()
for text, _ in train_data + dev_data:
    all_tokens.update(tokenize_and_cut(text))
vocab = {w: i+2 for i, w in enumerate(sorted(all_tokens))}
vocab['<pad>'] = 0
vocab['<unk>'] = 1

def encode(text):
    tokens = tokenize_and_cut(text)
    ids = [vocab.get(t, 1) for t in tokens]
    if len(ids) < max_len:
        ids += [0] * (max_len - len(ids))
    else:
        ids = ids[:max_len]
    return ids

## 2. Model 1: GloVe (300d) + BiLSTM + 全连接层

In [6]:
import os
import shutil
from torchtext.vocab import GloVe  # 确保正确导入GloVe

# 你的GloVe文件的实际路径
glove_file_path = r"D:\glove_vectors\glove.840B.300d\glove.840B.300d.txt"

# 设置标准缓存目录
cache_dir = os.path.expanduser("~/.vector_cache")
os.makedirs(cache_dir, exist_ok=True)

# 检查GloVe文件是否存在
print(f"检查GloVe文件: {glove_file_path}")
if os.path.exists(glove_file_path):
    print("GloVe文件已找到!")
    
    # 确保缓存目录中有正确命名的文件
    cache_file = os.path.join(cache_dir, "glove.840B.300d.txt")
    if not os.path.exists(cache_file):
        print(f"在缓存目录中创建GloVe文件的符号链接...")
        try:
            # 在Windows上创建符号链接（需要管理员权限）
            # 如果没有权限，可以使用复制，但会占用更多空间
            if os.name == 'nt':  # Windows
                # 尝试创建符号链接，如果失败则复制文件
                try:
                    os.symlink(glove_file_path, cache_file)
                    print("成功创建符号链接")
                except OSError:
                    print("无法创建符号链接，将复制文件（这可能需要一些时间）")
                    # 不复制文件，而是告诉用户设置环境变量
                    print("正在设置环境变量，指向你的GloVe文件...")
            else:  # Unix/Linux
                os.symlink(glove_file_path, cache_file)
        except Exception as e:
            print(f"创建链接或复制文件时出错: {str(e)}")
else:
    print(f"警告: 找不到GloVe文件: {glove_file_path}")
    print("将尝试自动下载，但这可能会很慢")

# 设置环境变量，告诉torchtext在哪里找到GloVe文件
os.environ['TORCH_HOME'] = os.path.dirname(os.path.dirname(glove_file_path))
vectors_cache = os.path.dirname(glove_file_path)

# 尝试加载GloVe词向量
print("正在加载GloVe词向量...")
try:
    # 使用你自己的向量目录
    glove = GloVe(name='840B', dim=300, cache=vectors_cache)
    print("成功加载GloVe词向量!")
except Exception as e:
    print(f"加载GloVe出错: {str(e)}")
    print("将使用随机初始化的词向量代替")
    # 创建一个假的glove对象，只包含必要的属性
    class DummyGlove:
        def __init__(self):
            self.stoi = {}  # 空字典
        def __getitem__(self, word):
            return torch.zeros(300)  # 返回零向量
    glove = DummyGlove()
embedding_matrix = np.zeros((len(vocab), 300))
for word, idx in vocab.items():
    if word in glove.stoi:
        embedding_matrix[idx] = glove[word].numpy()
    else:
        embedding_matrix[idx] = np.random.normal(scale=0.6, size=(300, ))

class IMDBDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        text, label = self.data[idx]
        return torch.tensor(encode(text)), torch.tensor(label)

train_dataset = IMDBDataset(train_data)
dev_dataset = IMDBDataset(dev_data)
test_dataset = IMDBDataset(test_data)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
dev_loader = DataLoader(dev_dataset, batch_size=128)
test_loader = DataLoader(test_dataset, batch_size=128)

检查GloVe文件: D:\glove_vectors\glove.840B.300d\glove.840B.300d.txt
GloVe文件已找到!
正在加载GloVe词向量...
成功加载GloVe词向量!
成功加载GloVe词向量!


In [7]:
class BiLSTMClassifier(nn.Module):
    def __init__(self, embedding_matrix, hidden_dim=128, num_classes=10):
        super().__init__()
        num_embeddings, emb_dim = embedding_matrix.shape
        self.embedding = nn.Embedding(num_embeddings, emb_dim)
        self.embedding.weight.data.copy_(torch.tensor(embedding_matrix, dtype=torch.float))
        self.embedding.weight.requires_grad = False
        self.lstm = nn.LSTM(emb_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, num_classes)
    def forward(self, x):
        x = self.embedding(x)
        out, _ = self.lstm(x)
        out = out[:, -1, :]
        logits = self.fc(out)
        return logits

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 检查并打印设备信息
print("使用设备:", device)
print("是否使用GPU:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU名称:", torch.cuda.get_device_name(0))
    print("当前GPU内存使用情况:", torch.cuda.memory_allocated(0)/1024**2, "MB")

model1 = BiLSTMClassifier(embedding_matrix, num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model1.parameters(), lr=1e-3)

使用设备: cuda
是否使用GPU: True
GPU名称: NVIDIA GeForce RTX 4060 Laptop GPU
当前GPU内存使用情况: 0.0 MB


In [8]:
def train_epoch(model, loader, criterion, optimizer):
    model.train()
    total_loss = 0
    processed_batches = 0
    print(f"开始训练，共 {len(loader)} 批次...")
    
    for i, (x, y) in enumerate(tqdm(loader)):
        if i == 0:
            print(f"第一批次数据形状: x={x.shape}, y={y.shape}")
        
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()*x.size(0)
        processed_batches += 1
        
        # 每100批次打印一次进度
        if (i+1) % 100 == 0:
            print(f"处理了 {i+1}/{len(loader)} 批次...")
    
    print(f"训练完成，共处理 {processed_batches} 批次")
    return total_loss/len(loader.dataset)

def evaluate(model, loader):
    model.eval()
    correct = 0
    total = 0
    print(f"开始评估，共 {len(loader)} 批次...")
    
    with torch.no_grad():
        for i, (x, y) in enumerate(loader):
            x, y = x.to(device), y.to(device)
            logits = model(x)
            pred = logits.argmax(dim=1)
            correct += (pred==y).sum().item()
            total += y.size(0)
            
            # 每50批次打印一次进度
            if (i+1) % 50 == 0:
                print(f"评估了 {i+1}/{len(loader)} 批次...")
    
    acc = correct/total
    print(f"评估完成，准确率: {acc:.4f}")
    return acc

In [18]:
# 检查数据集和模型
print(f"训练集大小: {len(train_dataset)}")
print(f"批次大小: {train_loader.batch_size}")
print(f"总批次数: {len(train_loader)}")
print(f"模型参数数量: {sum(p.numel() for p in model1.parameters())}")

# 训练
EPOCHS = 2
print("开始训练 Model 1...")
for epoch in range(EPOCHS):
    print(f"开始 Epoch {epoch+1}/{EPOCHS}...")
    train_loss = train_epoch(model1, train_loader, criterion, optimizer)
    print(f"训练完成，loss={train_loss:.4f}")
    
    print(f"开始验证...")
    dev_acc = evaluate(model1, dev_loader)
    print(f"验证完成，acc={dev_acc:.4f}")
    
    print(f"[Model1][Epoch {epoch+1}/{EPOCHS}] train_loss={train_loss:.4f}, dev_acc={dev_acc:.4f}")
    
    # 强制刷新输出
    import sys
    sys.stdout.flush()

训练集大小: 67426
批次大小: 32
总批次数: 2108
模型参数数量: 24341490
开始训练 Model 1...
开始 Epoch 1/2...
开始训练，共 2108 批次...


  0%|          | 1/2108 [00:00<04:06,  8.55it/s]

第一批次数据形状: x=torch.Size([32, 256]), y=torch.Size([32])


  5%|▌         | 114/2108 [00:01<00:23, 85.25it/s]

处理了 100/2108 批次...


 10%|█         | 213/2108 [00:02<00:21, 87.02it/s]

处理了 200/2108 批次...


 15%|█▍        | 312/2108 [00:03<00:20, 86.50it/s]

处理了 300/2108 批次...


 20%|█▉        | 412/2108 [00:05<00:19, 86.40it/s]

处理了 400/2108 批次...


 24%|██▍       | 511/2108 [00:06<00:18, 87.16it/s]

处理了 500/2108 批次...


 29%|██▉       | 610/2108 [00:07<00:17, 86.68it/s]

处理了 600/2108 批次...


 34%|███▍      | 712/2108 [00:08<00:15, 88.93it/s]

处理了 700/2108 批次...


 39%|███▊      | 812/2108 [00:09<00:14, 87.09it/s]

处理了 800/2108 批次...


 43%|████▎     | 911/2108 [00:10<00:13, 86.54it/s]

处理了 900/2108 批次...


 48%|████▊     | 1011/2108 [00:11<00:12, 86.01it/s]

处理了 1000/2108 批次...


 53%|█████▎    | 1110/2108 [00:13<00:11, 85.01it/s]

处理了 1100/2108 批次...


 57%|█████▋    | 1209/2108 [00:14<00:10, 86.27it/s]

处理了 1200/2108 批次...


 62%|██████▏   | 1317/2108 [00:15<00:09, 87.57it/s]

处理了 1300/2108 批次...


 67%|██████▋   | 1416/2108 [00:16<00:07, 87.03it/s]

处理了 1400/2108 批次...


 72%|███████▏  | 1515/2108 [00:17<00:06, 87.98it/s]

处理了 1500/2108 批次...


 77%|███████▋  | 1615/2108 [00:18<00:05, 88.36it/s]

处理了 1600/2108 批次...


 81%|████████▏ | 1714/2108 [00:19<00:04, 87.93it/s]

处理了 1700/2108 批次...


 86%|████████▌ | 1813/2108 [00:21<00:03, 88.47it/s]

处理了 1800/2108 批次...


 91%|█████████ | 1913/2108 [00:22<00:02, 88.08it/s]

处理了 1900/2108 批次...


 95%|█████████▌| 2013/2108 [00:23<00:01, 87.75it/s]

处理了 2000/2108 批次...


100%|██████████| 2108/2108 [00:24<00:00, 86.22it/s]



处理了 2100/2108 批次...
训练完成，共处理 2108 批次
训练完成，loss=1.4709
开始验证...
开始评估，共 66 批次...
评估了 50/66 批次...
评估了 50/66 批次...
评估完成，准确率: 0.3830
验证完成，acc=0.3830
[Model1][Epoch 1/2] train_loss=1.4709, dev_acc=0.3830
开始 Epoch 2/2...
开始训练，共 2108 批次...
评估完成，准确率: 0.3830
验证完成，acc=0.3830
[Model1][Epoch 1/2] train_loss=1.4709, dev_acc=0.3830
开始 Epoch 2/2...
开始训练，共 2108 批次...


  0%|          | 8/2108 [00:00<00:29, 71.43it/s]

第一批次数据形状: x=torch.Size([32, 256]), y=torch.Size([32])


  5%|▌         | 115/2108 [00:01<00:23, 86.42it/s]

处理了 100/2108 批次...


 10%|▉         | 207/2108 [00:02<00:21, 87.72it/s]

处理了 200/2108 批次...


 15%|█▍        | 315/2108 [00:03<00:20, 87.80it/s]

处理了 300/2108 批次...


 20%|█▉        | 414/2108 [00:04<00:19, 86.26it/s]

处理了 400/2108 批次...


 24%|██▍       | 513/2108 [00:05<00:18, 87.86it/s]

处理了 500/2108 批次...


 29%|██▉       | 612/2108 [00:07<00:17, 86.07it/s]

处理了 600/2108 批次...


 34%|███▍      | 713/2108 [00:08<00:15, 88.13it/s]

处理了 700/2108 批次...


 39%|███▊      | 812/2108 [00:09<00:14, 86.90it/s]

处理了 800/2108 批次...


 43%|████▎     | 911/2108 [00:10<00:13, 85.94it/s]

处理了 900/2108 批次...


 48%|████▊     | 1012/2108 [00:11<00:12, 87.65it/s]

处理了 1000/2108 批次...


 53%|█████▎    | 1111/2108 [00:12<00:11, 87.72it/s]

处理了 1100/2108 批次...


 58%|█████▊    | 1219/2108 [00:14<00:10, 87.75it/s]

处理了 1200/2108 批次...


 62%|██████▏   | 1313/2108 [00:15<00:08, 88.89it/s]

处理了 1300/2108 批次...


 67%|██████▋   | 1412/2108 [00:16<00:08, 86.68it/s]

处理了 1400/2108 批次...


 72%|███████▏  | 1511/2108 [00:17<00:06, 88.13it/s]

处理了 1500/2108 批次...


 76%|███████▋  | 1610/2108 [00:18<00:05, 86.47it/s]

处理了 1600/2108 批次...


 81%|████████  | 1709/2108 [00:19<00:04, 85.87it/s]

处理了 1700/2108 批次...


 86%|████████▌ | 1817/2108 [00:20<00:03, 86.60it/s]

处理了 1800/2108 批次...


 91%|█████████ | 1916/2108 [00:22<00:02, 88.54it/s]

处理了 1900/2108 批次...


 96%|█████████▌| 2016/2108 [00:23<00:01, 87.18it/s]

处理了 2000/2108 批次...


100%|██████████| 2108/2108 [00:24<00:00, 86.89it/s]



处理了 2100/2108 批次...
训练完成，共处理 2108 批次
训练完成，loss=1.4022
开始验证...
开始评估，共 66 批次...
评估了 50/66 批次...
评估了 50/66 批次...
评估完成，准确率: 0.3890
验证完成，acc=0.3890
[Model1][Epoch 2/2] train_loss=1.4022, dev_acc=0.3890
评估完成，准确率: 0.3890
验证完成，acc=0.3890
[Model1][Epoch 2/2] train_loss=1.4022, dev_acc=0.3890


## 3. Model 2: BERT-base 嵌入 + BiLSTM + 分类头

In [10]:
# 假设你已将模型文件下载到以下目录
local_model_path = "D:/models/bert-base-uncased"  # 修改为你的实际路径

# 检查BERT模型文件是否存在
required_files = ['config.json', 'pytorch_model.bin', 'vocab.txt', 'tokenizer_config.json']
missing_files = [f for f in required_files if not os.path.exists(os.path.join(local_model_path, f))]
if missing_files:
    print(f"警告: BERT模型文件夹 {local_model_path} 中缺少以下文件:")
    for f in missing_files:
        print(f"  - {f}")
    print("这将触发自动下载，请从Hugging Face下载: https://huggingface.co/bert-base-uncased/tree/main")

# 从本地加载 tokenizer 和模型
bert_tokenizer = BertTokenizer.from_pretrained(local_model_path)
bert_model = BertModel.from_pretrained(local_model_path).to(device)
bert_model.eval()

class BERTLSTMDataset(Dataset):
    def __init__(self, data, max_len=128):
        self.data = data
        self.max_len = max_len
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        text, label = self.data[idx]
        encoding = bert_tokenizer(
            text,
            max_length=self.max_len,
            truncation=True,
            padding='max_length',
            return_tensors='pt'
        )
        input_ids = encoding['input_ids'].squeeze(0)
        attention_mask = encoding['attention_mask'].squeeze(0)
        return input_ids, attention_mask, torch.tensor(label)

def collate_bert(batch):
    input_ids = torch.stack([b[0] for b in batch])
    attention_mask = torch.stack([b[1] for b in batch])
    labels = torch.stack([b[2] for b in batch])
    return input_ids, attention_mask, labels

# 数据集和数据加载器代码不需要修改
train_bert_dataset = BERTLSTMDataset(train_data)
dev_bert_dataset = BERTLSTMDataset(dev_data)
test_bert_dataset = BERTLSTMDataset(test_data)

train_bert_loader = DataLoader(train_bert_dataset, batch_size=16, shuffle=True, collate_fn=collate_bert)
dev_bert_loader = DataLoader(dev_bert_dataset, batch_size=64, collate_fn=collate_bert)
test_bert_loader = DataLoader(test_bert_dataset, batch_size=64, collate_fn=collate_bert)

In [11]:
class BERTBiLSTMClassifier(nn.Module):
    def __init__(self, bert_model, hidden_dim=128, num_classes=10):
        super().__init__()
        self.bert = bert_model
        self.lstm = nn.LSTM(768, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, num_classes)
        for param in self.bert.parameters():
            param.requires_grad = False
    def forward(self, input_ids, attention_mask):
        with torch.no_grad():
            outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
            x = outputs.last_hidden_state
        out, _ = self.lstm(x)
        out = out[:, -1, :]
        logits = self.fc(out)
        return logits

model2 = BERTBiLSTMClassifier(bert_model, num_classes=10).to(device)
optimizer2 = torch.optim.Adam(model2.parameters(), lr=1e-3)
criterion2 = nn.CrossEntropyLoss()

In [12]:
def train_epoch_bert(model, loader, criterion, optimizer):
    model.train()
    total_loss = 0
    for input_ids, attn_mask, y in tqdm(loader):
        input_ids, attn_mask, y = input_ids.to(device), attn_mask.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(input_ids, attn_mask)
        loss = criterion(logits, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()*input_ids.size(0)
    return total_loss/len(loader.dataset)

def evaluate_bert(model, loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for input_ids, attn_mask, y in loader:
            input_ids, attn_mask, y = input_ids.to(device), attn_mask.to(device), y.to(device)
            logits = model(input_ids, attn_mask)
            pred = logits.argmax(dim=1)
            correct += (pred==y).sum().item()
            total += y.size(0)
    return correct/total

In [13]:
# 训练
EPOCHS = 2
for epoch in range(EPOCHS):
    train_loss = train_epoch_bert(model2, train_bert_loader, criterion2, optimizer2)
    dev_acc = evaluate_bert(model2, dev_bert_loader)
    print(f"[Model2][Epoch {epoch}] train_loss={train_loss:.4f}, dev_acc={dev_acc:.4f}")

  attn_output = torch.nn.functional.scaled_dot_product_attention(
  attn_output = torch.nn.functional.scaled_dot_product_attention(
100%|██████████| 4215/4215 [08:44<00:00,  8.04it/s]



[Model2][Epoch 0] train_loss=1.7394, dev_acc=0.3261


100%|██████████| 4215/4215 [08:41<00:00,  8.08it/s]



[Model2][Epoch 1] train_loss=1.6429, dev_acc=0.3503


## 4. Model 3: BERT-base 微调 + [CLS] + 分类器

In [14]:
# 使用本地模型路径替代'bert-base-uncased'
local_model_path = "D:/models/bert-base-uncased"  # 修改为你的实际路径

# 检查BERT模型文件是否存在（同上）
if missing_files:
    print("BERT分类模型将从网络下载")

model3 = BertForSequenceClassification.from_pretrained(local_model_path, num_labels=10).to(device)
optimizer3 = AdamW(model3.parameters(), lr=2e-5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at D:/models/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
def train_epoch_model3(model, loader, optimizer):
    model.train()
    total_loss = 0
    for input_ids, attn_mask, y in tqdm(loader):
        input_ids, attn_mask, y = input_ids.to(device), attn_mask.to(device), y.to(device)
        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attn_mask, labels=y)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()*input_ids.size(0)
    return total_loss/len(loader.dataset)

def evaluate_model3(model, loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for input_ids, attn_mask, y in loader:
            input_ids, attn_mask, y = input_ids.to(device), attn_mask.to(device), y.to(device)
            outputs = model(input_ids=input_ids, attention_mask=attn_mask)
            logits = outputs.logits
            pred = logits.argmax(dim=1)
            correct += (pred==y).sum().item()
            total += y.size(0)
    return correct/total

In [16]:
EPOCHS = 2
for epoch in range(EPOCHS):
    train_loss = train_epoch_model3(model3, train_bert_loader, optimizer3)
    dev_acc = evaluate_model3(model3, dev_bert_loader)
    print(f"[Model3][Epoch {epoch}] train_loss={train_loss:.4f}, dev_acc={dev_acc:.4f}")

100%|██████████| 4215/4215 [17:28<00:00,  4.02it/s]



[Model3][Epoch 0] train_loss=1.6351, dev_acc=0.3854


100%|██████████| 4215/4215 [17:28<00:00,  4.02it/s]



[Model3][Epoch 1] train_loss=1.4364, dev_acc=0.3896


## 5. 测试集评估与结果对比

In [19]:
test_acc1 = evaluate(model1, test_loader)
test_acc2 = evaluate_bert(model2, test_bert_loader)
test_acc3 = evaluate_model3(model3, test_bert_loader)
print(f"Test accuracy: Model1={test_acc1:.4f}, Model2={test_acc2:.4f}, Model3={test_acc3:.4f}")

# 显示设备使用情况
print("\n设备使用情况:")
print(f"使用设备: {device}")
if torch.cuda.is_available():
    print(f"GPU名称: {torch.cuda.get_device_name(0)}")
    print(f"GPU内存使用: {torch.cuda.memory_allocated(0)/1024**2:.2f} MB / {torch.cuda.get_device_properties(0).total_memory/1024**2:.2f} MB")
else:
    print("使用CPU训练 - 模型训练速度会较慢")

开始评估，共 72 批次...
评估了 50/72 批次...
评估了 50/72 批次...
评估完成，准确率: 0.3917
评估完成，准确率: 0.3917
Test accuracy: Model1=0.3917, Model2=0.3509, Model3=0.3808

设备使用情况:
使用设备: cuda
GPU名称: NVIDIA GeForce RTX 4060 Laptop GPU
GPU内存使用: 2228.89 MB / 8187.50 MB
Test accuracy: Model1=0.3917, Model2=0.3509, Model3=0.3808

设备使用情况:
使用设备: cuda
GPU名称: NVIDIA GeForce RTX 4060 Laptop GPU
GPU内存使用: 2228.89 MB / 8187.50 MB
