# 垃圾邮件分类：使用Hugging Face Transformers

这是一个使用预训练的Transformer模型（`distilbert-base-uncased`）对邮件进行分类的Jupyter Notebook。

**流程如下：**
1.  **加载数据**: 从`train.csv`（带标签）和`test.csv`（无标签）加载数据。
2.  **数据准备**: 从训练数据中切分出验证集，用于模型评估。
3.  **模型与分词**: 加载Hugging Face的预训练模型和分词器。
4.  **预处理**: 对文本数据进行分词和编码。
5.  **训练**: 使用`Trainer` API微调模型。
6.  **预测**: 使用训练好的模型对无标签的测试集进行预测。
7.  **生成提交文件**: 创建一个包含预测结果的`submission.csv`文件。

In [None]:
!unzip team2.zip

unzip:  cannot find or open team2.zip, team2.zip.zip or team2.zip.ZIP.


In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 第0步：环境设置

安装库

In [None]:
!pip install transformers datasets scikit-learn pandas torch numpy

### 导入所需的库

In [None]:
!pip install --upgrade transformers

In [1]:
import pandas as pd
import numpy as np
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

### 第1步：数据加载与准备

加载训练和测试数据集。我们将从带标签的训练数据中分出一部分作为验证集，用于在训练过程中客观地评估模型性能。

In [2]:
# 加载您带标签的训练数据（假设有50,000条）
print("加载带标签的训练数据...")
full_train_df = pd.read_csv('/content/drive/MyDrive/team-2-垃圾邮件分类/train.csv') # 假设这个文件包含 'text' 和 'label'

# 加载您无标签的测试数据
print("加载无标签的测试数据...")
test_df = pd.read_csv('/content/drive/MyDrive/team-2-垃圾邮件分类/test.csv') # 这个文件包含 'id' 和 'text'

# 从原始训练数据中分出训练集和验证集
print("从训练数据中切分出训练集和验证集...")
train_df, eval_df = train_test_split(
    full_train_df,
    test_size=0.1,  # 分出10%作为验证集
    random_state=42,
    stratify=full_train_df['label']
)

# 将pandas DataFrame转换为Hugging Face Dataset对象
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)
test_dataset = Dataset.from_pandas(test_df)

print("\n数据加载和切分完成：")
print(f"训练集大小: {len(train_dataset)}")
print(f"验证集大小: {len(eval_dataset)}")
print(f"测试集大小: {len(test_dataset)}")

加载带标签的训练数据...
加载无标签的测试数据...
从训练数据中切分出训练集和验证集...

数据加载和切分完成：
训练集大小: 45000
验证集大小: 5000
测试集大小: 10000


### 第2步：加载预训练模型与分词器

使用`distilbert-base-uncased`

In [3]:
# --- 1. 定义本地模型路径 ---
# 将这里的路径替换为您真实存放模型文件的文件夹路径
local_model_path = "/content/drive/MyDrive/distilbert-base-uncased"
# 例如:
# local_model_path = "D:/HF_Models/distilbert-base-uncased"  # Windows 示例
# local_model_path = "/home/user/models/distilbert-base-uncased" # Linux 示例


# --- 2. 从本地路径加载模型与分词器 ---

print(f"\n从本地路径加载预训练模型: {local_model_path}...")

# 将原来的 model_name 变量替换为 local_model_path
tokenizer = AutoTokenizer.from_pretrained(local_model_path)
model = AutoModelForSequenceClassification.from_pretrained(local_model_path, num_labels=2)

print("\n模型和分词器已从本地加载成功！")


从本地路径加载预训练模型: /content/drive/MyDrive/distilbert-base-uncased...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at /content/drive/MyDrive/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



模型和分词器已从本地加载成功！


### 第3步：数据预处理

定义一个函数，使用分词器将文本转换为模型可以理解的数字ID

In [4]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=128)

print("\n对数据集进行分词处理...")
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_eval_dataset = eval_dataset.map(preprocess_function, batched=True)
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True)

# 移除不再需要的列，以整理数据集
tokenized_train_dataset = tokenized_train_dataset.remove_columns(["text", "__index_level_0__"])
tokenized_eval_dataset = tokenized_eval_dataset.remove_columns(["text", "__index_level_0__"])
tokenized_test_dataset = tokenized_test_dataset.remove_columns(["text"])

print("分词处理完成。")


对数据集进行分词处理...


Map:   0%|          | 0/45000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

分词处理完成。


### 第4步：定义评估指标与训练参数

定义一个`compute_metrics`函数，用于在评估过程中计算准确率、精确率、召回率和F1分数。同时，通过`TrainingArguments`来配置所有训练参数。

In [5]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary', zero_division=0)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",      # 每个epoch结束时在验证集上评估
    save_strategy="epoch",            # 每个epoch结束时保存模型
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,      # 训练结束后加载性能最好的模型
    metric_for_best_model="f1",       # 使用f1分数来判断最优模型
)

### 第5步：初始化并训练模型

将所有组件（模型、参数、数据集、评估函数）传入`Trainer`对象，然后调用`.train()`开始训练。`load_best_model_at_end=True`确保了在训练结束后，`trainer.model`是验证集上表现最好的那个版本。

In [6]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset, # 使用验证集进行评估
    compute_metrics=compute_metrics,
)

print("\n开始微调Transformer模型...")
trainer.train()

# 训练完成后，对最佳模型在验证集上进行最终评估
print("\n在验证集上的最终评估结果:")
eval_results = trainer.evaluate()
print(eval_results)




开始微调Transformer模型...


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpicnight[0m ([33mpicnight-beihang-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.0182,0.039204,0.9896,0.989401,0.988595,0.990208
2,0.001,0.049311,0.991,0.990792,0.993842,0.98776
3,0.0406,0.039849,0.9932,0.99307,0.992261,0.99388



在验证集上的最终评估结果:


{'eval_loss': 0.039848655462265015, 'eval_accuracy': 0.9932, 'eval_f1': 0.9930697105584998, 'eval_precision': 0.9922606924643584, 'eval_recall': 0.9938800489596084, 'eval_runtime': 18.9446, 'eval_samples_per_second': 263.927, 'eval_steps_per_second': 16.522, 'epoch': 3.0}


### 第6步：对无标签的`test.csv`进行预测

现在，使用训练好的最佳模型来对真正的测试数据进行预测。

In [7]:
print("\n对test.csv进行预测...")
# trainer.predict()方法专门用于预测无标签数据
predictions = trainer.predict(tokenized_test_dataset)

# predictions.predictions 是一个logits数组，需要找到概率最高的类别
predicted_labels = np.argmax(predictions.predictions, axis=1)
print("预测完成。")


对test.csv进行预测...


预测完成。


### 第7步：创建并保存提交文件

最后，将预测结果与测试集中的`id`对应起来，生成一个标准的提交文件。

In [8]:
print("创建提交文件 submission.csv...")
submission_df = pd.DataFrame({
    'id': test_df['id'],
    'label': predicted_labels
})

submission_df.to_csv('submission.csv', index=False)

print("\n任务完成！预测结果已保存至 submission.csv")

创建提交文件 submission.csv...

任务完成！预测结果已保存至 submission.csv
