# 数据集介绍

<font size=4>[TREC 2006 Spam Track Public Corpora](https://plg.uwaterloo.ca/~gvcormac/treccorpus06/about.html)是一个公开的垃圾邮件语料库，由国际文本检索会议提供，分为英文数据集（trec06p）和中文数据集（trec06c），其中所含的邮件均来源于真实邮件保留了邮件的原有格式和内容。</font>

<font size=4>除TREC 2006外，还有TREC 2005和TREC 2007的英文垃圾邮件数据集（对，没有中文），本项目中，仅使用TREC 200提供的中文数据集进行演示。TREC 2005-2007的垃圾邮件数据集，均已整理在项目挂载的数据集中，感兴趣的读者可以自行fork。</font>

<font size=4>文件目录形式：delay和full分别是一种垃圾邮件过滤器的过滤机制，full目录下，是理想的邮件分类结果，我们可以视为研究的标签。</font>
```
trec06c
│
└───data
│   │   000
│   │   001
│   │   ...
│   └───215
└───delay
│   │   index
└───full
│   │   index  
```

邮件内容样本示例：
```
负责人您好我是深圳金海实业有限公司 广州 东莞 等省市有分公司我司有良好的社会关系和实力 因每月进项多出项少现有一部分发票可优惠对外代开税率较低 增值税发票为 其它国税 地税运输 广告等普通发票为 的税点 还可以根据数目大小来衡量优惠的多少 希望贵公司 商家等来电商谈欢迎合作本公司郑重承诺所用票据可到税务局验证或抵扣欢迎来电进一步商谈电话 小时服务信箱联系人 张海南顺祝商祺深圳市金海实业有限公司
```
```
GG非常好的朋友H在计划马上的西藏自助游（完全靠搭车的那种），我和H也是很早认识的朋友，他有女朋友，在一起10年了，感情很好。
GG对旅游兴趣不大。而且喜欢跟着旅行社的那种。所以肯定不去。
我在没有认识GG前，时常独自去一些地方，从南到北，觉得旅行不应该目的那么强。
```

# 一、环境配置

本项目基于Paddle 2.0 编写，如果你的环境不是本版本，请先参考官网[安装](https://www.paddlepaddle.org.cn/install/quick) Paddle 2.0 。

In [1]:
pip install translate

Collecting translate
  Downloading translate-3.6.1-py2.py3-none-any.whl.metadata (7.7 kB)
Collecting lxml (from translate)
  Downloading lxml-5.3.0-cp39-cp39-win_amd64.whl.metadata (3.9 kB)
Collecting libretranslatepy==2.1.1 (from translate)
  Downloading libretranslatepy-2.1.1-py3-none-any.whl.metadata (233 bytes)
Downloading translate-3.6.1-py2.py3-none-any.whl (12 kB)
Downloading libretranslatepy-2.1.1-py3-none-any.whl (3.2 kB)
Downloading lxml-5.3.0-cp39-cp39-win_amd64.whl (3.8 MB)
   ---------------------------------------- 0.0/3.8 MB ? eta -:--:--
   -- ------------------------------------- 0.3/3.8 MB ? eta -:--:--
   ------------------- -------------------- 1.8/3.8 MB 7.2 MB/s eta 0:00:01
   ---------------------------------------- 3.8/3.8 MB 10.3 MB/s eta 0:00:00
Installing collected packages: libretranslatepy, lxml, translate
Successfully installed libretranslatepy-2.1.1 lxml-5.3.0 translate-3.6.1
Note: you may need to restart the kernel to use updated packages.


In [2]:

from translate import Translator
#在任何两种语言之间，中文翻译成英文
translator=Translator(from_lang="english",to_lang="chinese")
translation = translator.translate("can you speak english?")
print(translation)
def translate_text(text):
    try:
        translator = Translator(to_lang="zh")
        translated_text = translator.translate(text)
        return translated_text
    except Exception as e:
        return None

你会说英文吗


In [1]:
# 导入相关的模块
import re
import jieba
import os 
import random
import paddle
import paddlenlp as ppnlp
from paddlenlp.data import Stack, Pad, Tuple
import paddle.nn.functional as F
import paddle.nn as nn
from visualdl import LogWriter
import numpy as np
from functools import partial #partial()函数可以用来固定某些参数值，并返回一个新的callable对象

PLEASE USE OMP_NUM_THREADS WISELY.


In [2]:
print(paddle.__version__)

2.5.2


# 二、数据加载

## 2.1 数据集准备


In [6]:
# 解压数据集
!tar xvf data/data89631/trec06c.tgz

tar: Error opening archive: Failed to open 'data/data89631/trec06c.tgz'


## 2.2 提取邮件内容，划分训练集、验证集、测试集
<font size=4 color=red>本项目中，截取中文邮件提取内容的最后200个字符作为文本分类任务的输入，但是这里需要特别注意的是，个别邮件提取结果为`null`，在BERT预训练模型的finetune任务中，如果输入为空会产生报错。</font>

<font size=4>因此，在生成训练集、验证集、测试集前，要进行数据清洗。</font>

In [None]:
# 去掉非中文字符
def clean_str(string):
    string = re.sub(r"[^\u4e00-\u9fff]", " ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip()

# 从指定路径读取邮件文件内容信息
def get_data_in_a_file(original_path, save_path='all_email.txt'):
    email = ''
    f = open(original_path, 'r', encoding='gb2312', errors='ignore')
    # lines = f.readlines()
    for line in f:
        # 去掉换行符
        line = line.strip().strip('\n')
        # 去掉非中文字符
        line = clean_str(line)
        email += line
    f.close()
    # 只保留末尾200个字符
    return email[-200:]

In [9]:
f1 = open('train_list2.txt', 'r')
translator=Translator(from_lang="english",to_lang="chinese")
i=0
for line in f1:
    line=line.strip()
    translation = translate_text(line[:-2])
    if translation==None:
        continue
    i+=1
    print(i)
    if i==50:
        break
    label=line[-1]
    translation = clean_str(translation)
    translation=translation.replace(" ",",")
    # 设置垃圾邮件的标签为0
    with open("train_list3.txt","a+") as f2:
                    f2.write(translation + '\t' + label + '\n')

In [15]:
# 读取标签文件信息
f = open('trec06c/full/index', 'r')
for line in f:
    str_list = line.split(" ")
    # 设置垃圾邮件的标签为0
    if str_list[0] == 'spam':
        label = '0'
    # 设置正常邮件标签为1
    elif str_list[0] == 'ham':
        label = '1'
    text = get_data_in_a_file('trec06c/full/' + str(str_list[1].split("\n")[0]))
    with open("all_email.txt","a+",encoding='utf-8') as f:
                    f.write(text + '\t' + label + '\n')

In [3]:
data_list_path="./"

with open(os.path.join(data_list_path, 'eval_list.txt'), 'w', encoding='utf-8') as f_eval:
    f_eval.seek(0)
    f_eval.truncate()
    
with open(os.path.join(data_list_path, 'train_list.txt'), 'w', encoding='utf-8') as f_train:
    f_train.seek(0)
    f_train.truncate() 

with open(os.path.join(data_list_path, 'test_list.txt'), 'w', encoding='utf-8') as f_test:
    f_test.seek(0)
    f_test.truncate()

with open(os.path.join(data_list_path, 'all_email.txt'), 'r', encoding='utf-8') as f_data:
    lines = f_data.readlines()

i = 0
with open(os.path.join(data_list_path, 'eval_list.txt'), 'a', encoding='utf-8') as f_eval,open(os.path.join(data_list_path, 'test_list.txt'), 'a', encoding='utf-8') as f_test,open(os.path.join(data_list_path, 'train_list.txt'), 'a', encoding='utf-8') as f_train:
    for line in lines:
        # 提取label信息
        label = line.split('\t')[-1].replace('\n', '')
        # 提取输入文本信息
        words = line.split('\t')[0]
        # 邮件文本提取结果中有大量空格，这里统一用逗号替换
        words = words.replace(' ', ',')
        labs = ""
        # 数据清洗，如果输入文本内容为空，则视为脏数据予以提出，避免在BERT模型finetune时报错
        if len(words) > 0:
            # 划分验证集
            if i % 10 == 1:
                labs = words + '\t' + label + '\n'
                f_eval.write(labs)
            # 划分测试集
            elif i % 10 == 2:
                labs = words + '\t' + label + '\n'
                f_test.write(labs)
            # 划分训练集
            else:
                labs = words + '\t' + label + '\n'
                f_train.write(labs)
            i += 1
        else:
            pass
    
print("数据列表生成完成！")

数据列表生成完成！


## 2.3 自定义数据集
在示例项目中，BERT模型finetune的数据集为公开中文情感分析数据集ChnSenticorp。使用PaddleNLP的.datasets.ChnSentiCorp.get_datasets方法即可以加载该数据集。

<font size=4 color=red>在本项目中，我们需要自定义数据集，并使自定义数据集后的数据格式与使用`ppnlp.datasets.ChnSentiCorp.get_datasets(['train','dev','test'])
`加载后完全一致。</font>

In [4]:
class SelfDefinedDataset(paddle.io.Dataset):
    def __init__(self, data):
        super(SelfDefinedDataset, self).__init__()
        self.data = data

    def __getitem__(self, idx):
        return self.data[idx]

    def __len__(self):
        return len(self.data)
        
    def get_labels(self):
        return ["0", "1"]

def txt_to_list(file_name):
    res_list = []
    for line in open(file_name,encoding='utf-8'):
        res_list.append(line.strip().split('\t'))
    return res_list

trainlst = txt_to_list('train_list.txt')
devlst = txt_to_list('eval_list.txt')
testlst = txt_to_list('test_list.txt')

train_ds = SelfDefinedDataset(trainlst)
dev_ds = SelfDefinedDataset(devlst)
test_ds = SelfDefinedDataset(testlst)
label_list = train_ds.get_labels()
print(label_list)
from paddlenlp.datasets import MapDataset
train_ds = MapDataset(train_ds)
dev_ds = MapDataset(dev_ds)
test_ds = MapDataset(test_ds)
print("训练集数据：{}\n".format(train_ds[0:3]))
print("验证集数据:{}\n".format(dev_ds[0:3]))
print("测试集数据:{}\n".format(test_ds[0:3]))

['0', '1']
训练集数据：[['公司艾克森金山石化,中国化工进出口公司,正大集团大福饲料,厦华集团,灿坤股份东金电子,太原钢铁集团,深圳开发科技,大冷王运输制冷,三洋华强,等知名企业提供项目辅导或专题培训,王老师授课狙榉岣唬风格幽默诙谐,逻辑清晰,过程互动,案例生动,深受学员喜爱授课时间,地点月,周六,日,上海课,程,费,用元,人,包含培训费用,午餐,证书,资料优惠,三人以上参加,赠送一名额联,系,我,们联系人,桂先生电话,传真', '0'], ['贵公司负责人,经理,财务,您好深圳市华龙公司受多家公司委托向外低点代开部分增值税电脑发票,左右,和普通商品销售税发票,国税,地税运输,广告,服务等票,左右,还可以根据所做数量额度的大小来商讨优惠的点数本公司郑重承诺所用绝对是真票,可验证后付款此信息长期有效,如须进一步洽商请电联系人,刘剑辉顺祝商祺低点代开发票', '0'], ['用付出劳动,那就交注册费吧,呵呵,让网站去赚你注册费的,吧,你注册费的,付给你的上线,那样你真的赚到什么了吗,真搞不懂当您发展下线时,只需将本页的注册连接中的,换成您在,的用户名即可独乐乐,不如众乐乐,大家一起赚美国人的钱吧把这个连接,全部蓝色部份,复制到浏览器地址栏中,回车即可进入注册界面我的邮件地址广告,网络电话包年卡,元,长途市话全包最快的论坛邮址搜索专家,最好的邮件群发专家论坛短信群发专家', '0']]

验证集数据:[['讲的是孔子后人的故事,一个老领导回到家乡,跟儿子感情不和,跟贪财的孙子孔为本和睦老领导的弟弟魏宗万是赶马车的有个洋妞大概是考察民俗的,在他们家过年孔为本总想出国,被爷爷教育了最后,一家人基本和解顺便问另一类电影,北京青年电影制片厂的,中越战背景,一军人被介绍了一个对象,去相亲,女方是军队医院的护士,犹豫不决,总是在回忆战场上负伤的男友,好像还没死,最后男方表示理解,归队了', '1'], ['贵公司负责人,经理,财务,您好深圳市华龙公司受多家公司委托向外低点代开部分增值税电脑发票,左右,和普通商品销售税发票,国税,地税运输,广告,服务等票,左右,还可以根据所做数量额度的大小来商讨优惠的点数本公司郑重承诺所用绝对是真票,可验证后付款此信息长期有效,如须进一步洽商请电联系人,刘剑辉顺祝商祺低点代开发票', '0'], ['可以代理代办其它发票如,广告,运输,建筑

In [5]:


#看看数据长什么样子，分别打印训练集、验证集、测试集的前3条数据。
print("训练集数据：{}\n".format(train_ds[0:3]))
print("验证集数据:{}\n".format(dev_ds[0:3]))
print("测试集数据:{}\n".format(test_ds[0:3]))

print("训练集样本个数:{}".format(len(train_ds)))
print("验证集样本个数:{}".format(len(dev_ds)))
print("测试集样本个数:{}".format(len(test_ds)))

训练集数据：[['公司艾克森金山石化,中国化工进出口公司,正大集团大福饲料,厦华集团,灿坤股份东金电子,太原钢铁集团,深圳开发科技,大冷王运输制冷,三洋华强,等知名企业提供项目辅导或专题培训,王老师授课狙榉岣唬风格幽默诙谐,逻辑清晰,过程互动,案例生动,深受学员喜爱授课时间,地点月,周六,日,上海课,程,费,用元,人,包含培训费用,午餐,证书,资料优惠,三人以上参加,赠送一名额联,系,我,们联系人,桂先生电话,传真', '0'], ['贵公司负责人,经理,财务,您好深圳市华龙公司受多家公司委托向外低点代开部分增值税电脑发票,左右,和普通商品销售税发票,国税,地税运输,广告,服务等票,左右,还可以根据所做数量额度的大小来商讨优惠的点数本公司郑重承诺所用绝对是真票,可验证后付款此信息长期有效,如须进一步洽商请电联系人,刘剑辉顺祝商祺低点代开发票', '0'], ['用付出劳动,那就交注册费吧,呵呵,让网站去赚你注册费的,吧,你注册费的,付给你的上线,那样你真的赚到什么了吗,真搞不懂当您发展下线时,只需将本页的注册连接中的,换成您在,的用户名即可独乐乐,不如众乐乐,大家一起赚美国人的钱吧把这个连接,全部蓝色部份,复制到浏览器地址栏中,回车即可进入注册界面我的邮件地址广告,网络电话包年卡,元,长途市话全包最快的论坛邮址搜索专家,最好的邮件群发专家论坛短信群发专家', '0']]

验证集数据:[['讲的是孔子后人的故事,一个老领导回到家乡,跟儿子感情不和,跟贪财的孙子孔为本和睦老领导的弟弟魏宗万是赶马车的有个洋妞大概是考察民俗的,在他们家过年孔为本总想出国,被爷爷教育了最后,一家人基本和解顺便问另一类电影,北京青年电影制片厂的,中越战背景,一军人被介绍了一个对象,去相亲,女方是军队医院的护士,犹豫不决,总是在回忆战场上负伤的男友,好像还没死,最后男方表示理解,归队了', '1'], ['贵公司负责人,经理,财务,您好深圳市华龙公司受多家公司委托向外低点代开部分增值税电脑发票,左右,和普通商品销售税发票,国税,地税运输,广告,服务等票,左右,还可以根据所做数量额度的大小来商讨优惠的点数本公司郑重承诺所用绝对是真票,可验证后付款此信息长期有效,如须进一步洽商请电联系人,刘剑辉顺祝商祺低点代开发票', '0'], ['可以代理代办其它发票如,广告,运输,建筑其它服务行业都可以代理

## 2.4 数据预处理

In [6]:
#调用ppnlp.transformers.BertTokenizer进行数据处理，tokenizer可以把原始输入文本转化成模型model可接受的输入数据格式。
tokenizer = ppnlp.transformers.BertTokenizer.from_pretrained("bert-base-chinese")

#数据预处理
def convert_example(example,tokenizer,label_list,max_seq_length=256,is_test=False):
    if is_test:
        text = example
    else:
        text, label = example
    #tokenizer.encode方法能够完成切分token，映射token ID以及拼接特殊token
    encoded_inputs = tokenizer.encode(text=text, max_seq_len=max_seq_length)
    input_ids = encoded_inputs["input_ids"]
    #注意，在早前的PaddleNLP版本中，token_type_ids叫做segment_ids
    segment_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label_map = {}
        for (i, l) in enumerate(label_list):
            label_map[l] = i

        label = label_map[label]
        label = np.array([label], dtype="int64")
        return input_ids, segment_ids, label
    else:
        return input_ids, segment_ids

#数据迭代器构造方法
def create_dataloader(dataset, trans_fn=None, mode='train', batch_size=1, use_gpu=True, pad_token_id=0, batchify_fn=None):
    if trans_fn:
        dataset = dataset.map(trans_fn, lazy=True)

    if mode == 'train' and use_gpu:
        sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True)
    else:
        shuffle = True if mode == 'train' else False #如果不是训练集，则不打乱顺序
        sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) #生成一个取样器
    dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True, collate_fn=batchify_fn)
    return dataloader

#使用partial()来固定convert_example函数的tokenizer, label_list, max_seq_length, is_test等参数值
trans_fn = partial(convert_example, tokenizer=tokenizer, label_list=label_list, max_seq_length=128, is_test=False)
batchify_fn = lambda samples, fn=Tuple(Pad(axis=0,pad_val=tokenizer.pad_token_id), Pad(axis=0, pad_val=tokenizer.pad_token_id), Stack(dtype="int64")):[data for data in fn(samples)]
#训练集迭代器
train_loader = create_dataloader(train_ds, mode='train', batch_size=64, batchify_fn=batchify_fn, trans_fn=trans_fn)
#验证集迭代器
dev_loader = create_dataloader(dev_ds, mode='dev', batch_size=64, batchify_fn=batchify_fn, trans_fn=trans_fn)
#测试集迭代器
test_loader = create_dataloader(test_ds, mode='test', batch_size=64, batchify_fn=batchify_fn, trans_fn=trans_fn)

[32m[2024-10-17 17:23:11,227] [    INFO][0m - Already cached C:\Users\biang\.paddlenlp\models\bert-base-chinese\bert-base-chinese-vocab.txt[0m
[32m[2024-10-17 17:23:11,238] [    INFO][0m - tokenizer config file saved in C:\Users\biang\.paddlenlp\models\bert-base-chinese\tokenizer_config.json[0m
[32m[2024-10-17 17:23:11,240] [    INFO][0m - Special tokens file saved in C:\Users\biang\.paddlenlp\models\bert-base-chinese\special_tokens_map.json[0m


In [None]:
#devlist2=txt_to_list('train_list3.txt')
#dev_ds2=SelfDefinedDataset(devlist2)
#dev_ds2=MapDataset(dev_ds2)
#dev_loader2 = create_dataloader(dev_ds2, mode='dev', batch_size=64, batchify_fn=batchify_fn, trans_fn=trans_fn)

# 三、模型训练
## 3.1 加载BERT预训练模型

In [7]:
#加载预训练模型Bert用于文本分类任务的Fine-tune网络BertForSequenceClassification, 它在BERT模型后接了一个全连接层进行分类。
#由于本任务中的垃圾邮件识别是二分类问题，设定num_classes为2
model = ppnlp.transformers.BertForSequenceClassification.from_pretrained("bert-base-chinese", num_classes=2)

[32m[2024-10-17 17:23:15,115] [    INFO][0m - Already cached C:\Users\biang\.paddlenlp\models\bert-base-chinese\model_state.pdparams[0m
[32m[2024-10-17 17:23:15,116] [    INFO][0m - Loading weights file model_state.pdparams from cache at C:\Users\biang\.paddlenlp\models\bert-base-chinese\model_state.pdparams[0m
[32m[2024-10-17 17:23:15,704] [    INFO][0m - Loaded weights file from disk, setting weights to model.[0m
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).[0m
You should probably TRAIN this model on a down-stream task to be able to us

In [11]:
pip install matplotlib seaborn scikit-learn


Looking in indexes: https://mirror.baidu.com/pypi/simple/, https://mirrors.aliyun.com/pypi/simple/

[notice] A new release of pip available: 22.1.2 -> 24.0
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.


In [18]:
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.transformers import BertForSequenceClassification
from visualdl import LogWriter
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

## 3.2 开始训练
<font size=4>为了看到训练过程，这里需要引入VisualDL记录训练`log`信息。添加的方式如下：</font>
```python
from visualdl import LogWriter
with LogWriter(logdir="./log") as writer:
    writer.add_scalar(tag="train/loss", step=global_step, value=loss)
    writer.add_scalar(tag="train/acc", step=global_step, value=acc)
    writer.add_scalar(tag="eval/loss", step=epoch, value=eval_loss)
    writer.add_scalar(tag="eval/acc", step=epoch, value=eval_acc)
```
<font size=4>具体实现详见下方代码。</font>

In [8]:
#设置训练超参数

#学习率
learning_rate = 1e-6 
#训练轮次
epochs = 3
#学习率预热比率
warmup_proption = 0.1
#权重衰减系数
weight_decay = 0.01

num_training_steps = len(train_loader) * epochs
num_warmup_steps = int(warmup_proption * num_training_steps)

def get_lr_factor(current_step):
    if current_step < num_warmup_steps:
        return float(current_step) / float(max(1, num_warmup_steps))
    else:
        return max(0.0,
                    float(num_training_steps - current_step) /
                    float(max(1, num_training_steps - num_warmup_steps)))
#学习率调度器
lr_scheduler = paddle.optimizer.lr.LambdaDecay(learning_rate, lr_lambda=lambda current_step: get_lr_factor(current_step))

#优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ])

#损失函数
criterion = paddle.nn.loss.CrossEntropyLoss()
#评估函数
metric = paddle.metric.Accuracy()

In [9]:
#评估函数，设置返回值，便于VisualDL记录
def evaluate(model, criterion, metric, data_loader):
    model.eval()
    metric.reset()
    losses = []
    for batch in data_loader:
        input_ids, segment_ids, labels = batch
        logits = model(input_ids, segment_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
        accu = metric.accumulate()
    print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
    model.train()
    metric.reset()
    return np.mean(losses), accu

In [10]:
#开始训练
global_step = 0
with LogWriter(logdir="./log") as writer:
    for epoch in range(1, epochs + 1):
        print("epoch:",epoch)
        for step, batch in enumerate(train_loader, start=1): #从训练数据迭代器中取数据
            print("step:",step)
            input_ids, segment_ids, labels = batch
            print("input_ids:",input_ids)
            logits = model(input_ids, segment_ids)
            print("logits:",logits)
            loss = criterion(logits, labels) #计算损失
            print("loss:",loss)
            probs = F.softmax(logits, axis=1)
            correct = metric.compute(probs, labels)
            metric.update(correct)
            acc = metric.accumulate()
            print("acc:",acc)

            global_step += 1
            print('global_step:',global_step)
            if global_step % 50 == 0 :
                print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
                #记录训练过程
                writer.add_scalar(tag="train/loss", step=global_step, value=loss)
                writer.add_scalar(tag="train/acc", step=global_step, value=acc)
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_gradients()
        eval_loss, eval_acc = evaluate(model, criterion, metric, dev_loader)
        #eval_loss2, eval_acc2 = evaluate(model, criterion, metric, dev_loader2)
        # 在每个epoch后保存模型
        paddle.save(model.state_dict(), f'./saved_models/model_epoch{epoch}.pdparams')
        #记录评估过程
        writer.add_scalar(tag="eval/loss", step=epoch, value=eval_loss)
        writer.add_scalar(tag="eval/acc", step=epoch, value=eval_acc)
        #writer.add_scalar(tag="eval/loss2", step=epoch, value=eval_loss2)
        #writer.add_scalar(tag="eval/acc2", step=epoch, value=eval_acc2)



epoch: 1
step: 1
input_ids: Tensor(shape=[64, 128], dtype=int32, place=Place(gpu_pinned), stop_gradient=True,
       [[101 , 5470, 1184, ..., 117 , 704 , 102 ],
        [101 , 762 , 1874, ..., 4638, 2130, 102 ],
        [101 , 1912, 2845, ..., 1071, 2124, 102 ],
        ...,
        [101 , 4500, 4638, ..., 4638, 798 , 102 ],
        [101 , 5106, 3796, ..., 117 , 1555, 102 ],
        [101 , 2245, 1184, ..., 117 , 2769, 102 ]])
logits: Tensor(shape=[64, 2], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [[ 0.20209090, -0.39877081],
        [ 0.74738932, -0.73671073],
        [ 0.78613943, -0.59866434],
        [ 0.77173334, -0.18493524],
        [ 0.52312410, -0.38009256],
        [ 0.91436362, -0.06754762],
        [ 0.58960402, -0.68650067],
        [ 0.16917586, -0.26603153],
        [ 0.53388309, -0.63708985],
        [ 0.65927619, -0.55190331],
        [ 0.63688868, -0.80141580],
        [ 0.27530465, -0.15779850],
        [ 0.54590845, -0.77733517],
        [ 0.5555


KeyboardInterrupt



In [25]:
#用于加载训练模型权重验证
model2 = ppnlp.transformers.BertForSequenceClassification.from_pretrained("bert-base-chinese", num_classes=2)

[32m[2024-10-17 16:08:21,155] [    INFO][0m - Already cached C:\Users\biang\.paddlenlp\models\bert-base-chinese\model_state.pdparams[0m
[32m[2024-10-17 16:08:21,157] [    INFO][0m - Loading weights file model_state.pdparams from cache at C:\Users\biang\.paddlenlp\models\bert-base-chinese\model_state.pdparams[0m
[32m[2024-10-17 16:08:22,026] [    INFO][0m - Loaded weights file from disk, setting weights to model.[0m
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).[0m
You should probably TRAIN this model on a down-stream task to be able to us

In [None]:
#保存模型
paddle.save(model.state_dict(), f'./saved_models/model_final.pdparams')

<font size=4>可以看到，使用BERT预训练模型进行finetune，在第2个epoch后验证集准确率已经达到99.4%以上，在第3个epoch就能达到99.6%以上。</font>

# 四、预测效果

完成上面的模型训练之后，可以得到一个能够通过中文邮件内容识别是否为垃圾邮件的模型。接下来查看模型在测试集上的泛化能力。

In [23]:
def predict(model, data, tokenizer, label_map, batch_size=1):
    examples = []
    for text in data:
        input_ids, segment_ids = convert_example(text, tokenizer, label_list=label_map.values(),  max_seq_length=128, is_test=True)
        examples.append((input_ids, segment_ids))

    batchify_fn = lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id), Pad(axis=0, pad_val=tokenizer.pad_token_id)): fn(samples)
    batches = []
    one_batch = []
    for example in examples:
        one_batch.append(example)
        if len(one_batch) == batch_size:
            batches.append(one_batch)
            one_batch = []
    if one_batch:
        batches.append(one_batch)

    results = []
    model.eval()
    for batch in batches:
        input_ids, segment_ids = batchify_fn(batch)
        input_ids = paddle.to_tensor(input_ids)
        segment_ids = paddle.to_tensor(segment_ids)
        logits = model(input_ids, segment_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [label_map[i] for i in idx]
        results.extend(labels)
    return results

In [None]:

data = ['您好我公司有多余的发票可以向外代开,国税,地税,运输,广告,海关缴款书如果贵公司,厂,有需要请来电洽谈,咨询联系电话,罗先生谢谢顺祝商祺']
label_map = {0: '垃圾邮件', 1: '正常邮件'}
data1=translate_text("It doesn't look like there's much interest in a bof;maybe next time.")
data2=translate_text("if you want a bof, schedule it and see what happens.i'm a little curious myself.")
data=['您好我公司有多余的发票可以向外代开,国税,地税,运输,广告,海关缴款书如果贵公司,厂,有需要请来电洽谈,咨询联系电话,罗先生谢谢顺祝商祺',data1,data2]
predictions = predict(model, data, tokenizer, label_map, batch_size=32)
for idx, text in enumerate(data):
    print('预测内容: {} \n邮件标签: {}'.format(text, predictions[idx]))

In [26]:
params_path = './saved_models/model_epoch1.pdparams'
state_dict = paddle.load(params_path)
model2.set_state_dict(state_dict)
data = ['您好我公司有多余的发票可以向外代开,国税,地税,运输,广告,海关缴款书如果贵公司,厂,有需要请来电洽谈,咨询联系电话,罗先生谢谢顺祝商祺']
label_map = {0: '垃圾邮件', 1: '正常邮件'}
data1=translate_text("It doesn't look like there's much interest in a bof;maybe next time.")
data2=translate_text("if you want a bof, schedule it and see what happens.i'm a little curious myself.")
data=['老师好，请问一下这周需要上课吗',data1,data2]
predictions = predict(model2, data, tokenizer, label_map, batch_size=32)
for idx, text in enumerate(data):
    print('预测内容: {} \n邮件标签: {}'.format(text, predictions[idx]))



预测内容: 老师好，请问一下这周需要上课吗 
邮件标签: 正常邮件
预测内容: 看起来对bof不太感兴趣；也许下次吧。 
邮件标签: 正常邮件
预测内容: 如果你想要一个bof ，安排它，看看会发生什么。我自己也有点好奇。 
邮件标签: 正常邮件


In [29]:
!pip install matplotlib seaborn


Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2


In [30]:
#增加新的评估参数
import numpy as np
import paddle
import paddle.nn.functional as F
from paddle.io import DataLoader
from paddle.metric import Accuracy
from visualdl import LogWriter
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
import seaborn as sns

# 加载预训练模型Bert用于文本分类任务的Fine-tune网络BertForSequenceClassification
model = ppnlp.transformers.BertForSequenceClassification.from_pretrained("bert-base-chinese", num_classes=2)

# 设置训练超参数
learning_rate = 1e-6
epochs = 1
warmup_proportion = 0.1
weight_decay = 0.01

num_training_steps = len(train_loader) * epochs
num_warmup_steps = int(warmup_proportion * num_training_steps)

def get_lr_factor(current_step):
    if current_step < num_warmup_steps:
        return float(current_step) / float(max(1, num_warmup_steps))
    else:
        return max(0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps)))

# 学习率调度器
lr_scheduler = paddle.optimizer.lr.LambdaDecay(learning_rate, lr_lambda=lambda current_step: get_lr_factor(current_step))

# 优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ]
)

# 损失函数
criterion = paddle.nn.loss.CrossEntropyLoss()
# 评估函数
metric = Accuracy()

def evaluate(model, criterion, metric, data_loader):
    model.eval()
    metric.reset()
    losses = []
    all_labels = []
    all_preds = []

    for batch in data_loader:
        input_ids, segment_ids, labels = batch
        logits = model(input_ids, segment_ids)
        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        
        # 获取预测结果
        preds = F.softmax(logits, axis=1).argmax(axis=1).numpy()
        all_labels.extend(labels.numpy().tolist())
        all_preds.extend(preds.tolist())
        
        correct = metric.compute(logits, labels)
        metric.update(correct)

    accu = metric.accumulate()
    
    # 计算精确率和召回率
    precision = metrics.precision_score(all_labels, all_preds, average='binary')
    recall = metrics.recall_score(all_labels, all_preds, average='binary')
    f1 = metrics.f1_score(all_labels, all_preds, average='binary')
    
    print(f"eval loss: {np.mean(losses):.5f}, accu: {accu:.5f}, precision: {precision:.5f}, recall: {recall:.5f}, f1: {f1:.5f}")
    
    # 计算混淆矩阵
    confusion_mtx = metrics.confusion_matrix(all_labels, all_preds)
    plot_confusion_matrix(confusion_mtx)
    
    model.train()
    metric.reset()
    return np.mean(losses), accu 

def plot_confusion_matrix(confusion_mtx):
    plt.figure(figsize=(8, 6))
    sns.heatmap(confusion_mtx, annot=True, fmt="d", cmap="Blues", cbar=False)
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted Labels')
    plt.ylabel('True Labels')
    plt.show()

# 开始训练
global_step = 0
with LogWriter(logdir="./log") as writer:
    for epoch in range(1, epochs + 1):
        for step, batch in enumerate(train_loader, start=1):
            input_ids, segment_ids, labels = batch
            logits = model(input_ids, segment_ids)
            loss = criterion(logits, labels)  # 计算损失
            probs = F.softmax(logits, axis=1)
            correct = metric.compute(probs, labels)
            metric.update(correct)
            acc = metric.accumulate()

            global_step += 1
            if global_step % 50 == 0:
                print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc))
                # 记录训练过程
                writer.add_scalar(tag="train/loss", step=global_step, value=loss)
                writer.add_scalar(tag="train/acc", step=global_step, value=acc)
            
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.clear_gradients()

        eval_loss, eval_acc = evaluate(model, criterion, metric, dev_loader)
        paddle.save(model.state_dict(), f'./saved_models/model_epoch{epoch}.pdparams')
        # 记录评估过程
        writer.add_scalar(tag="eval/loss", step=epoch, value=eval_loss)
        writer.add_scalar(tag="eval/acc", step=epoch, value=eval_acc)

[32m[2024-10-17 16:16:01,395] [    INFO][0m - Already cached C:\Users\biang\.paddlenlp\models\bert-base-chinese\model_state.pdparams[0m
[32m[2024-10-17 16:16:01,396] [    INFO][0m - Loading weights file model_state.pdparams from cache at C:\Users\biang\.paddlenlp\models\bert-base-chinese\model_state.pdparams[0m
[32m[2024-10-17 16:16:02,069] [    INFO][0m - Loaded weights file from disk, setting weights to model.[0m
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).[0m
You should probably TRAIN this model on a down-stream task to be able to us

KeyboardInterrupt: 