<center><font size=6>微博情感分析</font></center>

## 模型训练说明
模型基于BERT-chinse-base进行finetune
* 请在根目录下创建目标目录（若已创建，则跳过）
* 请将此notebook另存到上一步所创建的目录下
* 请初始化相关路径变量

In [1]:
! jupyter notebook --version

6.5.4


## 导入包

In [2]:
import os
os.environ['TRANSFORMERS_CACHE'] = 'My_Model'
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

import pandas as pd
import numpy as np

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import transformers
from transformers import BertConfig, BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from tqdm import trange, notebook
from tqdm import *

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,f1_score, roc_auc_score

import warnings
warnings.filterwarnings("ignore")

In [3]:
print(f'torch version: {torch.__version__}\ntransformers version: {transformers.__version__}')

torch version: 2.0.1+cu117
transformers version: 4.27.1


## 初始化模型

In [4]:
model_name = 'bert-base-chinese'
# 下面三个文件的路径为 bert-base-chinese 文件夹，根据自己的存储路径更换
config = BertConfig.from_pretrained('model/'+model_name, finetuning_task='binary')  # BERT 模型配置
tokenizer = BertTokenizer.from_pretrained('model/'+model_name)  # BERT 的分词器
model = BertForSequenceClassification.from_pretrained('model/'+model_name, num_labels=2)  # BERT 的文本分类模型
 
# 用于将文本转换为BERT模型的输入标记
def get_tokens(text, tokenizer, max_seq_length, add_special_tokens=True): 
    # 使用分词器将文本转换为模型可以接受的输入格式
    input_ids = tokenizer.encode(text,
                                 add_special_tokens=add_special_tokens,
                                 truncation=True,
                                 max_length=max_seq_length,
                                 pad_to_max_length=True)
    # 创建一个关注掩码，标记哪些标记是真实文本标记
    attention_mask = [int(id > 0) for id in input_ids]
    # 确保输入标记和关注掩码的长度等于最大序列长度
    assert len(input_ids) == max_seq_length
    assert len(attention_mask) == max_seq_length
    return (input_ids, attention_mask)

Some weights of the model checkpoint at ../model/bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model chec

## 导入标注数据

In [5]:
# 目标：保留话题的文字、移除poi超链接、视频超链接、保留表情图片中的标题文字、移除其他html标签
def prepare(text):
    import re
    # 删除包含 wbicon 的 <i> 标签以及它们之间的内容
    tokens = re.sub(re.compile(r'(<i\s+.*?wbicon.*?>.*?</i>)', re.S), '', text)
    # 删除与 HTML 匹配的标签，包括尖括号 < 和 > 之间的内容
    tokens = re.sub(re.compile(r'<(.*?)>', re.S), '', text)
    # 保留 <img title="xxx"> 中的 title 信息
    tokens = re.sub(re.compile(r'<img.*?(alt=["|\']{0,1}(.*?)["|\']{0,1}|title=["|\']{0,1}(.*?)["|\']{0,1})\s+.*?>', re.S|re.M), '\g<2>', text).strip()
    # 将匹配到的 <br/> 替换为 \n
    tokens = re.sub(re.compile(r'<br/>',re.S),'\n',text)
    # 移除文本内的连续重复内容
    # 移除连续发生3次及以上次数的重复性内容
    # 重复内容的字符串长度>=3
    tokens = re.sub(re.compile(r'([\s|\S]{2,}?)\1{2,}',re.S|re.M),'\g<1>',text)
    # 移除 poi链接、视频链接、直播链接
    urls=re.findall(r"<a.*?href=.*?<\/a>", text, re.I|re.S|re.M)
    url=[u for u in urls if '>2<' in u or 'location_default.png' in u or '视频</a>' in u or '视频</span></a>' in u or '直播</a>' in u]
    if len(url)>0:
        for u in url:
            tokens=text.replace(u,'')
            
    return tokens

In [6]:
df_label = pd.read_csv('data/weibo_label.csv')

In [7]:
import psutil
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=psutil.cpu_count(logical=False))

INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [8]:
df_label['message'] = df_label['message'].parallel_apply(prepare)

In [9]:
df_label.sentiment.value_counts()

sentiment
 0    2771
 1    2102
-1     683
 6     638
Name: count, dtype: int64

## 垃圾文本预测

In [10]:
df = df_label.copy()
df.loc[df[df.sentiment!=6].index,'sentiment'] = 1

In [11]:
df = df[(df['sentiment']==1)|(df['sentiment']==6)]
# 随机丢弃标注列表中量较多的数据，以保持二者的标注量基本相同，提高后期模型预测的准确率
drop_size = len(df[df['sentiment']==1].sentiment)-len(df[df['sentiment']==6].sentiment)
df.drop(df[df['sentiment']==1].sample(drop_size).index, inplace=True)

In [12]:
df.sentiment.unique()

array([6, 1])

In [13]:
df.loc[df[df.sentiment==6].index,'sentiment'] = 0

In [14]:
df.sentiment.unique()

array([0, 1])

In [15]:
df.sample(10)

Unnamed: 0,mid,message,sentiment
5879,4818951786202064,#一条plog告别九月# [二哈],0
4461,4820570703136111,🌄 🌅,0
731,4260379876100120,分享图片,0
2358,4260900208896280,分享视频,0
199,4260015919234630,睡前打卡[月亮] 酵素果冻水果味，大人小孩都爱吃[耶] 排毒 养颜，清肠 治便秘，净...,0
5723,4821704956706891,江同学，你好，我是f班的袁湘琴。,1
96,4262178615743880,只有几件抹胸吊带裙。随便秒杀￥75包邮。 夏季出游拍图美美哒,0
1072,4262059300643330,7.17 薛之谦 \n 唯爱薛之谦@薛之谦 \n #薛之谦717生日快乐##薛之谦##薛...,1
4703,4820576437276314,迷途漫漫，终有一归🌈,1
1905,4260256810421990,一叶孤……,1


### 数据集划分

In [16]:
X_train, X_test, Y_train, Y_test = train_test_split(df['message'],  # 文本消息数据
                                                    df['sentiment'],  # 文本情感标签
                                                    test_size=0.2,    # 测试集占总数据的比例
                                                    random_state=42,  # 随机种子，以确保可重复性
                                                    stratify=df['sentiment'])  # 根据情感标签进行分层抽样
# 使用自定义函数 get_tokens 对训练集和测试集的文本进行分词，每个文本最多包含150个标记
X_train_tokens = X_train.apply(get_tokens, args=(tokenizer, 150))
X_test_tokens = X_test.apply(get_tokens, args=(tokenizer, 150))

### 训练准备

In [17]:
# 将训练集的文本特征转换为PyTorch张量
input_ids_train = torch.tensor(
    [features[0] for features in X_train_tokens.values], dtype=torch.long)  # 输入特征 ID
input_mask_train = torch.tensor(
    [features[1] for features in X_train_tokens.values], dtype=torch.long)  # 输入掩码
label_ids_train = torch.tensor(Y_train.values, dtype=torch.long)  # 标签 ID

# # 输出训练集张量的形状
# print(input_ids_train.shape)  # 输出训练集输入特征的形状
# print(input_mask_train.shape)  # 输出训练集输入掩码的形状
# print(label_ids_train.shape)  # 输出训练集标签的形状

# 创建训练数据集
train_dataset = TensorDataset(input_ids_train, input_mask_train, label_ids_train)

# 将测试集的文本特征转换为PyTorch张量
input_ids_test = torch.tensor([features[0] for features in X_test_tokens.values], dtype=torch.long)
input_mask_test = torch.tensor([features[1] for features in X_test_tokens.values], dtype=torch.long)
label_ids_test = torch.tensor(Y_test.values, dtype=torch.long)

# 创建测试数据集
test_dataset = TensorDataset(input_ids_test, input_mask_test, label_ids_test)

In [18]:
# 训练批次大小和训练周期数
train_batch_size = 64
num_train_epochs = 3

# 创建训练数据采样器和数据加载器
train_sampler = RandomSampler(train_dataset)  # 随机采样器，用于随机选择训练样本
train_dataloader = DataLoader(train_dataset, 
                              sampler=train_sampler, 
                              batch_size=train_batch_size)  # 创建训练数据加载器
t_total = len(train_dataloader) // num_train_epochs  # 计算总的训练步数

# 输出一些训练相关的信息
print("样本数量 =", len(train_dataset))  # 输出训练集样本数量
print("训练周期数 =", num_train_epochs)  # 输出训练周期数
print("总的训练批次大小 =", train_batch_size)  # 输出总的训练批次大小
print("总的优化步数 =", t_total)  # 输出总的优化步数

# 优化器和学习率调度器的设置
learning_rate = 5e-5  # 学习率
adam_epsilon = 1e-8  # Adam优化器的epsilon值
warmup_steps = 0  # 学习率预热步数

# 创建AdamW优化器和学习率调度器
optimizer = AdamW(model.parameters(), lr=learning_rate, eps=adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=warmup_steps, 
                                            num_training_steps=t_total)  # 创建学习率调度器

样本数量 = 1020
训练周期数 = 3
总的训练批次大小 = 64
总的优化步数 = 5


### 训练

In [19]:
# 检测是否有GPU可用，如果有则使用GPU，否则使用CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 创建一个描述训练周期的迭代器
train_iterator = trange(num_train_epochs, desc="Epoch")

# 将模型置于 train 模式
model.train()

for epoch in train_iterator:
    # 创建一个描述迭代的迭代器
    epoch_iterator = tqdm(train_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):
        # 重置每个迭代开始时的所有梯度
        model.zero_grad()
        
        # 将模型和输入数据移到GPU（如果可用）
        # torch.cuda.empty_cache()  # 清理GPU缓存
        model.to(device)  # 将模型移到GPU或CPU
        cuda = next(model.parameters()).device
        batch = tuple(t.to(cuda) for t in batch)  # 将批次数据移到GPU或CPU

        # 确定传递给模型的输入
        inputs = {
            'input_ids': batch[0],      # 输入特征ID
            'attention_mask': batch[1], # 输入掩码
            'labels': batch[2]         # 标签
        }

        # 通过模型进行前向传播：输入 -> 模型 -> 输出
        outputs = model(**inputs)

        # 计算损失
        loss = outputs[0]

        # 打印当前损失值
        print("\r%f" % loss, end='')

        # 反向传播损失，自动计算梯度
        loss.backward()

        # 通过将梯度限制在一定范围内来防止梯度爆炸
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # 更新模型参数和学习率
        optimizer.step()
        scheduler.step()

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s][A

0.830740


Iteration:   6%|▋         | 1/16 [00:02<00:38,  2.55s/it][A

0.622954


Iteration:  12%|█▎        | 2/16 [00:03<00:22,  1.58s/it][A

0.636609


Iteration:  19%|█▉        | 3/16 [00:04<00:16,  1.27s/it][A

0.508802


Iteration:  25%|██▌       | 4/16 [00:05<00:13,  1.12s/it][A

0.478594


Iteration:  31%|███▏      | 5/16 [00:06<00:11,  1.04s/it][A

0.536109


Iteration:  38%|███▊      | 6/16 [00:07<00:09,  1.01it/s][A

0.479835


Iteration:  44%|████▍     | 7/16 [00:07<00:08,  1.03it/s][A

0.417668


Iteration:  50%|█████     | 8/16 [00:08<00:07,  1.06it/s][A

0.513813


Iteration:  56%|█████▋    | 9/16 [00:09<00:06,  1.07it/s][A

0.492849


Iteration:  62%|██████▎   | 10/16 [00:10<00:05,  1.08it/s][A

0.472073


Iteration:  69%|██████▉   | 11/16 [00:11<00:04,  1.09it/s][A

0.442426


Iteration:  75%|███████▌  | 12/16 [00:12<00:03,  1.09it/s][A

0.498893


Iteration:  81%|████████▏ | 13/16 [00:13<00:02,  1.10it/s][A

0.541201


Iteration:  88%|████████▊ | 14/16 [00:14<00:01,  1.10it/s][A

0.508828


Iteration:  94%|█████████▍| 15/16 [00:15<00:00,  1.10it/s][A

0.536815


Iteration: 100%|██████████| 16/16 [00:16<00:00,  1.00s/it][A
Epoch:  33%|███▎      | 1/3 [00:16<00:32, 16.05s/it]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s][A

0.527742


Iteration:   6%|▋         | 1/16 [00:00<00:13,  1.10it/s][A

0.475472


Iteration:  12%|█▎        | 2/16 [00:01<00:12,  1.10it/s][A

0.486746


Iteration:  19%|█▉        | 3/16 [00:02<00:11,  1.10it/s][A

0.567740


Iteration:  25%|██▌       | 4/16 [00:03<00:10,  1.10it/s][A

0.443561


Iteration:  31%|███▏      | 5/16 [00:04<00:09,  1.10it/s][A

0.492978


Iteration:  38%|███▊      | 6/16 [00:05<00:09,  1.10it/s][A

0.463012


Iteration:  44%|████▍     | 7/16 [00:06<00:08,  1.11it/s][A

0.445318


Iteration:  50%|█████     | 8/16 [00:07<00:07,  1.10it/s][A

0.422389


Iteration:  56%|█████▋    | 9/16 [00:08<00:06,  1.10it/s][A

0.510062


Iteration:  62%|██████▎   | 10/16 [00:09<00:05,  1.10it/s][A

0.475671


Iteration:  69%|██████▉   | 11/16 [00:09<00:04,  1.11it/s][A

0.487049


Iteration:  75%|███████▌  | 12/16 [00:10<00:03,  1.11it/s][A

0.547127


Iteration:  81%|████████▏ | 13/16 [00:11<00:02,  1.11it/s][A

0.439794


Iteration:  88%|████████▊ | 14/16 [00:12<00:01,  1.10it/s][A

0.450694


Iteration:  94%|█████████▍| 15/16 [00:13<00:00,  1.10it/s][A

0.470267


Iteration: 100%|██████████| 16/16 [00:14<00:00,  1.11it/s][A
Epoch:  67%|██████▋   | 2/3 [00:30<00:15, 15.10s/it]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s][A

0.565314


Iteration:   6%|▋         | 1/16 [00:00<00:13,  1.11it/s][A

0.522709


Iteration:  12%|█▎        | 2/16 [00:01<00:12,  1.11it/s][A

0.482386


Iteration:  19%|█▉        | 3/16 [00:02<00:11,  1.11it/s][A

0.514656


Iteration:  25%|██▌       | 4/16 [00:03<00:10,  1.11it/s][A

0.488819


Iteration:  31%|███▏      | 5/16 [00:04<00:09,  1.11it/s][A

0.441887


Iteration:  38%|███▊      | 6/16 [00:05<00:09,  1.10it/s][A

0.466163


Iteration:  44%|████▍     | 7/16 [00:06<00:08,  1.10it/s][A

0.470273


Iteration:  50%|█████     | 8/16 [00:07<00:07,  1.11it/s][A

0.452852


Iteration:  56%|█████▋    | 9/16 [00:08<00:06,  1.11it/s][A

0.523107


Iteration:  62%|██████▎   | 10/16 [00:09<00:05,  1.10it/s][A

0.466437


Iteration:  69%|██████▉   | 11/16 [00:09<00:04,  1.10it/s][A

0.511291


Iteration:  75%|███████▌  | 12/16 [00:10<00:03,  1.11it/s][A

0.439552


Iteration:  81%|████████▏ | 13/16 [00:11<00:02,  1.11it/s][A

0.446591


Iteration:  88%|████████▊ | 14/16 [00:12<00:01,  1.11it/s][A

0.466913


Iteration:  94%|█████████▍| 15/16 [00:13<00:00,  1.11it/s][A

0.487008


Iteration: 100%|██████████| 16/16 [00:14<00:00,  1.11it/s][A
Epoch: 100%|██████████| 3/3 [00:44<00:00, 14.96s/it]


保存模型

In [20]:
model.save_pretrained('My_Model/weibo-bert-rubbish-model')

### 验证模型并评估精度

In [21]:
# 测试批次大小
test_batch_size = 64

# 创建测试数据采样器和数据加载器
test_sampler = SequentialSampler(test_dataset)  # 顺序采样器，用于顺序选择测试样本
test_dataloader = DataLoader(test_dataset, 
                             sampler=test_sampler, 
                             batch_size=test_batch_size)  # 创建测试数据加载器

# 加载之前保存的预训练模型
# model = model.from_pretrained('/outputs')

# 初始化预测和实际标签
preds = None
out_label_ids = None

# 将模型置于 eval 模式
model.eval()

for batch in tqdm(test_dataloader, desc="评估中"):
    # 将模型和输入数据移到GPU（如果可用）
    model.to(device)
    batch = tuple(t.to(device) for t in batch)
    
    # 在 eval 模式下不跟踪任何梯度
    with torch.no_grad():
        inputs = {
            'input_ids': batch[0],  # 输入特征ID
            'attention_mask': batch[1],  # 输入掩码
            'labels': batch[2]  # 标签
        }        

        # 通过模型进行前向传播
        outputs = model(**inputs)

        # 我们得到损失，因为我们提供了标签
        tmp_eval_loss, logits = outputs[:2]

        # 测试数据集可能包含多个批次的项目
        if preds is None:
            preds = logits.detach().cpu().numpy()
            out_label_ids = inputs['labels'].detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
            out_label_ids = np.append(out_label_ids, 
                                      inputs['labels'].detach().cpu().numpy(), 
                                      axis=0)

# 计算最终损失、预测和准确度
preds = np.argmax(preds, axis=1)  # 获取预测类别
acc_score = accuracy_score(preds, out_label_ids)  # 计算准确度
f1_score = f1_score(preds, out_label_ids)  # 计算F1分数
print ('测试集中的Accuracy分数: ', acc_score)
print ('测试集中的F1分数: ', f1_score)

评估中: 100%|██████████| 4/4 [00:01<00:00,  3.38it/s]

测试集中的Accuracy分数:  0.8046875
测试集中的F1分数:  0.8299319727891157





### 预测

In [22]:
df_origin = pd.read_csv('data/weibo_origin.csv')
df_origin['label'] = 0 #统一初始化为0
df_origin['text'] = df_origin.message.str.replace('\n',' ')
df_origin

Unnamed: 0,id,userid,message,ts_created,label,text
0,4817910549973967,6339251717,记录一下合肥一日游[送花花][送花花],2022-09-26 11:27:30,0,记录一下合肥一日游[送花花][送花花]
1,4817738155690682,6080243764,真诚才是爱的秘密,2022-09-26 00:02:28,0,真诚才是爱的秘密
2,4819531555412098,5867909333,回家[给力],2022-09-30 22:48:48,0,回家[给力]
3,4818048550179045,3289196050,躺平大师,2022-09-26 20:35:52,0,躺平大师
4,4819813207117305,6495662753,终于[泪],2022-10-01 17:27:59,0,终于[泪]
...,...,...,...,...,...,...
25989,4820530244883475,6219366090,热傻了今天,2022-10-03 16:57:14,0,热傻了今天
25990,4821179631666949,6557015016,#挪威海底电缆断裂#北约除了干瞪眼！实在做不出别的事情 跟北溪二号一样 哑巴吃黄连，有苦...,2022-10-05 11:57:40,0,#挪威海底电缆断裂#北约除了干瞪眼！实在做不出别的事情 跟北溪二号一样 哑巴吃黄连，有苦...
25991,4821641589163051,5639007389,今天：合肥下雨了，晚上加班到十点半 不过还好，心情不是很糟糕 我还是心心念念想要养一只小猫,2022-10-06 18:33:19,0,今天：合肥下雨了，晚上加班到十点半 不过还好，心情不是很糟糕 我还是心心念念想要养一只小猫
25992,4818771782408232,2072019043,参观晚清重臣李鸿章故居，重温那段风云变幻的历史，不胜感慨！“时来天地皆同力，运去英雄不自由”...,2022-09-28 20:29:44,0,参观晚清重臣李鸿章故居，重温那段风云变幻的历史，不胜感慨！“时来天地皆同力，运去英雄不自由”...


In [23]:
X_pred=df_origin['text']
Y_pred=df_origin['label']
X_pred_tokens = X_pred.parallel_apply(get_tokens, args=(tokenizer, 150))

input_ids_pred = torch.tensor(
    [features[0] for features in X_pred_tokens.values], dtype=torch.long)
input_mask_pred = torch.tensor(
    [features[1] for features in X_pred_tokens.values], dtype=torch.long)
label_pred=torch.tensor(Y_pred.values,dtype=torch.long)
pred_dataset = TensorDataset(input_ids_pred,input_mask_pred,label_pred)

pred_batch_size = 256
pred_sampler = SequentialSampler(pred_dataset)
pred_dataloader = DataLoader(pred_dataset, 
                             sampler=pred_sampler, 
                             batch_size=pred_batch_size)

In [24]:
# 调用训练好的模型
model = model.from_pretrained('My_Model/weibo-bert-rubbish-model')
preds = None
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [25]:
# 预测
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for batch in tqdm(pred_dataloader, desc="Predict"):
    
    batch = tuple(t.to(device) for t in batch)
    
    with torch.no_grad():
        inputs = {
            'input_ids': batch[0],
            'attention_mask': batch[1],
            'labels': batch[2]
        }

        outputs = model(**inputs)
        _, logits = outputs[:2]

        if preds is None:
            preds = logits.detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)

Predict: 100%|██████████| 102/102 [00:36<00:00,  2.76it/s]


In [26]:
prob = torch.nn.functional.softmax(torch.tensor(preds), dim=1)  # 使用softmax函数计算预测的概率分布
preds = np.argmax(preds, axis=1)  # 计算每个样本的最终预测类别
df_origin['ad_prob'] = [p[1].item() for p in prob]  # 将概率分布的第二列（表示"1"类别的概率）添加到DataFrame中
df_origin['pred'] = preds  # 将最终的预测类别添加到DataFrame中

In [37]:
df_origin.sample(10)

Unnamed: 0,id,userid,message,ts_created,label,text,ad_prob,pred
13042,4822375830913754,2162769183,🍽☕️🍰,2022-10-08 19:10:55,0,🍽☕️🍰,0.378869,0
21280,4820922705642071,7119677364,冬季开业 现在不是冬季是什么[疑问],2022-10-04 18:56:44,0,冬季开业 现在不是冬季是什么[疑问],0.710206,1
8411,4821971952737979,6176975248,#国庆假期最后1天#如果今天你觉得七天很短，那么明天开始，你就会觉得七天很长了[裂开],2022-10-07 16:26:04,0,#国庆假期最后1天#如果今天你觉得七天很短，那么明天开始，你就会觉得七天很长了[裂开],0.671589,1
11910,4818458758614365,5605622990,分享图片,2022-09-27 23:45:53,0,分享图片,0.144969,0
4001,4819522551546846,6410240102,分享图片,2022-09-30 22:13:00,0,分享图片,0.144969,0
16594,4819645582803688,5367195554,刚刚知道主人将玩具扔柜顶上，猫咪首次捡回失败后，第二次2秒制定路线潇洒完成。 #刚刚知道#,2022-10-01 06:21:54,0,刚刚知道主人将玩具扔柜顶上，猫咪首次捡回失败后，第二次2秒制定路线潇洒完成。 #刚刚知道#,0.649496,1
18033,4821591760307368,3936801473,[doge],2022-10-06 15:15:19,0,[doge],0.708079,1
18908,4821670345048669,1282849795,#非摄不可##索尼大法好##索尼a7m4# 国庆假期肥东撮街遇到了舞龙表演，看个新鲜！ @...,2022-10-06 20:27:34,0,#非摄不可##索尼大法好##索尼a7m4# 国庆假期肥东撮街遇到了舞龙表演，看个新鲜！ @...,0.686128,1
14406,4820637301080697,5247208178,[月亮] [收到] 打工人！好快乐！！！,2022-10-04 00:02:37,0,[月亮] [收到] 打工人！好快乐！！！,0.726042,1
10655,4821544843346970,1740304344,兰花村,2022-10-06 12:08:53,0,兰花村,0.457931,0


## 多分类