## 赛事背景
随着互联网内容的快速增长，网络平台面临日益严峻的违禁信息治理挑战。违禁词（如涉政敏感、色情低俗、暴力犯罪、宗教迷信等）通过谐音、缩写、符号插入、多语言混合等动态变体形式，持续规避传统检测规则，对网络生态安全和用户体验造成严重威胁。现有基于关键词匹配或简单规则的方法难以应对复杂语境下的语义歧义和对抗性干扰，亟需通过人工智能技术提升违禁内容识别的智能化水平。

## 赛事任务
本次大赛要求参赛者构建高效的机器学习或深度学习模型，用于准确识别文本中违禁词汇的类别。为此，提供经过脱敏处理的大规模真实网络文本数据集，包括但不限于社交媒体帖子、新闻评论等。

## 评审规则
1. 数据说明

本次比赛为参赛选手提供的数据包括从各大社交平台采集的用户发言，数据包含主要文本以及对应的违禁分类，以便选手利用。

2. 评估指标

本模型依据提交的结果文件，采用macro F1-score进行评价。

In [3]:
import pandas as pd
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score

In [4]:
train = pd.read_csv("dataset/train_all.csv")
test = pd.read_csv("dataset/test_text.csv")

In [13]:
category_list = list(train["类别"].unique())

In [14]:
train

Unnamed: 0,id,类别,文本
0,0,种族歧视,这政治正确玩的，真NM6。 剧黑农民了么？农民出身的官员确实有腐败现象能否认么？以后反腐剧里...
1,1,政治敏感,刁进平
2,2,微侵犯(MA),一直都没发现这条惊天大驴，直到突然怀崽才发现，一怀孕就开始歪屁股 举报 ...
3,3,色情,干的爽
4,4,种族歧视,黑人在美国侮辱华人，那我支持白人警察打击黑人有什么问题・_・?
...,...,...,...
25095,25095,种族歧视,实际上，如果你不是白种女孩的话是很容易被歧视的，如果你是黑人，
25096,25096,政治敏感,大纪
25097,25097,色情,逃亡艳旅*
25098,25098,基于文化背景的刻板印象(SCB),并且总感觉是举着帮助女人学会经营情感的大旗干着教女人怎么讨好迎合男人的勾当


## TFIDF

In [9]:
tfidf = TfidfVectorizer(tokenizer=jieba.lcut)
train_tfidf = tfidf.fit_transform(train["文本"])
test_tfidf = tfidf.transform(test["文本"])

In [17]:
pred = cross_val_predict(
    LinearSVC(),
    train_tfidf,
    train["类别"]
)
f1_score(train["类别"], pred, average="macro")

0.5846150237635646

## BERT

In [1]:
import torch
from torch.utils.data import Dataset, DataLoader, TensorDataset
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import random
import re

x_train, x_test, train_label, test_label =  train_test_split(train["文本"].values, 
                                                             train["类别"].values, 
                                                             test_size=0.2, 
                                                             stratify=train["类别"].values)

In [10]:
from transformers import BertTokenizer
# 分词器，词典

tokenizer = BertTokenizer.from_pretrained('/home/lyz/hf-models/bert-base-chinese')
train_encoding = tokenizer(list(x_train), truncation=True, padding=True, max_length=64)
test_encoding = tokenizer(list(x_test), truncation=True, padding=True, max_length=64)

In [15]:
# 数据集读取
class NewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    # 读取单个样本
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(int(category_list.index(self.labels[idx])))
        return item
    
    def __len__(self):
        return len(self.labels)

train_dataset = NewsDataset(train_encoding, train_label)
test_dataset = NewsDataset(test_encoding, test_label)

In [16]:
train_dataset[0]

{'input_ids': tensor([ 101, 1762,  704, 1066, 2190, 1920, 3791, 1355, 1220, 3655, 6999, 6833,
         2154, 1400, 8024, 3300,  782, 2218, 3221, 1728,  711,  704, 1066, 4638,
         6439, 6470, 2798, 1343,  749, 6237, 1920, 3791, 8024,  794, 5445, 2533,
         3791, 4638, 8024, 3300, 4638, 3221,  711,  749, 1091, 2821, 1161, 1920,
         3791, 4638, 3152, 4995, 3341,  749, 6237, 1920, 3791, 8024, 3297, 1400,
         3152, 4995, 3766,  102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'labels': tensor(4)}

In [18]:
# 精度计算
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup
model = BertForSequenceClassification.from_pretrained('/home/lyz/hf-models/bert-base-chinese', num_labels=len(category_list))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 单个读取到批量读取
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=True)

# 优化方法
optim = torch.optim.AdamW(model.parameters(), lr=2e-5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /home/lyz/hf-models/bert-base-chinese and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
# 训练函数
def train():
    model.train()
    total_train_loss = 0
    iter_num = 0
    total_iter = len(train_loader)
    for batch in train_loader:
        # 正向传播
        optim.zero_grad()
        
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        total_train_loss += loss.item()
        
        # 反向梯度信息
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        # 参数更新
        optim.step()

        iter_num += 1
        if(iter_num % 100==0):
            print("epoth: %d, iter_num: %d, loss: %.4f, %.2f%%" % (epoch, iter_num, loss.item(), iter_num/total_iter*100))
        
    print("Epoch: %d, Average training loss: %.4f"%(epoch, total_train_loss/len(train_loader)))
    
def validation():
    model.eval()
    total_eval_accuracy = 0
    total_eval_loss = 0
    for batch in test_dataloader:
        with torch.no_grad():
            # 正常传播
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        
        loss = outputs[0]
        logits = outputs[1]

        total_eval_loss += loss.item()
        logits = logits.detach().cpu().numpy()
        label_ids = labels.to('cpu').numpy()
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        
    avg_val_accuracy = total_eval_accuracy / len(test_dataloader)
    print("Accuracy: %.4f" % (avg_val_accuracy))
    print("Average testing loss: %.4f"%(total_eval_loss/len(test_dataloader)))
    print("-------------------------------")
    

for epoch in range(2):
    print("------------Epoch: %d ----------------" % epoch)
    train()
    validation()

------------Epoch: 0 ----------------
epoth: 0, iter_num: 100, loss: 0.4526, 7.97%
epoth: 0, iter_num: 200, loss: 0.3962, 15.94%
epoth: 0, iter_num: 300, loss: 0.1823, 23.90%
epoth: 0, iter_num: 400, loss: 0.2082, 31.87%
epoth: 0, iter_num: 500, loss: 0.2768, 39.84%
epoth: 0, iter_num: 600, loss: 0.0532, 47.81%
epoth: 0, iter_num: 700, loss: 0.8204, 55.78%
epoth: 0, iter_num: 800, loss: 0.6132, 63.75%
epoth: 0, iter_num: 900, loss: 0.1600, 71.71%
epoth: 0, iter_num: 1000, loss: 0.5317, 79.68%
epoth: 0, iter_num: 1100, loss: 0.8300, 87.65%
epoth: 0, iter_num: 1200, loss: 0.3585, 95.62%
Epoch: 0, Average training loss: 0.3998
Accuracy: 0.9145
Average testing loss: 0.2834
-------------------------------
------------Epoch: 1 ----------------
epoth: 1, iter_num: 100, loss: 0.1693, 7.97%
epoth: 1, iter_num: 200, loss: 0.1318, 15.94%
epoth: 1, iter_num: 300, loss: 0.1054, 23.90%
epoth: 1, iter_num: 400, loss: 0.4504, 31.87%
epoth: 1, iter_num: 500, loss: 0.0180, 39.84%
epoth: 1, iter_num: 600

KeyboardInterrupt: 

In [25]:
test_encoding = tokenizer(list(test["文本"]), truncation=True, padding=True, max_length=64)
test_dataset = NewsDataset(test_encoding, ['种族歧视'] * len(test["文本"]))
test_dataloader = DataLoader(test_dataset, batch_size=16, shuffle=False)

In [31]:
label = []
for batch in test_dataloader:
    with torch.no_grad():
        # 正常传播
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    pred = outputs.logits.data.cpu().numpy().argmax(1)
    label += [category_list[x] for x in pred]

In [33]:
test["类别"] = label

In [36]:
test[["id", "类别"]].to_csv("submit_bert.csv", index=None)

## Qwen

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 

from datasets import Dataset
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForSeq2Seq, TrainingArguments, Trainer, GenerationConfig

train_data = pd.read_csv("dataset/train_all.csv")
test = pd.read_csv("dataset/test_text.csv")

2025-06-30 21:08:48.798987: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
train_data["input"] = ""
train_data.columns = ["id","output", "instruction", "input"]
ds = Dataset.from_pandas(train_data)
ds[:3]

{'id': [0, 1, 2],
 'output': ['种族歧视', '政治敏感', '微侵犯(MA)'],
 'instruction': ['这政治正确玩的，真NM6。 剧黑农民了么？农民出身的官员确实有腐败现象能否认么？以后反腐剧里的坏人就必须是官二代城市人的身份咯？少民肯定也不行咯？女人LGBT黑人宗教人士肯定也都不行咯？ 一群白（黄？）左在下面跟着高潮也是醉了',
  '刁进平',
  '一直都没发现这条惊天大驴，直到突然怀崽才发现，一怀孕就开始歪屁股    \xa0举报    \xa0        赞[12]        \xa0回复        \xa0    05月12日 00:29\xa0来自网页'],
 'input': ['', '', '']}

In [3]:
tokenizer = AutoTokenizer.from_pretrained("/home/lyz/hf-models/Qwen/Qwen1.5-1.8B-Chat/", use_fast=False, trust_remote_code=True)
tokenizer

Qwen2Tokenizer(name_or_path='/home/lyz/hf-models/Qwen/Qwen1.5-1.8B-Chat/', vocab_size=151643, model_max_length=32768, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [4]:
def process_func(example):
    MAX_LENGTH = 128    # Llama分词器会将一个中文字切分为多个token，因此需要放开一些最大长度，保证数据的完整性
    input_ids, attention_mask, labels = [], [], []
    instruction = tokenizer(f"<|im_start|>system\n违禁文本分类<|im_end|>\n<|im_start|>user\n{example['instruction'] + example['input']}<|im_end|>\n<|im_start|>assistant\n", add_special_tokens=False)  # add_special_tokens 不在开头加 special_tokens
    response = tokenizer(f"{example['output']}", add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]  # 因为eos token咱们也是要关注的所以 补充为1
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]  
    if len(input_ids) > MAX_LENGTH:  # 做一个截断
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels
    }

In [5]:
tokenized_id = ds.map(process_func, remove_columns=ds.column_names)
tokenizer.decode(tokenized_id[0]['input_ids'])

Map:   0%|          | 0/25100 [00:00<?, ? examples/s]

'<|im_start|>system\n违禁文本分类<|im_end|>\n<|im_start|>user\n这政治正确玩的，真NM6。 剧黑农民了么？农民出身的官员确实有腐败现象能否认么？以后反腐剧里的坏人就必须是官二代城市人的身份咯？少民肯定也不行咯？女人LGBT黑人宗教人士肯定也都不行咯？ 一群白（黄？）左在下面跟着高潮也是醉了<|im_end|>\n<|im_start|>assistant\n种族歧视<|endoftext|>'

In [6]:
tokenizer.decode(list(filter(lambda x: x != -100, tokenized_id[1]["labels"])))

'政治敏感<|endoftext|>'

In [7]:
train_ds = Dataset.from_pandas(train_data.iloc[:-1000])
eval_ds = Dataset.from_pandas(train_data[-1000:])

train_tokenized_id = train_ds.map(process_func, remove_columns=ds.column_names)
eval_tokenized_id = eval_ds.map(process_func, remove_columns=ds.column_names)

Map:   0%|          | 0/24100 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [8]:
import torch

model = AutoModelForCausalLM.from_pretrained("/home/lyz/hf-models/Qwen/Qwen1.5-1.8B-Chat/", device_map="auto")
model.enable_input_require_grads() # 开启梯度检查点时，要执行该方法

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


In [9]:
from peft import LoraConfig, TaskType, get_peft_model

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    inference_mode=False, # 训练模式
    r=8, # Lora 秩
    lora_alpha=32, # Lora alaph，具体作用参见 Lora 原理
    lora_dropout=0.1# Dropout 比例
)

model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 7,495,680 || all params: 1,844,324,352 || trainable%: 0.4064187512284173


In [10]:
args = TrainingArguments(
    output_dir="./output_Qwen1.5",
    per_device_train_batch_size=6,
    gradient_accumulation_steps=4,
    logging_steps=100,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=50,
    num_train_epochs=5,
    save_steps=50,
    learning_rate=1e-4,
    save_on_each_node=True,
    gradient_checkpointing=True,
    load_best_model_at_end=True
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tokenized_id,
    eval_dataset=eval_tokenized_id,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)
trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 1.24 GiB. GPU 0 has a total capacity of 10.90 GiB of which 508.44 MiB is free. Including non-PyTorch memory, this process has 10.40 GiB memory in use. Of the allocated memory 9.84 GiB is allocated by PyTorch, and 403.75 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
from tqdm import tqdm_notebook
pred_label = []
for train_text in tqdm_notebook(test["文本"].values):
    prompt = f'''{train_text}'''
    messages = [
    {"role": "system", "content": "现在进行意图分类任务"},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=512,
        
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    pred_label += [response]