# 基于SKEP的外语新闻超党派倾向分析 （进行中）

# 一. 项目背景
文本情感分析，简单来说就是给定一段文本，让模型来判断文本的情感色彩，属于NLP经典任务中的分类任务。情感分析在商品喜好、消费决策、舆情分析等场景中均有应用。  
  
通常的文本情感分析研究，旨在判断文本的积极、消极或中立倾向。然而文本的情感倾向不仅可以按积极或消极进行划分，在外语新闻文本中，文本还存在超党派倾向，对于该类文本，超党派倾向的分析同样重要。 
  
超党派倾向（hyperpartisan）是指观点忽略客观事实，单纯从政党利益考量，为反对而反对的这种态势，该情况多出现在西方国家。具有超党派倾向的新闻，观点往往不够客观中立，影响新闻可信度。因此，外语新闻的阅读中，往往需要察觉其超党派倾向。利用机器学习方法对新闻文本进行分析，探查其超党派倾向，可运用于新闻筛选与推荐，具有一定研究价值。  
  
本项目采用 Hyperpartisan News Detection 2019 比赛的数据集，利用SKEP模型，对外语新闻中的超党派倾向进行分析，得到了 0.90625的准确率，并处于持续优化的阶段中。

# 二、 数据处理

## 2.1 数据加载与浏览

In [15]:
# 数据加载
from paddlenlp.datasets import load_dataset

train_ds, dev_ds, test_ds = load_dataset('hyp', splits=["train", "dev", "test"])
# 数据浏览
print(test_ds[0])
print(test_ds[1])
print(test_ds[2])
print(test_ds[3])

{'text': '<p>A woman is facing charges as part an investigation by the Manatee County Sheriff’s Office to crack down on illegal slot machines.</p> <p>Phalla Colman, 54, was arrested Thursday and charged with operating an illegal gambling establishment, operating an illegal lottery and possession of slot machines, according to a news release.</p> <p>Detectives began investigating Colman’s business at 3325 15 St. E., Bradenton, several weeks ago after she was suspected of running an illegal internet café or gambling establishment. Detectives were able to develop the case and obtain warrants for Colman’s arrest and the search the establishment, the release said.</p> <p>On Thursday, they searched her business and arrested her. Detectives seized 111 computer monitors, 79 computer towers and $3,415 in currency.</p> Breaking News <p>Be the first to know when big news breaks</p> <p>Recaptcha requires verification.</p> <p>protected by reCAPTCHA</p> <p><a href="https://www.google.com/intl/en/pol

对第一条数据中的文本部分进行分析：  
  {'text': '<p>The FBI is advising people to hang up if they receive a call from a woman screaming for help.</p> <p>A decades-old <a href="https://www.fbi.gov/news/stories/virtual-kidnapping" type="external">virtual kidnapping scam is placing more U.S. residents at risk of becoming potential victims</a>, the FBI warned on its website.</p> <p>The scheme takes many forms, but basically callers trick victims into paying ransoms to free family or friends who have been “kidnapped.” The virtual abductors coerce victims to pay a ransom before victims can find out that no one has been kidnapped.</p> <p>The FBI had been tracking virtual kidnapping calls primarily from prisons in Mexico between 2013 and 2015. The callers targeted individuals who spoke Spanish, the FBI said. Most of the victims were from the Los Angeles and Houston areas.</p> Today\'s top news by email <p>The local news you need to start your day</p> <p>Recaptcha requires verification.</p> <p>protected by reCAPTCHA</p> <p><a href="https://www.google.com/intl/en/policies/privacy/" type="external">Privacy</a> - <a href="https://www.google.com/intl/en/policies/terms/" type="external">Terms</a></p> <p><a href="https://www.google.com/intl/en/policies/privacy/" type="external">Privacy</a> - <a href="https://www.google.com/intl/en/policies/terms/" type="external">Terms</a></p> <p>The FBI issued the warning because kidnappers have widened their pool of potential victims by no longer targeting only specific individuals and Spanish speakers. The callers also are cold-calling numbers in various cities.</p> <p>In a recent investigation, the FBI found that more than 80 people have fallen victim to the new tactics in California, Minnesota, Idaho and Texas. Those victims paid more than $87,000 in ransom.</p> <p>An FBI spokeswoman for the Kansas City offices said she was not aware of any instances that have occurred in Kansas or Missouri where the FBI was involved.</p> <p>The scam works this way: The potential victim answers a call and hears a woman screaming, “Help me!” The victim might blurt out a name, like Mary, asking her if she’s OK. At that point, the caller will tell the victim that Mary has been kidnapped and she will be harmed if ransom isn’t paid quickly.</p> <p>The scam is successful when victims don’t know the whereabouts of their loved ones.</p> <p>The scammers typically demand that the ransoms are wired to Mexico. The amount is typically less than $2,000 because of legal restrictions for wiring larger amounts across the border, the FBI said.</p> <p>However, two victims in Houston were coerced to pay larger ransoms by making money drops.</p> <p>A federal grand jury charged a 34-year-old Houston woman in July with 10 counts, including wire fraud and money laundering, for her involvement in the scam. The charges are the first federal indictment in a virtual kidnapping case, the FBI said.</p> <p>
Robert A. Cronkleton: <a href="" type="external">816-234-4261</a>, <a href="https://twitter.com/cronkb" type="external">@cronkb</a></p> <p>How to avoid falling victim</p> <p>If you receive a call from someone demanding ransom for an alleged kidnap victim, the FBI suggests:</p> Hang up the phone. Don’t call out your loved one’s name if you do respond to call. Request to speak with the alleged victim. Ask questions only the alleged victim would know. Listen to the voice of the alleged victim if they do speak. Try to contact the alleged victim on the phone, text or social media. Don’t agree to pay a ransom. <p>If you believe a real kidnapping has taken place, call 911.</p> <p>Source: FBI</p>', 'label': 0}

分析可得，数据包括文本和标注两部分。其中，文本部分格式为HTML，文本内容包括新闻标题、新闻来源网站、新闻页面包含的其他外链，以及新闻中的发言人的社交网页链接。

## 2.2 文本数据转化为Feature

In [3]:
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer

In [4]:
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en")

In [5]:
import os
from functools import partial
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader

def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False):
    
    # 将原数据处理成model可读入的格式，enocded_inputs是一个dict，包含input_ids、token_type_ids等字段
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    # input_ids：对文本切分token后，在词汇表中对应的token id
    input_ids = encoded_inputs["input_ids"]
    # token_type_ids：当前token属于句子1还是句子2，即上述图中表达的segment ids
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids

# 2.3  Batchify和数据读入

In [7]:
# 批量数据大小
batch_size = 20
# 文本序列最大长度
max_seq_length = 256

# 将数据处理成模型可读入的数据格式
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)

# 将数据组成批量式数据，如
# 将不同长度的文本序列padding到批量式数据中最大长度
# 将每条数据label堆叠在一起
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
dev_data_loader = create_dataloader(
    dev_ds,
    mode='dev',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

# 三、 模型训练

## 3.1 模型介绍
  
近年来，大量的研究表明基于大型语料库的预训练模型（Pretrained Models, PTM）可以学习通用的语言表示，有利于下游NLP任务，同时能够避免从零开始训练模型。随着计算能力的发展，深度模型的出现（即 Transformer）和训练技巧的增强使得 PTM 不断发展，由浅变深。

情感预训练模型SKEP（Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis）。SKEP利用情感知识增强预训练模型， 在14项中英情感分析典型任务上全面超越SOTA，此工作已经被ACL 2020录用。SKEP是百度研究团队提出的基于情感知识增强的情感预训练算法，此算法采用无监督方法自动挖掘情感知识，然后利用情感知识构建预训练目标，从而让机器学会理解情感语义。SKEP为各类情感分析任务提供统一且强大的情感语义表示。

![](https://ai-studio-static-online.cdn.bcebos.com/f83b5220eab54333b3937ec83d88a01c86e54c106af944cabbe3fa3e3c837edc)  
百度研究团队在三个典型情感分析任务，句子级情感分类（Sentence-level Sentiment Classification），评价目标级情感分类（Aspect-level Sentiment Classification）、观点抽取（Opinion Role Labeling），共计14个中英文数据上进一步验证了情感预训练模型SKEP的效果。



## 3.2 模型设置

In [None]:
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_2.0_large_en", num_classes=len(train_ds.label_list))

In [8]:
import time

from utils import evaluate

# 训练轮次
epochs = 3
# 训练过程中保存模型参数的文件夹
ckpt_dir = "skep_ckpt"
# len(train_data_loader)一轮训练所需要的step数
num_training_steps = len(train_data_loader) * epochs

# Adam优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=2e-5,
    parameters=model.parameters())
# 交叉熵损失函数
criterion = paddle.nn.loss.CrossEntropyLoss()
# accuracy评价指标
metric = paddle.metric.Accuracy()

## 3.3 开启训练

In [None]:
# 开启训练
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        # 喂数据给model
        logits = model(input_ids, token_type_ids)
        # 计算损失函数值
        loss = criterion(logits, labels)
        # 预测分类概率值
        probs = F.softmax(logits, axis=1)
        # 计算acc
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        
        # 反向梯度回传，更新参数
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()

        if global_step % 10 == 0:
            save_dir = './output'
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            # 评估当前训练的模型
            evaluate(model, criterion, metric, dev_data_loader)
            # 保存当前模型参数等
            model.save_pretrained(save_dir)
            # 保存tokenizer的词表等
            tokenizer.save_pretrained(save_dir)

global step 10, epoch: 1, batch: 10, loss: 0.19500, accu: 0.93724, speed: 0.77 step/s
eval loss: 0.89132, accu: 0.78125
global step 20, epoch: 1, batch: 20, loss: 0.09746, accu: 0.98500, speed: 0.57 step/s
eval loss: 0.41448, accu: 0.89062
global step 30, epoch: 2, batch: 4, loss: 0.00545, accu: 1.00000, speed: 0.57 step/s
eval loss: 0.44573, accu: 0.89062
global step 40, epoch: 2, batch: 14, loss: 0.03099, accu: 1.00000, speed: 0.56 step/s
eval loss: 0.49891, accu: 0.89062
global step 50, epoch: 2, batch: 24, loss: 0.00080, accu: 0.99500, speed: 0.61 step/s
eval loss: 0.40838, accu: 0.90625
global step 60, epoch: 3, batch: 8, loss: 0.00086, accu: 1.00000, speed: 0.57 step/s
eval loss: 0.39832, accu: 0.89062
global step 70, epoch: 3, batch: 18, loss: 0.00238, accu: 1.00000, speed: 0.57 step/s
eval loss: 0.37430, accu: 0.90625


# 四、模型测试

In [9]:
import numpy as np
import paddle

# 处理测试集数据
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

In [10]:
params_path = 'output/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    # 加载模型参数
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

Loaded parameters from output/model_state.pdparams


In [11]:
label_map = {0: '0', 1: '1'}
results = []
# 切换model模型为评估模式，关闭dropout等随机因素
model.eval()
for batch in test_data_loader:
    input_ids, token_type_ids = batch
    # 喂数据给模型
    logits = model(input_ids, token_type_ids)
    # 预测分类
    probs = F.softmax(logits, axis=-1)
    idx = paddle.argmax(probs, axis=1).numpy()
    idx = idx.tolist()
    labels = [label_map[i] for i in idx]
    results.extend(labels)

In [13]:
res_dir = "result"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)
# 写入预测结果
with open(os.path.join('result', 'ChnSentiCorp.tsv'), 'w', encoding="utf8") as f:
    f.write("prediction\n")
    for label in results:
        f.write(label+"\n")

请点击[此处](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576)查看本环境基本用法.  <br>
Please click [here ](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576) for more detailed instructions. 