## 1.任务描述
临床试验是指通过人体志愿者也称为受试者进行的科学研究，目的是确定一种药物或一项治疗方法的疗效、安全性以及存在的副作用，对于促进医学发展和提高人类健康都起到关键的作用。根据试验目的等不同，受试者可能是患者或健康志愿者。筛选标准是临床试验负责人拟定的鉴定受试者是否满足某项临床试验的主要指标，分为入组标准和排出标准，一般为无规则的自由文本形式。临床试验的受试者招募一般是通过人工比较病历记录表和临床试验筛选标准完成，这种方式费时费力且效率低下。因此，临床试验面临诸多困境，比如受试者招募困难导致临床试验难以按期完成，入组患者流失影响试验的有效性等。近年来，随着临床试验设计越来越负责，数目越来越多，基于自然语言处理和信息抽取的系统也开始在临床试验受试者招募中崭露头角并表现出不错的效果，且具有很大的实际应用前景和医学临床价值。目前这类研究大多集中在英文临床试验筛选标准及英文电子健康记录数据，针对中文电子健康数据的研究也以及取得了很多进展，然而与中文临床试验筛选标准的自然语言处理研究很少。本任务就是在这样的背景下产生的，并在CHIP2019会议发布了评测任务(http://cips-chip.org.cn/)。
本次评测任务的主要目标是针对临床试验筛选标准进行分类，所有文本数据均来自于真实临床试验，短文本数据来源于中文临床试验注册网站(http://chictr.org.cn/)的临床试验公示信息中的筛选标准模块。数据公开透明，官网也提供下载链接。

## 2.任务说明
在本次评测中，我们给定事先定义好的44种筛选标准语义类别(详见附件的category.xlsx)和一系列中文临床试验筛选标准的描述句子，参赛者需返回每一条筛选标准的具体类别。
标注数据示例如下：
- S1 年龄>80岁 Age
- S2 近期颅内或椎管内手术史 Therapy or Surgery
- S3 血糖<2.7mmol/L Laboratory Examinations
## 3.评测指标
本任务的评价指标使用宏观F1值(Macro-F1，或称Average-F1)。最终排名以Macro-F1值为基准。假设我们有n个类别，C1, … …, Ci, … …, Cn。
- 准确率Pi = 正确预测为类别Ci的样本个数 / 预测为Ci类的样本个数。
- 召回率Ri = 正确预测为类别Ci的样本个数 / 真实的Ci类的样本个数。
- $Macro-F1 = (1/n)\sum_{i=1}^n{\frac {2*Pi*Ri} {Pi+Ri}}$
## 4.评测数据
本评测开放训练集数据22962条，验证集数据7682条，测试集数据10000条(注：leaderboard的测试数据和原CHIP评测任务的测试数据集不是同一份，重新标注了10000条数据集)。

In [1]:
# dataset
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import json


class TrainDataset(Dataset):
    # 训练数据集
    def __init__(self, data_path):
        with open(data_path, encoding="utf-8") as f:
            self.data_list = json.load(f)

    def __getitem__(self, index):
        # 获取索引对应位置的一条数据
        text = self.data_list[index]["text"]
        label = self.data_list[index]["label"]
        return (label, text)

    def __len__(self):
        # 返回数据的总数量
        return len(self.data_list)


class TestDataset(Dataset):
    # 测试数据集
    def __init__(self, data_path):
        with open(data_path, encoding="utf-8") as f:
            self.data_list = json.load(f)

    def __getitem__(self, index):
        # 获取索引对应位置的一条数据
        text = self.data_list[index]["text"]
        return text

    def __len__(self):
        # 返回数据的总数量
        return len(self.data_list)

train_data_path = r"data\CHIP-CTC\CHIP-CTC_train.json"
test_data_path = r"data\CHIP-CTC\CHIP-CTC_test.json"
dev_data_path = r"data\CHIP-CTC\CHIP-CTC_dev.json"

In [2]:
import jieba
import numpy as np

# data preprocess
# 为了符合faxttext的使用要求
def text_preprocess(text):
    return " ".join(jieba.cut(text))


def label_preprocess(label):
    return "__label__" + label.replace(" ", "_")


def label_depreprocess(label):
    return label[len("__label__"):].replace("_", " ")


train_dataset_loader = DataLoader(dataset=TrainDataset(train_data_path), batch_size=1, shuffle=True, drop_last=True)
output_file_path = "models\CTC_fasttext_data.txt"
with open(output_file_path,'w',encoding="utf-8") as file:
    for label, text in train_dataset_loader:
        line_formatted = text_preprocess(text[0]) + '\t' + label_preprocess(label[0])
        file.write(line_formatted + '\n')
dev_dataset_loader = DataLoader(dataset=TrainDataset(dev_data_path), batch_size=1, shuffle=False, drop_last=True)
test_dataset_loader = DataLoader(dataset=TestDataset(test_data_path), batch_size=1, shuffle=False, drop_last=False)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ROSENB~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.813 seconds.
Prefix dict has been built successfully.


In [3]:
%%time
# fasttext
import fasttext

model_path = "models\CTC_fasttext.model"
def build_classify_model():
    # 训练分类模型
    model = fasttext.train_supervised(output_file_path, epoch=140, wordNgrams=5, minCount=3)
    model.save_model(model_path)
    return model


def get_classify_model():
    # 获得分类模型
    model = fasttext.load_model(model_path)
    return model

    
train = True
if(train):
    model = build_classify_model()
model = get_classify_model()
    

Wall time: 47.9 s


## fasttext
![fasttext](resrc\faxttext.png)

[Bag of Tricks for Efficient Text Classification](file:///D:/Files/NLP%20Project/works/TianChi/paper/CHC/Bag%20of%20Tricks%20for%20Efficient%20Text%20Classification.pdf)
- The features are embedded and averaged to form the hidden variable
- A bag of n-grams as additional features to capture some partial information about the local word.
- In order to improve running time, using a hierarchical softmax based on the Huffman coding tree
## 实验
- 有趣的是，只有在训练到140左右这个比较不合理的epoch数之后模型在验证集上的表现才达到一个比较好的效果

In [4]:
# evaluate
label_name="""Disease
Symptom
Sign
Pregnancy-related Activity
Neoplasm Status
Non-Neoplasm Disease Stage
Allergy Intolerance
Organ or Tissue Status
Life Expectancy
Oral related
Pharmaceutical Substance or Drug
Therapy or Surgery
Device
Nursing
Diagnostic
Laboratory Examinations
Risk Assessment
Receptor Status
Age
Special Patient Characteristic
Literacy
Gender
Education
Address
Ethnicity
Consent
Enrollment in other studies
Researcher Decision
Capacity
Ethical Audit
Compliance with Protocol
Addictive Behavior
Bedtime
Exercise
Diet
Alcohol Consumer
Sexual related
Smoking Status
Blood Donation
Encounter
Disabilities
Healthy
Data Accessible
Multiple"""


def evaluation_macro_f1(model, test_lines, test_labels):
    # 人工实现的maro_f1计算
    prediction = np.array(model.predict(test_lines)[0])
    prediction = prediction.reshape(prediction.shape[0])
    accurate = (prediction == test_labels)
    label_names = [label_preprocess(line) for line in label_name.split('\n')] 
    count_prediction_label = np.zeros(len(label_names))
    count_prediction_accurate_label = np.zeros(len(label_names))
    count_total_label = np.zeros(len(label_names))
    for i in range(prediction.shape[0]):
        for j in range(len(label_names)):
            if(prediction[i] == label_names[j]):
                count_prediction_label[j] = count_prediction_label[j] + 1
                if(accurate[i]):
                    count_prediction_accurate_label[j] = count_prediction_accurate_label[j] + 1
    for i in range(len(test_labels)):
        for j in range(len(label_names)):
            if(test_labels[i] == label_names[j]):
                count_total_label[j] = count_total_label[j] + 1
    P = np.nan_to_num((count_prediction_accurate_label / count_prediction_label))
    R = np.nan_to_num((count_prediction_accurate_label /  count_total_label))
    result = np.nan_to_num((2 * P * R)/(P + R))
    return result.mean()

test_lines = list()
test_labels = list()
for label, text in dev_dataset_loader:
    test_lines.append(text_preprocess(text[0]))
    test_labels.append(label_preprocess(label[0]))
print(evaluation_macro_f1(model, test_lines, test_labels))

0.6360891888929809


In [5]:
# dump to json
dump_file_path = "result\CHIP-CTC_test.json"
with open(test_data_path,'r',encoding="utf-8") as source:
    texts = list()
    data = json.load(source)
    for text in test_dataset_loader:
        texts.append(text_preprocess(text[0]))
    
    prediction = np.array(model.predict(texts)[0])
    prediction = prediction.reshape(prediction.shape[0])
    
    for index in range(len(data)):
        data[index]["label"] = label_depreprocess(prediction[index])
    json_result = json.dumps(data, sort_keys=True, ensure_ascii=False)

with open(dump_file_path,'w',encoding="utf-8") as destination:
    destination.write(json_result)