## 1.任务描述
临床术语标准化任务是医学统计中不可或缺的一项任务。临床上，关于同一种诊断、手术、药品、检查、化验、症状等往往会有成百上千种不同的写法。标准化（归一）要解决的问题就是为临床上各种不同说法找到对应的标准说法。有了术语标准化的基础，研究人员才可对电子病历进行后续的统计分析。本质上，临床术语标准化任务也是语义相似度匹配任务的一种。但是由于原词表述方式过于多样，单一的匹配模型很难获得很好的效果。本任务就是在这样的背景下产生的，并在CHIP2020会议发布了评测任务(http://cips-chip.org.cn/)。

## 2.任务说明
本次评测任务主要目标是针对中文电子病历中挖掘出的真实诊断实体进行语义标准化。 给定一诊断原词，要求给出其对应的诊断标准词。所有诊断原词均来自于真实医疗数据，并以《国际疾病分类 ICD-10 北京临床版v601》词表为标准进行了标注。标注样例如下（注：预测值可能存在多个，用“##”分隔）：

- 右肺结节转移可能大 <-> 肺占位性病变##肺继发恶性肿瘤##转移性肿瘤
- 右肺结节住院 <-> 肺占位性病变
- 左上肺胸膜下结节待查 <-> 胸膜占位
## 3.评测指标
以(诊断原词，标准词)作为基本单位计算F1得分。如测试集有m对(诊断原词，标准词)组合，预测了n对(诊断原词，标准词)组合，有k对组合是预测正确的。
$P = k / n$

$R = k / m$

$F1 = 2 * P * R / (P+R)$

## 4.评测数据
本评测开放训练集数据6000条，验证集数据2000条（注：原CHIP评测中只提供了8000条训练集，依数据方专家建议，本leaderboard中切分为训练集、验证集分别为6000和2000条，验证集不可用做训练），测试集数据10000条。

In [1]:
# data read-in
import pandas as pd

#从excel中读入匹配词集
terminology_table = pd.read_excel(r"data\CHIP-CDN\国际疾病分类 ICD-10北京临床版v601.xlsx")
terminologies = terminology_table["霍乱"]
print(terminologies)

0           霍乱,由于01群霍乱弧菌,霍乱生物型所致
1                        古典生物型霍乱
2          霍乱,由于01群霍乱弧菌,埃尔托生物型所致
3                       埃尔托生物型霍乱
4                         未特指的霍乱
                  ...           
40468      与烷化剂有关与治疗有关的骨髓增生异常综合征
40469    与表鬼臼毒素有关与治疗有关的骨髓增生异常综合征
40470                  骨髓增生异常综合征
40471                      白血病前期
40472                   白血病前期综合征
Name: 霍乱, Length: 40473, dtype: object


## 匹配词说明
- 可以注意到这里的待匹配词有40000多个
    - 在完全不考虑时间成本的情况当然可以使用DNN的方法逐一匹配
    - 但显然这是不太现实的，而且在逐一匹配的情况下显然是无法使用接下来用的二分类孪生神经网络
        - 假设确实使用二分类方法逐一匹配，完全可以预测相当多的词网络会给出匹配的判断，这显然是不合理的
        - 这一点事实实际上已经点出了接下来处理流程的固有缺陷，将在模型部分具体说明
- 其次要注意的是这些术语其实不是完全无组织的
    - 在大类上原术语分为27类，这也是下面在使用Cluster pruning召回时, k_cluster参数特设为27的理由
        - 但事实上，由于在cluster pruning时cluster的leader时随机选定的，并不能说27个cluster确实一一对应了原来的大类
        - 但经目测，在特设这一次参数后，召回效果还是有肉眼可见的提升的
            - 但这依旧不能作为利用到了术语分类的固有属性的佐证
            - 因为减少k_cluster其实也完全可能是因为是事实上增大了搜索范围而导致的效果提升
    - 在小类和次小类的上术语仍有细分
    - 由于上述的说明，完全可以得到当前面临如此大匹配词集时可能的解决方案
        - 类似层次化softmax的多级分类
        - 利用预设条件来做无监督分类后再将代匹配词映射到语义空间后做召回
    - 上述思路将在模型部分做详细说明

In [2]:
# cut by word
def tokenize(text):
    result = list(text)
    return result

## 关于这里分词方法的reflection可见KUAKE-QIC

In [3]:
%%time
# Recall
from sklearn.feature_extraction.text import TfidfVectorizer
import pysparnn.cluster_index as ci
import pickle


recall_search_index_path = r"models\recall_search.index"
class Sentence2Vector:
    def __init__(self):
        pass

    def build_vectors(self, sentences):
        lines_cuted = [" ".join(tokenize(sentence)) for sentence in sentences]
        # tfidf 根据词频和文档的词语
        vertorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
        feature_vec = vertorizer.fit_transform(lines_cuted)
        search_index = self.get_cp(feature_vec, sentences)
        return vertorizer, feature_vec, lines_cuted, search_index

    def build_cp(self, vectors, data):
        search_index = ci.MultiClusterIndex(vectors, data)
        pickle.dump(search_index, open(recall_search_index_path, "wb"))
        return search_index

    def get_cp(self, vectors, data):
        if(os.path.exists(recall_search_index_path)):
            search_index = pickle.load(open(recall_search_index_path, "rb"))
        else:
            search_index = self.build_cp(vectors, data)
        return search_index

class Recall:
    def __init__(self, sentences, k=10):
        self.k = k
        sentence_vec = Sentence2Vector()
        self.vertorizer, self.feature_vec, self.lines_cuted, self.search_index = sentence_vec.build_vectors(sentences)

    def predict(self, sentence):
        sentence_vector = self.vertorizer.transform(sentence)
        return self.search_index.search(sentence_vector, k=self.k, k_clusters=27, return_distance=False)

print(Recall(terminologies).predict([" ".join(tokenize("糖尿病反复低血糖;骨质疏松;高血压冠心病不稳定心绞痛"))])[0])

['不稳定性心绞痛', '稳定性心绞痛', '糖尿病性高血压', '糖尿病性低血糖症', '糖尿病性心肌病', '糖尿病性缺血性心肌病', '糖尿病性体位性低血压', '1型糖尿病性高血压', '2型糖尿病性高血压', '2型糖尿病性低血糖症']
Wall time: 1.39 s


## Recall
#### For more 可见《[信息检索导论](file:///D:/Files/NLP%20Project/material/More/%E4%BF%A1%E6%81%AF%E6%A3%80%E7%B4%A2%E5%AF%BC%E8%AE%BA.pdf)》第六章
### TfidfVectorizer
- 词项频率(term frequencey)记为 $tf_{t,d}$，其中的两个下标分别对应词项和文档, 它表示t在文档中的出现次数
- 文档频率(document frequency)${df}_t$，它表示的是出现t的所有文档的数目
    -  词项t的idf(inverse document frequency, 逆文档频率)的定义如下:$idf_t = \log{\frac{N}{df_t}}$
        - N为所有文档的数目
- tf-idf 权重机制对文档d中的词项t赋予的权重如下:$tf-idf_{t,d} = tf_{t,d} × idf_t$
- tf-idf vectorization:文档看成是一个vector，其中的每个分量都对应词典中的一个词项，分量值为采用上述公式计算出的权重值
- 需要强调的是，计算的公式方法是不唯一的和且有许许多多改进，重要的是它的语义
- 这种向量化的方式可能导致的问题已经在前述提过

### 向量空间模型(vector space model，简称 VSM)
- 为了弥补文档长度给上述相似度计算所带来的负面效果，计算两篇文档$d_1$和$d_2$相似度的常规方法是求余弦相似度(cosine similarity):$sim(d_1, d_2)=\frac{V(d_1)V(d_2)}{|V(d_1)||V(d_2)|}$，其中V(d)表示对文档d向量化，特别的，在这里我们使用的向量化就是上述td-idf vectorization

### [Cluster pruning](https://nlp.stanford.edu/IR-book/html/htmledition/cluster-pruning-1.html)
In cluster pruning we have a preprocessing step during which we cluster the document vectors. Then at query time, we consider only documents in a small number of clusters as candidates for which we compute cosine scores.

1. Pick $\sqrt{N}$ documents at random from the collection. Call these leaders.
2. For each document that is not a leader, we compute its nearest leader.

We refer to documents that are not leaders as followers. Intuitively, in the partition of the followers induced by the use of $\sqrt{N}$ randomly chosen leaders, the expected number of followers for each leader is $\approx N/\sqrt{N} = \sqrt{N}$. Next, query processing proceeds as follows:

1. Given a query $q$, find the leader $L$ that is closest to $q$. This entails computing cosine similarities from $q$ to each of the $\sqrt{N}$ leaders.
2. The candidate set $A$ consists of $L$ together with its followers. We compute the cosine scores for all documents in this candidate set.

The use of randomly chosen leaders for clustering is fast and likely to reflect the distribution of the document vectors in the vector space: a region of the vector space that is dense in documents is likely to produce multiple leaders and thus a finer partition into sub-regions. 

![Cluster pruning](resrc\cluster_pruning.png)


In [4]:
# word to sequence
UNK_TAG = "UNK"
PAD_TAG = "PAD"
class Word2Sequence():
    UNK = 0
    PAD = 1

    def __init__(self):
        self.word2index_dict = {
            UNK_TAG : self.UNK,
            PAD_TAG : self.PAD,
        }
        self.count = {}


    def fit(self, sentence):
        # 保存句子到dict, 统计词频
        for word in sentence:
            self.count[word] = self.count.get(word, 0) + 1
        

    def build_vocab(self,min=0,max=None,max_features=None):
        self.count = {word:value for word,value in self.count.items() if value > min}
        if(max is not None):
            self.count = {word:value for word,value in self.count.items() if value < max}
        if max_features is not None:
            self.count = dict(sorted(self.count.items(), key = lambda x:x[-1], reverse=True)[:max_features])

        for word in self.count:
            self.word2index_dict[word] = len(self.word2index_dict)
        self.index2word_dict = dict(zip(self.word2index_dict.values(), self.word2index_dict.keys()))


    def words2index_transform(self, sentence, max_len=None):
        if max_len is not None:
            if max_len > len(sentence):
                sentence = sentence + [PAD_TAG] * (max_len - len(sentence))
            else:
                sentence = sentence[:max_len]
        return [self.word2index_dict.get(word, self.UNK) for word in sentence]


    def index2words_transform(self, sentence):
        return [self.index2word_dict.get(index) for index in sentence]

    
    def __len__(self):
        return len(self.word2index_dict)

In [5]:
# dictionary build
import pickle
from tqdm import tqdm
import json
import os

train_data_path = r"data\CHIP-CDN\CHIP-CDN_train.json"
test_data_path = r"data\CHIP-CDN\CHIP-CDN_test.json"
dev_data_path = r"data\CHIP-CDN\CHIP-CDN_dev.json"
if(not os.path.exists("models/STS_Word2Sequence.pkl")):
    word_index_tranformer = Word2Sequence()
    with open(train_data_path, encoding="utf-8") as f:
        for data in tqdm(json.load(f)):
            word_index_tranformer.fit(tokenize(data['text']))
            word_index_tranformer.fit(tokenize(data['normalized_result'].replace("##","")))
    for terminology in terminologies:
        word_index_tranformer.fit(tokenize(terminology))
    word_index_tranformer.build_vocab()
    pickle.dump(word_index_tranformer, open(r"models/CHIP-CDN_Word2Sequence.pkl", 'wb'))
else:
    word_index_tranformer = pickle.load(open(r"models/CHIP-CDN_Word2Sequence.pkl", 'rb'))
print("\n" + str(len(word_index_tranformer)))

100%|██████████| 6000/6000 [00:00<00:00, 146381.87it/s]
2560



In [6]:
# dataset
import torch
from torch.utils.data import Dataset
import json
import pickle
from tqdm import tqdm

max_sentece_length = 20
class RosDataset(Dataset):
    def __init__(self, data_path, mode):
        # 使用recall的结果来构造负例
        # mode: 
        # 0->train
        # 1->dev
        # 2->test

        self.mode = mode
        self.data = list()
        with open(data_path, encoding="utf-8") as f:
            data_list = json.load(f)
            data_path = r"models\CHIP-CDN_trainData"
            if(self.mode == 0):
                if(os.path.exists(data_path)):
                    self.data = pickle.load(open(data_path, "rb"))
                else:
                    recall = Recall(terminologies, 5)
                    for pair in tqdm(data_list, desc="train"):
                        text = tokenize(pair["text"])
                        matched_results = pair["normalized_result"].split("##")
                        ret = recall.predict([" ".join(text)])[0]
                        for r in matched_results:
                            self.data.append([text, tokenize(r), 1])
                            if r in ret:
                                ret.remove(r)
                        for r in ret:
                            self.data.append([text, tokenize(r), 0])
                    pickle.dump(self.data, open(data_path, "wb"))
            elif(self.mode == 1):
                for pair in tqdm(data_list, desc="dev"):
                    text = pair["text"]
                    matched_results = pair["normalized_result"].split("##")
                    self.data.append([text, matched_results])
            else:
                for pair in tqdm(data_list, desc="test"):
                    text = pair["text"]
                    self.data.append([text])

    def __getitem__(self, index):
        # 获取索引对应位置的一条数据
        text = self.data[index][0]
        if(self.mode == 0):
            indexed_text = torch.LongTensor(word_index_tranformer.words2index_transform(text, max_len=max_sentece_length))
            indexed_match = torch.LongTensor(word_index_tranformer.words2index_transform(self.data[index][1], max_len=max_sentece_length))
            label = self.data[index][2]
            return label, indexed_text, indexed_match
        elif(self.mode == 1):
            text = self.data[index][0]
            matches = self.data[index][1]
            return text, matches
        else:
            return text

    def __len__(self):
        # 返回数据的总数量
        return len(self.data)

train_dataset = RosDataset(train_data_path, mode=0)
dev_dataset = RosDataset(dev_data_path, mode=1)
test_dataset = RosDataset(test_data_path, mode=2)
print("\n---------------------------------")
print(train_dataset[0], len(train_dataset))
print("---------------------------------")
print(dev_dataset[1], len(dev_dataset))
print("---------------------------------")
print(test_dataset[0], len(test_dataset))

dev: 100%|██████████| 2000/2000 [00:00<00:00, 502251.71it/s]
test: 100%|██████████| 10000/10000 [00:00<00:00, 1249420.32it/s]
---------------------------------
(1, tensor([2, 3, 4, 5, 6, 7, 8, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), tensor([ 3, 10, 11, 12, 13,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1])) 36250
---------------------------------
('卵巢Ca', ['卵巢恶性肿瘤', '癌']) 2000
---------------------------------
﻿左前降支中段肌桥－壁冠状动脉MB-MCA 10000



## 数据集构建说明
- 由于我们依旧采用了相互比对并进行二分类的模型，我们就必须把原数据处理成一对对匹配对
- 而由于训练集中只给出了每一个词及其对应的标准化结果，也即只事实上存在正例，所以我们必须构造负例
- 所以我们最终构造的方法如下
    - 毫无疑问的，原来词及其标准化结果全部被处理成正例加入数据集
    - 对于每一个词，使用召回模块得到候选词，特别的，这里我们选了5个候选词
        - 这里候选词的个数是可以选定，这里选定5个词的理由是使得正负例的比例大致为1:4
            - 平均而言，一个词的标准结果大概为一个多词，再考虑召回的失配率，大致为1:4
    - 对于这5个候选词，如果它是标准化后的结果，那不必处理，因为它必然会作为正例加入数据集
    - 如果它不是标准化的结果，则将其作为负样例加入数据集
        - 当然，我们可以在40000个候选词随机选择一定量非标准化结果词作为负样例加入数据集，但它的坏处是显而易见的
        - 由于我们最终做预测时必然先进行召回，再对召回结果再进行匹配
        - 所以模型训练的关键在于成功从召回结果选出正确的标准化结果
            - 也即从已经召回的较为相似的结果中选择正确筛选
            - 这需要拉开文档向量空间中原本已经较为接近的向量的距离，这也正是负样例的意义所在
        - 而随机选择没有办法使得模型具有这种区分能力
            - 在理想情况下，对于完全不相似的词，模型应该很容易获得区分他们的能力而不需要专门的训练
            - 其次，由于召回过程已经经过初筛，即使模型在方面的能力有所欠缺也不重要

In [7]:
# dataloader
from torch.utils.data import DataLoader
import torch

def collate_fn(data):
    return data[0][0], data[0][1]

train_data_loader = DataLoader(dataset=train_dataset,batch_size=128,shuffle=True)
dev_data_loader = DataLoader(dataset=dev_dataset,batch_size=1,shuffle=True,collate_fn=collate_fn)
test_data_loader = DataLoader(dataset=test_dataset,batch_size=1,shuffle=False, drop_last=False)

for index, (text, matches) in enumerate(dev_data_loader):
    if(index > 5):
        break
    print(f"{index}:{text}, {matches}")

for index, text in enumerate(test_data_loader):
    if(index > 5):
        break
    print(f"{index}:{text}")

0:颅内脱髓鞘病变待诊, ['脱髓鞘病']
1:上消化道出血(食道胃底静脉曲张), ['上消化道出血', '食管静脉曲张', '胃底静脉曲张']
2:冠心病高血压病高脂血症, ['冠状动脉粥样硬化性心脏病', '高血压', '高脂血症']
3:骨髓淋巴细胞增生, ['淋巴细胞增殖性疾病']
4:颈椎椎间盘膨出, ['其他的颈椎间盘移位']
5:子宫颈鳞癌IIB期放化疗后, ['子宫颈恶性肿瘤', '鳞状细胞癌', '恶性肿瘤放疗', '化学治疗']
0:['\ufeff左前降支中段肌桥－壁冠状动脉MB-MCA']
1:['子宫脱出III阴道前壁膨出III']
2:['右手掌撕脱伤并手内肌损伤']
3:['右侧第11肋骨骨折']
4:['皮肤色素沉着原因待查1Addison病']
5:['十二指肠及空肠切除术']


In [8]:
# Siamese network
import torch.nn as nn
import torch.nn.functional as F
import torch

class SiameseNetwork(nn.Module):
    def __init__(self):
        super(SiameseNetwork, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=len(word_index_tranformer),embedding_dim=300,padding_idx=word_index_tranformer.PAD)
        self.gru1 = nn.GRU(input_size=300,hidden_size=256,num_layers=2,batch_first=True,bidirectional=True)
        self.gru2 = nn.GRU(input_size=256*4,hidden_size=256,num_layers=1,batch_first=True,bidirectional=False)
        self.dnn = nn.Sequential(
            nn.Linear(256*4,256),
            nn.ELU(inplace=True),
            nn.BatchNorm1d(256),
            nn.Dropout(0.3),

            nn.Linear(256,256),
            nn.ELU(inplace=True),
            nn.BatchNorm1d(256),
            nn.Dropout(0.3),

            nn.Linear(256, 2)
        )


    def forward(self, input1, input2):
        mask1 = input1.eq(word_index_tranformer.PAD)
        mask2 = input2.eq(word_index_tranformer.PAD)
        input1 = self.embedding(input1)
        input2 = self.embedding(input2)
        output1,_ = self.gru1(input1)
        output2,_ = self.gru1(input2)
        
        output1_align, output2_align = self.soft_attention_align(output1, output2, mask1, mask2)
        output1 = torch.cat([output1, output1_align], 2)
        output2 = torch.cat([output2, output2_align], 2)
        
        gru2_output1,_ = self.gru2(output1)
        gru2_output2,_ = self.gru2(output2)
        
        output1_pooled = self.apply_pooling(gru2_output1)
        output2_pooled = self.apply_pooling(gru2_output2)
        out = torch.cat([output1_pooled, output2_pooled], dim=-1)
        out = self.dnn(out)
        
        return F.log_softmax(out, dim=-1)


    def apply_pooling(self, output):
        avg_pooled = F.avg_pool1d(output.transpose(1,2), kernel_size=output.size(1)).squeeze(-1)
        max_pooled = F.max_pool1d(output.transpose(1,2), kernel_size=output.size(1)).squeeze(-1)
        return torch.cat([avg_pooled, max_pooled], dim=-1)


    def soft_attention_align(self, x1, x2, mask1, mask2):
        mask1 = mask1.float().masked_fill_(mask1, float("-inf"))
        mask2 = mask2.float().masked_fill_(mask2, float("-inf"))

        attention_weight = x1.bmm(x2.transpose(1, 2))
        x1_weight = F.softmax(attention_weight + mask2.unsqueeze(1), dim=-1) 
        x2_output = x1_weight.bmm(x2)

        x2_weight = F.softmax(attention_weight.transpose(1, 2) + mask1.unsqueeze(1), dim=-1) 
        x1_output = x2_weight.bmm(x1)
        
        return x1_output, x2_output

## 模型说明
模型本身是和其他任务完全类似的，不加赘述

### 其他任务中没具体提到的细节
- dropout 
    - $y=f(Wmask(x)+b)$
    - 其中mask(x)在训练时按概率为p的伯努利分布随机生成，在测试时则将输入x乘以p做补偿
- batch normalization
    - 批量归一化（Batch Normalization，BN）方法是一种逐层归一化方法，对神经网络中任意的中间层进行归一化操作
    - 公式如下：$\hat{z}^{(l)}=\frac{z^{(l)}-\mu_B}{\sqrt{\sigma^2_B+\epsilon}}⊙\gamma+\beta$
    - 其中$z^{(l)}$为前一层仿射变换后的结果，$\mu_B$为当前batch的均值，$\sigma^2_B$为当前batch的方差
- soft-attention
    - ![attention](resrc\attention.png)
- pooling
- GRU
    - ![GRU](resrc\GRU.png)

#### For more and details可参考《[神经网络与深度学习](file:///D:/Files/NLP%20Project/material/ML/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C%E4%B8%8E%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0.pdf)》
## 其他说明
#### 需要强调的，以下内容都是想法而没有经过严格实验
### 固有缺陷
- 这是一个pipline的方案，显然前一阶段的误差必然会成为后一阶段的瓶颈
- 在这里，召回的准确率就会成为整体的瓶颈
- 其次，由于数据集的构建时经过召回初筛的，模型本身的鲁棒性值得怀疑
### 多级分类
- 这是对于匹配词候选集过大而提出的idea，它的想法来自hierarchical softmax

![hierarchical softmax](resrc\层次化softmax.jpg)
- hierarchical softmax将一个多分类问题改造成一个多层的0-1 softmax，这一思路其实在这里同样使用
- 其次由于40000个词本身是分类别的，完全可以根据这些词本身的分类别来进行多级分类
### 无监督
- pipline的缺陷前面已经提到了，因而不难想到训练dnn来做embedding的工具再利用cluster purning一步到位
- 其实无论是无监督还是有监督的方法都可能做到这一点
    - 特别的，在CHIP-CTC中使用的faxttext也可能有一定效果

In [9]:
from tqdm import tqdm
from torch.optim import Adam


def train(epochs, model, model_path=None, optimizer_path=None, device=None):
    model = model.to(device)
    model.train()
    optimizer = Adam(model.parameters(), lr=0.001)
    t = tqdm(range(epochs), desc="Train")
    for epoch in t:
        for index, (label, text1, text2) in enumerate(train_data_loader):
            if not device is None:
                label = label.to(device)
                text1 = text1.to(device)
                text2 = text2.to(device)
            optimizer.zero_grad()
            output = model(text1, text2)
            loss = F.nll_loss(output, label)
            t.set_description(f"epoch:{epoch}")
            t.set_postfix(loss=loss.item())
            loss.backward()
            optimizer.step()
    
    if not model_path is None:
        torch.save(model.state_dict(), model_path)
    if not optimizer_path is None:
        torch.save(optimizer.state_dict(), optimizer_path)

In [10]:
train_mode = True
model = SiameseNetwork()
model_path = "models\CHIP-CDN_SiameseNetwork.pth"
optimizer_path = "models\CHIP-CDN_SiameseNetwork_optim.pth"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if(train_mode):
    train(epochs=16, model=model, model_path=model_path, optimizer_path=optimizer_path, device=device)
else:
    model.load_state_dict(torch.load(model_path))

In [11]:
class Prediction:
    def __init__(self, model, recall, device):
        self.device = device
        self.recall = recall
        self.model = model
        model = model.to(device)

    def predict(self, sentence):
        # predict的过程为先召回选择一定数量的词再使用模型逐一匹配
        rec = self.recall.predict([" ".join(tokenize(sentence))])
        result = list()
        self.model.eval()
        with torch.no_grad():
            for r in rec[0]:
                indexed_sentence = torch.LongTensor(word_index_tranformer.words2index_transform(tokenize(sentence), max_len=max_sentece_length)).to(self.device)
                indexed_r = torch.LongTensor(word_index_tranformer.words2index_transform(tokenize(r), max_len=max_sentece_length)).to(self.device)
                model_prediction = self.model(indexed_sentence.unsqueeze(0), indexed_r.unsqueeze(0)).argmax().item()
                if(model_prediction == 1):
                    result.append(r)
        if(len(result) == 0):
            result.append(rec[0][0])
        return result

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
p = Prediction(model, Recall(terminologies, 5), device)
print(p.predict("肿标升高待查全麻胃肠镜丙泊酚不耐受"))

['肿瘤标记物升高', '乳糖不耐受']


## Predition的过程
1. 使用recall模块召回候选词
2. 使用model模块选取最后的结果
3. 若model模块一个都没挑出来，则选择recall模块的首选结果

In [12]:
def evaluation_accuracy(prediction, test_data_loader, device=None):
    count_correct = 0.0
    t = tqdm(test_data_loader, desc="Evaluation")
    for text, matches in t:
        result = prediction.predict(text)
        for r in result:
            if (r) in matches:
                count_correct += 1 / len(result)
        t.set_postfix_str(count_correct / len(test_data_loader))
    return count_correct / len(test_data_loader)

print(evaluation_accuracy(p, dev_data_loader, device=device))

Evaluation: 100%|██████████| 2000/2000 [03:21<00:00,  9.92it/s, 0.4053666666666667]


0.4053666666666667

In [13]:
dump_file_path = "result\CHIP-CDN_test.json"
with open(test_data_path,'r',encoding="utf-8") as source:
    data = json.load(source)
    with torch.no_grad():
        for index,text in tqdm(enumerate(test_data_loader), desc="Evaluation", total=len(test_data_loader)):
            data[index]["normalized_result"] = "##".join(p.predict(text[0]))
            json_result = json.dumps(data, ensure_ascii=False)

with open(dump_file_path,'w',encoding="utf-8") as destination:
    destination.write(json_result)

Evaluation: 100%|██████████| 10000/10000 [16:04<00:00, 10.37it/s]
