## 1.任务描述
在医疗搜索中，评估搜索词(Query)表述主题和落地页标题(Title)表述主题的匹配程度是一项重要的任务，关系到搜索结果的准确性。Query的主题是指query的专注点,用户在输入query是希望找到与query主题相关的网页。该任务需要判断Query主题和Title主题是否一致及达到多大程度上的一致，本任务数据集就是在这样的背景下产生的。

## 2.任务说明
Query和Title的相关度共分为4档（0-3），0分为最差，3分为匹配最好。

- 3分：表示主题完全匹配。
- 2分：表示主题部分匹配。
- 1分：表示主题很少匹配，有一些参考价值。
- 0分：表示主题完全不匹配或者没有参考价值。

### 标注示例如下：
#### 3分（高度一致/完全匹配）：
Title主题和Query主题完全匹配。
- Q=缺维生素b的症状
- T=维生素b缺乏症的主要表现
- Q=排卵期有什么症状
- T=女性排卵期有什么症状？
#### 2分（部分匹配，偏大或偏小）：
Title主题略小于Query主题，Title主题是Query主题的主要方面之一。

Query主题略小于Title主题，但Query主题是Title主题的主要方面之一。

- Q > T
    - Q=拉了绿色的稀便 T=拉绿色的屎怎么回事
- Q ＜ T
    - Q=大腿软组织损伤怎么办 T=腿部软组织损伤怎么办
- 主题部分匹配
    - Q=排卵期有什么症状 T=女性排卵期有什么症状？女性备孕吃什么有助排卵？
#### 1分（很少匹配，存在一定价值）

- Q=小腿抽筋是什么原因引起的 T=小腿抽筋后一直疼怎么办
- Q=眉心长痘痘是什么原因 T=脸颊长痘痘是什么原因
#### 0分（完全不匹配无关）

- Q=挑食是什么原因造成的 T=影响身高的因素
## 3.评测指标
本任务的评价指标使用准确率Accuracy来评估，即：
### 准确率(Accuracy) = #预测正确的条目数 / #预测总条目数

## 4.评测数据
本评测开放训练集数据24174条，验证集数据2913条，测试集数据5465条。

In [1]:
# cut by word
def tokenize(text):
    return list(text)
print(tokenize("如果这是一句话。"))

['如', '果', '这', '是', '一', '句', '话', '。']


In [2]:
# word to sequence
UNK_TAG = "UNK"
PAD_TAG = "PAD"
class Word2Sequence():
    # 句子索引化转化类
    # 使用：
    # 建立词典
    # 转换
    # 逆转化

    UNK = 0
    PAD = 1

    def __init__(self):
        self.word2index_dict = {
            UNK_TAG : self.UNK,
            PAD_TAG : self.PAD,
        }
        self.count = {}


    def fit(self, sentence):
        # 保存句子到dict, 统计词频

        for word in sentence:
            self.count[word] = self.count.get(word, 0) + 1
        

    def build_vocab(self,min=0,max=None,max_features=None):
        # 显式调用，建立词典
        # min:将被采用的词至少出现min次
        # max:将被采用的词至多出现max次
        # max_features:按出现次数降序选择max_feature个词

        self.count = {word:value for word,value in self.count.items() if value > min}
        if(max is not None):
            self.count = {word:value for word,value in self.count.items() if value < max}
        if max_features is not None:
            self.count = dict(sorted(self.count.items(), key = lambda x:x[-1], reverse=True)[:max_features])

        for word in self.count:
            self.word2index_dict[word] = len(self.word2index_dict)
        self.index2word_dict = dict(zip(self.word2index_dict.values(), self.word2index_dict.keys()))


    def words2index_transform(self, sentence, max_len=None):
        # 索引化句子
        if max_len is not None:
            if max_len > len(sentence):
                sentence = sentence + [PAD_TAG] * (max_len - len(sentence))
            else:
                sentence = sentence[:max_len]
        return [self.word2index_dict.get(word, self.UNK) for word in sentence]


    def index2words_transform(self, sentence):
        # 逆索引化
        return [self.index2word_dict.get(index) for index in sentence]

    
    def __len__(self):
        return len(self.word2index_dict)

In [3]:
# dictionary build
import pickle
from tqdm import tqdm
import json
import os

train_data_path = r"data\KUAKE-QTR\KUAKE-QTR_train.json"
test_data_path = r"data\KUAKE-QTR\KUAKE-QTR_test.json"
dev_data_path = r"data\KUAKE-QTR\KUAKE-QTR_dev.json"
# 通过训练集中的语料建立词典
if(not os.path.exists("models/KUAKE-QTR_Word2Sequence.pkl")):
    word_index_tranformer = Word2Sequence()
    with open(train_data_path, encoding="utf-8") as f:
        for data in tqdm(json.load(f)):
            word_index_tranformer.fit(tokenize(data['query']))
            word_index_tranformer.fit(tokenize(data['title']))
    word_index_tranformer.build_vocab()
    pickle.dump(word_index_tranformer, open(r"models/KUAKE-QTR_Word2Sequence.pkl", 'wb'))
else:
    word_index_tranformer = pickle.load(open(r"models/KUAKE-QTR_Word2Sequence.pkl", 'rb'))
print("Total words amount:" + str(len(word_index_tranformer)))

Total words amount:2821


In [4]:
# dataset
import torch
from torch.utils.data import Dataset
import json

max_sentece_length = 30
class RosDataset(Dataset):
    def __init__(self, data_path, train=True):
        # 数据集准备
        # train:是否是训练或者验证集合，不同之处在于是否会返回label
        self.train = train
        with open(data_path, encoding="utf-8") as f:
            self.data_list = json.load(f)

    def __getitem__(self, index):
        # 获取索引对应位置的一条数据
        cuted_text1 = tokenize(self.data_list[index]["query"])
        cuted_text2 = tokenize(self.data_list[index]["title"])
        indexed_text1 = torch.LongTensor(word_index_tranformer.words2index_transform(cuted_text1, max_len=max_sentece_length))
        indexed_text2 = torch.LongTensor(word_index_tranformer.words2index_transform(cuted_text2, max_len=max_sentece_length))
        if(self.train):
            label = int(self.data_list[index]["label"])
            return label, indexed_text1, indexed_text2
        else:
            return indexed_text1, indexed_text2

    def __len__(self):
        # 返回数据的总数量
        return len(self.data_list)

train_dataset = RosDataset(train_data_path)
dev_dataset = RosDataset(dev_data_path)
test_dataset = RosDataset(test_data_path, train=False)
print(train_dataset[0], "\n", len(train_dataset))

(3, tensor([ 2,  3,  4,  5,  6,  6,  7,  8,  9, 10, 11,  4, 12, 13,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1]), tensor([ 2,  3,  4,  5, 14,  6,  6,  7, 15, 16, 17,  9, 10, 11, 12, 13, 18, 19,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])) 
 24174


In [5]:
# dataloader
from torch.utils.data import DataLoader
import torch

# 训练集中一个batch_size=128
train_data_loader = DataLoader(dataset=train_dataset,batch_size=128,shuffle=True)
dev_data_loader = DataLoader(dataset=dev_dataset,batch_size=1,shuffle=True)
test_data_loader = DataLoader(dataset=test_dataset,batch_size=1,shuffle=False)
for index, (label, indexed_text1, indexed_text2) in enumerate(train_data_loader):
    if(index > 0):
        break
    print(f"{index}:{label},{indexed_text1},{indexed_text2}")

0:tensor([2, 0, 2, 1, 3, 2, 3, 3, 3, 2, 3, 2, 0, 2, 3, 3, 1, 1, 3, 3, 1, 1, 3, 1,
        3, 2, 2, 0, 0, 3, 1, 3, 3, 0, 3, 0, 0, 0, 3, 0, 3, 2, 0, 3, 1, 1, 0, 3,
        3, 2, 3, 2, 2, 0, 3, 3, 3, 0, 3, 0, 2, 2, 2, 0, 3, 3, 2, 0, 1, 2, 3, 1,
        3, 3, 2, 2, 2, 3, 3, 3, 1, 1, 3, 3, 2, 3, 1, 0, 2, 2, 0, 3, 2, 1, 3, 3,
        2, 3, 3, 0, 1, 3, 0, 3, 3, 3, 3, 3, 0, 3, 1, 1, 3, 3, 2, 2, 0, 3, 3, 3,
        0, 2, 2, 2, 3, 1, 3, 2]),tensor([[ 799, 1687,  202,  ...,    1,    1,    1],
        [ 191,  142,   24,  ...,    1,    1,    1],
        [  27,   54,   12,  ...,    1,    1,    1],
        ...,
        [ 722,  430,  108,  ...,    1,    1,    1],
        [ 341,  342,  201,  ...,    1,    1,    1],
        [ 701,   46,  199,  ...,    1,    1,    1]]),tensor([[343,  72, 799,  ...,   1,   1,   1],
        [295, 199, 237,  ...,   1,   1,   1],
        [  2,   3,  54,  ...,   1,   1,   1],
        ...,
        [ 73,  74,  82,  ...,   1,   1,   1],
        [341, 342,  29,  ...,   1,   1,   

In [6]:
# Siamese network
import torch.nn as nn
import torch.nn.functional as F
import torch

class SiameseNetwork(nn.Module):
    def __init__(self):
        super(SiameseNetwork, self).__init__()
        self.embedding = nn.Embedding(num_embeddings=len(word_index_tranformer),embedding_dim=384,padding_idx=word_index_tranformer.PAD)
        self.gru1 = nn.GRU(input_size=384,hidden_size=256,num_layers=3,batch_first=True,bidirectional=True)
        self.gru2 = nn.GRU(input_size=256*4,hidden_size=256,num_layers=6,batch_first=True,bidirectional=True)
        self.dnn = nn.Sequential(
            nn.Linear(256*4,256),
            nn.ELU(inplace=True),
            nn.BatchNorm1d(256),
            nn.Dropout(0.3),

            nn.Linear(256,64),
            nn.ELU(inplace=True),
            nn.BatchNorm1d(64),
            nn.Dropout(0.3),

            nn.Linear(64, 4)
        )


    def forward(self, input1, input2):
        mask1 = input1.eq(word_index_tranformer.PAD)
        mask2 = input2.eq(word_index_tranformer.PAD)
        input1 = self.embedding(input1)
        input2 = self.embedding(input2)
        output1,hidden_state1 = self.gru1(input1)
        output2,hidden_state2 = self.gru1(input2)
        
        output1_align, output2_align = self.soft_attention_align(output1, output2, mask1, mask2)
        output1 = torch.cat([output1, output1_align], 2)
        output2 = torch.cat([output2, output2_align], 2)
        
        gru2_output1,gru2_hidden_state1 = self.gru2(output1)
        gru2_output2,gru2_hidden_state2 = self.gru2(output2)
        
        out = torch.cat([gru2_output1[:,-1,:], gru2_output2[:,-1,:]], dim=-1)
        out = self.dnn(out)
        
        return F.log_softmax(out, dim=-1)


    def soft_attention_align(self, x1, x2, mask1, mask2):
        mask1 = mask1.float().masked_fill_(mask1, float("-inf"))
        mask2 = mask2.float().masked_fill_(mask2, float("-inf"))

        attention_weight = x1.bmm(x2.transpose(1, 2))
        x1_weight = F.softmax(attention_weight + mask2.unsqueeze(1), dim=-1) 
        x2_output = x1_weight.bmm(x2)

        x2_weight = F.softmax(attention_weight.transpose(1, 2) + mask1.unsqueeze(1), dim=-1) 
        x1_output = x2_weight.bmm(x1)
        
        return x1_output, x2_output

# GRU+Attention Siamese Network
![](resrc\GRU+Attention.png)
## 说明
- embeddinding层将索引表示转化为长度为384的vector，图中没有画出
- 带上横线标识这是一个max_sentence长度的矩阵
- GRU1_A和GRU1_B、GRU2_A和GRU2_B共享权重
    - 实现上其实就是两个句子过了同一个网络
    - 孪生神经网络
        - 将两个句子映射到语义空间中衡量相似度
            - idea:转变句子在语义空间的表现来贴合题目分类的题意
        - 可以是直接取最后输出的embedding vector用Contrastive Loss做二分类问题
            - [Manhattan LSTM Model](file:///D:/Files/NLP%20Project/works/TianChi/paper/KUAKE/Siamese%20Recurrent%20Architectures%20for%20Learning%20Sentence%20Similarity.pdf)
            - Contrastive Loss的思想就是让相似的离的近，不相似的离的远
            - idea:扩展Contrastive Loss来适应特殊的n分类问题
                - 原题的4分类其实不是简单相似程度由高到低，而是还附带一定的语义信息
                - TODO:通过设计Loss来使得对句向量的相似性的衡量更符合题意
        - 这里因为是四分类问题所以直接用了FC+softmax
            - [Siamese Recurrent Neural Network](file:///D:/Files/NLP%20Project/works/TianChi/paper/KUAKE/Learning%20Text%20Similarity%20with%20Siamese%20Recurrent%20Networks.pdf)
            - 可以是FC层得到句子的embedding vector再做衡量
            - 这里是用FC来做classifier
- soft attention是两个句子的互注意力
    - 值得注意的是在该层中padding被用-inf填充，因而在attention中没有权重，即忽略了padding
## 实验
- 通过加深GRU的层数能让在验证集的正确率上涨十几个点
- 但超过差不多10层之后反而会让表现出现急剧的下降
    - 还有一个值得注意的细节：
        - 5+5：28%
        - 3+6：55%
- 在其他任务类似的模型中有pooling层，但它的效果我没有做验证

In [7]:
from tqdm import tqdm
from torch.optim import Adam


def train(epochs, model, model_path=None, optimizer_path=None, device=None):
    # device: 模型运行的位置
    # model_path:保存或加载的模型的模型路径
    # optimizer_path:保存或加载的模型的优化器路径
    model = model.to(device)
    model.train()
    optimizer = Adam(model.parameters(), lr=0.001)
    t = tqdm(range(epochs), desc="Train")
    for epoch in t:
        for index, (label, text1, text2) in enumerate(train_data_loader):
            if not device is None:
                label = label.to(device)
                text1 = text1.to(device)
                text2 = text2.to(device)
            optimizer.zero_grad()
            output = model(text1, text2)
            loss = F.nll_loss(output, label)
            t.set_description(f"epoch:{epoch}")
            t.set_postfix(loss=loss.item())
            loss.backward()
            optimizer.step()
    
    if not model_path is None:
        torch.save(model.state_dict(), model_path)
    if not optimizer_path is None:
        torch.save(optimizer.state_dict(), optimizer_path)

In [8]:
def evaluation_accuracy(model, test_data_loader, device=None):
    # model:衡量准确率的model
    # test_data_loader:做准确衡量的data_loader
    # device:运行的设备位置
    count_correct = 0
    model = model.to(device)
    model.eval()
    with torch.no_grad():
        for label, text1, text2 in tqdm(dev_data_loader, desc="Evaluation"):
            if not device is None:
                label = label.to(device)
                text1 = text1.to(device)
                text2 = text2.to(device)
            if(model(text1, text2).argmax() == label):
                count_correct = count_correct + 1
    print(f"\n{count_correct}/{len(test_data_loader)}")
    return count_correct / len(test_data_loader)

In [9]:
train_mode = False
model = SiameseNetwork()
model_path = "models\QTR_SiameseNetwork.pth"
optimizer_path = "models\QTR_SiameseNetwork_optim.pth"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if(train_mode):
    train(epochs=37, model=model, model_path=model_path, optimizer_path=optimizer_path, device=device)
else:
    model.load_state_dict(torch.load(model_path))

print('\n' + str(evaluation_accuracy(model, dev_data_loader, device=device)))

Evaluation: 100%|██████████| 2913/2913 [02:07<00:00, 22.81it/s]
1608/2913

0.5520082389289392



- 曾经调出来过0.60几左右的准确率过
- 但是我把参数改了以后忘记保存了
- 而由于这个模型比较大训练一次要三四十分钟
- 就不想重新调了orz

In [10]:
# dump to json
dump_file_path = "result\KUAKE-QTR_test.json"
with open(test_data_path,'r',encoding="utf-8") as source:
    data = json.load(source)
    model = model.to(device)
    model.eval()
    with torch.no_grad():
        for index,(text1, text2) in tqdm(enumerate(test_data_loader), desc="Evaluation", total=len(test_data_loader)):
            if not device is None:
                label = label.to(device)
                text1 = text1.to(device)
                text2 = text2.to(device)
            data[index]["label"] = str(model(text1, text2).argmax().item())
            json_result = json.dumps(data, ensure_ascii=False)

with open(dump_file_path,'w',encoding="utf-8") as destination:
    destination.write(json_result)

Evaluation: 100%|██████████| 5465/5465 [04:52<00:00, 18.71it/s]
