### 处理流程简单说明
- 文档预处理使用句子组合的形式，在最大可容纳字长中包含连续的多个句子，考虑到答案是连续且完整的句子组合
- 在正负样本的选择中，选取了所有的正样本和随机一个负样本进行训练
- 训练过程使用多任务方式，同时预测答案是否在当前片段内以及答案的起始位置，这两个loss按照合适的权重相加
- 预测过程中，样本的每个token都会输出一个start&end的logit，各自选择logit最大的前k个，排列组合为备选span，然后计算span-score：startlogit+endlogit-cls_start-cls_end，按照- spanscore对预测样本进行排序，选择分值最大的span最为最终答案
- 之后提分的思路可以从检索方面提高检索准确率，从重排序的分值计算方式和排序方式，以及多任务的训练方式上面提高单模的性能，使用更合适的预训练语言模型也能直接提升效果

In [1]:
import torch

from datasets import load_dataset, load_metric
from datasets import Features, Value, ClassLabel, load_dataset
from pathlib import Path
from transformers import AutoTokenizer
from transformers import TrainingArguments, Trainer
from transformers import default_data_collator

from model import BertForQuestionAnsweringWithMultiTask
from utils import  Prepare_Train_Features,show_random_elements,Prepare_Train_Features_For_CRF


In [3]:
PATH = Path(r"C:\Users\hp\Desktop\2021.02.24 BertForQuestionAnsweringWithMultiTask\squad_chinese")
 
features = Features({'answers': Value('string'), 'context': Value('string'), 'id': Value('string'), 'question': Value('string'), 'title': Value('string')})

file_dict = {'train': PATH/'train-zen-v1.0.json','dev': PATH/'dev-zen-v1.0.json'} #载入训练集和测试集

datasets = load_dataset(
                       path = r"C:\Users\hp\Desktop\2021.02.24 BertForQuestionAnsweringWithMultiTask\squad_chinese", 
                       data_files=file_dict, 
                       script_version='master', 
                       #split='train',
                       cache_dir=r"squad_chinese",
                       
                         
                        )


Using custom data configuration default


Downloading and preparing dataset squad/default-16fd39ceb5503ccf (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to squad_chinese\squad\default-16fd39ceb5503ccf\0.0.0\6f2fe9bd41d2e840220525d0b40a863add682e883599aca2de6940b809fa66a5...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset squad downloaded and prepared to squad_chinese\squad\default-16fd39ceb5503ccf\0.0.0\6f2fe9bd41d2e840220525d0b40a863add682e883599aca2de6940b809fa66a5. Subsequent calls will reuse this data.


In [4]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4997
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4997
    })
})

In [5]:
datasets['validation'][0]

{'answers': {'answer_start': [41],
  'text': ['省人社厅、省工信厅、省教育厅、省财政厅、省交通运输厅、省卫健委']},
 'context': '福建：6部门联合出台暖企措施支持复工稳岗 为解决企业复产的用工困难，经省政府同意，省人社厅、省工信厅、省教育厅、省财政厅、省交通运输厅、省卫健委联合下发通知，出台一系列暖企措施支持疫情防控期间复工稳岗。 通知明确，切实发挥各级农民工工作领导小组办公室的统筹协调作用, 加强劳务用工有效对接，对具备外出务工条件、可成规模输送到我省用工地，并在出行前14天内及在途没有相关症状的，可由用工地和输出地联合开展“点对点、一站式”直达企业的专门运输。省级公共就业服务机构可与主要劳务输出省份签订劳务协作协议、设立劳务协作工作站，对每个工作站给予一次性10万元就业服务经费补助。鼓励优先聘用本地劳务人员。 未经省应对新冠肺炎疫情工作有关机构确认的疫情防控急需物资生产企业引进劳动力的，一次性用工服务奖补标准最高提到每人2000元。对上述企业坚持在生产一线工作的职工，给予每人每天100元的生活补助，纳入一次性用工服务奖补范畴。对春节当月至疫情一级响应结束月，采取稳定职工队伍保持连续生产的企业，给予一次性稳就业奖补。 加大失业保险稳岗返还力度，将中小微企业稳岗返还政策裁员率标准调整为不高于上年度全国调查失业率的控制目标，对参保职工30人（含）以下的企业，裁员率调整为不超过企业参保职工总数的20%。对不裁员或少裁员，符合条件的参保企业，可返还其上年度实际缴纳失业保险费的50%。对受疫情影响面临暂时性生产经营困难且恢复有望、坚持不裁员或少裁员、符合条件的参保企业，按6个月的当地月人均失业保险金和参保职工人数落实失业保险稳岗返还政策。 加强职业技能培训，鼓励技工院校学生在符合疫情防控条件下参加实习实训，探索简易岗前技能培训。对企业因生产急需新录用的人员，按每人200元标准一次性给予企业简易岗前技能培训补贴。鼓励实施线上培训，对受疫情影响的企业，在停工期、恢复期组织职工参加各类线上或线下职业培训的，可按规定纳入补贴类培训范围。 通知要求，各地要着力提升政策措施的精准度和有效性，提升各类企业享受政策措施的获得感。各类企业要落实落细防控主体责任，严格落实返岗信息登记、班车错峰接送、员工分散用餐、体温监测等具体应对措

In [6]:
show_random_elements(datasets["train"], num_examples=1)

Unnamed: 0,answers,context,id,question,title
0,"{'answer_start': [1221], 'text': ['1个月房租免收、2个月租金减半。']}",江西省人民政府印发关于有效应对疫情稳定经济增长20条政策措施的通知 各市、县（区）人民政府，赣江新区管委会，省政府各部门： 现将《关于有效应对疫情稳定经济增长20条政策措施》印发给你们，请认真贯彻执行。 2020年2月4日 （此件主动公开） 关于有效应对疫情稳定经济增长20条政策措施 为深入贯彻习近平总书记关于坚决打赢疫情防控阻击战的重要指示精神，全面落实党中央、国务院有关决策部署，在全力做好疫情防控工作的同时，着力促进全省经济平稳增长，现提出以下政策措施： 一、加强对疫情防控物资和生活必需品生产企业的扶持 1.支持全国性商业银行、国家开发银行、农业发展银行、进出口银行等在赣分支机构加大服务对接力度，全力满足疫情防控融资需求。实行疫情防控重点企业融资白名单制，支持江西银行、九江银行和进贤农商行利用专项再贷款为企业提供优惠利率信贷支持，最高不得超过最近公布的一年期lpr减100个基点。（省金融监管局、人行南昌中心支行牵头，江西银保监局、省工业和信息化厅、省发展改革委等配合） 2.对2020年新增的全省疫情防控重点保障企业专项再贷款，在人民银行专项再贷款支持金融机构提供优惠利率信贷、中央财政按人民银行再贷款利率的50%给予贴息的基础上，省财政统筹资金再给予25%的贴息支持，贴息期限不超过1年。对疫情防控重点保障企业给予稳岗补贴和创业担保贷款支持。对在疫情防控、生活必需品保供稳价工作中主动让利的重点企业和商户，各地可从价格调节基金或其他可用财力中给予一定补助，在项目安排等扶持政策上给予倾斜。（省财政厅、省人力资源社会保障厅、省发展改革委牵头，省工业和信息化厅、人行南昌中心支行，各市、县〔区〕人民政府、赣江新区管委会配合） 3.积极帮助疫情防控重点物资和生活必需品生产企业复工复产，安排专人进行“一对一”蹲点帮扶，协调解决设备、原辅料、人工、资金、运输及用能等实际困难。对扩大疫情防控重点物资产能的企业，经主管部门同意后，可先扩产再补办相关审批手续。疫情防控重点物资生产企业扩大产能、改造生产线发生的实际投入，纳入省级企业技术改造项目支持。（省工业和信息化厅、省发展改革委牵头，省人力资源社会保障厅、省卫生健康委、省财政厅等配合） 4.全省药品补充申请、再注册收费标准和二类医疗器械首次注册、变更注册、延续注册收费标准降低30%。（省发展改革委牵头，省财政厅、省市场监管局等配合） 5.严格落实“一断三不断”要求，稳妥处置未经批准擅自设卡拦截、断路阻碍交通等行为，确保疫情防控物资和必要的生产生活物资运输通畅。简化绿色通道查验手续和程序。对于疫情防控应急物资、由省新型冠状病毒感染的肺炎疫情防控应急指挥部或有关成员单位统一调拨转运的重要生活物资等保障车辆，疫情期间免除高速公路通行费用。（省交通运输厅、省公安厅牵头，各市、县〔区〕人民政府、赣江新区管委会配合） 二、扶持实体企业渡难关 6.对承租国有资产类生产经营用房的企业，1个月房租免收、2个月租金减半。对租用其他经营用房的，鼓励业主为租户减免租金，具体由双方协商解决。对在疫情期间为承租的中小企业减免租金的创业园、科技企业孵化器、创业孵化基地等各类载体，优先予以政策扶持。（省财政厅、省国资委牵头，省工业和信息化厅、省科技厅、省人力资源社会保障厅，各市、县〔区〕人民政府、赣江新区管委会配合） 7.对受疫情影响较大的批发零售、住宿餐饮、物流运输、文化旅游等行业企业和未能及时充分复工的工业企业，及时辅导落实好小微企业普惠性减税等政策。对因疫情原因导致企业发生重大损失，正常生产经营活动受到重大影响，纳税确有困难的，依法予以减免房产税、城镇土地使用税。（省税务局牵头，各市、县〔区〕人民政府、赣江新区管委会配合） 8.对按月申报的纳税人、扣缴义务人，将2020年2月份的法定申报纳税期限延长至2月24日；在申报纳税期限延长后，办理仍有困难的，可依法申请进一步延期，延期期间不征收滞纳金。纳税人受疫情影响确有特殊困难不能按期缴纳税款的，还可依法申请办理延期缴纳税款，最长不超过3个月。（省税务局牵头） 9.加大对企业的金融支持，确保2020年企业信贷余额不低于2019年同期余额。鼓励金融机构适当下调贷款利率，增加信用贷款和中长期贷款额度。引导中小微企业通过江西省小微客户融资服务平台、一站式金融综合服务平台申请贷款。对受疫情影响较大的批发零售、住宿餐饮、物流运输、文化旅游、“三农”领域等行业，以及有发展前景但受疫情影响暂时受困的企业，金融机构不得盲目抽贷、断贷、压贷；对到期还款困难的企业，应予以展期或续贷。（省金融监管局牵头，人行南昌中心支行、江西银保监局等配合） 10.各级政府性融资担保再担保机构要加强与银行机构合作，针对受疫情影响严重行业和疫情防控行业定制担保产品，对因疫情暂遇困难企业特别是小微企业，取消反担保抵质押要求，降低担保费。对受疫情影响严重地区的融资担保机构，省融资担保股份有限公司及各设区市再担保机构减半收取再担保费。（省金融监管局牵头，人行南昌中心支行、江西银保监局，各市、县〔区〕人民政府、赣江新区管委会配合） 11.充分发挥国有大中型企业的中坚作用，依法依规在货款回收、原材料供应、项目发包等方面，加大对产业链上中小企业的支持，确保产业链运行平稳。（省国资委牵头，省工业和信息化厅，各市、县〔区〕人民政府、赣江新区管委会配合） 三、以扩投资为重点稳需求 12.积极推广不见面招商，充分利用赣服通、政务网、公众号等平台，高频次、高精度、大范围进行招商引资项目推介。对成熟且有签约意向的项目，要加强网上对接、洽谈力度，确保尽快签约。对已签约项目，要全力做好项目的立项、开工、投产全过程服务，确保项目尽快落地。（省商务厅牵头，各市、县〔区〕人民政府、赣江新区管委会配合） 13.充分发挥全省投资项目在线审批监管平台作用，全面推广网上收件、网上审批和网上出件。对按规定确需提交纸质材料原件的，除特殊情况外，由项目单位通过在线平台或电子邮件提供电子材料后先行办理；项目单位应对提供的电子材料真实性负责，待疫情结束后补交纸质材料原件。（省发展改革委牵头，省住房城乡建设厅、省自然资源厅、省生态环境厅、省水利厅，各市、县〔区〕人民政府、赣江新区管委会配合）。 14.将省重点工程建设单位人员疫情防控、生活物资保障、施工物资供应等工作纳入地方保供范围，切实协调解决项目建设涉及的市政配套、水电接入、资金落实等问题。对项目建设中确因受疫情影响或疫情防控工作需要不能按时履行合同的，允许合理延后合同执行期限，不作违约处理。（省发展改革委牵头，省直有关部门，各市、县〔区〕人民政府、赣江新区管委会配合） 15.对确因受疫情影响不能及时复产履约的外贸企业，指导企业向国家有关部门申请“新型冠状病毒感染的肺炎疫情不可抗力事实证明”，及时帮助企业最大限度减小损失。设立进出口商品绿色通道，确保进出口商品快速通关。鼓励中国信保江西分公司为因疫情遇到困难的出口企业提供风险保障、保单融资等服务。（省商务厅、南昌海关牵头） 16.支持传统商贸主体电商化、数字化改造升级，积极培育网络诊疗、在线办公、在线教育、线上文化娱乐、影视及智能家居等新兴消费业态和消费热点，繁荣“宅经济”。大力实施“互联网+”农产品出村进城工程，加快城乡商品要素流动。（省商务厅牵头，省发展改革委、省工业和信息化厅、省卫生健康委、省文化和旅游厅、省教育厅、省农业农村厅等配合） 四、加大企业稳岗和就业促进力度 17.鼓励受疫情影响企业与职工协商采取调整薪酬（生活费）、轮岗轮休、缩短工时、待岗等方式稳定工作岗位。对不裁员或少裁员的参保企业，可返还其上年度实际缴纳失业保险费的50％。2020年1月1日至2020年12月31日，对面临暂时性生产经营困难、坚持不裁员或少裁员的参保企业，返还标准按上年度6个月的统筹地月人均失业保险金和企业上年度月均参保职工人数确定。所需资金从失业保险基金列支。（省人力资源社会保障厅、省财政厅牵头） 18.对受疫情影响的企业、灵活就业人员和城乡居民，未能按时办理社会保险缴费业务的，可延长至疫情解除后补办。逾期缴纳社会保险费期间，免收滞纳金，不影响个人权益。相关补办手续在疫情解除后三个月内完成。（省人力资源社会保障厅牵头，省医保局、省财政厅等配合） 19.对已发放的个人创业担保贷款，受疫情影响还款出现困难的，可向贷款银行申请展期还款，展期期限原则上不超过1年，省财政继续给予贴息支持；对受疫情影响未能按时完成展期手续的，免于信用惩戒。对受疫情影响暂时失去收入来源的个人和小微企业，各有关部门要在其申请创业担保贷款时优先给予支持。（省人力资源社会保障厅牵头，省财政厅、省金融监管局、人行南昌中心支行、省发展改革委等配合） 20.密切关注全省农民工、应届毕业生等重点群体就业状况和省内外用工需求，充分发挥人力资源市场网络平台作用，及时发布就业信息、企业开复工信息，开展网上招聘。建立返乡务工人员滞留省内就业应对机制，促进与用工企业精准对接。（省人力资源社会保障厅牵头，省发展改革委、省商务厅、省工业和信息化厅，各市、县〔区〕人民政府、赣江新区管委会配合）,762fdb59ef70306fb8c28171b429ef29,江西省对承租国有资产类生产经营用房的企业有何政策？,


In [7]:
model_checkpoint = r'chinese-bert-wwm-ext'
base_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [8]:
chars = list(set(''.join(list(set(datasets["train"]["context"])))))

In [9]:
add_vocabs = ['’','‘','“','”']
for i in chars:
    if i not in base_tokenizer.vocab.keys():
        add_vocabs.append(i) 

In [10]:
with open('voc.txt','a+',encoding = 'utf-8') as f:
    for i in add_vocabs:
        f.write(str(i) +'\n')

In [11]:
with open('voc.txt','r',encoding = 'utf-8') as f:
    add_vocabs = [i.strip() for i in f.readlines()]
    
base_tokenizer.add_tokens(add_vocabs)


132

In [12]:
class Prepare_Train_Features_For_CRF:
    def __init__(self,tokenizer,max_length = 384,stride = 128,pad_on_right = "right"):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.stride = stride
        self.pad_on_right = pad_on_right
        
    def FindOffset(self,tokens_id, answer_id):
        n = len(tokens_id)
        m = len(answer_id)
        if n < m:
            return False
        for i in range(n - m + 1):
            if tokens_id[i:i + m] == answer_id:
                return (i, i + m)
        return False

    def prepare_train_features(self,examples):
        # 用 truncation和padding来Tokenize我们的实例，但用stride来保持溢出。
        # 这就导致了当context较长时，一个实例可能会给出几个feature，
        # 每个feature的context都与前一个feature的context有一点重叠。
        tokenized_examples = self.tokenizer(
            examples["question" if self.pad_on_right else "context"],
            examples["context" if self.pad_on_right else "question"],
            truncation="only_second" if self.pad_on_right else "only_first",
            max_length= self.max_length,
            stride= self.stride,
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length",
        )
        
        # Since one example might give us several features if it has a long context, we need a map from a feature to
        # its corresponding example. This key gives us just that.
        sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
        # The offset mappings will give us a map from token to character position in the original context. This will
        # help us compute the start_positions and end_positions.
        offset_mapping = tokenized_examples.pop("offset_mapping")

        # Let's label those examples!
        tokenized_examples["start_positions"] = []
        tokenized_examples["end_positions"] = []
        tokenized_examples["answer_offset"] = []
        tokenized_examples["answer_seq_label"] = []
        tokenized_examples["labels"] = []
        for i, offsets in enumerate(offset_mapping):
            # We will label impossible answers with the index of the CLS token.
            input_ids = tokenized_examples["input_ids"][i]           
            answer_seq_label = len(input_ids) * [0]
            #print(len( answer_seq_label))
            cls_index = input_ids.index(self.tokenizer.cls_token_id)

            # Grab the sequence corresponding to that example (to know what is the context and what is the question).
            sequence_ids = tokenized_examples.sequence_ids(i)

            # One example can give several spans, this is the index of the example containing this span of text.
            sample_index = sample_mapping[i]
            
           
            answers = examples["answers"][sample_index]
            
            # If no answers are given, set the cls_index as answer.
            if len(answers["answer_start"]) == 0:
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
                tokenized_examples["answer_offset"].append((-1,-1))
                tokenized_examples["answer_seq_label"].append(answer_seq_label)
                tokenized_examples["labels"].append(0)
            else:
                # Start/end character index of the answer in the text.
                start_char = answers["answer_start"][0]
                end_char = start_char + len(answers["text"][0])

                # Start token index of the current span in the text.
                token_start_index = 0
                while sequence_ids[token_start_index] != (1 if self.pad_on_right else 0):
                    token_start_index += 1

                # End token index of the current span in the text.
                token_end_index = len(input_ids) - 1
                while sequence_ids[token_end_index] != (1 if self.pad_on_right else 0):
                    token_end_index -= 1

                # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
                if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                    tokenized_examples["start_positions"].append(cls_index)
                    tokenized_examples["end_positions"].append(cls_index)
                    tokenized_examples["answer_offset"].append((-1,-1))
                    tokenized_examples["answer_seq_label"].append(answer_seq_label)
                    tokenized_examples["labels"].append(0)

                else:
                    
                   
                    # 否则就把token_start_index和token_end_index移到答案的两端。
                    # 注意：如果答案是最后一个字，我们可以移到最后一个offset 之后(极少出现之情形)
                    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                        token_start_index += 1
                    tokenized_examples["start_positions"].append(token_start_index - 1)
                    while offsets[token_end_index][1] >= end_char:
                        token_end_index -= 1
                    tokenized_examples["end_positions"].append(token_end_index + 1)
                    
                    answer_tokens = self.tokenizer.encode(answers["text"][0])
                    if self.FindOffset(input_ids, answer_tokens[1:-1]):
                        #print(answer_tokens[1:-1])
                        answer_offset = self.FindOffset(input_ids, answer_tokens[1:-1]) #有肯能返回False
                        tokenized_examples["answer_offset"].append(self.FindOffset(input_ids, answer_tokens[1:-1]))
                        answer_seq_label[answer_offset[0]:answer_offset[1]] = [1]*(len(answer_tokens[1:-1]))
                        tokenized_examples["answer_seq_label"].append(answer_seq_label)
                        tokenized_examples["labels"].append(1)
                        
                    else:
                        
                        tokenized_examples["answer_offset"].append(answer_offset)
                        #answer_seq_label[answer_offset[0]:answer_offset[1]] = [1]*(len(answer_tokens[1:-1]))
                        tokenized_examples["answer_seq_label"].append(answer_seq_label)
                        tokenized_examples["labels"].append(0)

        return tokenized_examples
    
    

In [13]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.
#prepare_features = Prepare_Train_Features(base_tokenizer,max_length = 384,stride = 128,pad_on_right = "right")
prepare_features = Prepare_Train_Features_For_CRF(base_tokenizer,max_length = 384,stride = 128,pad_on_right = "right")
#ds = prepare_features.prepare_train_features(datasets['train'][21:22])
#ds

In [14]:
base_tokenizer.decode([4689, 782, 4852, 1324, 510, 4689, 2339, 928, 1324, 510, 4689, 3136, 5509, 1324, 510, 4689, 6568, 3124, 1324, 
                       510, 4689, 769, 6858, 6817, 6783, 1324, 510, 4689, 1310, 978, 1999])

'省 人 社 厅 、 省 工 信 厅 、 省 教 育 厅 、 省 财 政 厅 、 省 交 通 运 输 厅 、 省 卫 健 委'

In [15]:
tokenized_datasets = datasets.map(
                                  prepare_features.prepare_train_features, 
                                  batched=True,
                                  load_from_cache_file=False,
                                  remove_columns=datasets["train"].column_names
                                 )


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




In [16]:
print(tokenized_datasets['train'][1120])

{'answer_offset': [171, 201], 'answer_seq_label': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

### 检查数据的正确性

In [17]:
ex_id = 1120
indexs = tokenized_datasets['train'][ex_id]['answer_offset']
print(base_tokenizer.decode(tokenized_datasets['train'][ex_id]['input_ids']))
print(''.join(base_tokenizer.decode(tokenized_datasets['train'][ex_id]['input_ids']).split()[indexs[0]: indexs[1]]))

[CLS] 针 对 重 要 民 生 商 品 ， 市 场 监 管 总 局 已 推 多 家 企 业 向 社 会 做 出 怎 样 的 承 诺 ？ [SEP] 门 从 重 、 从 快 、 从 严 打 击 各 类 价 格 违 法 行 为 。 截 至 2 月 7 日 ， 全 国 市 场 监 管 部 门 共 立 案 涉 及 民 生 商 品 价 格 的 案 件 360 多 件 ， 已 经 处 罚 100 多 件 。 “ 重 拳 出 击 ， 铁 腕 治 乱 ， 决 不 让 违 法 行 为 形 成 气 候 。 ” 他 说 。 据 介 绍 ， 市 场 监 管 总 局 已 推 动 物 美 、 阿 里 巴 巴 、 武 汉 中 百 仓 储 超 市 有 限 公 司 等 30 多 家 企 业 主 动 向 社 会 做 出 承 诺 ， 在 疫 情 防 控 期 间 重 要 民 生 商 品 “ 价 格 不 涨 、 质 量 不 降 、 供 应 不 断 ” 。 截 止 到 目 前 为 止 ， 已 有 150 多 家 企 业 主 动 参 与 ， 涉 及 门 店 1. 6 万 多 家 。 陈 志 江 透 露 ， 从 目 前 巡 查 监 测 的 40 个 城 市 数 据 来 看 ， 总 体 来 说 米 袋 子 、 菜 篮 子 重 要 民 生 商 品 价 格 基 本 平 稳 ， 供 应 充 足 ， 一 些 果 蔬 类 商 品 价 格 下 降 比 较 明 显 。 “ 我 们 将 继 续 关 注 重 要 民 生 商 品 价 格 ， 加 强 巡 查 监 测 ， 强 化 执 法 办 案 ， 引 导 社 会 各 界 共 同 维 护 重 要 民 生 商 品 价 格 秩 序 。 ” 他 说 。 核 酸 检 测 试 剂 可 靠 吗 ？ “ 四 个 最 严 ” 监 管 ！ 新 冠 肺 炎 病 毒 核 酸 [SEP]
在疫情防控期间重要民生商品“价格不涨、质量不降、供应不断”。


In [18]:
from collections import Counter

c = Counter([len(i) for i in tokenized_datasets['train']['answer_seq_label']])
print (dict(c))

{384: 31708}


In [19]:
from collections import Counter

c = Counter([len(i) for i in tokenized_datasets['train']['input_ids']])
print (dict(c))

{384: 31708}


In [20]:
import torch
import numpy as np
import torch.nn as nn
from dataclasses import dataclass
from typing import Optional, Tuple
from transformers import AutoModelForQuestionAnswering, BertPreTrainedModel,BertModel
from transformers.modeling_outputs import QuestionAnsweringModelOutput
from utils import CRF

 
@dataclass        
class QuestionAnsweringModelOutputWithMultiTask_CRF(QuestionAnsweringModelOutput):
    loss: Optional[torch.FloatTensor] = None
    categorized_logits: torch.FloatTensor = None
    crf_logits: torch.FloatTensor = None
    start_logits: torch.FloatTensor = None
    end_logits: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[torch.FloatTensor]] = None

        
# Example Usage:- smooth_one_hot(torch.tensor([2, 3]), classes=10, smoothing=0.1)
def smooth_one_hot(true_labels: torch.Tensor, classes: int, smoothing=0.0):
    """
  if smoothing == 0, it's one-hot method
  if 0 < smoothing < 1, it's smooth method
  """
    assert 0 <= smoothing < 1
    confidence = 1.0 - smoothing
    #print(f"Confidence:{confidence}")
    label_shape = torch.Size((true_labels.size(0), classes))
    #print(f"Label Shape:{label_shape}")
    with torch.no_grad():
        true_dist = torch.empty(size=label_shape, device=true_labels.device)
        #print(f"True Distribution:{true_dist}")
        true_dist.fill_(smoothing / (classes - 1))
        #print(f"First modification to True Distribution:{true_dist}")
        true_dist.scatter_(1, true_labels.data.unsqueeze(1), confidence)
    #print(f"Modified Distribution:{true_dist}")
    return true_dist

def cross_entropy(input, target, size_average=True):
    """ Cross entropy that accepts soft targets
  Args:
        pred: predictions for neural network
        targets: targets, can be soft
        size_average: if false, sum is returned instead of mean
  """
    logsoftmax = nn.LogSoftmax(dim=1)
    if size_average:
        return torch.mean(torch.sum(-target * logsoftmax(input), dim=1))
    else:
        return torch.sum(torch.sum(-target * logsoftmax(input), dim=1))
      
def loss_fn(start_logits, end_logits, start_positions, end_positions):
    
    smooth_start_positions = smooth_one_hot(start_positions, classes=384, smoothing=0.1)
    smooth_end_positions = smooth_one_hot(end_positions, classes=384, smoothing=0.1)

    start_loss = cross_entropy(start_logits, smooth_start_positions)
    end_loss = cross_entropy(end_logits, smooth_end_positions)
    total_loss = (start_loss + end_loss)
  
    return total_loss        

@dataclass        
class QuestionAnsweringModelOutputWithMultiTask(QuestionAnsweringModelOutput):
    loss: Optional[torch.FloatTensor] = None
    categorized_logits: torch.FloatTensor = None
    start_logits: torch.FloatTensor = None
    end_logits: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[torch.FloatTensor]] = None


class BertForQuestionAnsweringWithMultiTask_CRF(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.high_dropout = nn.Dropout(p=0.5) 
        self.dropout = nn.Dropout(p=0.2) 
        self.bert = BertModel(config)
        self.span_classifier = nn.Linear(config.hidden_size*2, config.num_labels, bias=True)
        self.include_classifier = nn.Linear(config.hidden_size, config.num_labels, bias=True)
        self.CRF_fc1 = nn.Sequential(
            self.high_dropout,
            nn.Linear(config.hidden_size, config.num_labels + 2, bias=True),
            )
        
        self.CRF = CRF(target_size = self.bert.config.num_labels,device= torch.device("cuda"))
        self.CrossEntropyLoss = nn.CrossEntropyLoss()
        self.fc2 = nn.Linear(config.hidden_size, 2, bias=True)
        
        assert config.num_labels == 2
        
        torch.nn.init.normal_(self.span_classifier.weight, std=0.02)
        torch.nn.init.normal_(self.include_classifier.weight, std=0.02)
        
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids= None,
        answer_offset= None,
        answer_seq_label= None,
        labels=None,
        head_mask=None,
        inputs_embeds=None,
        start_positions=None,
        end_positions=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
       
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        bert_output = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states = True,
            return_dict= True,
        )
        
        
        last_hidden_state_output = bert_output.last_hidden_state[:,0,:]  # (batch_size, sequence_length, hidden_size)
        last_hidden_state_output = self.dropout(last_hidden_state_output)
        include_logits = self.include_classifier(last_hidden_state_output)
        
        #################################### CRF #############################################################################
        
        batch_size, seq_length = input_ids[:,1:].size() # 计算sql_len 不包含[CLS]
 
        # CRF mask
        mask = np.ones(shape=[batch_size, seq_length], dtype=np.uint8)
        mask = torch.ByteTensor(mask).bool().to('cuda')  # [batch_size, seq_len, 4]
        #print('mask',mask.shape )

        # No [CLS]
        #print(bert_output.last_hidden_state[:,1:,:].shape)
        crf_logits = self.CRF_fc1(bert_output.last_hidden_state[:,1:,:] )
        #_, CRFprediction = self.CRF.forward(feats=crf_logits, mask=mask)   
        
        #################################### Span #############################################################################
        span_hidden_states = bert_output.hidden_states # (batch_size, sequence_length, hidden_size)
        span_out = torch.stack((span_hidden_states[-1], span_hidden_states[-2], span_hidden_states[-3], span_hidden_states[-4]), dim=0)  #最后四层拼接
        span_out_mean = torch.mean(span_out, dim=0)
        span_out_max, _ = torch.max(span_out, dim=0)
        span_out = torch.cat((span_out_mean, span_out_max), dim=-1)
        span_logits = torch.mean(torch.stack([ self.span_classifier(self.high_dropout(span_out))for _ in range(5) ], dim=0), dim=0)
        #######################################################################################################################
        start_logits, end_logits = span_logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)  # (bs, max_query_len)
        end_logits = end_logits.squeeze(-1)  # (bs, max_query_len)
    
        total_loss = None
        if start_positions is not None and end_positions is not None and labels is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)

            #sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)
            span_loss = loss_fn(start_logits, end_logits, start_positions, end_positions)
            include_loss = nn.CrossEntropyLoss()(include_logits, labels)
            crf_loss = self.CRF.neg_log_likelihood_loss(feats=crf_logits, mask=mask, tags=answer_seq_label[:,1:] )
            total_loss = span_loss + include_loss + crf_loss
            
        return QuestionAnsweringModelOutputWithMultiTask_CRF(
                        loss= total_loss,
                        start_logits=start_logits,
                        end_logits=end_logits,
                        categorized_logits = include_logits  ,
                        crf_logits=crf_logits,
                        hidden_states= bert_output.hidden_states,
                        attentions= bert_output.attentions  )
  

In [21]:
mode_path =  r'chinese-bert-wwm-ext'

model = BertForQuestionAnsweringWithMultiTask_CRF.from_pretrained(mode_path)
model.resize_token_embeddings(len(base_tokenizer)) 
question, text = "毛泽东是谁？", "毛泽东是国家主席。"
inputs = base_tokenizer(question, text, return_tensors='pt')
start_positions = torch.tensor([4])
end_positions = torch.tensor([7])
labels =  torch.tensor([1])
outputs = model(**inputs, 
                #start_positions=start_positions, 
                #end_positions=end_positions,
                #labels =labels
               )


Some weights of the model checkpoint at chinese-bert-wwm-ext were not used when initializing BertForQuestionAnsweringWithMultiTask_CRF: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForQuestionAnsweringWithMultiTask_CRF from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnsweringWithMultiTask_CRF from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnsweringWithMultiTask_CR

In [25]:
base_tokenizer.decode(base_tokenizer.encode("毛泽东是谁？", "毛泽东是国家“主席”。"))

'[CLS] 毛 泽 东 是 谁 ？ [SEP] 毛 泽 东 是 国 家 “ 主 席 ” 。 [SEP]'

In [26]:
[torch.argmax(outputs.start_logits),torch.argmax(outputs.end_logits)]

[tensor(16), tensor(11)]

In [27]:
base_tokenizer.save_pretrained("test-squad-trained")

('test-squad-trained\\tokenizer_config.json',
 'test-squad-trained\\special_tokens_map.json',
 'test-squad-trained\\vocab.txt',
 'test-squad-trained\\added_tokens.json')

In [30]:
args = TrainingArguments(
    f"test-squad-trained",
    do_train = True,
    do_eval = True,
    #evaluation_strategy = "epoch",
    evaluation_strategy = "steps",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_steps =2000,
    save_steps = 1500,
    eval_steps = 2000,
    num_train_epochs=1,
    save_total_limit = 10,
    weight_decay=0.01,
    label_names = ["start_positions", "end_positions"] ,
    seed = 2021
            
                 )


In [31]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator= default_data_collator,
    tokenizer=base_tokenizer,
   
       )


In [32]:
trainer.train()


test-squad-trained


wandb: Currently logged in as: gaochangkuan (use `wandb login --relogin` to force relogin)


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

In [33]:
import gc
import torch



gc.collect()
torch.cuda.empty_cache()


In [34]:
trainer.save_model("test-squad-trained")
base_tokenizer.save_pretrained("test-squad-trained")

('test-squad-trained\\tokenizer_config.json',
 'test-squad-trained\\special_tokens_map.json',
 'test-squad-trained\\vocab.txt',
 'test-squad-trained\\added_tokens.json')

In [35]:
import torch

for batch in trainer.get_eval_dataloader():
    break
    

In [36]:
ds = trainer.eval_dataset[:40]
batch = {k: torch.tensor(v).to(trainer.args.device) for k, v in ds.items()}

In [37]:
#batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
trainer.model.eval()
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()


odict_keys(['loss', 'start_logits', 'end_logits', 'hidden_states', 'categorized_logits', 'crf_logits'])

In [38]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'))

In [39]:
import numpy as np

n_best_size = 10

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()

valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

In [40]:
#And then we can sort the valid_answers according to their score and only keep the best one. The only point left is how to check a given span is 
# inside the context (and not the question) and how to get back the text inside. To do this, we need to add two things to our validation features:
#the ID of the example that generated the feature (since each example can generate several features, as seen before);
#the offset mapping that will give us a map from token indices to character positions in the context.
#That's why we will re-process the validation set with the following function, slightly different from prepare_train_features:

# 然后我们可以根据得分对valid_answers 进行排序，只保留最好的一个。唯一剩下的问题是如何检查一个给定的span 是否在context中（而不是在question中），
#以及如何取回里面的文本。要做到这一点，我们需要在我们的validation features中添加两样东西：
# 1）生成特征的例子的ID（因为每个实例可以生成多个feature，如前所述）；
# 2）offset mapping ，将给我们一个从标记索引到context中字符位置的映射。
# 这就是为什么我们要用下面的函数 re-process验证集，与prepare_train_features略有不同的原因。

def prepare_validation_features(examples):
    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = base_tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples


In [41]:
pad_on_right = 'right'

validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
                )


HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




In [42]:
print(validation_features[1])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [None]:
raw_predictions = trainer.predict(validation_features)


In [43]:
import gc
import torch



gc.collect()
torch.cuda.empty_cache()


In [44]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))


In [47]:
ex_index = 25
max_answer_length = 50
n_best_size = 20

start_logits = output.start_logits[ex_index].cpu().numpy()
end_logits = output.end_logits[ex_index].cpu().numpy()
offset_mapping = validation_features[ex_index]["offset_mapping"]

# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index

context = datasets["validation"][ex_index]["context"]
print('question:', datasets["validation"][ex_index]["question"],
      'context:',context,
      'answer:',datasets["validation"][ex_index]["answers"]['text'][0])

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

question: 北京市网上办理登记注册事项的服务平台是什么？ context: 北京市市场监督管理局关于在疫情防控期间提倡网上办理登记注册事项的倡议书 为全面加强疫情防控，北京市市场监督管理局向您发出倡议:“网上办理业务，有效减少聚集，强化疫情防控”。 请登录北京市企业登记“e窗通”服务平台(网址：etc.scjgj.beijing.gov.cn)全程网上办理相关业务，我们将为你提供便捷高效的服务。 需现场提交材料的业务，建议您延期办理。您可选择通过在线咨询（各登记注册部门咨询电话详见下表）、网上预审的方式先行沟通。登记注册部门已安排专业人员进行业务办理咨询和指导。 确需现场办理的事项，请您自觉正确佩戴口罩，业务办理完毕後尽快离开。伴有发热、咳嗽等不适症状者，建议暂不到现场办理业务。 在此期间，给您带来的不便敬请谅解，衷心感谢您的理解和支持！业务咨询、投诉建议请致电010-11616611。相关办事指南、政策宣传敬请关注“北京市市场监督管理局”官网。                        北京市市场监督管理局                          二○二○年二月三日 各登记注册部门咨询电话 answer: “e窗通”服务平台(网址：etc.scjgj.beijing.gov.cn)


[{'score': -6.7814484, 'text': '办理相关业'},
 {'score': -7.08699, 'text': '办理相关'},
 {'score': -7.089334, 'text': 'gov.cn)全程网上办理相关业'},
 {'score': -7.3057723, 'text': '业'},
 {'score': -7.3867755, 'text': '办理'},
 {'score': -7.3948755, 'text': 'gov.cn)全程网上办理相关'},
 {'score': -7.4955273, 'text': 'gov.'},
 {'score': -7.5838537, 'text': '已安排专'},
 {'score': -7.603981, 'text': '.cn)全程网上办理相关业'},
 {'score': -7.6684933, 'text': 'v.cn)全程网上办理相关业'},
 {'score': -7.670457, 'text': 'gov'},
 {'score': -7.6859384, 'text': '办理相关业务'},
 {'score': -7.694661, 'text': 'gov.cn)全程网上办理'},
 {'score': -7.7009935, 'text': '相关业'},
 {'score': -7.727461, 'text': '已安'},
 {'score': -7.739602, 'text': '关业'},
 {'score': -7.8180265, 'text': '已安排'},
 {'score': -7.832142, 'text': '发热、咳嗽等'},
 {'score': -7.8447456, 'text': '有发热、咳嗽等'},
 {'score': -7.846612, 'text': '等'}]

In [None]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)
    

In [None]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(base_tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score > feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

In [None]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

In [None]:
if squad_v2:
    import os
    # Adapt this to your local environment
    path_to_transformers = "../git/transformers"
    path_to_qa_examples = os.path.join(path_to_transformers, "examples/question-answering")
    metric = load_metric(os.path.join(path_to_qa_examples, "squad_v2_local"))
    # Uncomment when the fix is merged in master and has been released.
    # metric = load_metric("squad_v2")
else:
    metric = load_metric("squad")

In [None]:
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

In [None]:
import numpy as np
import pandas as pd

def answerQuestion(question, paper):
    """
    This funtion provides the best answer found by the Q&A model, the chunk containing it
    among all chunks of the input paper and the score obtained by the answer.
    该方法提供了问答模型找到的最佳答案，在所有输入chunk中包含该答案的chunks，以及由该答案得到的分数（置信度）。
    """
    paper = [paragraph for paragraph in paper if len(paragraph)>0]
    inputs = [base_tokenizer.encode_plus(
        question, paragraph.replace('\n','').replace('\t','').replace(' ',''), 
               add_special_tokens=True, return_tensors="pt") for paragraph in paper ]
    answers = []
    confidence_scores = []
    for n, Input in enumerate(inputs):
        input_ids = Input['input_ids'].to(torch_device)
        token_type_ids = Input['token_type_ids'].to(torch_device)
        attention_masks = Input['attention_mask'].to(torch_device)
        if len(input_ids[0]) > 512:
            input_ids = input_ids[:, :512]
            token_type_ids = token_type_ids[:, :512]
            attention_masks = attention_masks[:, :512]
            
        text_tokens = base_tokenizer.convert_ids_to_tokens(input_ids[0])
        outputs = model(    input_ids,
                            token_type_ids =token_type_ids,
                            attention_mask = attention_masks  
                                              )
        
        start_scores = outputs.start_logits
        end_scores = outputs.end_logits
        answer_start = torch.argmax(start_scores)
        answer_end = torch.argmax(end_scores)
        
        # 如果答案的起始标记包含在问题中，起始标记就会被移动到该chunk的第一个。
        check = text_tokens.index("[SEP]")
        if int(answer_start) <= check:
            answer_start = check+1
        answer = base_tokenizer.convert_tokens_to_string(text_tokens[answer_start:(answer_end+1)])
        answer = answer.replace('[SEP]', '')
        confidence = start_scores[0][answer_start] + end_scores[0][answer_end]
        if answer.startswith('。') or answer.startswith('，'):
            answer = answer[2:]
        answers.append(answer)
        confidence_scores.append(float(confidence))
    
    maxIdx = np.argmax(confidence_scores)
    confidence = confidence_scores[maxIdx]
    best_answer = answers[maxIdx]
    best_paragraph = paper[maxIdx]

    return best_answer.replace(' ',''), confidence, best_paragraph.replace(' ','')

In [None]:

def checkAnyStop(token_list, token_stops):
    return any([stop in token_list for stop in token_stops])

def firstFullStopIdx(token_list, token_stops):
    """
    Returns the index of first full-stop token appearing.  
    """
    idxs = []
    for stop in token_stops:
        if stop in token_list:
            idxs.append(token_list.index(stop))
    minIdx = min(idxs) if idxs else None
    return minIdx


puncts = ['？', '。', '?', '；',"！","!",';']
puncts_tokens = [base_tokenizer.tokenize(x)[0] for x in puncts]

def splitTokens(tokens, punct_tokens, split_length):
    """
    To avoid splitting a sentence and lose the semantic meaning of it, a paper is splitted 
    into chunks in such a way that each chunk ends with a full-stop token (['？', '。', '?', '；']) 
    """
    splitted_tokens = []
    while len(tokens) > 0:
        if len(tokens) < split_length or not checkAnyStop(tokens, punct_tokens):
            splitted_tokens.append(tokens)
            break
        # to not have too long parapraphs, the nearest fullstop is searched both in the previous 
        # and the next strings.
        prev_stop_idx = firstFullStopIdx(tokens[:split_length][::-1], puncts_tokens)
        next_stop_idx = firstFullStopIdx(tokens[split_length:], puncts_tokens)
        if pd.isna(next_stop_idx):
            splitted_tokens.append(tokens[:split_length - prev_stop_idx])
            tokens = tokens[split_length - prev_stop_idx:]
        elif pd.isna(prev_stop_idx):
            splitted_tokens.append(tokens[:split_length + next_stop_idx + 1])
            tokens = tokens[split_length + next_stop_idx + 1:] 
        elif prev_stop_idx < next_stop_idx:
            splitted_tokens.append(tokens[:split_length - prev_stop_idx])
            tokens = tokens[split_length - prev_stop_idx:]
        else:
            splitted_tokens.append(tokens[:split_length + next_stop_idx + 1])
            tokens = tokens[split_length + next_stop_idx + 1:] 
    return splitted_tokens

def splitParagraph(text, split_length=128):
    text = text.replace('\n','').replace('\xa0','').replace('\t','')
    tokens = base_tokenizer.tokenize(text)
    splitted_tokens = splitTokens(tokens, puncts_tokens, split_length)
    return [''.join(base_tokenizer.convert_tokens_to_string(x)).replace(' ','') for x in splitted_tokens]


In [None]:
import requests
from gne import GeneralNewsExtractor

url ='https://baike.baidu.com/item/%E6%AF%9B%E6%B3%BD%E4%B8%9C/113835?fromtitle=%E6%AF%9B%E4%B8%BB%E5%B8%AD&fromid=380922'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3823.400 QQBrowser/10.7.4307.400'}
resp = requests.get(url ,headers=headers).text


In [None]:
extractor = GeneralNewsExtractor()
result = extractor.extract(resp, noise_node_list=['//div[@class="comment-list"]'])
print(result['title'])

In [None]:
torch_device = 'cuda'
model = BertForQuestionAnsweringWithMultiTask.from_pretrained(r'test-squad-trained')
model.to(torch_device)
model.eval()

In [None]:
question  = '''毛泽东的籍贯是哪里？'''
question  = '''毛泽东在经济建设方面的思想是什么？'''
question  = '''毛泽东的主要军事思想是什么？'''
#question  = '''整风运动是什么时候开展的？'''
#question  = '''《双十协定》是在哪里签署的？'''
#question  = '''《双十协定》是什么时候签署的？'''
#question  = '''毛泽东是什么时候出生的？'''
#question = '''六所私塾读书是什么时间段？'''
#question = '''毛泽东是什么时候死的？'''
#question = '''毛泽东的'三个世界'的论断是什么时候提出的？'''
#question = '''《国民政府与中共代表会谈纪要》是什么时候签署的？'''
#question = '''枪杆子里面出政权是哪里提出的？'''
#question = '''秋收起义是什么时候爆发的？'''
#question = '''文化大革命是怎样爆发的？'''
#question = '''整风运动是在什么时候发生的？'''
#question = '''毛泽东因为什么事很惊讶？'''

In [None]:
context = splitParagraph(result['content'], split_length=128)
answerQuestion(question, context )


In [None]:
answer_text =  [
    
'''1月22日，华晨宇承认：我们有一个孩子。随后，张碧晨发长文承认，她于2018年秋天怀孕，当时选择了独自离开，“所以在花花完全不知情的情况下，我独自完成了孕育和生产，成功升级成一个妈妈。”''',
'''谈到不公开的原因，华晨宇称主要事情有些复杂，怕说不清楚的话会让歌迷们担心，同时也可以让孩子在安静的环境里成长，而不被外界关注。现在既然被曝光出来了，那我们都会坦然面对大家的疑惑。这件事情可能会让歌迷们感觉到很突然，我只能希望大家理解，谢谢大家。''',
'''华晨宇长文中谈到孩子带来对她的影响，他表示“这个孩子的到来真的治愈了我很多，我很开心上天给我带来了这样一份特别的礼物，虽然很突然，但是也很开心，我们会给孩子带来健康快乐的成长环境。最惊喜的是，她很喜欢音乐，会经常自己一个人拿着麦克风边跳边唱《斗牛》，这是她最爱的一首歌，连睡觉都要听着这首歌入睡。”''',
'''华晨宇谈到与女儿相处细节，表示她很会撒娇，“想吃零食的时候总是用各种方式哄你开心，同样也会关心人，每次自己拿到好吃的食物的时候，总是会先说，‘爸爸妈妈吃’。她真的很可爱，也真的成长的很好，看见她我就觉得很幸福。”''',
'''2018年秋，当我知道自己怀孕的时候，我整个人都懵了。我和花花虽然在一起，我们也憧憬过未来的生活，但计划里从没有过生孩子结婚，至少几年之内没有，所以我当时完全慌了，不知道该怎么做才是对的。可能对我而言，30岁之前生一个自己和自己爱的人的孩子，是除了唱歌做歌手以外最大的梦想。但当我做好了要生下这个孩子的决定的时候，我混乱到完全不知道怎么跟花花说，也没去想他会怎么回应我，我顾自选择了离开，选择不告诉他不让他知道，自己去完成这一切。''',
'''我离开了他，走的时候没有说任何理由，只说了以后别联系了。很长一段时间，我不接他的电话，不回他的微信，让他找不到我，慢慢的我们就断了联系。我知道我这么做其实很愚蠢，但我实在太慌乱太害怕了，当时这个事情远远超出了我世界里的所有认知。所以在花花完全不知情的情况下，我独自完成了孕育和生产，成功升级成一个妈妈。''' ,
'''所以，虽然我们分开这么久了，我们的生活在分开期间也发生了很大的改变，但我们努力去重新磨合，最重要的是让孩子感受到爱，感受到家庭的温暖。''',
'''孩子健康聪明，每天都有无数的爱围绕她，她的爷爷奶奶、姥姥姥爷、爸爸妈妈都非常地爱她，她真的成长地很好。''',
'''很抱歉这件事隐瞒了这么久，一切的隐瞒更多的是为了保护这个孩子，想让她在平静快乐的环境里茁壮长大，给她充满爱的生活。对歌迷们和所有关心我们的人说抱歉，也感谢你们看完我的文字。'''
                
                
               
               ]




question = '华晨宇担心什么？'
question = '张碧晨抱歉什么？'
question = '张碧晨因为什么而懵逼？'
#question = '华晨宇为什么狂喜？'


In [None]:
answerQuestion(question, answer_text )