# 实验二：命名实体识别

**实验任务：**
1. 阅读、理解并运行本次实验提供的利用HMM和CRF进行命名实体识别的代码。 （1分）
1. 实现一个命名实体识别的评价程序，该程序可以在实体级别计算测试结果的每种实体以及总体的Precision、Recall和F1值。（3分）
1. 根据给定的提示，在本Notebook中指定位置，实现一个基于最大熵模型的实体识别系统，并利用ner_char_data目录下的train.txt文件训练模型，利用test.txt文件测试模型效果。（3分）
1. 在本Notebook中指定位置，利用给定的ner_clue_data目录下的训练数据文件train.txt分别训练HMM、ME和CRF模型,并使用dev.txt文件里的数据来测试三个模型。输出每个模型在dev.txt数据上的测试结果（每个模型对应一个结果文件），在Notebook中输出每个模型对应的每种实体以及总体的Precision、Recall和F1值。（6分）
1. 在本Notebook中指定位置，基于给定的参考项目，利用ner_clue_data目录下的训练数据文件train.txt训练一个BiLSTM+CRF模型,并使用dev.txt文件里的数据来测试模型。输出模型在dev.txt数据上的测试结果，在Notebook中输出模型对应的每种实体以及总体的Precision、Recall和F1值。（2分）
1. 按照课程QQ群里给定的模板完成本次实验的实验报告，并按照要求提交。 （5分）

**实验提交截止时间：**
* 2022年12月10日 晚上10点

**实验提交方式：**
* 在百度AI Studio的课程中提交，实验报告作为附件上传。

In [1]:
from score import SeqEntityScore
from tqdm import tqdm
import numpy as np
from score import SeqEntityScore
import paddle

#  1. 数据集介绍和数据读取练习
## 1.1 对 ner_char_data目录下的数据进行读取

ner_char_data 目录下数据为预处理后的数据，包括训练数据文件train.txt和测试数据文件test.txt。数据文件中的每一行由两部分组成字符和字符对应的实体标记，两部分之间用空格分隔。实体标记采用BMES的方式。文件中不同的句子之间用空行分割。

In [None]:
""" 读取并处理 ner_char_data 目录下的数据文件 """
def data_build(file_name: str, make_vocab=True):
    word_lists = []
    tag_lists = []
    with open('./ner_char_data/' + file_name, 'r', encoding='utf-8') as file_read:
        word_list = []
        tag_list = []
        for line in file_read:
            if line != '\n':
                word, tag = line.strip('\n').split()
                word_list.append(word)
                tag_list.append(tag)
            else:
                word_lists.append(word_list)
                tag_lists.append(tag_list)
                word_list = []
                tag_list = []
    
    if make_vocab == True:
        word2id = {}
        for word_list in word_lists:
            for word in word_list:
                if word not in word2id:
                    word2id[word] = len(word2id)
        tag2id = {}
        for tag_list in tag_lists:
            for tag in tag_list:
                if tag not in tag2id:
                    tag2id[tag] = len(tag2id)
        return word_lists, tag_lists, word2id, tag2id
    return word_lists, tag_lists

train_word_lists, train_tag_lists, word2id, tag2id = data_build(file_name="train.txt", make_vocab=True)
id2tag = dict([items[1], items[0]] for items in tag2id.items())   
test_word_lists, test_tag_lists = data_build(file_name="test.txt", make_vocab=False)

In [None]:
tag2id

{'B-NAME': 0,
 'E-NAME': 1,
 'O': 2,
 'B-CONT': 3,
 'M-CONT': 4,
 'E-CONT': 5,
 'B-RACE': 6,
 'E-RACE': 7,
 'B-TITLE': 8,
 'M-TITLE': 9,
 'E-TITLE': 10,
 'B-EDU': 11,
 'M-EDU': 12,
 'E-EDU': 13,
 'B-ORG': 14,
 'M-ORG': 15,
 'E-ORG': 16,
 'M-NAME': 17,
 'B-PRO': 18,
 'M-PRO': 19,
 'E-PRO': 20,
 'S-RACE': 21,
 'S-NAME': 22,
 'B-LOC': 23,
 'M-LOC': 24,
 'E-LOC': 25,
 'M-RACE': 26,
 'S-ORG': 27}

##  1.2 对ner_clue_data目录下的数据进行读取

### 实体类别：
```
包含10个标签，分别为: 地址（address），书名（book），公司（company），游戏（game），政府（government），电影（movie），姓名（name），组织机构（organization），职位（position），景点（scene）
```
### 标签类别定义 & 标注规则：
```
地址（address）: **省**市**区**街**号，**路，**街道，**村等（如单独出现也标记）。地址是标记尽量完全的, 标记到最细。
书名（book）: 小说，杂志，习题集，教科书，教辅，地图册，食谱，书店里能买到的一类书籍，包含电子书。
公司（company）: **公司，**集团，**银行（央行，中国人民银行除外，二者属于政府机构）, 如：新东方，包含新华网/中国军网等。
游戏（game）: 常见的游戏，注意有一些从小说，电视剧改编的游戏，要分析具体场景到底是不是游戏。
政府（government）: 包括中央行政机关和地方行政机关两级。 中央行政机关有国务院、国务院组成部门（包括各部、委员会、中国人民银行和审计署）、国务院直属机构（如海关、税务、工商、环保总局等），军队等。
电影（movie）: 电影，也包括拍的一些在电影院上映的纪录片，如果是根据书名改编成电影，要根据场景上下文着重区分下是电影名字还是书名。
姓名（name）: 一般指人名，也包括小说里面的人物，宋江，武松，郭靖，小说里面的人物绰号：及时雨，花和尚，著名人物的别称，通过这个别称能对应到某个具体人物。
组织机构（organization）: 篮球队，足球队，乐团，社团等，另外包含小说里面的帮派如：少林寺，丐帮，铁掌帮，武当，峨眉等。
职位（position）: 古时候的职称：巡抚，知州，国师等。现代的总经理，记者，总裁，艺术家，收藏家等。
景点（scene）: 常见旅游景点如：长沙公园，深圳动物园，海洋馆，植物园，黄河，长江等。
```

### 数据来源：
[数据下载](https://github.com/CLUEbenchmark/CLUENER2020)

```
本数据是在清华大学开源的文本分类数据集THUCTC基础上，选出部分数据进行细粒度命名实体标注，原数据来源于Sina News RSS.
```

### 数据分布：
```
训练集(train.txt)：10748
验证集集(dev.txt)：1343

按照不同标签类别统计，训练集数据分布如下（注：一条数据中出现的所有实体都进行标注，如果一条数据出现两个地址（address）实体，那么统计地址（address）类别数据的时候，算两条数据）：
【训练集】标签数据分布如下：
地址（address）:2829
书名（book）:1131
公司（company）:2897
游戏（game）:2325
政府（government）:1797
电影（movie）:1109
姓名（name）:3661
组织机构（organization）:3075
职位（position）:3052
景点（scene）:1462

【验证集】标签数据分布如下：
地址（address）:364
书名（book）:152
公司（company）:366
游戏（game）:287
政府（government）:244
电影（movie）:150
姓名（name）:451
组织机构（organization）:344
职位（position）:425
景点（scene）:199
```

### 数据字段解释：
```
以train.json为例，数据分为两列：text & label，其中text列代表文本，label列代表文本中出现的所有包含在10个类别中的实体。
例如：
  text: "北京勘察设计协会副会长兼秘书长周荫如"
  label: {"organization": {"北京勘察设计协会": [[0, 7]]}, "name": {"周荫如": [[15, 17]]}, "position": {"副会长": [[8, 10]], "秘书长": [[12, 14]]}}
  其中，organization，name，position代表实体类别，
  "organization": {"北京勘察设计协会": [[0, 7]]}：表示原text中，"北京勘察设计协会" 是类别为 "组织机构（organization）" 的实体, 并且start_index为0，end_index为7 （注：下标从0开始计数）
  "name": {"周荫如": [[15, 17]]}：表示原text中，"周荫如" 是类别为 "姓名（name）" 的实体, 并且start_index为15，end_index为17
  "position": {"副会长": [[8, 10]], "秘书长": [[12, 14]]}：表示原text中，"副会长" 是类别为 "职位（position）" 的实体, 并且start_index为8，end_index为10，同时，"秘书长" 也是类别为 "职位（position）" 的实体,
  并且start_index为12，end_index为14
```




In [6]:
# 在这里练习读取并处理 ner_clue_data目录下的数据。
# train_word_lists, train_tag_lists, word2id, tag2id 
""" 读取并处理 ner_char_data 目录下的数据文件 """
import json
    
def data_build_gluener(file_name:str, make_vocab=True):
    word_lists = []
    tag_lists = []
    with open('./ner_clue_data/' + file_name, 'r', encoding='utf-8') as f:

        for line in f:
            json_data = {}
            line = json.loads(line.strip())
            text = line['text']
            label_items = line.get('label', None)
            
            # 标注
            labels = ['O']*len(text)
            if(label_items != None):
                for key,value in label_items.items():
                    for name, index in value.items():
                        for start_idx, end_idx in index:
                            assert text[start_idx:end_idx + 1] == name
                            # if(len(name) == 1):
                            #     print(name)
                            if(start_idx == end_idx):
                                labels[start_idx] = 'S-' + key
                            else:
                                labels[start_idx] = 'B-'+ key
                                labels[start_idx+1: end_idx+1] = ['I-'+key]*(end_idx - start_idx)
            word_lists.append(list(text))
            tag_lists.append(labels)

    if make_vocab == True:
        word2id = {}
        for word_list in word_lists:
            for word in word_list:
                if word not in word2id:
                    word2id[word] = len(word2id)
        tag2id = {}
        for tag_list in tag_lists:
            for tag in tag_list:
                if tag not in tag2id:
                    tag2id[tag] = len(tag2id)
        return word_lists, tag_lists, word2id, tag2id

    return word_lists, tag_lists
             

clue_train_word_lists, clue_train_tag_lists, clue_word2id, clue_tag2id = data_build_gluener(file_name="train.txt", make_vocab=True)
clue_dev_word_lists, clue_dev_tag_lists = data_build_gluener(file_name="dev.txt", make_vocab=False)

In [None]:
clue_tag2id

{'B-company': 0,
 'I-company': 1,
 'O': 2,
 'B-name': 3,
 'I-name': 4,
 'B-game': 5,
 'I-game': 6,
 'B-organization': 7,
 'I-organization': 8,
 'B-movie': 9,
 'I-movie': 10,
 'B-position': 11,
 'I-position': 12,
 'B-address': 13,
 'I-address': 14,
 'B-government': 15,
 'I-government': 16,
 'B-scene': 17,
 'I-scene': 18,
 'B-book': 19,
 'I-book': 20,
 'S-company': 21,
 'S-address': 22,
 'S-name': 23,
 'S-position': 24}

# 2. 隐马尔科夫（HMM）模型
## 2.1隐马尔科夫模型参数的计算
利用训练数据获取HMM模型对应的参数**A**，**B**和**Pi**

In [None]:
""" HMM 参数构建 """
import numpy as np
# N: 状态数，这里对应存在的标注的种类 
# M: 观测数，这里对应有多少不同的字
N, M = len(tag2id), len(word2id)
# 状态转移概率矩阵 A[i][j]表示从i状态转移到j状态的概率
A = np.zeros(shape=(N, N), dtype=float)
# 观测概率矩阵, B[i][j]表示i状态下生成j观测的概率
B = np.zeros(shape=(N, M), dtype=float)
# 初始状态概率  Pi[i]表示初始时刻为状态i的概率
Pi = np.zeros(shape=N, dtype=float)

""" 构建转移概率矩阵 """
for tag_list in train_tag_lists:
    seq_len = len(tag_list)
    for i in range(seq_len - 1):
        current_tagid = tag2id[tag_list[i]]
        next_tagid = tag2id[tag_list[i+1]]
        A[current_tagid][next_tagid] += 1
A[A == 0.] = 1e-10  # 平滑处理
A = A / np.sum(a=A, axis=1, keepdims=True)

""" 构建观测概率矩阵 """
for tag_list, word_list in zip(train_tag_lists, train_word_lists):
    assert len(tag_list) == len(word_list)
    for tag, word in zip(tag_list, word_list):
        tag_id = tag2id[tag]
        word_id = word2id[word]
        B[tag_id][word_id] += 1
B[B == 0.] = 1e-10  # 平滑处理
B = B / np.sum(a=B, axis=1, keepdims=True)

""" 构建初始状态概率 """
for tag_list in train_tag_lists:
    init_tagid = tag2id[tag_list[0]]
    Pi[init_tagid] += 1
Pi[Pi == 0.] = 1e-10  # 平滑处理
Pi = Pi / np.sum(a=Pi)


## 2.2 维特比算法的实现

In [None]:
""" 维特比算法 """
def viterbi(word_list, word2id, tag2id):
    """
    使用维特比算法对给定观测序列求状态序列， 这里就是对字组成的序列,求其对应的标注。
    维特比算法实际是用动态规划解隐马尔可夫模型预测问题，即用动态规划求概率最大路径（最优路径）
    这时一条路径对应着一个状态序列
    """
    # 问题:整条链很长的情况下，十分多的小概率相乘，最后可能造成下溢
    # 解决办法：采用对数概率，这样源空间中的很小概率，就被映射到对数空间的大的负数
    #  同时相乘操作也变成简单的相加操作
    ALog = np.log(A)
    BLog = np.log(B)
    PiLog = np.log(Pi)

    # 初始化 维比特矩阵viterbi 它的维度为[状态数, 序列长度]
    # 其中viterbi[i, j]表示标注序列的第j个标注为i的所有单个序列(i_1, i_2, ..i_j)出现的概率最大值
    seq_len = len(word_list)
    viterbi = np.zeros(shape=(N, seq_len), dtype=float)
    # backpointer是跟viterbi一样大小的矩阵
    # backpointer[i, j]存储的是 标注序列的第j个标注为i时，第j-1个标注的id
    # 等解码的时候，我们用backpointer进行回溯，以求出最优路径
    backpointer = np.zeros(shape=(N, seq_len), dtype=float)

    # Pi[i] 表示第一个字的标记为i的概率
    # Bt[word_id]表示字为word_id的时候，对应各个标记的概率
    # A.t()[tag_id]表示各个状态转移到tag_id对应的概率

    # 所以第一步为
    start_wordid = word2id.get(word_list[0], None)
    Bt = BLog.T
    if start_wordid is None:
        # 如果字不再字典里，则假设状态的概率分布是均匀的
        bt = np.log(np.ones(shape=N, dtype=float) / N)
    else:
        bt = Bt[start_wordid]
    viterbi[:, 0] = PiLog + bt
    backpointer[:, 0] = -1

    # 递推公式：viterbi[tag_id, step] = max(viterbi[:, step-1]* A.t()[tag_id] * Bt[word])
    # 其中word是step时刻对应的字, 由上述递推公式求后续各步
    for step in range(1, seq_len):
        wordid = word2id.get(word_list[step], None)
        # 处理字不在字典中的情况
        # bt是在t时刻字为wordid时，状态的概率分布
        if wordid is None:
            # 如果字不再字典里，则假设状态的概率分布是均匀的
            bt = np.log(np.ones(N) / N)
        else:
            bt = Bt[wordid]  # 否则从观测概率矩阵中取bt
        for tag_id in range(len(tag2id)):
            # (step-1)的状态自带的概率viterbi[:, step - 1]  ；(step-1)的状态 转移到tag_id的转移概率ALog[:, tag_id]
            max_prob = np.max(a=viterbi[:, step - 1] + ALog[:, tag_id], axis=0) 
            max_id = np.argmax(a=viterbi[:, step - 1] + ALog[:, tag_id], axis=0) 
            viterbi[tag_id, step] = max_prob + bt[tag_id]
            backpointer[tag_id, step] = max_id


    # for step in range(1, seq_len):
    #     wordid = wordid.get(word_list[step], None)
    #     for tag_id in range(len(tag2id)):
    #         max_id = np.argmax(viterbi[:, step - 1] + ALog[:tag_id], axis = 0)
    #         max_prob = np.max(viterbi[:,step-1] + A[:,tag_id], axis = 0)
    #         viterbi[tag_id, step] = max_prob + BLog[tag_id,wordid]
    #         backpointer [tag_id, step] = max_id





    # 终止， t=seq_len 即 viterbi[:, seq_len]中的最大概率，就是最优路径的概率
    best_path_prob = np.max(a=viterbi[:, seq_len - 1], axis=0)
    best_path_pointer = np.argmax(a=viterbi[:, seq_len - 1], axis=0)

    # 回溯，求最优路径
    best_path_pointer = int(best_path_pointer)
    best_path = [best_path_pointer]

    for back_step in range(seq_len-1, 0, -1):
        best_path_pointer = backpointer[best_path_pointer, back_step]
        best_path_pointer = int(best_path_pointer)
        best_path.append(best_path_pointer)
    


    # 将tag_id组成的序列转化为tag
    assert len(best_path) == len(word_list)
    id2tag = dict((id_, tag) for tag, id_ in tag2id.items())
    tag_list = [id2tag[id_] for id_ in reversed(best_path)]

    return tag_list

## 2.3 利用HMM模型和viterbi进行实体识别

In [None]:
""" 利用HMM识别ner_char_data目录下test.txt中的数据"""
pred_tag_lists = []
for word_list in test_word_lists:
    pred_tag_list = viterbi(word_list, word2id, tag2id)
    pred_tag_lists.append(pred_tag_list)

## 2.4 按标记对HMM的识别结果进行评测

In [None]:
""" HMM 评测 """
from evaluating import Metrics
metrics = Metrics(test_tag_lists, pred_tag_lists, remove_O=False)
metrics.report_scores()
metrics.report_confusion_matrix()

                 precision    recall  f1-score   support
         B-NAME     0.9800    0.8750    0.9245       112
         M-NAME     0.9459    0.8537    0.8974        82
         E-NAME     0.9000    0.8036    0.8491       112
              O     0.9568    0.9177    0.9369      5190
          B-PRO     0.5581    0.7273    0.6316        33
          E-PRO     0.6512    0.8485    0.7368        33
          B-EDU     0.9000    0.9643    0.9310       112
          E-EDU     0.9167    0.9821    0.9483       112
        B-TITLE     0.8811    0.8925    0.8867       772
        M-TITLE     0.9038    0.8751    0.8892      1922
        E-TITLE     0.9514    0.9637    0.9575       772
          B-ORG     0.8422    0.8879    0.8644       553
          M-ORG     0.9002    0.9327    0.9162      4325
          E-ORG     0.8262    0.8680    0.8466       553
         B-CONT     0.9655    1.0000    0.9825        28
         M-CONT     0.9815    1.0000    0.9907        53
         E-CONT     0.9655    1

In [None]:
# 对HMM进行实体级别评测
from score import SeqEntityScore
def print_score(id2tag ,gold_tag_lists, pred_tag_lists):
    metrics = SeqEntityScore(id2tag= id2tag)
    metrics.update(gold_tag_lists,pred_tag_lists)
        # 根据metric统计的数据，计算最终的准确率，召回率，F1值
    result = metrics.get_result()
    metrics.report_scores(result)

id2tag = dict((id_, tag) for tag, id_ in tag2id.items())
print_score(id2tag,test_tag_lists, pred_tag_lists)

                 precision    recall  f1-score   support
           NAME     0.9800    0.8750    0.9245       112
            PRO     0.5581    0.7273    0.6316        33
            EDU     0.9000    0.9643    0.9310       112
          TITLE     0.8811    0.8925    0.8867       772
            ORG     0.8422    0.8879    0.8644       553
           CONT     0.9655    1.0000    0.9825        28
           RACE     1.0000    0.9286    0.9630        14
            LOC     0.3333    0.3333    0.3333         6
      avg/total     0.8669    0.8914    0.8790      1630


# 3. 条件随机场(CRF)模型

## 3.1 安装sklearn-crfsuite

In [3]:
!pip install sklearn-crfsuite

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting sklearn-crfsuite
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/16/c0/e61ec91560d34518a4986a29898f15248a226e7bf201ade882f5fda8f7c1/python_crfsuite-0.9.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (965 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m965.4/965.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.8 sklearn-crfsuite-0.3.6

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32

## 3.2 从sklearn_crfsuite模块中导入CRF包

In [4]:
from sklearn_crfsuite import CRF

  supported_dtypes = [np.typeDict[x] for x in supported_dtypes]
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  from numpy import (exp, inf, pi, sqrt, floor, sin, cos, around, int,
  from numpy.dual import register_func


# 3.3  CRF中的特征抽取

In [None]:
def word2features(sent, i):
    """抽取单个字的特征"""
    word = sent[i]
    prev_word = "<s>" if i == 0 else sent[i-1]
    next_word = "</s>" if i == (len(sent)-1) else sent[i+1]
    # 使用的特征：
    # 前一个词，当前词，后一个词，
    # 前一个词+当前词， 当前词+后一个词
    features = {
        'w': word,
        'w-1': prev_word,
        'w+1': next_word,
        'w-1:w': prev_word+word,
        'w:w+1': word+next_word,
        'w-1:w:w+1':prev_word+word+next_word,
        'bias': 1
    }
    return features


def sent2features(sent):
    """抽取序列特征"""
    return [word2features(sent, i) for i in range(len(sent))]

# 3.4  CRF模型的实现

In [None]:
class CRFModel(object):
    def __init__(self,
                 algorithm='lbfgs',
                 c1=0.1,
                 c2=0.1,
                 max_iterations=100,
                 all_possible_transitions=False
                 ):

        self.model = CRF(algorithm=algorithm,
                         c1=c1,
                         c2=c2,
                         max_iterations=max_iterations,
                         all_possible_transitions=all_possible_transitions)

    def train(self, sentences, tag_lists):
        features = [sent2features(s) for s in sentences]
        self.model.fit(features, tag_lists)

    def test(self, sentences):
        features = [sent2features(s) for s in sentences]
        pred_tag_lists = self.model.predict(features)
        return pred_tag_lists

# 3.3  CRF模型训练、测试与评价

In [None]:
from evaluating import Metrics

In [None]:
from evaluating import Metrics
# 训练CRF模型
crf_model = CRFModel()
crf_model.train(train_word_lists, train_tag_lists)

pred_tag_lists = crf_model.test(test_word_lists)

metrics = Metrics(test_tag_lists, pred_tag_lists, remove_O=False)
metrics.report_scores()
metrics.report_confusion_matrix()

                 precision    recall  f1-score   support
         B-NAME     1.0000    0.9821    0.9910       112
         M-NAME     1.0000    0.9756    0.9877        82
         E-NAME     1.0000    0.9821    0.9910       112
              O     0.9653    0.9659    0.9656      5190
          B-PRO     0.9375    0.9091    0.9231        33
          E-PRO     0.9375    0.9091    0.9231        33
          B-EDU     0.9820    0.9732    0.9776       112
          E-EDU     0.9910    0.9821    0.9865       112
        B-TITLE     0.9417    0.9417    0.9417       772
        M-TITLE     0.9447    0.9069    0.9254      1922
        E-TITLE     0.9819    0.9819    0.9819       772
          B-ORG     0.9550    0.9584    0.9567       553
          M-ORG     0.9436    0.9630    0.9532      4325
          E-ORG     0.9189    0.9222    0.9206       553
         B-CONT     1.0000    1.0000    1.0000        28
         M-CONT     1.0000    1.0000    1.0000        53
         E-CONT     1.0000    1

# 4. 实现实体级别评价程序

请在下面的Cell中实现一个命名实体识别的评价程序，该程序可以在实体级别计算测试结果中的每种实体以及总体的Precision、Recall和F1值。（3分）

In [None]:
from score import SeqEntityScore
def print_score(id2tag ,gold_tag_lists, pred_tag_lists):
    """
    @id2tag ：标签id序列到标签序列的转换字典，标签的标注方式应当遵循BIO标注法  
    @gold_tag_lists：真实的标签序列构成的列表
    @pred_tag_lists: 预测的标签序列构成的列表
    """
    metrics = SeqEntityScore(id2tag= id2tag)
    metrics.update(gold_tag_lists,pred_tag_lists)
        # 根据metric统计的数据，计算最终的准确率，召回率，F1值
    result = metrics.get_result()
    metrics.report_scores(result)

In [None]:
print_score(id2tag,test_tag_lists,pred_tag_lists)

                 precision    recall  f1-score   support
           NAME     1.0000    0.9821    0.9910       112
            PRO     0.9375    0.9091    0.9231        33
            EDU     0.9820    0.9732    0.9776       112
          TITLE     0.9417    0.9417    0.9417       772
            ORG     0.9550    0.9584    0.9567       553
           CONT     1.0000    1.0000    1.0000        28
           RACE     1.0000    1.0000    1.0000        14
            LOC     1.0000    0.8333    0.9091         6
      avg/total     0.9545    0.9528    0.9536      1630


# 5. 基于最大熵模型的实体识别
请在下面的Cell中实现一个基于最大熵模型的实体识别系统，并利用ner_char_data目录下的train.txt文件训练模型，利用test.txt文件测试模型效果。（3分）

In [None]:
# 请在这里实现一个基于最大熵模型的实体识别系统
# https://blog.csdn.net/gary101818/article/details/121902646
# 导入sklearn中的LogisticRegression （LogisticRegression即为最大熵模型）
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from evaluating import Metrics

# dataloader
train_tag_lists_flatten = [x for train_tag_list in train_tag_lists for x in train_tag_list]

In [None]:
# model
class MaxEntropyModel(object):
    def __init__(self):
        self.model = LogisticRegression(solver="saga", max_iter=100, n_jobs=-1)
        self.vectorizer = DictVectorizer()
        
    def train(self, sentences, tag_lists):
        features = [feature for s in sentences for feature in sent2features(s) ]
        features = self.vectorizer.fit_transform(features)
        self.model.fit(features, tag_lists)

    def test(self, sentences):
        features = [self.vectorizer.transform(sent2features(s)) for s in sentences]
        pred_tag_lists = [self.model.predict(feature).tolist() for feature in features]
        return pred_tag_lists

In [None]:
# train
logistic_model = MaxEntropyModel()
logistic_model.train(train_word_lists, train_tag_lists_flatten)



In [None]:
pred_tag_lists = logistic_model.test(test_word_lists)
# 实体级别
print_score(id2tag,test_tag_lists, pred_tag_lists)
# 字符级别
metrics = Metrics(test_tag_lists, pred_tag_lists, remove_O=False)
metrics.report_scores()
metrics.report_confusion_matrix()

           precision    recall  f1-score   support
     NAME     0.8203    0.9375    0.8750       112
      PRO     0.8500    0.5152    0.6415        33
      EDU     0.9292    0.9375    0.9333       112
    TITLE     0.9119    0.8718    0.8914       772
      ORG     0.8893    0.8571    0.8729       553
     CONT     0.6000    0.9643    0.7397        28
     RACE     1.0000    0.9286    0.9630        14
      LOC     0.0000    0.0000    0.0000         6
avg/total     0.8882    0.8675    0.8777      1630
           precision    recall  f1-score   support
   B-NAME     0.8203    0.9375    0.8750       112
   M-NAME     0.9762    0.5000    0.6613        82
   E-NAME     0.9478    0.9732    0.9604       112
        O     0.9606    0.9435    0.9520      5190
    B-PRO     0.8500    0.5152    0.6415        33
    E-PRO     0.7750    0.9394    0.8493        33
    B-EDU     0.9292    0.9375    0.9333       112
    E-EDU     0.9908    0.9643    0.9774       112
  B-TITLE     0.9119    0.8718 

In [None]:
import score
import evaluating
import importlib
importlib.reload(score)
importlib.reload(evaluating)
from score import SeqEntityScore
from evaluating import Metrics

In [None]:
text = pred_tag_lists[0]
print("".join(text))
xx = pred_tag_lists[0]
print(xx)

for text, tag in zip(test_word_lists[:5], pred_tag_lists[:5]):
    text = "".join(text)
    print(text)
    print(tag)
    entities = SeqEntityScore(tag2id).get_entities_bio(tag)
    for entity in entities:
        entity_type, start, end = entity
        print(f"{text[start:end+1]} : {entity_type}")
    print(">"*10)

B-NAMEM-ORGE-NAMEOOO
['B-NAME', 'M-ORG', 'E-NAME', 'O', 'O', 'O']
常建良，男，
['B-NAME', 'M-ORG', 'E-NAME', 'O', 'O', 'O']
常 : NAME
>>>>>>>>>>
1963年出生，工科学士，高级工程师，北京物资学院客座副教授。
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-TITLE', 'M-ORG', 'B-EDU', 'E-EDU', 'O', 'B-TITLE', 'M-TITLE', 'M-TITLE', 'M-TITLE', 'E-TITLE', 'O', 'B-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'E-ORG', 'B-TITLE', 'M-TITLE', 'M-TITLE', 'M-TITLE', 'E-TITLE', 'O']
工 : TITLE
学 : EDU
高 : TITLE
北 : ORG
客 : TITLE
>>>>>>>>>>
1985年8月—1993年在国家物资局、物资部、国内贸易部金属材料流通司从事国家统配钢材中特种钢材品种的全国调拔分配工作，先后任科员、副主任科员、主任科员。
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'E-ORG', 'O', 'B-TITLE', 'M-TITLE', 'O', 'O', 'O', 'M-ORG', 'M-ORG', 'M-TITLE', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'E-ORG', 'O', 'O', 'O', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'O', 'O', 'O', 'M-ORG', 'M-ORG', 'M-ORG', 'M-ORG', 'O', 'O', 'O

# 6. 利用新数据重新训练和测试HMM、ME和CRF模型
请在下面的Cell中：利用给定的ner_clue_data目录下的训练数据文件train.txt分别训练HMM、ME和CRF模型,并使用dev.txt文件里的数据来测试三个模型。输出每个模型在dev.txt数据上的测试结果（每个模型对应一个结果文件），在Notebook中输出每个模型对应的每种实体以及总体的Precision、Recall和F1值。（6分）

In [7]:
# 在这里实现利用新数据重新训练和测试HMM、ME和CRF模型
clue_train_word_lists, clue_train_tag_lists, clue_word2id, clue_tag2id = data_build_gluener(file_name="train.txt", make_vocab=True)
clue_dev_word_lists, clue_dev_tag_lists = data_build_gluener(file_name="dev.txt", make_vocab=False)

clue_test_word_lists, clue_test_tag_lists = data_build_gluener(file_name="test.txt", make_vocab=False)

In [None]:
clue_tag2id

{'B-company': 0,
 'I-company': 1,
 'O': 2,
 'B-name': 3,
 'I-name': 4,
 'B-game': 5,
 'I-game': 6,
 'B-organization': 7,
 'I-organization': 8,
 'B-movie': 9,
 'I-movie': 10,
 'B-position': 11,
 'I-position': 12,
 'B-address': 13,
 'I-address': 14,
 'B-government': 15,
 'I-government': 16,
 'B-scene': 17,
 'I-scene': 18,
 'B-book': 19,
 'I-book': 20,
 'S-company': 21,
 'S-address': 22,
 'S-name': 23,
 'S-position': 24}

HMM

In [None]:
import numpy as np
# N: 状态数，这里对应存在的标注的种类 
# M: 观测数，这里对应有多少不同的字
tag2id = clue_tag2id
word2id = clue_word2id

N, M = len(tag2id), len(word2id)
# 状态转移概率矩阵 A[i][j]表示从i状态转移到j状态的概率
A = np.zeros(shape=(N, N), dtype=float)
# 观测概率矩阵, B[i][j]表示i状态下生成j观测的概率
B = np.zeros(shape=(N, M), dtype=float)
# 初始状态概率  Pi[i]表示初始时刻为状态i的概率
Pi = np.zeros(shape=N, dtype=float)

""" 构建转移概率矩阵 """
for tag_list in clue_train_tag_lists:
    seq_len = len(tag_list)
    for i in range(seq_len - 1):
        current_tagid = tag2id[tag_list[i]]
        next_tagid = tag2id[tag_list[i+1]]
        A[current_tagid][next_tagid] += 1
A[A == 0.] = 1e-10  # 平滑处理
A = A / np.sum(a=A, axis=1, keepdims=True)

""" 构建观测概率矩阵 """
for tag_list, word_list in zip(clue_train_tag_lists, clue_train_word_lists):
    assert len(tag_list) == len(word_list)
    for tag, word in zip(tag_list, word_list):
        tag_id = tag2id[tag]
        word_id = word2id[word]
        B[tag_id][word_id] += 1
B[B == 0.] = 1e-10  # 平滑处理
B = B / np.sum(a=B, axis=1, keepdims=True)

""" 构建初始状态概率 """
for tag_list in clue_train_tag_lists:
    init_tagid = tag2id[tag_list[0]]
    Pi[init_tagid] += 1
Pi[Pi == 0.] = 1e-10  # 平滑处理
Pi = Pi / np.sum(a=Pi)


""" 利用HMM识别ner_char_data目录下test.txt中的数据"""
pred_tag_lists = []
for word_list in clue_dev_word_lists:
    pred_tag_list = viterbi(word_list, word2id, tag2id)
    pred_tag_lists.append(pred_tag_list)


In [None]:
clue_id2tag = {(y,x)  for x,y in tag2id.items() }

In [None]:
# 实体级别
print_score(clue_id2tag,clue_dev_tag_lists, pred_tag_lists)
# 字符级别
metrics = Metrics(clue_dev_tag_lists, pred_tag_lists, remove_O=False)
metrics.report_scores()
metrics.report_confusion_matrix()

                 precision    recall  f1-score   support
           name     0.5456    0.5656    0.5554       465
        address     0.3531    0.3190    0.3352       373
   organization     0.4601    0.4714    0.4657       367
           game     0.5460    0.6644    0.5994       295
          scene     0.4010    0.3780    0.3892       209
           book     0.3361    0.2597    0.2930       154
        company     0.4592    0.4471    0.4531       378
       position     0.5666    0.6189    0.5916       433
     government     0.3656    0.4737    0.4127       247
          movie     0.4518    0.4967    0.4732       151
      avg/total     0.4689    0.4880    0.4782      3072
                 precision    recall  f1-score   support
         B-name     0.6826    0.7075    0.6948       465
         I-name     0.5434    0.7111    0.6160      1021
              O     0.9438    0.9175    0.9305     36747
      B-address     0.5104    0.4611    0.4845       373
      I-address     0.5780    0

CRF

In [None]:
# 训练CRF模型
crf_model = CRFModel()
crf_model.train(clue_train_word_lists, clue_train_tag_lists)

pred_tag_lists = crf_model.test(clue_dev_word_lists)

In [None]:
# 实体级别
print_score(clue_id2tag,clue_dev_tag_lists, pred_tag_lists)
# 字符级别
metrics = Metrics(clue_dev_tag_lists, pred_tag_lists, remove_O=False)
metrics.report_scores()
metrics.report_confusion_matrix()



                 precision    recall  f1-score   support
           name     0.7778    0.6925    0.7327       465
        address     0.6021    0.4665    0.5257       373
   organization     0.7934    0.7221    0.7561       367
           game     0.7845    0.7898    0.7872       295
          scene     0.7211    0.5072    0.5955       209
           book     0.7742    0.6234    0.6906       154
        company     0.7822    0.7222    0.7510       378
       position     0.8226    0.7067    0.7602       433
     government     0.7924    0.7571    0.7743       247
          movie     0.7279    0.7086    0.7181       151
      avg/total     0.7638    0.6735    0.7158      3072
                 precision    recall  f1-score   support
         B-name     0.8382    0.7462    0.7895       465
         I-name     0.7781    0.7522    0.7649      1021
              O     0.9461    0.9760    0.9608     36747
      B-address     0.7232    0.5603    0.6314       373
      I-address     0.7514    0

最大熵

In [None]:
clue_train_tag_lists_flatten = [x for train_tag_list in clue_train_tag_lists for x in train_tag_list]
print("done train_tag_lists_flatten")
# train
logistic_model = MaxEntropyModel()
logistic_model.train(clue_train_word_lists, clue_train_tag_lists_flatten)

# test
pred_tag_lists = logistic_model.test(clue_dev_word_lists)

print_score(clue_id2tag, clue_dev_tag_lists, pred_tag_lists)

metrics = Metrics(clue_dev_tag_lists, pred_tag_lists, remove_O=False)
metrics.report_scores()

done train_tag_lists_flatten




                 precision    recall  f1-score   support
           name     0.5785    0.4753    0.5218       465
        address     0.2986    0.2306    0.2602       373
   organization     0.6300    0.5613    0.5937       367
           game     0.6339    0.6339    0.6339       295
          scene     0.3608    0.1675    0.2288       209
           book     0.3279    0.2597    0.2899       154
        company     0.4841    0.4021    0.4393       378
       position     0.7957    0.6028    0.6859       433
     government     0.5130    0.3198    0.3940       247
          movie     0.3276    0.2517    0.2846       151
      avg/total     0.5386    0.4248    0.4750      3072
                 precision    recall  f1-score   support
         B-name     0.8010    0.6581    0.7226       465
         I-name     0.6560    0.6650    0.6605      1021
              O     0.9169    0.9719    0.9436     36747
      B-address     0.5938    0.4584    0.5174       373
      I-address     0.6236    0

# 7. 训练和测试BiLSTM+CRF模型
请在下面的Cell中：基于百度Paddle框架，利用ner_clue_data目录下的训练数据文件train.txt训练一个BiLSTM+CRF模型,并使用dev.txt文件里的数据来测试模型。输出模型在dev.txt数据上的测试结果，在Notebook中输出模型对应的每种实体以及总体的Precision、Recall和F1值。（2分）

参考项目：https://aistudio.baidu.com/aistudio/projectdetail/4877149

In [None]:
# 在这里基于百度Paddle框架，实现一个实体识别的BiLSTM+CRF模型

from score import SeqEntityScore
from tqdm import tqdm
import numpy as np
from score import SeqEntityScore
import paddle


## 7.1 数据读取

In [None]:
# train_word_lists, train_tag_lists, word2id, tag2id 
""" 读取并处理 ner_char_data 目录下的数据文件 """
import json
    
def data_build_gluener(file_name:str, make_vocab=True):
    word_lists = []
    tag_lists = []
    with open('./ner_clue_data/' + file_name, 'r', encoding='utf-8') as f:

        for line in f:
            json_data = {}
            line = json.loads(line.strip())
            text = line['text']
            label_items = line['label']
            
            # 标注
            labels = ['O']*len(text)
            if(label_items != None):
                for key,value in label_items.items():
                    for name, index in value.items():
                        for start_idx, end_idx in index:
                            assert text[start_idx:end_idx + 1] == name
                            if(start_idx == end_idx):
                                labels[start_idx] = 'S-' + key
                            else:
                                labels[start_idx] = 'B-'+ key
                                labels[start_idx+1: end_idx+1] = ['I-'+key]*(end_idx - start_idx)
            word_lists.append(list(text))
            tag_lists.append(labels)

    if make_vocab == True:
        word2id = {}
        for word_list in word_lists:
            for word in word_list:
                if word not in word2id:
                    word2id[word] = len(word2id)
        tag2id = {}
        for tag_list in tag_lists:
            for tag in tag_list:
                if tag not in tag2id:
                    tag2id[tag] = len(tag2id)
        return word_lists, tag_lists, word2id, tag2id

    return word_lists, tag_lists

# 加载词典
import os
def load_dict(dict_name):
    assert dict_name in ["tag", "vocab"]
    path = './ner_clue_data/'
    data_path = os.path.join(path, dict_name+".dict")
    data_dict = {}
    with open(data_path, "r", encoding="utf-8") as f:
        lines = [item.strip().split("\t") for item in f.readlines()]
        data_dict = dict([(item[1], int(item[0])) for item in lines])

    return data_dict              


# clue_train_word_lists, clue_train_tag_lists, clue_word2id, clue_tag2id = data_build_gluener(file_name="train.txt", make_vocab=True)
# clue_dev_word_lists, clue_dev_tag_lists = data_build_gluener(file_name="dev.txt", make_vocab=False)

In [None]:
clue_train_word_lists, clue_train_tag_lists = data_build_gluener(file_name="train.txt", make_vocab=False)
clue_dev_word_lists, clue_dev_tag_lists = data_build_gluener(file_name="dev.txt", make_vocab=False)

clue_word2id = load_dict(dict_name="vocab")
clue_tag2id = load_dict(dict_name="tag")
clue_id2tag = dict([items[1], items[0]] for items in clue_tag2id.items())   

In [None]:
clue_id2tag

{0: 'O',
 1: 'B-address',
 2: 'B-book',
 3: 'B-company',
 4: 'B-game',
 5: 'B-government',
 6: 'B-movie',
 7: 'B-name',
 8: 'B-organization',
 9: 'B-position',
 10: 'B-scene',
 11: 'I-address',
 12: 'I-book',
 13: 'I-company',
 14: 'I-game',
 15: 'I-government',
 16: 'I-movie',
 17: 'I-name',
 18: 'I-organization',
 19: 'I-position',
 20: 'I-scene',
 21: 'S-address',
 22: 'S-book',
 23: 'S-company',
 24: 'S-game',
 25: 'S-government',
 26: 'S-movie',
 27: 'S-name',
 28: 'S-organization',
 29: 'S-position',
 30: 'S-scene'}

In [None]:
# utils
import paddle

def conver_word_to_id(word_list, word2id):
    return [word2id.get(word) for word in word_list]

def conver_tag_to_id(tag_lists, tag2id):
    return conver_word_to_id(tag_lists, tag2id)

def zip_data (word_lists, tag_lists, word2id, tag2id):
    """
    [(words, tags, len),(words, tags, len)...]
    """
    return [(conver_word_to_id(x,word2id), conver_tag_to_id(y, tag2id),len(x)) for x,y in zip (word_lists,tag_lists)]

train_set = zip_data(clue_train_word_lists, clue_train_tag_lists, clue_word2id, clue_tag2id)
dev_set = zip_data(clue_dev_word_lists, clue_dev_tag_lists, clue_word2id, clue_tag2id)


In [None]:
class DatasetLoader(object):
    def __init__(self, data, batch_size, shuffle=False, sort=True, drop_last=False):
        self.examples = data
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.sort = sort
        self.drop_last = drop_last
        self.split_data()
    
    # 将数据分成mini-batch
    def split_data(self):
        # 根据样本的长度对数据进行排序
        if self.sort:
            self.examples = sorted(self.examples, key=lambda x: x[2], reverse=True)
        if self.shuffle:
            indices = list(range(len(self.examples)))
            random.shuffle(indices)
            self.examples = [self.examples[i] for i in indices]
        self.batch_data = [self.examples[i:i + self.batch_size] for i in range(0, len(self.examples), self.batch_size)]
        if self.drop_last and len(self.batch_data[-1]) < batch_size:
            self.batch_data = self.batch_data[:-1]

    def padding_sequence(self, batch, batch_size, mask=None):
        # 固定batch中样本的长度，将短的样本padding到最长样本的长度
        tokens_list, tags_list, lens = batch
        max_len = max(lens)
        batch_size = len(lens)
        batch_token = paddle.full(shape=[batch_size, max_len], fill_value=0, dtype="int64")
        batch_tag = paddle.full(shape=[batch_size, max_len], fill_value=0, dtype="int64")
        batch_mask = paddle.full(shape=[batch_size, max_len], fill_value=0, dtype="int64")
        for i in range(batch_size):

            batch_token[i, :lens[i]] = paddle.to_tensor(tokens_list[i], dtype="int64")
            batch_tag[i, :lens[i]] = paddle.to_tensor(tags_list[i], dtype="int64")

            if mask:
                batch_mask[i, :lens[i]] = paddle.to_tensor([1] * lens[i], dtype="int64") 
        if mask:
            return batch_token, batch_tag, batch_mask

        return batch_token, batch_tag

    def __len__(self):
        return len(self.batch_data)

    def __getitem__(self, index):
        # 每次调用该函数，则返回一个batch
        # batch中每条样本包含三部分数据：text，tag, len
        batch = self.batch_data[index]
        batch_size = len(batch)
        # batch变成三个list，依次是text，tag, len
        batch = list(zip(*batch))
        batch_ids, batch_tags, batch_mask = self.padding_sequence(batch, batch_size, mask=True)
        batch_lens = paddle.to_tensor(batch[2])

        return (batch_ids, batch_tags, batch_mask, batch_lens)


## 7.2 模型结构

In [None]:
import paddlenlp.layers.crf  as crf

class NERModel(paddle.nn.Layer):
    def __init__(self, vocab_size, embedding_size, hidden_size, label2id, n_layers=2, drop_p = 0.1 ):
        super(NERModel, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.label2id = label2id

        self.embedding = paddle.nn.Embedding(vocab_size, embedding_size)
        self.bilstm = paddle.nn.LSTM(input_size = embedding_size, hidden_size = hidden_size,direction="bidirectional", num_layers=n_layers, dropout=drop_p )
        self.layer_norm = paddle.nn.LayerNorm(hidden_size * 2)
        self.dropout_emb = paddle.nn.Dropout(drop_p)
        self.classifier = paddle.nn.Linear(hidden_size*2 , len(label2id)+2) # add START and STOP tag

        self.crf = crf.LinearChainCrf(len(label2id), crf_lr = 0.001, with_start_stop_tag = True)
        self.crf_loss = crf.LinearChainCrfLoss(self.crf)
        self.viterbi_decoder = crf.ViterbiDecoder(self.crf.transitions)
    
    def forward(self, input_id, input_mask):
        # input_id: [batch_size, seq_len]
        embs = self.embedding(input_id) #[batch_size,seq_len, embedding_sie]
        embs = self.dropout_emb(embs)
        embs = embs*paddle.to_tensor(input_mask, dtype = 'float32').unsqueeze(2)

        last_layer_hiddens, _ =  self.bilstm(embs)
        last_layer_hiddens = self.layer_norm(last_layer_hiddens)
        features = self.classifier(last_layer_hiddens)
        
        return features
    
    def forward_loss(self, input_ids, input_mask, input_lens, input_tags=None):
        features = self.forward(input_ids, input_mask)
        if input_tags is not None:
            return features, self.crf_loss(features, input_lens,"",input_tags)
        return features

## 7.3 训练与保存

In [None]:
paddle.get_device()

'gpu:0'

In [None]:
# 参数
n_epochs = 100
batch_size = 32
vocab_size = len(clue_word2id.keys())
embedding_size = 128
hidden_size = 384
n_layers = 2
dropout_rate = 0.1
learning_rate = 0.001

In [None]:
train_loader = DatasetLoader(train_set, batch_size, False, True)
test_loader = DatasetLoader(dev_set, batch_size,  False ,True)


use_gpu = True if paddle.get_device().startswith("gpu") else False
if use_gpu:
    paddle.set_device('gpu:0')

# 实例化模型
ner_model = NERModel(vocab_size=vocab_size, embedding_size=embedding_size,
                     hidden_size=hidden_size,label2id=clue_tag2id, n_layers=n_layers, drop_p=dropout_rate)

# 指定优化器
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, beta1=0.9, beta2=0.99,
                                  parameters=ner_model.parameters())

In [None]:
clue_id2tag

{0: 'O',
 1: 'B-address',
 2: 'B-book',
 3: 'B-company',
 4: 'B-game',
 5: 'B-government',
 6: 'B-movie',
 7: 'B-name',
 8: 'B-organization',
 9: 'B-position',
 10: 'B-scene',
 11: 'I-address',
 12: 'I-book',
 13: 'I-company',
 14: 'I-game',
 15: 'I-government',
 16: 'I-movie',
 17: 'I-name',
 18: 'I-organization',
 19: 'I-position',
 20: 'I-scene',
 21: 'S-address',
 22: 'S-book',
 23: 'S-company',
 24: 'S-game',
 25: 'S-government',
 26: 'S-movie',
 27: 'S-name',
 28: 'S-organization',
 29: 'S-position',
 30: 'S-scene'}

In [None]:

# 模型评估
def evaluate(model,test_loader):
    # 定义统计评估指标的类
    metric = SeqEntityScore(clue_id2tag)
    model.eval()

    with paddle.no_grad():
        for step, batch in enumerate(test_loader):
            # 获取数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            
            # 前向计算，得出发射分数
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)

            # 根据发射分数，利用CRF进行解码
            scores, pred_paths = model.viterbi_decoder(features, batch_lens)

            # 将这些预测的标签序列进行id2tag，即转换为相应的标签
            pred_paths = [[clue_id2tag[int(tag_id)] for tag_id in tag_seq] for tag_seq in pred_paths]
            
            # 根据文本序列的真实长度，对真实标签序列进行截断
            batch_tags = batch_tags.numpy().tolist()
            real_paths = [tag_seq[:tag_len] for tag_seq, tag_len in zip(batch_tags, batch_lens)]

            # 更新统计指标相关数据
            metric.update(pred_paths=pred_paths, real_paths=real_paths)

    # 根据metric统计的数据，计算最终的准确率，召回率，F1值
    result = metric.get_result()
    #format_print(result)
    metric.format_print(result)

    return result 

from evaluating import Metrics
from score import SeqEntityScore
def evaluate_detail(model,test_loader):
    # 定义统计评估指标的类
    metric_ent = SeqEntityScore(clue_id2tag)
    model.eval()
    
    test_tag_lists = []
    pred_tag_lists = []
    with paddle.no_grad():
        for step, batch in enumerate(test_loader):
            # 获取数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            
            # 前向计算，得出发射分数
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)

            # 根据发射分数，利用CRF进行解码
            scores, pred_paths = model.viterbi_decoder(features, batch_lens)

            # 将这些预测的标签序列进行id2tag，即转换为相应的标签
            pred_paths = [[clue_id2tag[int(tag_id)] for tag_id in tag_seq] for tag_seq in pred_paths]
            
            # 根据文本序列的真实长度，对真实标签序列进行截断
            batch_tags = batch_tags.numpy().tolist()
            real_paths = [tag_seq[:tag_len] for tag_seq, tag_len in zip(batch_tags, batch_lens)]

            # 更新统计指标相关数据
            metric_ent.update(pred_paths=pred_paths, real_paths=real_paths)

            test_tag_lists.extend([clue_id2tag[i] for real_path in real_paths for i in real_path  ] )
            pred_tag_lists.extend([pred_path for pred_path in pred_paths])
    
    # 根据metric统计的数据，计算最终的准确率，召回率，F1值
    result = metric_ent.get_result()
    metric_ent.report_scores(result)
    
    metrics = Metrics(test_tag_lists, pred_tag_lists, remove_O=False)
    metrics.report_scores()
    metrics.report_confusion_matrix()

    return result 
    

# 模型训练
def train(model, train_loader, test_loader):

    for epoch in range(1, 1 + n_epochs):
       
        model.train()
        print(f"Epoch {epoch}/{n_epochs}")
        for step, batch in enumerate(train_loader):
            # 获取batch中的数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            # 执行模型的前向计算，并计算出损失
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)
            loss = paddle.mean(loss)
            
            # 梯度计算和反向参数更新
            loss.backward()
            optimizer.step()
            optimizer.clear_gradients()

            # 训练过程中打印信息
            if step % 20 ==0:
                print(f"epoch: {epoch}, step: {step}, loss: {loss.numpy()[0]}")
        
        # 模型评估
        evaluate(model, test_loader)
    evaluate_detail(model, test_loader)
    
train(ner_model, train_loader, test_loader)


## 7.4 保存

In [None]:
# 模型保存的名称
model_name = "ner_model"
# 保存模型
paddle.save(ner_model.state_dict(), "{}.pdparams".format(model_name))
paddle.save(optimizer.state_dict(), "{}.optparams".format(model_name))

In [None]:
# load
layer_state_dict = paddle.load("ner_model.pdparams")
opt_state_dict = paddle.load("ner_model.optparams")

# 实例化模型
ner_model = NERModel(vocab_size=vocab_size, embedding_size=embedding_size,
                     hidden_size=hidden_size,label2id=clue_tag2id, n_layers=n_layers, drop_p=dropout_rate)

# 指定优化器
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, beta1=0.9, beta2=0.99,
                                  parameters=ner_model.parameters())
                                
ner_model.set_state_dict(layer_state_dict)
optimizer.set_state_dict(opt_state_dict)

In [None]:
metric = SeqEntityScore(clue_id2tag)
def infer(model, text, word2id,tag2id):
    model.eval()
    # 数据处理
    tokens = [word2id.get(w) for w in list(text)]
    tokens_len = len(tokens)

    # 构造输入模型的数据
    tokens = paddle.to_tensor(tokens, dtype="int64").unsqueeze(0)
    tokens_mask = paddle.to_tensor([1] * tokens_len, dtype="int64").unsqueeze(0)
    tokens_len = paddle.to_tensor(tokens_len, dtype="int64")

    # 计算发射分数
    features = model.forward_loss(tokens, tokens_mask, tokens_len)

    # 根据发射分数进行解码
    _, pred_paths = model.viterbi_decoder(features, tokens_len)

    print(pred_paths[0])
    print([tag2id.get(int(x)) for x in pred_paths[0] ])
    print(text)
    # 解析路径中的实体
    entities = SeqEntityScore(tag2id).get_entities_bios(pred_paths[0])
    print(entities)
    for entity in entities:
        entity_type, start, end = entity
        print(f"{text[start:end+1]} : {entity_type}")


text="今年1月中国光大银行沈阳银行致电，称由于他的信用记录良好，银行可把普通信用卡免费升级为白金卡"
infer(ner_model, text, clue_word2id, clue_id2tag)
    
    

In [None]:
seq = [2, 2, 2, 2, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
metric.get_entities_bio(seq)

[['company', 4, 13]]

# 去除CRF层重新训练
结果反而比CRF高？
看下一节吧，下一节直接在BiLSTM+CRF模型的基础上去除CRF

##模型


In [None]:
import paddle

class NERModel_withoutCRF(paddle.nn.Layer):
    def __init__(self, vocab_size, embedding_size, hidden_size, label2id, n_layers=2, drop_p = 0.1 ):
        super(NERModel_withoutCRF, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.label2id = label2id

        self.embedding = paddle.nn.Embedding(vocab_size, embedding_size)
        self.bilstm = paddle.nn.LSTM(input_size = embedding_size, hidden_size = hidden_size,direction="bidirectional", num_layers=n_layers, dropout=drop_p )
        self.layer_norm = paddle.nn.LayerNorm(hidden_size * 2)
        self.dropout_emb = paddle.nn.Dropout(drop_p)
        self.classifier = paddle.nn.Linear(hidden_size*2 , len(label2id)+2) # add START and STOP tag
        
    
    def decoder(self, features):
        """
        features [N, seqlen, tag_size]
        """
        return paddle.argmax(features, axis= -1) #(N,seqlen)
    
    def forward(self, input_id, input_mask):
        # input_id: [batch_size, seq_len]
        embs = self.embedding(input_id) #[batch_size,seq_len, embedding_sie]
        embs = self.dropout_emb(embs)
        embs = embs*paddle.to_tensor(input_mask, dtype = 'float32').unsqueeze(2)

        last_layer_hiddens, _ =  self.bilstm(embs)
        last_layer_hiddens = self.layer_norm(last_layer_hiddens)
        features = self.classifier(last_layer_hiddens)
        
        return features
    
    def forward_loss(self, input_ids, input_mask, input_lens, input_tags=None):
        features = self.forward(input_ids, input_mask)
        input_tags = paddle.unsqueeze(input_tags, axis=-1)
        loss = paddle.nn.functional.softmax_with_cross_entropy(logits=features, label=input_tags)
        # if input_tags is not None:
        #     return features, self.crf_loss(features, input_lens,"",input_tags)
        return features,loss

In [None]:
# 参数
n_epochs = 50
batch_size = 32
vocab_size = len(clue_word2id.keys())
embedding_size = 128
hidden_size = 384
n_layers = 2
dropout_rate = 0.1
learning_rate = 0.001


train_loader = DatasetLoader(train_set, batch_size, False, True)
test_loader = DatasetLoader(dev_set, batch_size,  False ,True)
use_gpu = True if paddle.get_device().startswith("gpu") else False
if use_gpu:
    paddle.set_device('gpu:0')

# 实例化模型
model_withouCRF = NERModel_withoutCRF(vocab_size=vocab_size, embedding_size=embedding_size,
                     hidden_size=hidden_size,label2id=clue_tag2id, n_layers=n_layers, drop_p=dropout_rate)

# 指定优化器
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, beta1=0.9, beta2=0.99,
                                  parameters=model_withouCRF.parameters())

In [None]:

# 模型训练
def train(model, train_loader, test_loader):

    for epoch in range(1, 1 + n_epochs):
        model.train()
        print(f"Epoch {epoch}/{n_epochs}")
        for step, batch in enumerate(train_loader):
            # 获取batch中的数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            # 执行模型的前向计算，并计算出损失
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)
            loss = paddle.mean(loss)
            
            # 梯度计算和反向参数更新
            loss.backward()
            optimizer.step()
            optimizer.clear_gradients()

            # 训练过程中打印信息
            if step % 20 ==0:
                print(f"epoch: {epoch}, step: {step}, loss: {loss.numpy()[0]}")
        
        # 模型评估
        evaluate(model, test_loader)
    
    evaluate_detail(model, test_loader)

# 模型评估
def evaluate(model,test_loader):
    # 定义统计评估指标的类
    metric = SeqEntityScore(clue_id2tag)
    model.eval()

    with paddle.no_grad():
        for step, batch in enumerate(test_loader):
            # 获取数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            
            # 前向计算，得出发射分数
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)

            # 根据发射分数，利用CRF进行解码
            pred_paths = model.decoder(features) #(N,seqlen)

            # 将这些预测的标签序列进行id2tag，即转换为相应的标签
            pred_paths = [[clue_id2tag[int(tag_id)] for tag_id in tag_seq] for tag_seq in pred_paths]
            
            # 根据文本序列的真实长度，对真实标签序列进行截断
            batch_tags = batch_tags.numpy().tolist()
            real_paths = [tag_seq[:tag_len] for tag_seq, tag_len in zip(batch_tags, batch_lens)]

            # 更新统计指标相关数据
            metric.update(pred_paths=pred_paths, real_paths=real_paths)

    # 根据metric统计的数据，计算最终的准确率，召回率，F1值
    result = metric.get_result()
    #format_print(result)
    metric.format_print(result)

    return result 


from evaluating import Metrics
from score import SeqEntityScore
def evaluate_detail(model,test_loader):
    # 定义统计评估指标的类
    metric_ent = SeqEntityScore(clue_id2tag)
    model.eval()
    
    test_tag_lists = []
    pred_tag_lists = []
    with paddle.no_grad():
        for step, batch in enumerate(test_loader):
            # 获取数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            
            # 前向计算，得出发射分数
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)

            # 根据发射分数，利用CRF进行解码
            pred_paths = model.decoder(features) #(N,seqlen)

            # 将这些预测的标签序列进行id2tag，即转换为相应的标签
            pred_paths = [[clue_id2tag[int(tag_id)] for tag_id in tag_seq] for tag_seq in pred_paths]
            
            # 根据文本序列的真实长度，对真实标签序列进行截断
            batch_tags = batch_tags.numpy().tolist()
            real_paths = [tag_seq[:tag_len] for tag_seq, tag_len in zip(batch_tags, batch_lens)]

            # 更新统计指标相关数据
            metric_ent.update(pred_paths=pred_paths, real_paths=real_paths)

            test_tag_lists.extend([clue_id2tag[i] for real_path in real_paths for i in real_path  ] )
            pred_tag_lists.extend([pred_path for pred_path in pred_paths])
    
    # 根据metric统计的数据，计算最终的准确率，召回率，F1值
    result = metric_ent.get_result()
    metric_ent.report_scores(result)
    
    metrics = Metrics(test_tag_lists, pred_tag_lists, remove_O=False)
    metrics.report_scores()
    metrics.report_confusion_matrix()

    return result 


train(model_withouCRF, train_loader, test_loader)

 I-game     944      14      25       9      31      11      40       9      33       7      24       6      28       5      17       5      23       5      24      20      82 

In [None]:
# 模型保存的名称
model_name = "withoutCRF_model"
# 保存模型
paddle.save(model_withouCRF.state_dict(), "{}.pdparams".format(model_name))
paddle.save(optimizer.state_dict(), "{}.optparams".format(model_name))

In [None]:
# load
layer_state_dict = paddle.load("withoutCRF_model.pdparams")
opt_state_dict = paddle.load("withoutCRF_model.optparams")

# 实例化模型
model_withouCRF = NERModel_withoutCRF(vocab_size=vocab_size, embedding_size=embedding_size,
                     hidden_size=hidden_size,label2id=clue_tag2id, n_layers=n_layers, drop_p=dropout_rate)

# 指定优化器
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, beta1=0.9, beta2=0.99,
                                  parameters=model_withouCRF.parameters())
                                
model_withouCRF.set_state_dict(layer_state_dict)
optimizer.set_state_dict(opt_state_dict)

In [None]:
evaluate_detail(model_withouCRF, test_loader)

In [None]:


metric = SeqEntityScore(clue_id2tag)
def infer(model, text, word2id,tag2id):
    model.eval()
    # 数据处理
    tokens = [word2id.get(w) for w in list(text)]
    tokens_len = len(tokens)

    # 构造输入模型的数据
    tokens = paddle.to_tensor(tokens, dtype="int64").unsqueeze(0)
    tokens_mask = paddle.to_tensor([1] * tokens_len, dtype="int64").unsqueeze(0)
    tokens_len = paddle.to_tensor(tokens_len, dtype="int64")

    # 计算发射分数
    features = model.forward_loss(tokens, tokens_mask, tokens_len)

    # 根据发射分数进行解码
    _, pred_paths = model.decoder(features)

    print(pred_paths[0])
    print([tag2id.get(int(x)) for x in pred_paths[0] ])
    print(text)
    # 解析路径中的实体
    entities = SeqEntityScore(tag2id).get_entities_bios(pred_paths[0])
    print(entities)
    for entity in entities:
        entity_type, start, end = entity
        print(f"{text[start:end+1]} : {entity_type}")


text="今年1月中国光大银行沈阳银行致电，称由于他的信用记录良好，银行可把普通信用卡免费升级为白金卡"
infer(model_withouCRF, text, clue_word2id, clue_id2tag)
    
    

In [None]:
# 查看当前挂载的数据集目录, 该目录下的变更重启环境后会自动还原
# View dataset directory. 
# This directory will be recovered automatically after resetting environment. 
# !ls /home/aistudio/data

In [None]:
# 查看工作区文件, 该目录下的变更将会持久保存. 请及时清理不必要的文件, 避免加载过慢.
# View personal work directory. 
# All changes under this directory will be kept even after reset. 
# Please clean unnecessary files in time to speed up environment loading. 
# !ls /home/aistudio/work

In [None]:
# 如果需要进行持久化安装, 需要使用持久化路径, 如下方代码示例:
# If a persistence installation is required, 
# you need to use the persistence path as the following: 
# !mkdir /home/aistudio/external-libraries
# !pip install beautifulsoup4 -t /home/aistudio/external-libraries

In [None]:
# 同时添加如下代码, 这样每次环境(kernel)启动的时候只要运行下方代码即可: 
# Also add the following code, 
# so that every time the environment (kernel) starts, 
# just run the following code: 
# import sys 
# sys.path.append('//home/aistudio/external-libraries')

请点击[此处](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576)查看本环境基本用法.  <br>
Please click [here ](https://ai.baidu.com/docs#/AIStudio_Project_Notebook/a38e5576) for more detailed instructions. 

#  BI-LSTM+CRF

In [8]:
import os
import json
import random
import numpy as np
import paddle
import paddle.nn as nn
import paddlenlp.layers.crf  as crf
from score import SeqEntityScore
from evaluating import Metrics

def load_data(path):
    # 加载数据集
    def load_dataset(mode="train"):
        assert mode in ["train", "dev", "test"]
        data_path = os.path.join(path, mode+".json")
        examples = []
        with open(data_path, "r", encoding="utf-8") as f:
            for idx, line in enumerate(f):
                # 保存该行样本整理后的数据
                example = {}
                line = json.loads(line.strip())
                text = line["text"]
                tag_entities = line.get("label", None)
                words = list(text)
                tags = ["0"] * len(words)
                if tag_entities is not None:
                    for tag_name, tag_value in tag_entities.items():
                        for entity_name, entity_index in tag_value.items():
                            for start_index, end_index in entity_index:
                                assert "".join(words[start_index:end_index+1]) == entity_name
                                if start_index == end_index:
                                    tags[start_index] = "S-" + tag_name
                                else:
                                    tags[start_index] = "B-" + tag_name
                                    tags[start_index + 1:end_index + 1] = ["I-" + tag_name] * (len(entity_name) - 1)
                example["text"] = " ".join(words)
                example["tag"] = " ".join(tags)
                examples.append(example)

        return examples
    
    # 加载词典
    def load_dict(dict_name):
        assert dict_name in ["tag", "vocab"]
        data_path = os.path.join(path, dict_name+".dict")
        data_dict = {}
        with open(data_path, "r", encoding="utf-8") as f:
            lines = [item.strip().split("\t") for item in f.readlines()]
            data_dict = dict([(item[1], int(item[0])) for item in lines])

        return data_dict 

    train_data = load_dataset(mode="train")
    test_data = load_dataset(mode="dev")

    word2id = load_dict(dict_name="vocab")
    tag2id = load_dict(dict_name="tag")

    return train_data, test_data, word2id, tag2id   

In [9]:
tag2id

NameError: name 'tag2id' is not defined

In [3]:
def conver_word_to_id(data, word2id, tag2id):
        examples = []
        for example in data:
            text = example['text']
            tokens = [word2id.get(w, word2id["[UNK]"]) for w in text.split(" ")]
            text_real_len = len(tokens)
            tag = example['tag']
            tag_ids = [tag2id.get(t, tag2id["O"]) for t in tag.split(" ")]
            examples.append((tokens, tag_ids, text_real_len))
        return examples

In [4]:
class DatasetLoader(object):
    def __init__(self, data, batch_size, shuffle=False, sort=True, drop_last=False):
        self.examples = data
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.sort = sort
        self.drop_last = drop_last
        self.split_data()
    
    # 将数据分成mini-batch
    def split_data(self):
        # 根据样本的长度对数据进行排序
        if self.sort:
            self.examples = sorted(self.examples, key=lambda x: x[2], reverse=True)
        if self.shuffle:
            indices = list(range(len(self.examples)))
            random.shuffle(indices)
            self.examples = [self.examples[i] for i in indices]
        self.batch_data = [self.examples[i:i + self.batch_size] for i in range(0, len(self.examples), self.batch_size)]
        if self.drop_last and len(self.batch_data[-1]) < batch_size:
            self.batch_data = self.batch_data[:-1]

    def padding_sequence(self, batch, batch_size, mask=None):
        # 固定batch中样本的长度，将短的样本padding到最长样本的长度
        tokens_list, tags_list, lens = batch
        max_len = max(lens)
        batch_size = len(lens)
        batch_token = paddle.full(shape=[batch_size, max_len], fill_value=0, dtype="int64")
        batch_tag = paddle.full(shape=[batch_size, max_len], fill_value=0, dtype="int64")
        batch_mask = paddle.full(shape=[batch_size, max_len], fill_value=0, dtype="int64")
        for i in range(batch_size):
            batch_token[i, :lens[i]] = paddle.to_tensor(tokens_list[i], dtype="int64")
            batch_tag[i, :lens[i]] = paddle.to_tensor(tags_list[i], dtype="int64")

            if mask:
                batch_mask[i, :lens[i]] = paddle.to_tensor([1] * lens[i], dtype="int64") 
        if mask:
            return batch_token, batch_tag, batch_mask

        return batch_token, batch_tag

    def __len__(self):
        return len(self.batch_data)

    def __getitem__(self, index):
        # 每次调用该函数，则返回一个batch
        # batch中每条样本包含三部分数据：text，tag, len
        batch = self.batch_data[index]
        batch_size = len(batch)
        # batch变成三个list，依次是text，tag, len
        batch = list(zip(*batch))
        batch_ids, batch_tags, batch_mask = self.padding_sequence(batch, batch_size, mask=True)
        batch_lens = paddle.to_tensor(batch[2])

        return (batch_ids, batch_tags, batch_mask, batch_lens)

In [5]:
class NERModel(paddle.nn.Layer):
    def __init__(self, vocab_size, embedding_size, hidden_size, label2id, n_layers=2, drop_p=0.1):
        super(NERModel, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size
        self.label2id = label2id

        self.embedding = paddle.nn.Embedding(vocab_size, embedding_size)
        self.bilstm = paddle.nn.LSTM(input_size=embedding_size, hidden_size=hidden_size, direction="bidirectional", num_layers=n_layers, dropout=drop_p)
        self.layer_norm = paddle.nn.LayerNorm(hidden_size*2)
        self.dropout_emb = paddle.nn.Dropout(p=drop_p)
        # 在CRF的具体实现时，会引入2个辅助性标签<START> 和 <STOP>
        self.classifier = paddle.nn.Linear(hidden_size*2, len(label2id)+2) # add START and STOP tag

        # 将标签数量传入crf中，生成crf实例，这里需要注意一下n_labels不包含START和STOP标签
        self.crf = crf.LinearChainCrf(len(label2id), crf_lr=0.001, with_start_stop_tag=True)
        self.crf_loss = crf.LinearChainCrfLoss(self.crf)
        self.viterbi_decoder = crf.ViterbiDecoder(self.crf.transitions)


    def forward(self, input_ids, input_mask):
        # 该前向计算将会输出bilstm最后一层的序列隐状态
        # input_ids: [batch_size, seq_len]
        # embs: [batch_size, seq_len, embedding_size]
        embs = self.embedding(input_ids)
        embs = self.dropout_emb(embs)
        embs = embs * paddle.to_tensor(input_mask, dtype="float32").unsqueeze(2)
        # last_layer_hiddens: [batch_size, seq_len, hidden_size]
        last_layer_hiddens, _ = self.bilstm(embs)
        last_layer_hiddens = self.layer_norm(last_layer_hiddens)
        # features: [batch_size, seq_len, n_labels]
        features = self.classifier(last_layer_hiddens)

        return features

    def forward_loss(self, input_ids, input_mask, input_lens, input_tags=None):
        # input_ids: [batch_size, seq_len]
        # input_mask: [batch_size, seq_len]
        # input_lens: [batch_size] Tensor
        # input_tags: [batch_size, seq_len]

        # features: [batch_size, seq_len, n_labels]
        features = self.forward(input_ids, input_mask)

        if input_tags is not None:
            return features, self.crf_loss(features, input_lens, "", input_tags)
        return features

In [6]:
def addTags_2(tag2id):  
    tag2id['<Start>'] = len(tag2id)
    tag2id['<End>'] = len(tag2id)
    return tag2id

In [7]:
# 加载数据集
root_path = "./ner_clue_data/"
train_data, test_data, word2id, tag2id = load_data(root_path)
train_set = conver_word_to_id(train_data, word2id, tag2id)
test_set = conver_word_to_id(test_data, word2id, tag2id)

addTags_2(tag2id)

# 参数设置
n_epochs = 50
batch_size = 32
vocab_size = len(word2id.keys())
embedding_size = 128
hidden_size = 384
n_layers = 2
dropout_rate = 0.1
learning_rate = 0.001

# 生成data_loader，方便按照batch取数据
train_loader = DatasetLoader(train_set, batch_size, shuffle=False, sort=True)
test_loader = DatasetLoader(test_set, batch_size, shuffle=False, sort=False)



# 检测是否可以使用GPU，如果可以优先使用GPU
use_gpu = True if paddle.get_device().startswith("gpu") else False
if use_gpu:
    paddle.set_device('gpu:0')

# 实例化模型
ner_model = NERModel(vocab_size=vocab_size, embedding_size=embedding_size,
                     hidden_size=hidden_size,label2id=tag2id, n_layers=n_layers, drop_p=dropout_rate)

# 指定优化器
optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, beta1=0.9, beta2=0.99,
                                  parameters=ner_model.parameters())

# 反转tag2id, 得到id2tag字典
id2tag = dict([items[1], items[0]] for items in tag2id.items())   

W1118 22:18:20.714541   143 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W1118 22:18:20.760452   143 device_context.cc:465] device: 0, cuDNN Version: 7.6.


In [9]:
id2tag

{0: 'O',
 1: 'B-address',
 2: 'B-book',
 3: 'B-company',
 4: 'B-game',
 5: 'B-government',
 6: 'B-movie',
 7: 'B-name',
 8: 'B-organization',
 9: 'B-position',
 10: 'B-scene',
 11: 'I-address',
 12: 'I-book',
 13: 'I-company',
 14: 'I-game',
 15: 'I-government',
 16: 'I-movie',
 17: 'I-name',
 18: 'I-organization',
 19: 'I-position',
 20: 'I-scene',
 21: 'S-address',
 22: 'S-book',
 23: 'S-company',
 24: 'S-game',
 25: 'S-government',
 26: 'S-movie',
 27: 'S-name',
 28: 'S-organization',
 29: 'S-position',
 30: 'S-scene',
 31: '<Start>',
 32: '<End>'}

In [10]:
 

# 模型评估
def evaluate(model,test_loader):
    # 定义统计评估指标的类
    metric = SeqEntityScore(id2tag)
    model.eval()

    with paddle.no_grad():
        for step, batch in enumerate(test_loader):
            # 获取数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            
            # 前向计算，得出发射分数
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)

            # 根据发射分数，利用CRF进行解码
            scores, pred_paths = model.viterbi_decoder(features, batch_lens)

            # 将这些预测的标签序列进行id2tag，即转换为相应的标签
            pred_paths = [[id2tag[int(tag_id)] for tag_id in tag_seq] for tag_seq in pred_paths]
            
            # 根据文本序列的真实长度，对真实标签序列进行截断
            batch_tags = batch_tags.numpy().tolist()
            real_paths = [tag_seq[:tag_len] for tag_seq, tag_len in zip(batch_tags, batch_lens)]

            # 更新统计指标相关数据
            metric.update(pred_paths=pred_paths, real_paths=real_paths)

    # 根据metric统计的数据，计算最终的准确率，召回率，F1值
    result = metric.get_result()
    #format_print(result)
    metric.format_print(result)

    return result 


# 模型训练
def train(model, train_loader):

    for epoch in range(1, 1 + n_epochs):
        model.train()
        print(f"Epoch {epoch}/{n_epochs}")
        for step, batch in enumerate(train_loader):
            # 获取batch中的数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            # 执行模型的前向计算，并计算出损失
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)
            loss = paddle.mean(loss)
            
            # 梯度计算和反向参数更新
            loss.backward()
            optimizer.step()
            optimizer.clear_gradients()

            # 训练过程中打印信息
            if step % 20 ==0:
                print(f"epoch: {epoch}, step: {step}, loss: {loss.numpy()[0]}")
        
        # 模型评估
        evaluate(model, test_loader)


train(ner_model, train_loader)
# 模型保存的名称
model_name = "ReRunNer_model"
# 保存模型
paddle.save(ner_model.state_dict(), "{}.pdparams".format(model_name))
paddle.save(optimizer.state_dict(), "{}.optparams".format(model_name))

Epoch 1/50
epoch: 1, step: 0, loss: 254.2734832763672




epoch: 1, step: 20, loss: 66.33065795898438
epoch: 1, step: 40, loss: 63.17569351196289
epoch: 1, step: 60, loss: 42.73284149169922
epoch: 1, step: 80, loss: 37.45672607421875
epoch: 1, step: 100, loss: 31.633586883544922
epoch: 1, step: 120, loss: 35.52635192871094
epoch: 1, step: 140, loss: 33.965660095214844
epoch: 1, step: 160, loss: 25.99199676513672
epoch: 1, step: 180, loss: 24.506038665771484
epoch: 1, step: 200, loss: 21.9235782623291
epoch: 1, step: 220, loss: 27.952251434326172
epoch: 1, step: 240, loss: 21.93344497680664
epoch: 1, step: 260, loss: 14.469120979309082
epoch: 1, step: 280, loss: 13.816789627075195
epoch: 1, step: 300, loss: 16.045164108276367
epoch: 1, step: 320, loss: 8.633285522460938
Total: Precision: 0.3315 - Recall: 0.3831 - F1: 0.3554
Epoch 2/50
epoch: 2, step: 0, loss: 32.469356536865234
epoch: 2, step: 20, loss: 22.231422424316406
epoch: 2, step: 40, loss: 24.642513275146484
epoch: 2, step: 60, loss: 21.22760009765625
epoch: 2, step: 80, loss: 17.25895

In [105]:
def load():
    # load
    RERUNNER_layer_state_dict = paddle.load("ReRunNer_model.pdparams")
    RERUNNER_opt_state_dict = paddle.load("ReRunNer_model.optparams")

    # 实例化模型
    RERUNNER__model = NERModel(vocab_size=vocab_size, embedding_size=embedding_size,
                        hidden_size=hidden_size,label2id=tag2id, n_layers=n_layers, drop_p=dropout_rate)

    # 指定优化器
    optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, beta1=0.9, beta2=0.99,
                                    parameters=RERUNNER__model.parameters())
                                    
    RERUNNER__model.set_state_dict(RERUNNER_layer_state_dict)
    optimizer.set_state_dict(RERUNNER_opt_state_dict)
    return RERUNNER__model

## BiLSTMCRF

In [12]:
# 模型评估
import score
import importlib
importlib.reload(score)
from score import SingleClassificationScore


def evaluate_detail_char_entity(model,test_loader):
    # 定义统计评估指标的类
    metric = SeqEntityScore(id2tag)
    char_metric =  SingleClassificationScore(id2tag)
    model.eval()

    with paddle.no_grad():
        for step, batch in enumerate(test_loader):
            # 获取数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            
            # 前向计算，得出发射分数
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)

            # 根据发射分数，利用CRF进行解码
            scores, pred_paths = model.viterbi_decoder(features, batch_lens)

            # 将这些预测的标签序列进行id2tag，即转换为相应的标签
            pred_paths = [[id2tag[int(tag_id)] for tag_id in tag_seq] for tag_seq in pred_paths]
            
            # 根据文本序列的真实长度，对真实标签序列进行截断
            batch_tags = batch_tags.numpy().tolist()
            real_paths = [tag_seq[:tag_len] for tag_seq, tag_len in zip(batch_tags, batch_lens)]

            # 更新统计指标相关数据
            # print([y for x in pred_paths for y in x ])
            assert len(pred_paths) == len(real_paths)
            metric.update(pred_paths=pred_paths, real_paths=real_paths)
            char_metric.update(pred_labels=[y for x in pred_paths for y in x ], real_labels=[y for x in real_paths for y in x ])

    # 根据metric统计的数据，计算最终的准确率，召回率，F1值
    result = metric.get_result()
    metric.format_print(result)
    metric.report_scores(result)

    result_2 = char_metric.get_result()
    char_metric.report_scores(result_2)

    return result 

evaluate_detail_char_entity(RERUNNER__model, test_loader)

Total: Precision: 0.6346 - Recall: 0.6491 - F1: 0.6418
                 precision    recall  f1-score   support
           name     0.7085    0.6796    0.6937       465
        address     0.4297    0.4424    0.4359       373
   organization     0.6548    0.7493    0.6989       367
           game     0.7021    0.8068    0.7508       295
          scene     0.4922    0.4545    0.4726       209
           book     0.6410    0.6494    0.6452       154
        company     0.6136    0.6429    0.6279       378
       position     0.7210    0.6744    0.6969       433
     government     0.6552    0.6923    0.6732       247
          movie     0.6972    0.6556    0.6758       151
      avg/total     0.6346    0.6491    0.6418      3072
                 precision    recall  f1-score   support
         B-name     0.0381    0.0366    0.0373       465
         I-name     0.0446    0.0421    0.0433      1021
              O     0.5530    0.8032    0.6550     36747
      B-address     0.0130    0.0

{'name': {'Precision': 0.7085, 'Recall': 0.6796, 'F1': 0.6937, 'support': 465},
 'address': {'Precision': 0.4297,
  'Recall': 0.4424,
  'F1': 0.4359,
  'support': 373},
 'organization': {'Precision': 0.6548,
  'Recall': 0.7493,
  'F1': 0.6989,
  'support': 367},
 'game': {'Precision': 0.7021, 'Recall': 0.8068, 'F1': 0.7508, 'support': 295},
 'scene': {'Precision': 0.4922,
  'Recall': 0.4545,
  'F1': 0.4726,
  'support': 209},
 'book': {'Precision': 0.641, 'Recall': 0.6494, 'F1': 0.6452, 'support': 154},
 'company': {'Precision': 0.6136,
  'Recall': 0.6429,
  'F1': 0.6279,
  'support': 378},
 'position': {'Precision': 0.721,
  'Recall': 0.6744,
  'F1': 0.6969,
  'support': 433},
 'government': {'Precision': 0.6552,
  'Recall': 0.6923,
  'F1': 0.6732,
  'support': 247},
 'movie': {'Precision': 0.6972,
  'Recall': 0.6556,
  'F1': 0.6758,
  'support': 151},
 'Total': {'Precision': 0.6346,
  'Recall': 0.6491,
  'F1': 0.6418,
  'support': 3072}}

## 去除CRF层

In [13]:
# 模型评估
import score
import importlib
importlib.reload(score)
from score import SingleClassificationScore


def evaluate_detail_char_entity_no_CRF(model,test_loader):
    # 定义统计评估指标的类
    metric = SeqEntityScore(id2tag)
    char_metric =  SingleClassificationScore(id2tag)
    model.eval()

    with paddle.no_grad():
        for step, batch in enumerate(test_loader):
            # 获取数据
            batch_ids, batch_tags, batch_mask, batch_lens = batch
            
            # 前向计算，得出发射分数
            features, loss = model.forward_loss(batch_ids, batch_mask, batch_lens, batch_tags)

            # 根据发射分数，利用CRF进行解码
            # scores, pred_paths = model.viterbi_decoder(features, batch_lens)
            pred_paths = paddle.argmax(features, axis= -1)

            # 将这些预测的标签序列进行id2tag，即转换为相应的标签
            pred_paths = [[id2tag[int(tag_id)] for tag_id in tag_seq] for tag_seq in pred_paths]
            
            # 根据文本序列的真实长度，对真实标签序列进行截断
            batch_tags = batch_tags.numpy().tolist()
            real_paths = [tag_seq[:tag_len] for tag_seq, tag_len in zip(batch_tags, batch_lens)]

            # 更新统计指标相关数据
            # print([y for x in pred_paths for y in x ])
            assert len(pred_paths) == len(real_paths)
            metric.update(pred_paths=pred_paths, real_paths=real_paths)
            char_metric.update(pred_labels=[y for x in pred_paths for y in x ], real_labels=[y for x in real_paths for y in x ])

    # 根据metric统计的数据，计算最终的准确率，召回率，F1值
    result = metric.get_result()
    metric.format_print(result)
    metric.report_scores(result)

    result_2 = char_metric.get_result()
    char_metric.report_scores(result_2)

    return result 

evaluate_detail_char_entity_no_CRF(RERUNNER__model, test_loader)

Total: Precision: 0.5004 - Recall: 0.6374 - F1: 0.5606
                 precision    recall  f1-score   support
           name     0.5957    0.6495    0.6214       465
        address     0.2548    0.4236    0.3182       373
   organization     0.6247    0.7439    0.6791       367
           game     0.6676    0.7898    0.7236       295
          scene     0.3186    0.4498    0.3730       209
           book     0.4692    0.6429    0.5425       154
        company     0.3552    0.6296    0.4542       378
       position     0.7132    0.6721    0.6920       433
     government     0.6429    0.6923    0.6667       247
          movie     0.6600    0.6556    0.6578       151
      avg/total     0.5004    0.6374    0.5606      3072
                 precision    recall  f1-score   support
         B-name     0.0316    0.0344    0.0329       465
         I-name     0.0240    0.0852    0.0374      1021
              O     0.5565    0.6421    0.5962     36747
      B-address     0.0081    0.0

{'name': {'Precision': 0.5957, 'Recall': 0.6495, 'F1': 0.6214, 'support': 465},
 'address': {'Precision': 0.2548,
  'Recall': 0.4236,
  'F1': 0.3182,
  'support': 373},
 'organization': {'Precision': 0.6247,
  'Recall': 0.7439,
  'F1': 0.6791,
  'support': 367},
 'game': {'Precision': 0.6676, 'Recall': 0.7898, 'F1': 0.7236, 'support': 295},
 'scene': {'Precision': 0.3186, 'Recall': 0.4498, 'F1': 0.373, 'support': 209},
 'book': {'Precision': 0.4692, 'Recall': 0.6429, 'F1': 0.5425, 'support': 154},
 'company': {'Precision': 0.3552,
  'Recall': 0.6296,
  'F1': 0.4542,
  'support': 378},
 'position': {'Precision': 0.7132,
  'Recall': 0.6721,
  'F1': 0.692,
  'support': 433},
 'government': {'Precision': 0.6429,
  'Recall': 0.6923,
  'F1': 0.6667,
  'support': 247},
 'movie': {'Precision': 0.66, 'Recall': 0.6556, 'F1': 0.6578, 'support': 151},
 'Total': {'Precision': 0.5004,
  'Recall': 0.6374,
  'F1': 0.5606,
  'support': 3072}}

In [106]:
RERUNNER__model = load()
transitionArray = RERUNNER__model.crf.transitions

In [107]:
transitionArray

Parameter containing:
Tensor(shape=[33, 33], dtype=float32, place=CPUPlace, stop_gradient=False,
       [[-0.08526525, -0.03793711,  0.02859537, ...,  0.24255468,
          0.28833938, -0.25289056],
        [ 0.03477779, -0.29694635,  0.13645241, ..., -0.05103235,
          0.24255556, -0.01399392],
        [ 0.03728428, -0.25427681,  0.13968380, ...,  0.26362556,
          0.16084141, -0.20982440],
        ...,
        [ 0.23981276,  0.19680473,  0.27714926, ...,  0.19684142,
          0.19561511,  0.09494364],
        [-0.10466477,  0.16904950,  0.21952766, ..., -0.17706488,
         -0.11380799, -0.17421100],
        [-0.27031475, -0.07000922,  0.22302569, ..., -0.15513097,
          0.30032444, -0.01853615]])

In [109]:
print(transitionArray[-1,:])

Tensor(shape=[33], dtype=float32, place=CPUPlace, stop_gradient=False,
       [-0.27031475, -0.07000922,  0.22302569, -0.11706991, -0.25752637,
        -0.02069395, -0.11952141, -0.20108281, -0.28450519, -0.12538649,
         0.24515440, -0.25085741, -0.09736122,  0.03068645,  0.14794850,
         0.23071198,  0.14207439, -0.30041534,  0.13280988,  0.11484459,
         0.17873847,  0.30134013, -0.17815727,  0.17397724,  0.08293276,
         0.07850233, -0.28681079, -0.15948965, -0.24004097, -0.23333408,
        -0.15513097,  0.30032444, -0.01853615])


In [96]:
# transition [-1]为start，[-2]为stop
# 假设x[i,j]代表j->i，那么可以赋 transition[-1,:] = -1000,代表没有转移到start的情况

In [146]:
id2tag

{0: 'O',
 1: 'B-address',
 2: 'B-book',
 3: 'B-company',
 4: 'B-game',
 5: 'B-government',
 6: 'B-movie',
 7: 'B-name',
 8: 'B-organization',
 9: 'B-position',
 10: 'B-scene',
 11: 'I-address',
 12: 'I-book',
 13: 'I-company',
 14: 'I-game',
 15: 'I-government',
 16: 'I-movie',
 17: 'I-name',
 18: 'I-organization',
 19: 'I-position',
 20: 'I-scene',
 21: 'S-address',
 22: 'S-book',
 23: 'S-company',
 24: 'S-game',
 25: 'S-government',
 26: 'S-movie',
 27: 'S-name',
 28: 'S-organization',
 29: 'S-position',
 30: 'S-scene'}

In [150]:
# 更新参数并测试
def load():
    # load
    RERUNNER_layer_state_dict = paddle.load("ReRunNer_model.pdparams")
    RERUNNER_opt_state_dict = paddle.load("ReRunNer_model.optparams")

    # 实例化模型
    RERUNNER__model = NERModel(vocab_size=vocab_size, embedding_size=embedding_size,
                        hidden_size=hidden_size,label2id=tag2id, n_layers=n_layers, drop_p=dropout_rate)

    # 指定优化器
    optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, beta1=0.9, beta2=0.99,
                                    parameters=RERUNNER__model.parameters())
    
    transitions = RERUNNER_layer_state_dict['viterbi_decoder.transitions'].detach()

    ###
    print(transitions)
    # transitions[-1,:] = -100
    # transitions[-1,:] = -10000 #转移到statr(-1)的概率为0
    # transitions[:,-2] = -10000 #从stop(-2)转移出去的概率为0

    # transitions[12:20,1 ] = -10000
    transitions = transitions * 1000

    RERUNNER_layer_state_dict['viterbi_decoder.transitions'] = transitions
    print(RERUNNER_layer_state_dict['viterbi_decoder.transitions'])
    ###

    RERUNNER__model.set_state_dict(RERUNNER_layer_state_dict)
    optimizer.set_state_dict(RERUNNER_opt_state_dict)
    return RERUNNER__model
RERUNNER__model = load()


Tensor(shape=[33, 33], dtype=float32, place=CPUPlace, stop_gradient=True,
       [[-0.08526525, -0.03793711,  0.02859537, ...,  0.24255468,
          0.28833938, -0.25289056],
        [ 0.03477779, -0.29694635,  0.13645241, ..., -0.05103235,
          0.24255556, -0.01399392],
        [ 0.03728428, -0.25427681,  0.13968380, ...,  0.26362556,
          0.16084141, -0.20982440],
        ...,
        [ 0.23981276,  0.19680473,  0.27714926, ...,  0.19684142,
          0.19561511,  0.09494364],
        [-0.10466477,  0.16904950,  0.21952766, ..., -0.17706488,
         -0.11380799, -0.17421100],
        [-0.27031475, -0.07000922,  0.22302569, ..., -0.15513097,
          0.30032444, -0.01853615]])
Tensor(shape=[33, 33], dtype=float32, place=CPUPlace, stop_gradient=True,
       [[-85.26525116 , -37.93711090 ,  28.59536743 , ...,
          242.55467224,  288.33938599, -252.89056396],
        [ 34.77779388 , -296.94635010,  136.45240784, ...,
         -51.03234482 ,  242.55555725, -13.99392033 ]

In [151]:
evaluate_detail_char_entity(RERUNNER__model, test_loader)



KeyError: 31