# dh_msra 说明
0. **下载地址：** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/dh_msra/dh_msra.zip)
1. **数据概览：** 5 万多条中文命名实体识别标注数据（[IOB2](https://dl.acm.org/citation.cfm?id=977059) 格式，符合 [CoNLL 2002](https://www.clips.uantwerpen.be/conll2002/ner/) 和 [CRF++](https://taku910.github.io/crfpp/#format) 标准）
2. **推荐实验：** 中文命名实体识别
2. **数据来源：** 不详
3. **原数据集：** [zh-NER-TF](https://github.com/Determined22/zh-NER-TF)，网上搜集，具体作者、来源不详，可能是来自于 MSRA 的语料
4. **加工处理：**
    1. 将原来 2 个文件 (train 和 test) 整合到 1 个文件中

In [1]:
import codecs
import random

import numpy as np

In [2]:
path = 'dh_msra_文件夹_所在_路径'

# 1. dh_msra.txt

## 加载数据

In [3]:
def load_iob2(file_path):
    '''加载 IOB2 格式的数据'''
    token_seqs = []
    label_seqs = []
    tokens = []
    labels = []
    with codecs.open(file_path) as f:
        for index, line in enumerate(f):
            items = line.strip().split()
            if len(items) == 2:
                token, label = items
                tokens.append(token)
                labels.append(label)
            elif len(items) == 0:
                if tokens:
                    token_seqs.append(tokens)
                    label_seqs.append(labels)
                    tokens = []
                    labels = []
            else:
                print('格式错误。行号：{} 内容：{}'.format(index, line))
                continue
                
    if tokens: # 如果文件末尾没有空行，手动将最后一条数据加入序列的列表中
        token_seqs.append(tokens)
        label_seqs.append(labels)    
        
    return np.array(token_seqs), np.array(label_seqs)


def show_iob2(token_seqs, label_seqs, num=5, shuffle=True):
    '''显示 IOB2 格式数据'''
    if shuffle:
        length = len(token_seqs)
        indexes = [random.randrange(0, length) for i in range(num)] 
        zip_seqs = zip(token_seqs[indexes], label_seqs[indexes])
    else:
        zip_seqs = zip(token_seqs[0:num], label_seqs[0:num])
        
    for tokens, labels in zip_seqs:
        for token, label in zip(tokens, labels):
            print('{}/{} '.format(token, label), end='')
        print('\n')

In [None]:
token_seqs, label_seqs = load_iob2(path+'dh_msra.txt')

print(len(token_seqs), len(label_seqs))
print()    
show_iob2(token_seqs, label_seqs)

## 标签说明

| 标签 | 说明 |
| ---- | ---- |
| LOC | 地点 (LOCATION) |
| ORG | 机构 (ORGANIZATION) |
| PER | 人物 (PERSON) |

In [None]:
set([label for labels in label_seqs for label in labels])