#  用CNN(卷积神经网络)识别新闻数据中的命名实体

*在这个教程中，我们会用卷积神经网络（convolutional neural network,CNN）去解决命名实体识别（Named Entity Recognition,NER）的问题。*
  
*命名实体识别是自然语言处理中经常遇到的问题，它的作用是从文本中抽取出一些实体，例如人、机构、地点等等。* 
  
*在这里，我们会做些实验，在CoNLL-2003数据集中的不同新闻中，识别出命名实体。*  
  
  例如，我们想从下面这句话中解析出人和机构的名字
>Yan Goodfellow works for Google Brain
  
  NER模型需要提供如下的标签(tags)序列:
>B-PER I-PER    O     O   B-ORG  I-ORG
  
  
  这里有两个前缀：  
*B-*代表着实体的beginning  
*I-*代表着实体的inside  
*O*代表没有标签  
带有这种前缀的标记称为BIO标记(BIO markup),引入此标记是为了区分具有相似类型的后续实体。  
  
  
  解决这种问题需要用到神经网络的相关知识，尤其是**卷积神经网络**。  
  


## 数据
下面的单元格将把这个任务所需的所有数据下载到文件夹 \data 中，库中的下载工具用来下载和提取文件的。

In [1]:
import deeppavlov
from deeppavlov.core.data.utils import download_decompress
download_decompress('http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz', 'data/')

2018-08-20 17:28:32.227 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 205: Starting new HTTP connection (1): files.deeppavlov.ai:80
2018-08-20 17:28:32.529 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 393: http://files.deeppavlov.ai:80 "GET /deeppavlov_data/conll2003_v2.tar.gz HTTP/1.1" 302 None
2018-08-20 17:28:32.541 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 205: Starting new HTTP connection (1): 202.112.144.234:80
2018-08-20 17:28:32.599 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 393: http://202.112.144.234:80 "GET /files/3146000001CF3D15/lnsigo.mipt.ru/export/deeppavlov_data/conll2003_v2.tar.gz HTTP/1.1" 200 957092
2018-08-20 17:28:32.601 INFO in 'deeppavlov.core.data.utils'['utils'] at line 62: Downloading from http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz to data/conll2003_v2.tar.gz
100%|██████████| 957k/957k [00:00<00:00, 14.3MB/s]
2018-08-20 17:28:32.691 INFO in 'deeppavlov.core.data.utils'['utils'

## 加载CoNLL-2003命名实体识别语料库
这里我们将运用到一个包含带有命名实体标签的推特文章的语料库(corpus)。一个典型的命名实体识别数据文件包含符号（*tokens*）（词或标点符号）和标签（*tags*），它们被空格分隔开。有的时候一些附加信息，比如 POS-tags 也是包含在其中的。
不同的文件是用 **-DOCSTART-** 开头的一行分隔开的，不同的句子是用一行空白行分隔开的。
例如：
~~~
  
  -DOCSTART- -X- -X- O

  EU NNP B-NP B-ORG  
  rejects VBZ B-VP O  
  German JJ B-NP B-MISC  
  call NN I-NP O  
  to TO B-VP O  
  boycott VB I-VP O  
  British JJ B-NP B-MISC  
  lamb NN I-NP O  
  . . O O  

  Peter NNP B-NP B-PER  
  Blackburn NNP I-NP I-PER  
  
~~~
这个教程中我们只关注tokens和tags（也就是每行的第一个元素和最后一个元素）,而忽略掉两者之间的POS元素。  
  

我们先新建一个Conll2003DatasetReader类，用来读取数据集。它返回的是一个dictionary包含train,test,valid这三个field，每个field存储着一些sample构成的list，每个sample是由tokens和tags构成的tuple,其中tokens和tags是list。
下面的例子描述了这个dictionary的结构，它由NerDatasetReader类中的read()方法返回：  
~~~
{'train': [(['Mr.', 'Dwag', 'are', 'derping', 'around'], ['B-PER', 'I-PER', 'O', 'O', 'O']), ....],
 'valid': [...],
 'test': [...]}

~~~  
数据集分为三个部分：  
1.train: 用来训练模型  
2.valid: 用来评估以及参数调优  
3.test:用来最终评估模型  
这三个部分分别存在三个txt文件中。  
  
我们会用库中的Conll2003DatasetReader类来读取数据，也就是把文本转换成如上所说的形式。


In [2]:
from deeppavlov.dataset_readers.conll2003_reader import Conll2003DatasetReader
dataset = Conll2003DatasetReader().read('data/')

我们应该始终了解我们处理的数据类型，因此，我们用下面的代码把它们打印出来。

In [3]:
for sample in dataset['train'][:5]:
    for token,tag in zip(*sample):
        print('%s\t%s' % (token,tag))
    print()
    # zip() 函数用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。
    # 如果各个迭代器的元素个数不一致，则返回列表长度与最短的对象相同，利用 * 号操作符，可以将元组解压为列表。

<DOCSTART>	O

EU	B-ORG
rejects	O
German	B-MISC
call	O
to	O
boycott	O
British	B-MISC
lamb	O
.	O

Peter	B-PER
Blackburn	I-PER

BRUSSELS	B-LOC
1996-08-22	O

The	O
European	B-ORG
Commission	I-ORG
said	O
on	O
Thursday	O
it	O
disagreed	O
with	O
German	B-MISC
advice	O
to	O
consumers	O
to	O
shun	O
British	B-MISC
lamb	O
until	O
scientists	O
determine	O
whether	O
mad	O
cow	O
disease	O
can	O
be	O
transmitted	O
to	O
sheep	O
.	O



## 准备字典
为了训练出来一个神经网络，我们需要用到两个映射（mapping）：  
* {token}$\to${token id}: 为当前的token处理嵌入矩阵中的行
* {tag}$\to${tag id}:制造one-hot地面真值概率分布向量，用于计算网络输出损耗。  
  
库中的 SimpleVocabulary 将会执行这些映射。

In [4]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

接下来，我们将会为token和tag准备字典。有时词汇表中有一些特殊的token，例如一个未知的单词标记，每当我们遇到词汇表之外的单词时就会使用它。这种情况下，我们就会用< UNK > 这种特殊的记号来表示词汇表之外的单词。

In [5]:
special_tokens = ['<UNK>']
token_vocab = SimpleVocabulary(special_tokens, save_path='model/token.dict')
tag_vocab = SimpleVocabulary(save_path='model/tag.dict')



然后我们在数据的训练部分中加入词汇表。

In [6]:
all_tokens_by_sentences = [tokens for tokens, tags in dataset['train']]
all_tags_by_sentences = [tags for tokens, tags in dataset['train']]# 这是list

token_vocab.fit(all_tokens_by_sentences)
tag_vocab.fit(all_tags_by_sentences)

尝试得到索引，请记住，我们正在使用以下结构的批次:
~~~
[['utt0_tok0', 'utt1_tok1', ...], ['utt1_tok0', 'utt1_tok1', ...], ...]
~~~

In [7]:
token_vocab([['How', 'to', 'do', 'a', 'barrel', 'roll', '?']])

[[10167, 6, 168, 7, 6097, 5518, 1865]]

In [8]:
tag_vocab([['O', 'O', 'O'], ['B-ORG', 'I-ORG']])

[[0, 0, 0], [3, 5]]

In [9]:
tag_vocab([['I-ORG']])

[[5]]

接下来，我们试试从索引到token的转化。

In [10]:
import numpy as np
token_vocab([np.random.randint(0, 512, size=10)])
# numpy.random.randint()用法
# numpy.random.randint(low, high=None, size=None, dtype='l')
# low : int 产生随机数的最小值
# high : int, optional 给随机数设置个上限，即产生的随机数必须小于high
# size : int or tuple of ints, optional 输出的大小，可以是整数，或者元组
# dtype : dtype, optional 期望结果的类型



[['31',
  '10',
  'go',
  'loss',
  'minutes',
  'so',
  'matches',
  'then',
  'for',
  'following']]

In [11]:
token_vocab([[1,2]])# 索引1、2分别对应着'.'和‘,’

[['.', ',']]

## 数据集迭代器(Iterator)
神经网络通常是分批训练的。这意味着网络的权值更新是基于每一次的多个序列。每一批中的所有序列需要具有相同的长度。因此，我们将向它们填充一个特殊的< UKN >记号。同样，token和tag也必须填充它。为循环神经网络(Recurrent Neural Network，RNN)提供序列长度是很好的实践，所以它可以跳过对填充部件的计算。我们在这里供批处理函数batches_generator以节省时间。  
  
  批量生成的一个重要概念是打乱(shuffling)。打乱是从数据集中随机抽取样本,对打乱后的数据进行训练是很重要的，因为从同一类抽取的大量结果样本可能导致模型太过于”纯净“。


In [12]:
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator

从加载的数据集中创建数据集迭代器

In [13]:
data_iterator = DataLearningIterator(dataset)

尝试输出：

In [14]:
next(data_iterator.gen_batches(2, shuffle=True))

((['Algeria',
   ',',
   'fighting',
   'a',
   'vicious',
   'war',
   'against',
   'Moslem',
   'fundamentalist',
   'guerrillas',
   ',',
   'attacked',
   'Britain',
   'on',
   'Wednesday',
   'for',
   'allowing',
   'Islamist',
   'groups',
   'to',
   'meet',
   'in',
   'London',
   '.'],
  ['Perot',
   'won',
   'his',
   'party',
   "'s",
   'official',
   'nomination',
   'as',
   'its',
   'presidential',
   'candidate',
   'in',
   'a',
   'secret',
   'ballot',
   'earlier',
   'this',
   'month',
   '.']),
 (['B-LOC',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'B-MISC',
   'O',
   'O',
   'O',
   'O',
   'B-LOC',
   'O',
   'O',
   'O',
   'O',
   'B-MISC',
   'O',
   'O',
   'O',
   'O',
   'B-LOC',
   'O'],
  ['B-PER',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O']))

In [15]:
next(data_iterator.gen_batches(5, shuffle=True))

((['The',
   'Egyptian',
   'government',
   'will',
   'have',
   'nothing',
   'more',
   'to',
   'do',
   'with',
   'the',
   'Sudanese',
   'government',
   'because',
   'it',
   'continues',
   'to',
   'shelter',
   'and',
   'support',
   'Egyptian',
   'militants',
   ',',
   'President',
   'Hosni',
   'Mubarak',
   'said',
   'in',
   'a',
   'speech',
   'on',
   'Thursday',
   '.'],
  ['Glickman', 'says', 'USDA', 'monitoring', 'aflatoxin', 'in', 'Texas', '.'],
  ['Although',
   'Christie',
   ',',
   'who',
   'is',
   'not',
   'racing',
   'the',
   'individual',
   '100',
   'metres',
   'in',
   'Berlin',
   ',',
   'took',
   'his',
   'time',
   'to',
   'agree',
   'to',
   'run',
   ',',
   'the',
   'veteran',
   'was',
   'clearly',
   'delighted',
   'to',
   'be',
   'part',
   'of',
   'the',
   'tribute',
   'to',
   'the',
   'black',
   'American',
   '.'],
  ['SepOct', '733.75', '743.50', 'unq', 'unq'],
  ['Extras', '(', 'nb-5', ')', '5']),
 (['O',
   'B

## 生成掩码
关于生成训练数据的最后一件事。我们需要生成一个二进制掩码，在这个掩码中，token代表1，其他代表是0。  
这个掩码将阻止通过填充来反向传播。  
此类掩码的一个实例:
~~~
[[1, 1, 0, 0, 0],
 [1, 1, 1, 1, 1]]
~~~
代表这些句子：
~~~
 [['The', 'roof'],
  ['This', 'is', 'my', 'domain', '!']]
~~~
掩码长度必须等于批次中句子的最大长度。

In [16]:
from deeppavlov.models.preprocessors.mask import Mask
get_mask = Mask()

In [17]:
get_mask([['Try', 'to', 'get', 'the', 'mask'], ['Check', 'paddings']])

array([[1., 1., 1., 1., 1.],
       [1., 1., 0., 0., 0.]], dtype=float32)

## 建立一个循环神经网络(RNN)
这是任务最重要的部分，在这里，我们将指定基于TensorFlow构建块的网络体系结构。  
我们将创建一个卷积神经网络（CNN），它将为句子中的每个token生成tag的概率分布。为了考虑token的右侧和左侧上下文，我们将使用CNN。将在顶部使用全连接层(Dense Layer)来执行标签分类。

In [18]:
import tensorflow as tf
import numpy as np

np.random.seed(42)
tf.set_random_seed(42)

在NLP领域中，几乎所有网络的一个重要组成部分就是词语的嵌入。我们将文本作为一系列token传递给网络。每个token都由其索引表示。对于每个token(索引)我们有一个向量。总的来说，这些向量构成了一个嵌入矩阵。这个矩阵可以使用像Skip-Gram或CBOW这样的通用算法进行预训练，也可以由随机值初始化并与网络的其他参数一起训练。在本教程中，我们将遵循**第二种选择**。  

我们需要构建一个函数，它使用形状为\[batch_size, num_token\]的token索引张量，对于这个矩阵中的每个索引，它从嵌入矩阵中检索一个与该索引对应的向量。这就产生了一个新的张量\[batch_size, num_token, emb_dim\]。

In [19]:
def get_embeddings(indices, vocabulary_size, emb_dim):
    # Initialize the random gaussian matrix with dimensions [vocabulary_size, embedding_dimension]
    # The **VARIANCE** of the random samples must be 1 / embedding_dimension
    emb_mat = np.random.randn(vocabulary_size, emb_dim).astype(np.float32) / np.sqrt(emb_dim) # YOUR CODE HERE
    emb_mat = tf.Variable(emb_mat, name='Embeddings', trainable=True)
    emb = tf.nn.embedding_lookup(emb_mat, indices)
    return emb

网络的主体是卷积层。卷积背后的基本思想是对每个连续的n个示例(在我们的例子中是token)应用相同的全连接层(Dense Layer)。下面描述了一个简化的例子:  
![](https://github.com/deepmipt/DeepPavlov/raw/c7896c6db96f43f57cacd9a6a471e37cb70bf07a/examples/tutorials/img/convolution.png)  
这里的输入和输出特征数等于1。  
让我们以一个简单的例子用它：  

In [20]:
# Create a tensor with shape [batch_size, number_of_tokens, number_of_features]
x = tf.random_normal(shape=[2, 10, 100])
y = tf.layers.conv1d(x, filters=200, kernel_size=8)
print(y)

Tensor("conv1d/BiasAdd:0", shape=(2, 3, 200), dtype=float32)


正如您所看到的，由于零填充(输入的开头和结尾都是0)的缺失，number_of_tokens维上的结果张量的大小减小了。  
要使用填充并保持沿卷积维数的维数，需要向函数传递padding='same'参数。


In [21]:
y_with_padding = tf.layers.conv1d(x, filters=200, kernel_size=8, padding='same')
print(y_with_padding)

Tensor("conv1d_1/BiasAdd:0", shape=(2, 10, 200), dtype=float32)


现在用n_hidden_list变量给的维数来堆叠一些层。

In [22]:
def conv_net(units, n_hidden_list, cnn_filter_width, activation=tf.nn.relu):
    # Use activation(units) to apply activation to units
    for n_hidden in n_hidden_list:
        
        units = tf.layers.conv1d(units,
                                 n_hidden,
                                 cnn_filter_width,
                                 padding='same')
        units = activation(units)
    return units

分类任务的一个常见损失是交叉熵。为什么要分类?因为对于每个标记，网络必须决定预测哪个标记。交叉熵的形式如下所示:

![](https://render.githubusercontent.com/render/math?math=H%28P%2C%20Q%29%20%3D%20-E_%7Bx%20%5Csim%20P%7D%20log%20Q%28x%29&mode=display)

它衡量了真值分布与预测分布之间的差异。在大多数情况下，真值分布是one-hot的。幸运的是，这种损失已经在TensorFlow中实现了。

In [23]:
# The logits
l = tf.random_normal([1, 4, 3]) # shape [batch_size, number_of_tokens, number of classes]
indices = tf.placeholder(tf.int32, [1, 4])

# Make one-hot distribution from indices for 3 types of tag
p = tf.one_hot(indices, depth=3)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits_v2(labels=p, logits=l)
print(loss_tensor)

Tensor("softmax_cross_entropy_with_logits/Reshape_2:0", shape=(1, 4), dtype=float32)


每一批句子都有相同的长度，我们把每个句子填充成最长。所以在末端有填补，并且推动网络去预测那些填补通常会导致质量恶化。然后，我们需要用二进制掩码乘以损失张量，以防止从填补中产生梯度流动。

In [24]:
mask = tf.placeholder(tf.float32, shape=[1, 4])
loss_tensor *= mask

最后一步是计算损失张量的平均值:

In [25]:
loss = tf.reduce_mean(loss_tensor)

现在定义一个函数来返回一个标量掩蔽交叉熵损失（scalar masked cross-entropy loss）

In [26]:
def masked_cross_entropy(logits, label_indices, number_of_tags, mask):
    ground_truth_labels = tf.one_hot(label_indices, depth=number_of_tags)
    loss_tensor = tf.nn.softmax_cross_entropy_with_logits_v2(labels=ground_truth_labels, logits=logits)
    loss_tensor *= mask
    loss = tf.reduce_mean(loss_tensor)
    return loss

接下来把所有东西都放到一个类里面去：

In [27]:
import  numpy  as  np
import  tensorflow  as tf

class NerNetwork:
    def __init__(self,
                 n_tokens,
                 n_tags,
                 token_emb_dim=100,
                 n_hidden_list=(128,),
                 cnn_filter_width=7,
                 use_batch_norm=False,
                 embeddings_dropout=False,
                 top_dropout=False,
                 **kwargs):
        
        # ================ Building inputs =================
        
        self.learning_rate_ph = tf.placeholder(tf.float32, [])
        self.dropout_keep_ph = tf.placeholder(tf.float32, [])
        self.token_ph = tf.placeholder(tf.int32, [None, None], name='token_ind_ph')
        self.mask_ph = tf.placeholder(tf.float32, [None, None], name='Mask_ph')
        self.y_ph = tf.placeholder(tf.int32, [None, None], name='y_ph')
        
        # ================== Building the network ==================
        
        # Now embedd the indices of tokens using token_emb_dim function
        
        ######################################
        ########## YOUR CODE HERE ############
        emb = get_embeddings(self.token_ph, n_tokens, token_emb_dim)
        ######################################

        emb = tf.nn.dropout(emb, self.dropout_keep_ph, (tf.shape(emb)[0], 1, tf.shape(emb)[2]))
        
        # Build a multilayer CNN on top of the embeddings.
        # The number of units in the each layer must match
        # corresponding number from n_hidden_list.
        # Use ReLU activation 
        ######################################
        ########## YOUR CODE HERE ############
        units = conv_net(emb, n_hidden_list, cnn_filter_width)
        ######################################
        units = tf.nn.dropout(units, self.dropout_keep_ph, (tf.shape(units)[0], 1, tf.shape(units)[2]))
        logits = tf.layers.dense(units, n_tags, activation=None)
        self.predictions = tf.argmax(logits, 2)
        
        # ================= Loss and train ops =================
        # Use cross-entropy loss. check the tf.nn.softmax_cross_entropy_with_logits_v2 function
        ######################################
        ########## YOUR CODE HERE ############
        self.loss = masked_cross_entropy(logits, self.y_ph, n_tags, self.mask_ph)
        ######################################

        # Create a training operation to update the network parameters.
        # We purpose to use the Adam optimizer as it work fine for the
        # most of the cases. Check tf.train to find an implementation.
        # Put the train operation to the attribute self.train_op
        
        ######################################
        ########## YOUR CODE HERE ############
        optimizer = tf.train.AdamOptimizer(self.learning_rate_ph)
        self.train_op = optimizer.minimize(self.loss)
        ######################################

        # ================= Initialize the session =================
        
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

    def __call__(self, tok_batch, mask_batch):
        feed_dict = {self.token_ph: tok_batch,
                     self.mask_ph: mask_batch,
                     self.dropout_keep_ph: 1.0}
        return self.sess.run(self.predictions, feed_dict)

    def train_on_batch(self, tok_batch, tag_batch, mask_batch, dropout_keep_prob, learning_rate):
        feed_dict = {self.token_ph: tok_batch,
                     self.y_ph: tag_batch,
                     self.mask_ph: mask_batch,
                     self.dropout_keep_ph: dropout_keep_prob,
                     self.learning_rate_ph: learning_rate}
        self.sess.run(self.train_op, feed_dict)

创建一个NerNetwork实例

In [28]:
nernet = NerNetwork(len(token_vocab),
                    len(tag_vocab),
                    n_hidden_list=[100, 100])

我们通常希望在每个阶段检查数据集验证部分的得分。在大多数NER任务的情况下，类是不平衡的。而精确度并不是衡量性能的最佳指标。如果我们有95%的O标签，比愚蠢的分类器，总是预测0得到95%的准确率。为了解决这个问题，使用了F1得分。F1得分定义为:
![](https://render.githubusercontent.com/render/math?math=F1%20%3D%20%20%5Cfrac%7B2%20P%20R%7D%7BP%20%2B%20R%7D&mode=display)


其中P为精度，R为召回率。  
我们需要写出求值函数。我们需要得到数据集给定部分的所有预测，并计算F1。

In [29]:
from deeppavlov.models.ner.evaluation import precision_recall_f1
# The function precision_recall_f1 takes two lists: y_true and y_predicted
# the tag sequences for each sentences should be merged into one big list 
from deeppavlov.core.data.utils import zero_pad
# zero_pad takes a batch of lists of token indices, pad it with zeros to the
# maximal length and convert it to numpy matrix
from itertools import chain


def eval_valid(network, batch_generator):
    total_true = []
    total_pred = []
    for x, y_true in batch_generator:

        # Prepare token indices from tokens batch
        x_inds = token_vocab(x) # YOUR CODE HERE

        # Pad the indices batch with zeros
        x_batch = zero_pad(x_inds) # YOUR CODE HERE

        # Get the mask using get_mask
        mask = get_mask(x) # YOUR CODE HERE
        
        # We call the instance of the NerNetwork because we have defined __call__ method
        y_inds = network(x_batch, mask)

        # For every sentence in the batch extract all tags up to paddings
        y_inds = [y_inds[n][:len(x[n])] for n, y in enumerate(y_inds)] # YOUR CODE HERE
        y_pred = tag_vocab(y_inds)

        # Add fresh predictions 
        total_true.extend(chain(*y_true))
        total_pred.extend(chain(*y_pred))
    res = precision_recall_f1(total_true, total_pred, print_results=True)

设置参数可以从以下推荐值开始:
* batch_size: 32;
* n_epochs: 10;
* starting value of learning_rate: 0.001
* learning_rate_decay: a square root of 2;
* dropout_keep_probability equal to 0.7 for training (typical values for dropout probability are ranging from 0.3 to 0.9).  

收敛后降低学习率是一种非常有效的学习率管理方法。通常使用2、3和10来降低学习速率。

In [30]:
batch_size = 16 # YOUR HYPERPARAMETER HERE
n_epochs = 20 # YOUR HYPERPARAMETER HERE
learning_rate = 0.001 # YOUR HYPERPARAMETER HERE
dropout_keep_prob = 0.5 # YOUR HYPERPARAMETER HERE

现在我们逐批迭代数据集，并将数据传递给训练器。

In [31]:
for epoch in range(n_epochs):
    for x, y in data_iterator.gen_batches(batch_size, 'train'):
        # Convert tokens to indices via Vocab
        x_inds = token_vocab(x) # YOUR CODE 
        # Convert tags to indices via Vocab
        y_inds = tag_vocab(y) # YOUR CODE 
        
        # Pad every sample with zeros to the maximal length
        x_batch = zero_pad(x_inds)
        y_batch = zero_pad(y_inds)

        mask = get_mask(x)
        nernet.train_on_batch(x_batch, y_batch, mask, dropout_keep_prob, learning_rate)
    print('Evaluating the model on valid part of the dataset')
    eval_valid(nernet, data_iterator.gen_batches(batch_size, 'valid'))

Evaluating the model on valid part of the dataset


2018-08-20 17:28:53.519 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 4832 phrases; correct: 3116.

precision:  64.49%; recall:  52.44%; FB1:  57.84

	LOC: precision:  69.90%; recall:  69.41%; F1:  69.65 1824

	MISC: precision:  53.96%; recall:  19.96%; F1:  29.14 341

	ORG: precision:  51.90%; recall:  44.89%; F1:  48.14 1160

	PER: precision:  70.01%; recall:  57.27%; F1:  63.00 1507




Evaluating the model on valid part of the dataset


2018-08-20 17:29:12.258 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5301 phrases; correct: 4337.

precision:  81.81%; recall:  72.99%; FB1:  77.15

	LOC: precision:  86.92%; recall:  82.47%; F1:  84.64 1743

	MISC: precision:  80.29%; recall:  71.15%; F1:  75.45 817

	ORG: precision:  75.27%; recall:  62.86%; F1:  68.51 1120

	PER: precision:  81.62%; recall:  71.82%; F1:  76.41 1621




Evaluating the model on valid part of the dataset


2018-08-20 17:29:30.967 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5411 phrases; correct: 4655.

precision:  86.03%; recall:  78.34%; FB1:  82.00

	LOC: precision:  91.30%; recall:  84.59%; F1:  87.82 1702

	MISC: precision:  86.21%; recall:  75.27%; F1:  80.37 805

	ORG: precision:  79.01%; recall:  72.41%; F1:  75.56 1229

	PER: precision:  85.73%; recall:  77.96%; F1:  81.66 1675




Evaluating the model on valid part of the dataset


2018-08-20 17:29:49.663 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5254 phrases; correct: 4581.

precision:  87.19%; recall:  77.10%; FB1:  81.83

	LOC: precision:  93.37%; recall:  85.08%; F1:  89.03 1674

	MISC: precision:  85.80%; recall:  77.33%; F1:  81.35 831

	ORG: precision:  81.26%; recall:  73.38%; F1:  77.12 1211

	PER: precision:  85.89%; recall:  71.72%; F1:  78.17 1538




Evaluating the model on valid part of the dataset


2018-08-20 17:30:09.538 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5376 phrases; correct: 4700.

precision:  87.43%; recall:  79.10%; FB1:  83.05

	LOC: precision:  93.68%; recall:  87.15%; F1:  90.30 1709

	MISC: precision:  84.97%; recall:  79.07%; F1:  81.91 858

	ORG: precision:  81.45%; recall:  74.65%; F1:  77.90 1229

	PER: precision:  86.65%; recall:  74.32%; F1:  80.01 1580




Evaluating the model on valid part of the dataset


2018-08-20 17:30:29.531 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5260 phrases; correct: 4656.

precision:  88.52%; recall:  78.36%; FB1:  83.13

	LOC: precision:  94.29%; recall:  85.47%; F1:  89.66 1665

	MISC: precision:  86.88%; recall:  78.31%; F1:  82.37 831

	ORG: precision:  83.29%; recall:  74.35%; F1:  78.57 1197

	PER: precision:  87.24%; recall:  74.21%; F1:  80.20 1567




Evaluating the model on valid part of the dataset


2018-08-20 17:30:48.736 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5482 phrases; correct: 4771.

precision:  87.03%; recall:  80.29%; FB1:  83.53

	LOC: precision:  91.06%; recall:  88.73%; F1:  89.88 1790

	MISC: precision:  87.41%; recall:  78.31%; F1:  82.61 826

	ORG: precision:  79.08%; recall:  77.78%; F1:  78.42 1319

	PER: precision:  88.95%; recall:  74.70%; F1:  81.20 1547




Evaluating the model on valid part of the dataset


2018-08-20 17:31:07.955 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5336 phrases; correct: 4692.

precision:  87.93%; recall:  78.96%; FB1:  83.21

	LOC: precision:  93.35%; recall:  86.34%; F1:  89.71 1699

	MISC: precision:  87.41%; recall:  79.83%; F1:  83.45 842

	ORG: precision:  82.32%; recall:  75.02%; F1:  78.50 1222

	PER: precision:  86.71%; recall:  74.05%; F1:  79.88 1573




Evaluating the model on valid part of the dataset


2018-08-20 17:31:27.110 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5299 phrases; correct: 4708.

precision:  88.85%; recall:  79.23%; FB1:  83.76

	LOC: precision:  94.20%; recall:  86.72%; F1:  90.31 1691

	MISC: precision:  85.61%; recall:  79.39%; F1:  82.39 855

	ORG: precision:  83.62%; recall:  76.14%; F1:  79.70 1221

	PER: precision:  88.90%; recall:  73.94%; F1:  80.74 1532




Evaluating the model on valid part of the dataset


2018-08-20 17:31:47.200 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5247 phrases; correct: 4683.

precision:  89.25%; recall:  78.81%; FB1:  83.71

	LOC: precision:  93.80%; recall:  87.37%; F1:  90.47 1711

	MISC: precision:  88.65%; recall:  79.61%; F1:  83.89 828

	ORG: precision:  85.07%; recall:  76.06%; F1:  80.31 1199

	PER: precision:  87.74%; recall:  71.88%; F1:  79.02 1509




Evaluating the model on valid part of the dataset


2018-08-20 17:32:07.819 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5198 phrases; correct: 4639.

precision:  89.25%; recall:  78.07%; FB1:  83.29

	LOC: precision:  93.48%; recall:  86.61%; F1:  89.91 1702

	MISC: precision:  88.37%; recall:  79.93%; F1:  83.94 834

	ORG: precision:  83.82%; recall:  75.69%; F1:  79.55 1211

	PER: precision:  89.32%; recall:  70.36%; F1:  78.71 1451




Evaluating the model on valid part of the dataset


2018-08-20 17:32:28.986 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5204 phrases; correct: 4632.

precision:  89.01%; recall:  77.95%; FB1:  83.12

	LOC: precision:  93.45%; recall:  86.99%; F1:  90.10 1710

	MISC: precision:  89.12%; recall:  79.07%; F1:  83.79 818

	ORG: precision:  83.33%; recall:  76.44%; F1:  79.74 1230

	PER: precision:  88.52%; recall:  69.49%; F1:  77.86 1446




Evaluating the model on valid part of the dataset


2018-08-20 17:32:49.334 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5187 phrases; correct: 4612.

precision:  88.91%; recall:  77.62%; FB1:  82.88

	LOC: precision:  93.16%; recall:  86.06%; F1:  89.47 1697

	MISC: precision:  89.33%; recall:  79.93%; F1:  84.37 825

	ORG: precision:  84.38%; recall:  76.14%; F1:  80.05 1210

	PER: precision:  87.49%; recall:  69.11%; F1:  77.22 1455




Evaluating the model on valid part of the dataset


2018-08-20 17:33:09.46 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5170 phrases; correct: 4611.

precision:  89.19%; recall:  77.60%; FB1:  82.99

	LOC: precision:  93.75%; recall:  86.61%; F1:  90.04 1697

	MISC: precision:  89.08%; recall:  79.61%; F1:  84.08 824

	ORG: precision:  83.73%; recall:  76.36%; F1:  79.88 1223

	PER: precision:  88.50%; recall:  68.51%; F1:  77.23 1426




Evaluating the model on valid part of the dataset


2018-08-20 17:33:30.215 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5177 phrases; correct: 4625.

precision:  89.34%; recall:  77.84%; FB1:  83.19

	LOC: precision:  93.24%; recall:  86.28%; F1:  89.62 1700

	MISC: precision:  89.24%; recall:  80.04%; F1:  84.39 827

	ORG: precision:  84.78%; recall:  75.17%; F1:  79.68 1189

	PER: precision:  88.57%; recall:  70.25%; F1:  78.35 1461




Evaluating the model on valid part of the dataset


2018-08-20 17:33:50.712 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5219 phrases; correct: 4633.

precision:  88.77%; recall:  77.97%; FB1:  83.02

	LOC: precision:  92.75%; recall:  87.04%; F1:  89.81 1724

	MISC: precision:  88.10%; recall:  80.26%; F1:  84.00 840

	ORG: precision:  85.08%; recall:  74.87%; F1:  79.65 1180

	PER: precision:  87.46%; recall:  70.03%; F1:  77.78 1475




Evaluating the model on valid part of the dataset


2018-08-20 17:34:09.840 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5192 phrases; correct: 4615.

precision:  88.89%; recall:  77.67%; FB1:  82.90

	LOC: precision:  92.94%; recall:  86.72%; F1:  89.72 1714

	MISC: precision:  89.96%; recall:  79.72%; F1:  84.53 817

	ORG: precision:  83.18%; recall:  74.50%; F1:  78.60 1201

	PER: precision:  88.22%; recall:  69.92%; F1:  78.01 1460




Evaluating the model on valid part of the dataset


2018-08-20 17:34:29.318 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5170 phrases; correct: 4628.

precision:  89.52%; recall:  77.89%; FB1:  83.30

	LOC: precision:  94.96%; recall:  85.19%; F1:  89.81 1648

	MISC: precision:  89.78%; recall:  79.07%; F1:  84.08 812

	ORG: precision:  83.22%; recall:  76.21%; F1:  79.56 1228

	PER: precision:  88.53%; recall:  71.23%; F1:  78.94 1482




Evaluating the model on valid part of the dataset


2018-08-20 17:34:49.29 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5283 phrases; correct: 4668.

precision:  88.36%; recall:  78.56%; FB1:  83.17

	LOC: precision:  93.16%; recall:  86.72%; F1:  89.82 1710

	MISC: precision:  88.88%; recall:  79.72%; F1:  84.05 827

	ORG: precision:  81.53%; recall:  75.39%; F1:  78.34 1240

	PER: precision:  88.25%; recall:  72.15%; F1:  79.39 1506




Evaluating the model on valid part of the dataset


2018-08-20 17:35:08.617 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 51363 tokens with 5942 phrases; found: 5180 phrases; correct: 4610.

precision:  89.00%; recall:  77.58%; FB1:  82.90

	LOC: precision:  93.26%; recall:  86.55%; F1:  89.78 1705

	MISC: precision:  90.21%; recall:  79.93%; F1:  84.76 817

	ORG: precision:  83.36%; recall:  75.09%; F1:  79.01 1208

	PER: precision:  88.00%; recall:  69.27%; F1:  77.52 1450




评估测试数据：

In [32]:
eval_valid(nernet, data_iterator.gen_batches(batch_size, 'test'))


2018-08-20 17:35:10.85 DEBUG in 'deeppavlov.models.ner.evaluation'['evaluation'] at line 213: processed 46436 tokens with 5648 phrases; found: 4533 phrases; correct: 3728.

precision:  82.24%; recall:  66.01%; FB1:  73.23

	LOC: precision:  88.05%; recall:  80.88%; F1:  84.31 1532

	MISC: precision:  77.05%; recall:  70.80%; F1:  73.79 645

	ORG: precision:  79.23%; recall:  61.11%; F1:  69.00 1281

	PER: precision:  80.65%; recall:  53.62%; F1:  64.41 1075




接下来用模型预测我们自己说的话：

In [33]:
sentence = 'Petr stole my vodka in America'
x = [sentence.split()]

x_inds = token_vocab(x)
x_batch = zero_pad(x_inds)
mask = get_mask(x)
y_inds = nernet(x_batch, mask)
print(x[0])
print(tag_vocab(y_inds)[0])

['Petr', 'stole', 'my', 'vodka', 'in', 'America']
['B-PER', 'O', 'O', 'O', 'O', 'B-LOC']


In [34]:
sentence = 'Wu Jiahang is the fastest man alive in China'
x = [sentence.split()]

x_inds = token_vocab(x)
x_batch = zero_pad(x_inds)
mask = get_mask(x)
y_inds = nernet(x_batch, mask)
print(x[0])
print(tag_vocab(y_inds)[0])

['Wu', 'Jiahang', 'is', 'the', 'fastest', 'man', 'alive', 'in', 'China']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC']


In [35]:
sentence = 'YuLinzhu is the stupidest woman alive in China'
x = [sentence.split()]

x_inds = token_vocab(x)
x_batch = zero_pad(x_inds)
mask = get_mask(x)
y_inds = nernet(x_batch, mask)
print(x[0])
print(tag_vocab(y_inds)[0])

['YuLinzhu', 'is', 'the', 'stupidest', 'woman', 'alive', 'in', 'China']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC']
