Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arXiv-2019/8-Simplify the Usage of Lexicon in Chinese NER #280

Open
BrambleXu opened this issue Nov 7, 2019 · 0 comments
Open

arXiv-2019/8-Simplify the Usage of Lexicon in Chinese NER #280

BrambleXu opened this issue Nov 7, 2019 · 0 comments
Assignees
Labels
C Code Implementation CN(P) Chinese NLP Problem NER(T) Named Entity Recognition Task

Comments

@BrambleXu
Copy link
Owner

Summary:

因为 Lattice-LSTM #279 运算效率太低。所以这篇文章希望更有效地把lexicon information导入到character representation里。

Resource:

Paper information:

  • Author: Minlong Peng, Ruotian Ma, Qi Zhang, Xuanjing Huang , Fudan University
  • Dataset:
  • keywords:

Notes:

作者分析了 Lattice-LSTM的优缺点。

优点:

  • 其保存了所有可能匹配的单词。
  • 其可以将预训练好的word embedding嵌入到系统中。
  • 模型具有attention机制自动给单词赋权重。

所以这篇文章的想法是在保持上面优点的情况下,舍弃原来的LSTM模型。作者提出的方法是提出了一种新的编码方式。

一个句子s中的每一个字符c,都有对应的4个word sets。这个word sets是通过“BMES”4个标签来标记的。

  • B(c)集合:包含所有以字符c为起始的词
  • M(c)集合:包含所有以字符c为中间字的词
  • E(c)集合:包含所有以字符c为结束字的词
  • S(c)集合:c单独组成一个词

如果集合为空则成员为None。

Consider the sentence s = {c1, · · · , c5} and suppose that {c1, c2}, {c1, c2, c3}, {c2, c3, c4}, and {c2, c3, c4, c5} match the lexicon. Then, for c2, B(c2) = {{c2, c3, c4}, {c2, c3, c4, c5}}, M(c2) = {{c1, c2, c3}}, E(c2) = {{c1, c2}}, and S(c2) = {NONE}.

这里例子里的B(c2) = {{c2, c3, c4}, {c2, c3, c4, c5}},B(“南”)= {南京市,南京大桥}。

image

V^s的部分是一个map 函数,把一个word set变成fixed-dimensional vector。这里引入了mean-pooling算法来表示word set S的vector representation:

image

但是mean-pooling的效果并不好。Lattice-LSTM里使用了dynamical weighting algorithm,为了保证速度,这里才用的weighting 方法是 the frequency of the word as an indication of its weight. The basic idea beneath this algorithm is that the more times a character sequence occurs in the data, the more likely it is a word. Note that, the frequency of a word is a static value and can be obtained offline. This can greatly accelerate the calculation of the weight of each word (e.g., using a lookup table).

image

这里还专门提高infrequent words的权重

image

Model Graph:

Result:

Thoughts:

Next Reading:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C Code Implementation CN(P) Chinese NLP Problem NER(T) Named Entity Recognition Task
Projects
None yet
Development

No branches or pull requests

1 participant