Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACL-2018-Subcharacter Information in Japanese Embeddings: When Is It Worth It? #291

Open
BrambleXu opened this issue Dec 4, 2019 · 1 comment
Assignees
Labels
Embedding Embedding/Pre-train Model/Task JP(P) Japanese NLP Problem

Comments

@BrambleXu
Copy link
Owner

Summary:

subcharacter information对于中文是有效的,那么日文又如何呢?研究发现subcharacter对于中文的提升效果在日文上并不稳定(我想应该是有片假名和平假名的缘故吧)。但是在一些汉字比较多的场景下,character ngrams效果确实有提高。不过在实验中,发现即使是enhanced skip-gram 也比不上 single-character ngram fasttext。

Resource:

  • pdf
  • [code](
  • [paper-with-code](

Paper information:

  • Author:
  • Dataset:
  • keywords:

Notes:

image

fastText是subword level model,可以学习character n-grams。

image

  • SG: we modified SG by summing the target word vector w with vectors of its constituent characters c1, and c2. This can be regarded as a special case of FastText, where the minimal n-gram size and maximum n-gram size are both set to 1.
  • SG+kanji: learn Chinese word embeddings based on characters and sub-characters (Yu 2017 Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components)
  • SG+kanji+bushu: 加了 偏旁部首 的意思

Model Graph:

Result:

Thoughts:

Next Reading:

@BrambleXu BrambleXu self-assigned this Dec 4, 2019
@BrambleXu BrambleXu added Embedding Embedding/Pre-train Model/Task JP(P) Japanese NLP Problem labels Dec 4, 2019
@Crescentz
Copy link

请问有开源么

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Embedding Embedding/Pre-train Model/Task JP(P) Japanese NLP Problem
Projects
None yet
Development

No branches or pull requests

2 participants