Fairy Morphological Annotated Corpus
Python Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.circleci
corpus
scripts Updated config for circleci Jul 13, 2017
.gitignore
LICENSE.txt
README.md
setup.cfg

README.md

Fairy Morphological Annotated Corpus

CircleCI Apache License

This corpus includes morphological partial annotations for Japanese Wikipedia. The main purpose is more like error check for morphological analyzers than their training. This is a sample data.

また、銀河系にある|いて?座?A?*|のブラックホールの400倍も重い。
(Furthermore, it is also 400 times heavier than the black hole of Sagittarius A * in the galaxy.)

| indicates word boundary. ? between first and last | indicate word boundary candidates.

This corpus reveals some morphological analyzers wrongly parse it as あるい|て (あるい(walk) and て(and)). All annotations are based on JUMAN part of speech system which is extension of the Masuoka and Takubo grammar.

Files

  • corpus
    • First column in each .tsv file includes annotated texts.
    • Other columns contain additional information.
  • scripts

References

@INPROCEEDINGS{hayashibe:2017:SIGNL231,
    author    = {林部祐太},
    title     = {日本語部分形態素アノテーションコーパスの構築},
    booktitle = "情報処理学会第231回自然言語処理研究会",
    year      = "2017",
    pages     = "NL-231-9:1-8",
    publisher = "情報処理学会",
}

License