Skip to content
master
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Fairy Morphological Annotated Corpus

CircleCI Apache License

This corpus includes morphological partial annotations for Japanese Wikipedia. The main purpose is more like error check for morphological analyzers than their training. This is a sample data.

また、銀河系にある|いて?座?A?*|のブラックホールの400倍も重い。
(Furthermore, it is also 400 times heavier than the black hole of Sagittarius A * in the galaxy.)

| indicates word boundary. ? between first and last | indicate word boundary candidates.

This corpus reveals some morphological analyzers wrongly parse it as あるい|て (あるい(walk) and て(and)). All annotations are based on JUMAN part of speech system which is extension of the Masuoka and Takubo grammar.

Files

  • corpus
    • First column in each .tsv file includes annotated texts.
    • Other columns contain additional information.
  • scripts

References

@INPROCEEDINGS{hayashibe:2017:SIGNL231,
    author    = {林部祐太},
    title     = {日本語部分形態素アノテーションコーパスの構築},
    booktitle = "情報処理学会第231回自然言語処理研究会",
    year      = "2017",
    pages     = "NL-231-9:1-8",
    publisher = "情報処理学会",
}

License

You can’t perform that action at this time.