For different tasks, there are different corpora that are suitable.
Most commonly-used corpora (for all tasks):
• Wiki dump: 2 billion words https://dumps.wikimedia.org/
• Baidu Encyclopedia: 5422K http://research.baidu.com/Downloads
• People’s Daily News: 31 million
• Sogou News: 1226K http://www.sogou.com/labs/
• Weibo: 850K http://www.nlpir.org/download/weibo.7z
• Chinese Gigaword(v5) https://catalog.ldc.upenn.edu/ldc2011t13
• (used a lot in ACL 2017, 2018) OntoNotes, includes text from various genres https://catalog.ldc.upenn.edu/LDC2013T19
Corpora used in ACL 2017, 2018 for some specific tasks:
Sentiment Analysis
• HowNet
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1276017
http://www.keenage.com/html/e_index.html
POS Tagging
• Penn Chinese Treebank https://verbs.colorado.edu/chinese/
Summarization:
• Large Scale Chinese Short Text Summarization Dataset (LCSTS)
https://arxiv.org/pdf/1506.05865.pdf
http://icrc.hitsz.edu.cn/Article/show/139.html
Poem Generation
• Chinese poem corpus (CPC), Chinese quatrain corpus (CQC)
Basically, this step is the same as English context preprocessing.
Eliminate tags, parse sentences from XML…
因為簡轉繁在字上甚至是詞語上都是一對多,所以轉換過後在performance上難免會有一些落差。
Most commonly-used toolkit:
• OpenCC https://github.com/BYVoid/OpenCC
• hanziconv https://pypi.org/project/hanziconv/
Differ from English, probably the most important part in Chinese context preprocessing.
• Character-based: “我|就|讀|清|大|,|目|前|大|二|。” • Word-based: “我|就讀|清大|,|目前|大二|。”
Most commonly-used tokenize tool includes:
• THULAC (THU Lexical Analyzer for Chinese) (簡中)清华大学自然语言处理与社会人文计算实验室 https://github.com/thunlp/THULAC-Python
• Jieba(結巴)(簡中、繁中)百度開發
https://pypi.org/project/jieba/
https://github.com/fxsjy/jieba
• ChineseWordSegmentation(CWS) https://github.com/Moonshile/ChineseWordSegmentation
• CKIP分詞工具(繁中)中央研究院 http://ckip.iis.sinica.edu.tw:8080/contact/
• CoreNLP Stanford (Java) (multi-language) https://stanfordnlp.github.io/CoreNLP/human-languages.html
• FudanNLP 中文自然语言处理工具包 https://github.com/FudanNLP/fnlp
In Chinese NLP tasks, this step is not necessary, unless for specific tasks such as sentiment analysis.
Most commonly-used tool: THULAC、Stanford CoreNLP(他們都是工具包,有分詞也有標注)
停用詞一般指對文本特徵沒有任何貢獻的字詞,如:邊點符號、語氣、人稱等。
中文自然語言處理的相關資料
非常詳細!整理了很多不同的中文NLP資源
(including Chinese Word Segmentation tool, Chinese Corpora, Popular Toolkits)
https://github.com/crownpku/Awesome-Chinese-NLP
CA8 Chinese Word Embedding Evaluation tool
(ACL2018) a new Chinese word embedding evaluation toolkit
(Github裡包含在不同corpora pretrained好的word vectors)
https://github.com/Embedding/Chinese-Word-Vectors
http://aclweb.org/anthology/P18-2023
北京大學開放研究數據平台
http://opendata.pku.edu.cn/
哈工大ELMo pretrained 各種語言(包含簡、繁中)
https://github.com/HIT-SCIR/ELMoForManyLangs/blob/master/README.md
上面link沒有整理到的:
中华新华字典数据库
https://github.com/pwxcoo/chinese-xinhua
Gluonnlp
https://github.com/dmlc/gluon-nlp