Skip to content

CaseyPan/Chinese-Preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 

Repository files navigation

Chinese-Preprocessing

Flowchart

1) Find Suitable Corpora

For different tasks, there are different corpora that are suitable.

Most commonly-used corpora (for all tasks):
    • Wiki dump: 2 billion words https://dumps.wikimedia.org/
    • Baidu Encyclopedia: 5422K http://research.baidu.com/Downloads
    • People’s Daily News: 31 million
    • Sogou News: 1226K http://www.sogou.com/labs/
    • Weibo: 850K http://www.nlpir.org/download/weibo.7z
    • Chinese Gigaword(v5) https://catalog.ldc.upenn.edu/ldc2011t13
    • (used a lot in ACL 2017, 2018) OntoNotes, includes text from various genres https://catalog.ldc.upenn.edu/LDC2013T19

Corpora used in ACL 2017, 2018 for some specific tasks:
Sentiment Analysis
    • HowNet
        https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1276017
        http://www.keenage.com/html/e_index.html
POS Tagging
    • Penn Chinese Treebank https://verbs.colorado.edu/chinese/
Summarization:
    • Large Scale Chinese Short Text Summarization Dataset (LCSTS)
        https://arxiv.org/pdf/1506.05865.pdf
        http://icrc.hitsz.edu.cn/Article/show/139.html
Poem Generation
    • Chinese poem corpus (CPC), Chinese quatrain corpus (CQC)

2) Extract Needed Context/Sentences (語料清洗)

Basically, this step is the same as English context preprocessing.
Eliminate tags, parse sentences from XML…

3) Simplified/Traditional Chinese Conversion

因為簡轉繁在字上甚至是詞語上都是一對多,所以轉換過後在performance上難免會有一些落差。

Most commonly-used toolkit:
    • OpenCC https://github.com/BYVoid/OpenCC
    • hanziconv https://pypi.org/project/hanziconv/

4) Tokenize (分詞)

Differ from English, probably the most important part in Chinese context preprocessing.

•	Character-based: “我|就|讀|清|大|,|目|前|大|二|。”  
•	Word-based: “我|就讀|清大|,|目前|大二|。” 

Most commonly-used tokenize tool includes:
    • THULAC (THU Lexical Analyzer for Chinese) (簡中)清华大学自然语言处理与社会人文计算实验室         https://github.com/thunlp/THULAC-Python
    • Jieba(結巴)(簡中、繁中)百度開發
        https://pypi.org/project/jieba/
        https://github.com/fxsjy/jieba
    • ChineseWordSegmentation(CWS) https://github.com/Moonshile/ChineseWordSegmentation
    • CKIP分詞工具(繁中)中央研究院 http://ckip.iis.sinica.edu.tw:8080/contact/
    • CoreNLP Stanford (Java) (multi-language) https://stanfordnlp.github.io/CoreNLP/human-languages.html
    • FudanNLP 中文自然语言处理工具包 https://github.com/FudanNLP/fnlp

5) POS Tagging (詞性標注)

In Chinese NLP tasks, this step is not necessary, unless for specific tasks such as sentiment analysis.

Most commonly-used tool: THULAC、Stanford CoreNLP(他們都是工具包,有分詞也有標注)

6) Eliminate not-needed information(去除停用詞)

停用詞一般指對文本特徵沒有任何貢獻的字詞,如:邊點符號、語氣、人稱等。

Other Resources

中文自然語言處理的相關資料
非常詳細!整理了很多不同的中文NLP資源
(including Chinese Word Segmentation tool, Chinese Corpora, Popular Toolkits)
https://github.com/crownpku/Awesome-Chinese-NLP

CA8 Chinese Word Embedding Evaluation tool
(ACL2018) a new Chinese word embedding evaluation toolkit
(Github裡包含在不同corpora pretrained好的word vectors)
https://github.com/Embedding/Chinese-Word-Vectors
http://aclweb.org/anthology/P18-2023

北京大學開放研究數據平台
http://opendata.pku.edu.cn/

哈工大ELMo pretrained 各種語言(包含簡、繁中)
https://github.com/HIT-SCIR/ELMoForManyLangs/blob/master/README.md

上面link沒有整理到的:
中华新华字典数据库
https://github.com/pwxcoo/chinese-xinhua

Gluonnlp
https://github.com/dmlc/gluon-nlp

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published