HanziGraph

Visualization for information about Chinese characters via Neo4j & Text augmentation via Chinese characters and words (typos, synonym, antonym, similar entity, numeric, etc.).

Introduction:

I try to integrate several open source Chinese characters or words corpora to build a visualized graph, named HanziGraph, for these characters, motivated by the demand to deal with character-level similarity comparison, nlp data-augumentation, and curiosity (｡･ω･｡)ﾉ
Furthermore, a light Chinese Text Augmentation code is provided based on several clean word-level corpora integrated by myself from open-source datasets like The Contemporary Chinese Dictionary and BigCilin.

File Dependency:

-> corpus -> char_number: 汉字笔画材料
         |-> char_part: 汉字偏旁材料
         |-> char_pronunciation: 汉字拼音材料
         |-> char_similar: 汉字结构分类、四角编码信息、形近字、音近字材料
         |-> char_split: 汉字拆字、简繁字体对照材料
         |-> basic_dictionary_similar.json
         |-> basic_triple.xlsx
         |-> corpus_handian -> word_handian: 现代汉语词典材料、生成的json数据
                           |-> get_handian.py  # 处理汉典数据的脚本
                           |-> combine_n.py  # 合并汉典与词林名词数据的脚本
         |-> corpus_dacilin -> word_dacilin: 大词林材料、生成的json数据
                           |-> get_dacilin.py  # 处理词林数据的脚本
  |-> prepro.py  # 预处理汉字各数据集的脚本
  |-> build_graph.py  # 生成汉字图谱的脚本
  |-> text_augmentation.py  # 利用词典扩增文本的脚本

Dataset

basic_dictionary_similar.json: a file in json format to store a large dictionary, where each entry has the same data structure below:

entry (i.e. each character) = {
                               "split_to(拆字方案)": ["part(偏旁) atom(子字) ...", ...],
                               "has_atom(包含哪些子字)": [atom(子字), ...],
                               "is_atom_of(为哪些字的子字)": [char(字), ...],
                               "has_part(包含哪些偏旁部首)": [part(偏旁), ...],
                               "is_part_of(为哪些字的偏旁部首)": [char(字), ...],
                               "pronunciation(有哪些读音)": [pronunciation(带音调拼音), ...],
                               "number(笔画数)": number(笔画数),
                               "is_simple_to(是哪些繁体的简体形式)": [char(字), ...],
                               "is_traditional_to(是哪些简体的繁体形式)": [char(字), ...],
                               "similar_to(有哪些近似字)": [char(字), ...]
                               }

basic_triple.xlsx: a file in xlsx format has (head, relation, tail) triple in each row transformed from basic_dictionary_similar.json.
hanzi_entity.csv and hanzi_relation.csv: two necessary files in csv format to build the graph in Neo4j, transformed from basic_triple.xlsx, too large to upload here.
corpus4n, corpus4v, corpus4adj, corpus4adv, corpus4typos are dictionary in .json format for text augmentation function.

Command Line:

prepro: this code is used to intergrate corpora into the large dictionary, and then transfor the dictionary to the triple-based data:

python prepro.py

build_graph: this code is used to transfor the triple-based data into entity/relation-based data:

python build_graph.py

import data into Neo4j: this code is used in command line to import entity/relation-based data to Neo4j:

./neo4j-import -into /your_path/neo4j-community-3.5.5/data/databases/graph.db/ --nodes /your_path/hanzi_entity.csv --relationships /your_path/hanzi_relation.csv --ignore-duplicate-nodes=true --ignore-missing-nodes=true

start up Neo4j: open the browser, input the address http://localhost:7474/, check the graph:

./neo4j console

generate dictionaries for text augmentation: using functions from get_handian.py and get_dacilin.py, and function create_corpus4typos in prepro.py to generate dictionaries for text augmentation.
text augmentation: using functions in this code to generate new samples. Run the code to see examples. The detailed design method could be found here.

python text_augmentation.py

Requirements

Python = 3.6.9
Neo4j = 3.5.5
pypinyin = 0.41.0
pandas = 0.22.0
fuzzywuzzy = 0.17.0
LTP 4 = 4.0.9
tqdm = 4.39.0

References

pinyin-data、phrase-pinyin-data、pypinyin by Huang Huang.
语言文字规范标准 from 国家语言文字信息管理司.
SimilarCharacter by XiaoFang.
CharMap by guo-yong-zhi.
漢語拆字字典 by 開放詞典(kfcd).
funNLP by Yang fighting41love.
《现代汉语词典》（第7版） by CNMan.
chinese-xinhua 中华新华字典数据库 by Xiance Wu.
CJKV (Chinese Japanese Korean Vietnamese) Ideograph Database by CJKVI.

📖 See more ...

Cite this work as:

Chen J., He Z., Zhu Y., Xu L. (2021) TKB²ert: Two-Stage Knowledge Infused Behavioral Fine-Tuned BERT. In: Wang L., Feng Y., Hong Y., He R. (eds) Natural Language Processing and Chinese Computing. NLPCC 2021. Lecture Notes in Computer Science, vol 13029. Springer, Cham. https://doi.org/10.1007/978-3-030-88483-3_35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus

corpus

README.md

README.md

build_graph.py

build_graph.py

prepro.py

prepro.py

text_augmentation.py

text_augmentation.py

Repository files navigation

HanziGraph

Introduction:

File Dependency:

Dataset

Command Line:

Requirements

References

Cite this work as:

Use Case

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
corpus		corpus
README.md		README.md
build_graph.py		build_graph.py
prepro.py		prepro.py
text_augmentation.py		text_augmentation.py

Schlampig/HanziGraph

Folders and files

Latest commit

History

Repository files navigation

HanziGraph

Introduction:

File Dependency:

Dataset

Command Line:

Requirements

References

Cite this work as:

Use Case

About

Topics

Resources

Stars

Watchers

Forks

Languages