# XDHYCD7th dictionary processing

I found a Chinese dictionary here: https://github.com/CNMan/XDHYCD7th/blob/master/XDHYCD7th.txt

In notebook I removed the preamble and postamble.

Preamble:

拟在“华宇拼音输入法论坛”网友wangyanhan制作的《现代汉语词典》第5版全文TXT基础上更新到《现代汉语词典》第7版
项目地址：https://github.com/CNMan/XDHYCD7th
欢迎各路网友参与、协作修订
原则上字、词头只增不减（即不删除新版删掉的字、词头），字、词头释义合并除外
备用黑色圆圈数字：❶❷❸❹❺❻❼❽❾❿⓫⓬⓭⓮⓯⓰⓱⓲⓳⓴
备用白色圆圈数字：①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳㉑㉒㉓㉔㉕㉖㉗㉘㉙㉚㉛㉜㉝㉞㉟㊱㊲㊳㊴㊵㊶㊷㊸㊹㊺㊻㊼㊽㊾㊿
备用上标数字：⁰¹²³⁴⁵⁶⁷⁸⁹
备用下标数字：₀₁₂₃₄₅₆₇₈₉
备用汉语拼音小写：āáǎàōóǒòēéěèīíǐìūúǔùüǖǘǚǜêê̄ếê̌ềm̄ḿm̀ńňǹẑĉŝŋ
备用汉语拼音大写：ĀÁǍÀŌÓǑÒĒÉĚÈĪÍǏÌŪÚǓÙÜǕǗǙǛÊÊ̄ẾÊ̌ỀM̄ḾM̀ŃŇǸẐĈŜŊ
━━━━━━━━━━━━━━━
《现代汉语词典》（第7版）全文TXT
━━━━━━━━━━━━━━━


postamble is just an empty line

In [6]:
import pandas as pd
from tqdm import tqdm
import re

In [7]:
fileName = "../../../RawData/Dictionaries/Chinese/XDHyCD7th_stripped.txt"

In [8]:
# Initialize lists to store extracted data
hanzi_list = []
pinyin_list = []
definition_list = []

# Updated regular expression pattern to extract data
pattern = r'【(.*?)】[^()]*?([A-Za-z0-9•ɑāáǎàēéěèīíǐìōóǒòūúǔùǖǘǚǜüńňǹɡĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛÜŃŇǸ]*)(.*)'

# Determine the total number of lines (for tqdm progress bar)
with open(fileName, 'r', encoding='utf-8') as file:
    total_lines = sum(1 for _ in file)

# Read the file line by line and extract data
with open(fileName, 'r', encoding='utf-8') as file:
    for line in tqdm(file, total=total_lines, desc='Processing'):
        match = re.match(pattern, line)
        
        if not match:
            print(line)
        
        if match:
            hanzi, pinyin, definition = match.groups()
            hanzi_list.append(hanzi)
            pinyin_list.append(pinyin)
            definition_list.append(definition)

# Create a DataFrame
raw = pd.DataFrame({
    'Hanzi': hanzi_list,
    'Pinyin': pinyin_list,
    'Definition': definition_list
})

# Display the DataFrame
raw.head()

Processing: 100%|█████████████████████████████████████████████████████████████| 70414/70414 [00:00<00:00, 228307.36it/s]


Unnamed: 0,Hanzi,Pinyin,Definition
0,吖,ā,见下。
1,吖嗪,āqín,〈名〉有机化合物的一类，呈环状结构，含有一个或几个氮原子，如吡啶、哒嗪、嘧啶等。[英azine]
2,阿,ā,〈方〉前缀。❶用在排行、小名或姓的前面，有亲昵的意味：～大｜～宝｜～唐。❷用在某些亲属名称的...
3,阿鼻地狱,ābídìyù,佛教指最深层的地狱，是犯了重罪的人死后灵魂永远受苦的地方。[阿鼻，梵avīci]
4,阿昌族,Āchānɡzú,〈名〉我国少数民族之一，分布在云南。


In [9]:
raw[raw.Hanzi == '襄理']

Unnamed: 0,Hanzi,Pinyin,Definition
56844,襄理,xiānɡlǐ,❶〈书〉〈动〉帮助办理：～军务。❷〈名〉规模较大的银行或企业中协助经理主持业务的人，地位次于协理。


In [10]:
raw[raw.Hanzi == '做证']

Unnamed: 0,Hanzi,Pinyin,Definition
70410,做证,,


## Different meanings

These are disambiguated using number icons, teasing apart each different meaning might take a little while.

## Pinyin

right now this is separate, a treatment of phonetic display hasn't been considered yet, for now lets just add it to the definition and consider that problem later.

## Conclusion
 
Splitting the dictionary on the markers and integrating the pinyin should do for now 

## 见下  - "see below"

Sometimes this happens, we'll strip out those, since it is implied that in no way reduces the amount of information

## Missing definitions

some words such as 做证 don't have a definition, we'll skip those

In [11]:
# Regular expression pattern for disambiguation markers
markers_pattern = r'(❶|❷|❸|❹|❺|❻|❼|❽|❾|❿|⓫|⓬|⓭|⓮|⓯|⓰|⓱|⓲|⓳|⓴|①|②|③|④|⑤|⑥|⑦|⑧|⑨|⑩|⑪|⑫|⑬|⑭|⑮|⑯|⑰|⑱|⑲|⑳|㉑|㉒|㉓|㉔|㉕|㉖|㉗|㉘|㉙|㉚|㉛|㉜|㉝|㉞|㉟|㊱|㊲|㊳|㊴|㊵|㊶|㊷|㊸|㊹|㊺|㊻|㊼|㊽|㊾|㊿)'


# Split the definitions into lists
def split_definitions(definition):
    parts = re.split(markers_pattern, definition)
    if len(parts) <= 1:  # No markers found, keep the definition as is
        return [definition]

    initial_description = parts[0] if not re.match(markers_pattern, parts[0]) else ''
    
    # Sometimes there is stuff before the disambiguation, for now we prepend it to all the parts
    if (initial_description != ''):
        split_definitions = [initial_description + parts[i] for i in range(2, len(parts), 2) if parts[i]]
    else:
        split_definitions = [part for part in parts[::2] if not part.isspace()]

    return split_definitions


raw['Definition'] = raw['Definition'].apply(split_definitions)

# Explode the lists into separate rows
raw_exploded = raw.explode('Definition')

In [12]:
raw_exploded = raw_exploded[raw_exploded.Definition != "。"]
raw_exploded = raw_exploded[raw_exploded.Definition != ""]

In [13]:
raw_exploded["Definition"] = raw_exploded["Pinyin"] + "。" + raw_exploded["Definition"]

In [14]:
raw_exploded[raw_exploded.Definition.apply(lambda s : s[-3:] == "见下。")].sample(3)

Unnamed: 0,Hanzi,Pinyin,Definition
191,叆,,。（靉）ài见下。
17587,虼,ɡè,ɡè。见下。
3794,哱,bō,bō。见下。


In [15]:
raw_exploded = raw_exploded[raw_exploded.Definition.apply(lambda s : s[-3:] != "见下。")] # drop "see belows"

In [16]:
raw_exploded = raw_exploded.drop(columns = ["Pinyin"]).reset_index(drop = True).reset_index()

In [17]:
raw_exploded.sample(5)

Unnamed: 0,index,Hanzi,Definition
29569,29569,火力圈,huǒlìquān。〈名〉在一个区域内各种火力所及的范围。
45620,45620,瞑,mínɡ。眼花：耳聋目～。
83384,83384,只要,zhǐyào。〈连〉表示必要的条件（下文常用“就”或“便”呼应）：～肯干，就会干出成绩来｜～...
54933,54933,辱骂,rǔmà。〈动〉污辱谩骂。
66158,66158,顽健,wánjiàn。〈书〉〈形〉谦称自己身体强健。


In [18]:
processed = raw_exploded.rename(columns = {"index" : "ID", "Hanzi" : "Word"})

In [19]:
processed.head()

Unnamed: 0,ID,Word,Definition
0,0,吖嗪,āqín。〈名〉有机化合物的一类，呈环状结构，含有一个或几个氮原子，如吡啶、哒嗪、嘧啶等。[...
1,1,阿,ā。〈方〉前缀。用在排行、小名或姓的前面，有亲昵的意味：～大｜～宝｜～唐。
2,2,阿,ā。〈方〉前缀。用在某些亲属名称的前面：～婆｜～爸｜～哥。另见2页•ɑ；339页ē。
3,3,阿鼻地狱,ābídìyù。佛教指最深层的地狱，是犯了重罪的人死后灵魂永远受苦的地方。[阿鼻，梵avīci]
4,4,阿昌族,Āchānɡzú。〈名〉我国少数民族之一，分布在云南。


In [20]:
save_dir = "../../../ProcessedData/Dictionaries/Chinese/"
import os
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

In [21]:
processed.to_csv(save_dir + "XDHyCD7th.csv", encoding='utf-16', index=False)