Chengyu are typically four-character long Chinese idioms that are known for expressing information densely. They are also important for learning Chinese at a high level, as using Chengyu is typically seen as a sign of high education. As someone learning Chinese myself, I was looking for a way to keep track of the different chengyu and know which ones I should learn, as well as help me learn them, so I collected a dataset of information about some of the most common Chengyu to help me learn. An example of a row in the finished dataset is shown below.

In [3]:
import pandas as pd
df = pd.read_csv('Chengyu-Final.csv') #read csv file
del df['Unnamed: 0'] #get rid of column of numbers
print(df.iloc[1]) #print the 2nd row of the dataset

Chengyu                                                             牙牙学语
Appearances                                                            2
Topic                                                            culture
Definition                                       《庄子·逍遥游》：“鹪鹩巢于深林，不过一枝。”
Explanation                                           全面考核事物的称说是不是与实际相符。
Frequency                                                              2
Pinyin                                                      yá yá xué yǔ
English Definition     "Zhuangzi, Xiaoyaoyou": "The wren nests in the...
English Explanation    A comprehensive assessment of whether the clai...
Example                          第一次对阅读萌发兴趣，往往就始于牙牙学语之时，耳旁响起的父母读出的一段段文字。
Name: 1, dtype: object


The first thing I needed was a list of chengyu. I pulled from《<a href="https://ankiweb.net/shared/info/1464206393">成语小酷</a>》(a small collection of chengyu), an online deck of chengyu intended for native speakers for the digital flashcard program Anki. The deck contained a list of chengyu with poetically-worded definitions and explanations of those definitions in more common language. I exported this file as a text file from the Anki app and got this information by doing regular expressions on the text file. Besides this basic information, I also wanted English translations of the definitions and explanations, as well as transcriptions, an example sentence, the frequency of the chengyu, and the topic that that chengyu typically deals with.

I got the translations using the Google and DeepL translation APIs (I used Google for the poetically-worded definitions because it appeared to be better at handling the more classical language after testing it on example definitions, while I used DeepL for the more modern explanations that it seemed better suited for). Because of the higher restrictions on the use of the Google API, it was more of a hassle to translate the definitions, and I had to break the definitions up into several groups instead of being able to do it all at once.

To get the transcriptions, I used the pypinyin library to return a Pinyin transcription of each of the chengyu. This was the simplest of all of the data categories to collect.

Getting the example sentences was also fairly simple. I downloaded a list of Chinese sentences I had gotten from a HuggingFace library called <a href="https://huggingface.co/datasets/madao33/new-title-chinese">new-title-chinese</a>. I then joined all of the sentences in the dataset together into one string and did a regular expression match statement on it that got a sentence of text containing the Chengyu (it was not able to find an example for all of them, despite at a previous point having filtered out the Chengyu that do not appear in the dataset, so there may have been a problem with the RegEx, but the number of Chengyu without example sentences is few and it's mostly the uncommon ones, so it doesn't bother me much).

To get the frequency, I counted the number of times each Chengyu appeared in the <a href="https://huggingface.co/datasets/madao33/new-title-chinese">new-title-chinese</a> dataset. To increase speed, instead of using the count() method on each Chengyu, I instead made a function that outputs a counter object that counts each substring of a certain length in a string, including overlaps. Using these appearances, I then gave it a frequency score of 1 to 5 depending on how many occurences it had.

The methods I used to get the topic of the Chengyu were relatively computationally intensive, so I used the Julia programming language for most of the process. The first part was in Python, however, where I used a pre-trained Chinese topic classification model called <a href="https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese">uer/roberta-base-finetuned-chinanews-chinese</a> to categorize each sentence that appeared in the <a href="https://huggingface.co/datasets/madao33/new-title-chinese">new-title-chinese</a> dataset. Then, in Julia, I gathered each sentence that contained a Chengyu (again, using RegEx), and then I iterated through each Chengyu and got all of the sentences it appeared in. I then got each of those sentence's topics (that I had previously classified), and created a counter on the topics and selected the most common one. However, I noticed that the topic 'mainland Chinese politics' accounted for a large majority of the topics in the sentence dataset, so I made a function that adjusted the counter values by a certain amount that represents how much the key of the value being adjusted is underrepresented compared to the most common key ('mainland Chinese politics'). The function also manually changed the adjustment for the topic 'None' to 1, so that it would not receive more Chengyu than it should. This allowed for there to be more diversity in the topics.

For a final tool for myself, I made a CLI that can give information on a single chengyu or go through a sentence, find all of the chengyu in that sentence, and give information on each of them (this was done because it is sometimes difficult for me to tell which words are chengyu in a Chinese sentence). Example calls of each of those two functions are shown below. The Jupyter notebook kernel seems unable to handle the utf-8 encoded characters, but it works perfectly fine on the normal command line.

In [None]:
!python get_info.py chengyu 迫不及待

In [None]:
!python get_info.py sentence 他迫不及待地吃个饭。