<a href="https://colab.research.google.com/github/Huangphoux/standard-han-nom/blob/main/standard_han_nom.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone -q https://github.com/Huangphoux/standard-han-nom
%cd /content/standard-han-nom/
!pip install -q -r requirements.txt
!pip install -q genanki
%mkdir results

import pandas as pd
import numpy as np
import genanki

from google.colab import files

standard_table_lv1 = pd.DataFrame(pd.read_excel("standard-han-nom-lv1.xlsx"))
standard_table_lv2 = pd.DataFrame(pd.read_excel("standard-han-nom-lv2.xlsx"))
reform = pd.DataFrame(pd.read_csv('reformed-chinese.tsv', sep='\t'))

/content/standard-han-nom
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: Could not find a version that satisfies the requirement pywin32==308 (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for pywin32==308[0m[31m
[0m

## 2. Data preprocessing

Although the source claimed there were 3975 characters in the tables, there appear to be only 3974. I plan to deal with this problem later.
Update: The reason is in the row 2192. The converter didn't read the character 洛 in this row (maybe because it's not fully display in the pdf file), so there was a missing character.

Despite doing a good job, the converter did not make a perfect table. There is a issue that is the column "Unnamed: 3", it should have been merged to the "Examples" column.

In [2]:
standard_table_lv1['Examples'] = standard_table_lv1['Examples'].str.cat(standard_table_lv1['Unnamed: 3'], sep = ' ', na_rep='')

standard_table_lv1.drop('Unnamed: 3', axis=1, inplace=True)

standard_table_lv1.loc[2192, 'Character'] = '洛'

From now on, I will deal with the missing values from the columns: Reading, Character and unicode(create from Note column later).

In [3]:
# from above info, it seems like there is a missing value of Reading column
# standard_table_lv1[standard_table_lv1['Reading'].isnull()]

In [4]:
# The reading is "Na" but was misunderstood to be NaN value. So I replace it.
standard_table_lv1.loc[2812, 'Reading'] = "NA"
standard_table_lv2.loc[1904, 'Reading'] = "nan"

In [5]:
standard_table_lv1 = pd.concat([standard_table_lv1, standard_table_lv2], ignore_index=True)

In [6]:
# adding a unicode column which is extracted from 'Note' column
# the unicode columnn is the column of list (except there are some NaN values)
standard_table_lv1['unicode'] = standard_table_lv1['Note'].str.findall(r'(U\+[0-9A-Fa-f]+)')
standard_table_lv1['unicode'] = standard_table_lv1['unicode'].apply(lambda x: np.nan if x == [] else x)

standard_table_lv1['Note'] = standard_table_lv1['Note'].str.replace(r'(U\+[0-9A-Fa-f]+)', "", regex=True)
standard_table_lv1['Note'] = standard_table_lv1['Note'].str.replace('()', "")

#standard_table_lv1.drop('Note', axis=1, inplace=True)

In [7]:
# standard_table_lv1

Now I will replace NaN values in 'Character' and 'unicode' columns. Since every character is assigned with at least one code, and there are 3,975 characters and 3,975 lists, so that each character is associated with its own unique list.

Thus, we have this table (keep in mind one character might have n reading ways):

| Character                     | Reading   | unicode                           |
|-------------------------------|-----------|-----------------------------------|
| character can be in this cell | reading_1 | the list (of code) can be in this |
| or this cell                  | reading_2 | or this                           |
| or this one                   | reading_n | or this                           |

I make an assumption that the list of code should be in the same row with reading_1.
The character is expected to be in that row too, but that's not always true. The next Code Cell proves that the character can only be in the row of reading_1 or the row of reading_2.


In [8]:
temp1 = standard_table_lv1['Character'].isna() & ~standard_table_lv1['unicode'].isna()
special_index_list = []

for index, value in temp1.items():
    if value == True:
        special_index_list.append(index)
        special_index_list.append(index+1)

# standard_table_lv1.iloc[special_index_list]

In [9]:
# Now I bring all the character in its reading_2 row back to reading_1 row.
for index, value in temp1.items():
    if value == True:
        standard_table_lv1.loc[index, 'Character'] = standard_table_lv1.loc[index + 1, 'Character']
        standard_table_lv1.loc[index + 1, 'Character'] = np.nan

In [10]:
# fill all NaN value in Character column and unicode column
standard_table_lv1['Character'] = standard_table_lv1['Character'].fillna(method='ffill')
standard_table_lv1['unicode'] = standard_table_lv1['unicode'].fillna(method='ffill')

standard_table_lv1['Note'] = standard_table_lv1['Note'].fillna('')

  standard_table_lv1['Character'] = standard_table_lv1['Character'].fillna(method='ffill')
  standard_table_lv1['unicode'] = standard_table_lv1['unicode'].fillna(method='ffill')


Now we will assign each character to a grade from 1 to 5.

In [11]:
# first, I read all the characters from pdf to .txt file using below 3 lines of code to read pdf file
# then, I copied the result to .txt files manually

# from PyPDF2 import PdfReader
# reader = PdfReader("characters-by-grade.pdf")
# print(reader.pages[7].extract_text())

In [12]:
characters_by_grade = {}
characters_lv2 = {}

for grade in range(1, 7): # there are grade 1,2,3,4,5,6
    with open('new-characters-by-grade/grade' + str(grade) + '.txt', 'r', encoding='utf8') as file:
        characters_by_grade[grade] = file.read().split()

with open('new-characters-by-grade/level2.txt', 'r', encoding='utf8') as file:
        characters_lv2 = file.read().split()

standard_table_lv1['grade'] = -1

for grade in range(1, 7):
    for character in characters_by_grade[grade]:
        row = standard_table_lv1.loc[standard_table_lv1['Character'].str.startswith(character, na=False)]
        standard_table_lv1.loc[row.index, 'grade'] = grade

# 7: nằm trong cấp 2
# 8: không nằm trong đâu cả

for character in characters_lv2:
    row = standard_table_lv1.loc[standard_table_lv1['Character'].str.startswith(character, na=False)]
    standard_table_lv1.loc[row.index, 'grade'] = 7

standard_table_lv1['grade'] = standard_table_lv1['grade'].apply(lambda x: 8 if x == -1 else x)
# standard_table_lv1 = standard_table_lv1[standard_table_lv1['grade'].notna()]

# Cleaning all \n in Reading and Example
standard_table_lv1['Reading'] = standard_table_lv1['Reading'].str.replace('\n', '', regex=True)
standard_table_lv1['Examples'] = standard_table_lv1['Examples'].str.replace('\n', '', regex=True)
standard_table_lv1['Examples'] = standard_table_lv1['Examples'].str.replace(' ?· ?', ' ', regex=True)

standard_table_lv1['Note'] = standard_table_lv1['Note'].astype(str)


# Export
standard_table_lv1.to_excel('after-processing-list.xlsx', index=False)

In [13]:
standard_table_lv1

Unnamed: 0,Character,Reading,Examples,Note,unicode,grade
0,阿,A,阿從 a tòng 阿諛 a dua 阿片 a phiến 阿羅漢 A La H...,[翻],[U+963F],3
1,妸,ả,淹妸 êm ả 妸陶 ả đào,,[U+59B8],6
2,亞,Á,洲亞 Châu Á 亞金 á kim 亞聖 á thánh,,[U+4E9E],3
3,啊,à,勢啊？ Thế à? [嘆],,[U+554A],5
4,啊,ã,嗢啊 ồn ã [𠸨],,[U+554A],5
...,...,...,...,...,...,...
8895,昌,XƯƠNG,昌盛 xương thịnh,,[U+660C],6
8896,唱,XƯỚNG,喝唱 hát xướng,,[U+5531],7
8897,唱,xang,吋唱 xốn xang [𠸨] 𫕸唱 xênh xang [𠸨],,[U+5531],7
8898,䉅,xụp,㡴䉅 lụp xụp 嚏䉅 xì xụp,,[U+4245],7


Now, we have a complete table need for creating anki decks.

## 3. Create anki decks

In [14]:
# Create an Anki model (card layout)
han_nom_to_quoc_ngu_model = genanki.Model(
    1607392319,
    'Hán Nôm',
    fields=[
        {'name': 'Character'},
        {'name': 'Readings'},
        {'name': 'Audio'},
        {'name': 'Picture'},
        {'name': 'Examples'},
        {'name': 'Notes'},
        {'name': 'Grade'},
    ],
    templates=[
        {
            'name': 'Card 1',
            'qfmt': """<div style="font-size: 4em; font-family: minh">
{{Character}}
</div>""",
            'afmt': """{{FrontSide}}

<hr id=answer>

<div style="font-size: 2em; font-family: minh"">{{Readings}}</div>
<div>{{Audio}}</div>
<div>{{Picture}}</div>
<div class="examples">{{Examples}}</div>
<div>{{Notes}}</div>
<div>{{Grade}}</div>
""",
        },
    ],
    css = """.card {
    font-family: gothic;
    font-size: 2em;
    text-align: center;
    color: black;
    background-color: #fdf6e3;
}

@font-face {
  font-family: gothic;
  src: url("_gothic.ttf");
}

@font-face {
  font-family: minh;
  src: url("_minh.ttf");
}""",
)

In [15]:
def create_deck(new_df, deck_name, model, grade):
    match grade:
        case 7:
            subdeck_name = f'Cấp 2'
        case 8:
            subdeck_name = f'Ngoài bảng'
        case _:
            subdeck_name = f'Lớp {grade}'

    subdeck = genanki.Deck(2059400110+grade+ord(deck_name[0]), subdeck_name)  # Subdeck ID

    package = genanki.Package(subdeck)
    package.media_files = ['_gothic.ttf', '_minh.ttf']

    for _, row in new_df.iterrows():
        note = genanki.Note(
            model=model,
            fields=[str(row['Character']) if pd.notna(row['Character']) else '',
                    str(row['Readings']) if pd.notna(row['Readings']) else '',
                    '',
                    '',
                    str(row['Examples']) if pd.notna(row['Examples']) else '',
                    str(row['Note']) if pd.notna(row['Note']) else '',
                    str(row['grade']) if pd.notna(row['grade']) else '',]
        )
        subdeck.add_note(note)

    subdeck.write_to_file(f'results/{subdeck_name}.apkg')

def create_HanNom_to_Quocngu_deck(df, deck_name, grade, model = han_nom_to_quoc_ngu_model):
    df_by_grade = df[df['grade'] == grade]

    new_df = pd.DataFrame(columns=['Character', 'Readings', 'Examples', 'Note', 'grade'])

    for _, row in df_by_grade.iterrows():
        character = str(row['Character']) if pd.notna(row['Character']) else ''
        reading = str(row['Reading']) if pd.notna(row['Reading']) else ''
        examples = str(row['Examples']) if pd.notna(row['Examples']) else ''
        note = str(row['Note']) if pd.notna(row['Note']) else ''
        grade_val = str(row['grade']) if pd.notna(row['grade']) else ''

        if character in new_df['Character'].values:
            new_df.loc[new_df['Character'] == character, 'Readings'] += ', ' + reading
            new_df.loc[new_df['Character'] == character, 'Examples'] += '<br>' + examples
        else:
            new_df.loc[len(new_df)] = [character, reading, examples.rstrip(), note, grade_val]

    temp = pd.DataFrame(columns=['Character', 'Readings', 'Examples', 'Note', 'grade'])

    # Handle cases for grades 1-6, 7 (level2), and 8 (outside the table)
    if grade in range(1, 7):
        for character in characters_by_grade[grade]:
             row = new_df[new_df['Character'] == character]
             if not row.empty:
                temp.loc[len(temp)] = row.iloc[0].values
    elif grade == 7:
         for character in characters_lv2:
             row = new_df[new_df['Character'] == character]
             if not row.empty:
                temp.loc[len(temp)] = row.iloc[0].values
    elif grade == 8:
        temp = new_df # For grade 8, keep all characters in new_df

    new_df = temp

    create_deck(new_df, deck_name, model, grade)

# Iterate through grades 1 to 8
for i in range(1, 9):
    create_HanNom_to_Quocngu_deck(standard_table_lv1, 'Hán Nôm', i)

In [16]:
!zip -r /content/results.zip results

files.download("/content/results.zip")

  adding: results/ (stored 0%)
  adding: results/Ngoài bảng.apkg (deflated 85%)
  adding: results/Lớp 3.apkg (deflated 74%)
  adding: results/Cấp 2.apkg (deflated 65%)
  adding: results/Lớp 6.apkg (deflated 66%)
  adding: results/Lớp 5.apkg (deflated 76%)
  adding: results/Lớp 1.apkg (deflated 78%)
  adding: results/Lớp 4.apkg (deflated 75%)
  adding: results/Lớp 2.apkg (deflated 74%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>