# Create anki decks for standard-han-nom

In this project, I set out to create an Anki deck for standard-Han-Nom, organized by grade.
The list was taken from: https://www.hannom-rcv.org/

## 1. Extract the standard Han Nom level 1 list

First, I use [this converter](https://www.adobe.com/acrobat/online/pdf-to-excel.html) to turn the table in the pdf file into excel file. Then I check the table from xlsx file.

In [1]:
import pandas as pd
import numpy as np

standard_table_lv1 = pd.DataFrame(pd.read_excel("standard-han-nom-lv1.xlsx")) 

standard_table_lv1

Unnamed: 0,Character,Reading,Examples,Unnamed: 3,Note
0,阿,A,阿從 a tòng · 阿諛 a dua · 阿片 a phiến · 阿羅漢 ...,,[翻] U+963F
1,妸,ả,淹妸,êm ả · 妸陶 ả đào,U+59B8
2,亞,Á,洲亞 Châu Á · 亞金 á kim · 亞聖 á thánh,,U+4E9E
3,啊,à,勢啊？ Thế à? [嘆],,U+554A
4,,ã,嗢啊,ồn ã [𠸨],
...,...,...,...,...,...
5104,𬺗,xuống,𬨠𬺗 lên xuống · 𨀈𬺗 bước xuống · 𬺗𩯀 xuống tóc,,[異] 𫴋 𠖈 U+2CE97
5105,𦩰,xuồng,𣛥𦩰 be xuồng,,U+26A70
5106,𩩫,xương,𩩫骨 xương cốt · 𤐚𩩫 hầm xương · 𩩫𦘹 xương sườn,,[異] 昌 U+29A6B
5107,廠,XƯỞNG,工廠 công xưởng,,U+5EE0


In [2]:
standard_table_lv1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5109 entries, 0 to 5108
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Character   3974 non-null   object
 1   Reading     5108 non-null   object
 2   Examples    4982 non-null   object
 3   Unnamed: 3  18 non-null     object
 4   Note        4461 non-null   object
dtypes: object(5)
memory usage: 199.7+ KB


## 2. Data preprocessing

Although the source claimed there were 3975 characters in the tables, there appear to be only 3974. I plan to deal with this problem later.
Update: The reason is in the row 2192. The converter didn't read the character 洛 in this row (maybe because it's not fully display in the pdf file), so there was a missing character.

Despite doing a good job, the converter did not make a perfect table. There is a issue that is the column "Unnamed: 3", it should have been merged to the "Examples" column.

In [3]:
standard_table_lv1['Examples'] = standard_table_lv1['Examples'].str.cat(standard_table_lv1['Unnamed: 3'], sep = ' ', na_rep='')

standard_table_lv1.drop('Unnamed: 3', axis=1, inplace=True)

standard_table_lv1.loc[2192, 'Character'] = '洛'

standard_table_lv1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5109 entries, 0 to 5108
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Character  3975 non-null   object
 1   Reading    5108 non-null   object
 2   Examples   5109 non-null   object
 3   Note       4461 non-null   object
dtypes: object(4)
memory usage: 159.8+ KB


From now on, I will deal with the missing values from the columns: Reading, Character and unicode(create from Note column later).

In [4]:
# from above info, it seems like there is a missing value of Reading column
temp = standard_table_lv1[standard_table_lv1['Reading'].isnull()]
temp

Unnamed: 0,Character,Reading,Examples,Note
2812,挪,,挪威 Na Uy,U+632A


In [5]:
# The reading is "Na" but was misunderstood to be NaN value. So I replace it.
standard_table_lv1.loc[2812, 'Reading'] = "NA"
standard_table_lv1.iloc[2812]

Character             挪
Reading              NA
Examples     挪威  Na Uy 
Note             U+632A
Name: 2812, dtype: object

In [6]:
# adding a unicode column which is extracted from 'Note' column
# the unicode columnn is the column of list (except there are some NaN values)
standard_table_lv1['unicode'] = standard_table_lv1['Note'].str.findall(r'(U\+[0-9A-Fa-f]+)')

#standard_table_lv1.drop('Note', axis=1, inplace=True)

standard_table_lv1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5109 entries, 0 to 5108
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Character  3975 non-null   object
 1   Reading    5109 non-null   object
 2   Examples   5109 non-null   object
 3   Note       4461 non-null   object
 4   unicode    4461 non-null   object
dtypes: object(5)
memory usage: 199.7+ KB


In [7]:
# There are some character cells that have 2 or 3 unicode codes.
# Through experiments, the lenght of those lists can only be 0,1,2,3.

lst0 = []
lst1 = []
lst2 = []
lst3 = []

nonlist = []

for i, value in standard_table_lv1['unicode'].items():
    if isinstance(value, list):
        if len(value) == 0:
            lst0.append(i)
        if len(value) == 1:
            lst1.append(i)
        if len(value) == 2:
            lst2.append(i)
        if len(value) == 3:
            lst3.append(i)
    else:
        nonlist.append(i)

print('Number of list cell:', len(lst1) + len(lst2) + len(lst3) + len(lst0))
print('Number of non-list cells:', len(nonlist))
print('Number of []:', len(lst0))

print('Number of non-empty list', len(lst1) + len(lst2) + len(lst3))

unique_types = standard_table_lv1['unicode'].apply(type).unique()
print(unique_types)

not_list_value = standard_table_lv1['unicode'].iloc[nonlist].unique()
print(not_list_value)


Number of list cell: 4461
Number of non-list cells: 648
Number of []: 486
Number of non-empty list 3975
[<class 'list'> <class 'float'>]
[nan]


In [8]:
#turn all empty list into NaN.
standard_table_lv1['unicode'] = standard_table_lv1['unicode'].apply(lambda x: np.nan if x == [] else x)
print(standard_table_lv1.info())
standard_table_lv1

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5109 entries, 0 to 5108
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Character  3975 non-null   object
 1   Reading    5109 non-null   object
 2   Examples   5109 non-null   object
 3   Note       4461 non-null   object
 4   unicode    3975 non-null   object
dtypes: object(5)
memory usage: 199.7+ KB
None


Unnamed: 0,Character,Reading,Examples,Note,unicode
0,阿,A,阿從 a tòng · 阿諛 a dua · 阿片 a phiến · 阿羅漢 ...,[翻] U+963F,[U+963F]
1,妸,ả,淹妸 êm ả · 妸陶 ả đào,U+59B8,[U+59B8]
2,亞,Á,洲亞 Châu Á · 亞金 á kim · 亞聖 á thánh,U+4E9E,[U+4E9E]
3,啊,à,勢啊？ Thế à? [嘆],U+554A,[U+554A]
4,,ã,嗢啊 ồn ã [𠸨],,
...,...,...,...,...,...
5104,𬺗,xuống,𬨠𬺗 lên xuống · 𨀈𬺗 bước xuống · 𬺗𩯀 xuống tóc,[異] 𫴋 𠖈 U+2CE97,[U+2CE97]
5105,𦩰,xuồng,𣛥𦩰 be xuồng,U+26A70,[U+26A70]
5106,𩩫,xương,𩩫骨 xương cốt · 𤐚𩩫 hầm xương · 𩩫𦘹 xương sườn,[異] 昌 U+29A6B,[U+29A6B]
5107,廠,XƯỞNG,工廠 công xưởng,U+5EE0,[U+5EE0]


Now I will replace NaN values in 'Character' and 'unicode' columns. Since every character is assigned with at least one code, and there are 3,975 characters and 3,975 lists, so that each character is associated with its own unique list.

Thus, we have this table (keep in mind one character might have n reading ways):

| Character                     | Reading   | unicode                           |
|-------------------------------|-----------|-----------------------------------|
| character can be in this cell | reading_1 | the list (of code) can be in this |
| or this cell                  | reading_2 | or this                           |
| or this one                   | reading_n | or this                           |

I make an assumption that the list of code should be in the same row with reading_1.
The character is expected to be in that row too, but that's not always true. The next Code Cell proves that the character can only be in the row of reading_1 or the row of reading_2.


In [9]:
temp1 = standard_table_lv1['Character'].isna() & ~standard_table_lv1['unicode'].isna()
special_index_list = []
for index, value in temp1.items():
    if value == True:
        special_index_list.append(index)
        special_index_list.append(index+1)

print(special_index_list)

standard_table_lv1.iloc[special_index_list]

[151, 152, 182, 183, 470, 471, 507, 508, 990, 991, 1334, 1335, 2419, 2420, 2437, 2438, 2585, 2586, 2702, 2703, 3019, 3020, 3313, 3314, 3372, 3373, 4337, 4338, 4568, 4569]


Unnamed: 0,Character,Reading,Examples,Note,unicode
151,,BÀNG,彷徨 bàng hoàng,U+5F77,[U+5F77]
152,彷,PHẢNG,彷彿 phảng phất,[翻],
182,,BÀO,炮製 bào chế,U+70AE,[U+70AE]
183,炮,PHÁO,炮臺 pháo đài · 炮花 pháo hoa,,
470,,CẠNH,競爭 cạnh tranh,U+7AF6,[U+7AF6]
471,競,ganh,競𨅮 ganh đua · 競比 ganh tị,,
507,,CÂU,俱樂部 câu lạc bộ,U+4FF1,[U+4FF1]
508,俱,CỤ,俱備 cụ bị · 俱全 cụ toàn,,
990,,DỊCH,演繹 diễn dịch,U+7E79,[U+7E79]
991,繹,dếch,阿繹拜間 A-déc-bai-gian [摱],𡨸尼主要得使用抵翻音。䀡附錄。 Chữ này chủ yếu được sử dụng ...,


In [10]:
# Now I bring all the character in its reading_2 row back to reading_1 row.
for index, value in temp1.items():
    if value == True:
        standard_table_lv1.loc[index, 'Character'] = standard_table_lv1.loc[index + 1, 'Character']
        standard_table_lv1.loc[index + 1, 'Character'] = np.nan

standard_table_lv1.iloc[special_index_list]

Unnamed: 0,Character,Reading,Examples,Note,unicode
151,彷,BÀNG,彷徨 bàng hoàng,U+5F77,[U+5F77]
152,,PHẢNG,彷彿 phảng phất,[翻],
182,炮,BÀO,炮製 bào chế,U+70AE,[U+70AE]
183,,PHÁO,炮臺 pháo đài · 炮花 pháo hoa,,
470,競,CẠNH,競爭 cạnh tranh,U+7AF6,[U+7AF6]
471,,ganh,競𨅮 ganh đua · 競比 ganh tị,,
507,俱,CÂU,俱樂部 câu lạc bộ,U+4FF1,[U+4FF1]
508,,CỤ,俱備 cụ bị · 俱全 cụ toàn,,
990,繹,DỊCH,演繹 diễn dịch,U+7E79,[U+7E79]
991,,dếch,阿繹拜間 A-déc-bai-gian [摱],𡨸尼主要得使用抵翻音。䀡附錄。 Chữ này chủ yếu được sử dụng ...,


In [11]:
# fill all NaN value in Character column and unicode column

standard_table_lv1['Character'] = standard_table_lv1['Character'].fillna(method='ffill')
standard_table_lv1['unicode'] = standard_table_lv1['unicode'].fillna(method='ffill')

standard_table_lv1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5109 entries, 0 to 5108
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Character  5109 non-null   object
 1   Reading    5109 non-null   object
 2   Examples   5109 non-null   object
 3   Note       4461 non-null   object
 4   unicode    5109 non-null   object
dtypes: object(5)
memory usage: 199.7+ KB


  standard_table_lv1['Character'] = standard_table_lv1['Character'].fillna(method='ffill')
  standard_table_lv1['unicode'] = standard_table_lv1['unicode'].fillna(method='ffill')


Now we will assign each character to a grade from 1 to 5.

In [12]:
# first, I read all the characters from pdf to .txt file using below 3 lines of code to read pdf file
# then, I copied the result to .txt files manually

# from PyPDF2 import PdfReader
# reader = PdfReader("characters-by-grade.pdf")
# print(reader.pages[7].extract_text())

characters_by_grade = {}

for grade in range(1, 7): # there are grade 1,2,3,4,5,6
    with open('characters-by-grade/grade' + str(grade) + '.txt', 'r', encoding='utf8') as file:
        characters_by_grade[grade] = file.read().split()
        
standard_table_lv1['grade'] = -1
for grade in range(1, 7):
    for character in characters_by_grade[grade]:
        row = standard_table_lv1.loc[standard_table_lv1['Character'].str.startswith(character, na=False)]
        standard_table_lv1.loc[row.index, 'grade'] = grade

# Cleaning all \n in Reading and Example

standard_table_lv1['Reading'] = standard_table_lv1['Reading'].str.replace('\n', '', regex=True)
standard_table_lv1['Examples'] = standard_table_lv1['Examples'].str.replace('\n', '', regex=True)


# Export
#standard_table_lv1.to_excel('after-processing-list.xlsx', index=False)
standard_table_lv1


Unnamed: 0,Character,Reading,Examples,Note,unicode,grade
0,阿,A,阿從 a tòng · 阿諛 a dua · 阿片 a phiến · 阿羅漢 ...,[翻] U+963F,[U+963F],2
1,妸,ả,淹妸 êm ả · 妸陶 ả đào,U+59B8,[U+59B8],6
2,亞,Á,洲亞 Châu Á · 亞金 á kim · 亞聖 á thánh,U+4E9E,[U+4E9E],3
3,啊,à,勢啊？ Thế à? [嘆],U+554A,[U+554A],3
4,啊,ã,嗢啊 ồn ã [𠸨],,[U+554A],3
...,...,...,...,...,...,...
5104,𬺗,xuống,𬨠𬺗 lên xuống · 𨀈𬺗 bước xuống · 𬺗𩯀 xuống tóc,[異] 𫴋 𠖈 U+2CE97,[U+2CE97],3
5105,𦩰,xuồng,𣛥𦩰 be xuồng,U+26A70,[U+26A70],6
5106,𩩫,xương,𩩫骨 xương cốt · 𤐚𩩫 hầm xương · 𩩫𦘹 xương sườn,[異] 昌 U+29A6B,[U+29A6B],4
5107,廠,XƯỞNG,工廠 công xưởng,U+5EE0,[U+5EE0],6


Now, we have a complete table need for creating anki decks.

## 3. Create anki decks

In [13]:
import genanki

# Create an Anki model (card layout)
my_model = genanki.Model(
    1607392319,
    'Basic Model',
    fields=[
        {'name': 'Character'},
        {'name': 'Readings'},
        {'name': 'Examples'},
    ],
    templates=[
        {
            'name': 'Card 1',
            'qfmt': '{{Character}}',
            'afmt': '{{FrontSide}}\n<hr id=answer>\n{{Readings}}\n<br><br>\nVí dụ: {{Examples}}',
        },
    ],
    css = '.card {font-family: arial;font-size: 20px;text-align: center;color: black;background-color: white;',
)

#df = pd.read_excel('after-processing-list.xlsx') # "NA" will be misunderstood to be NaN value
#df.loc[2812, 'Reading'] = "NA"

def create_deck(new_df, deck_name, model, grade):
    
    subdeck_name = f'{deck_name}::Lớp {grade}'

    subdeck = genanki.Deck(2059400110+grade, subdeck_name)  # Subdeck ID

    for _, row in new_df.iterrows():
        note = genanki.Note(
            model=model,
            fields=[row['Character'], row['Readings'], row['Examples']],
        )
        subdeck.add_note(note)

    print(f'Successfully adding {len(new_df)} cards')
    
    subdeck.write_to_file(f'results/Standard chữ Hán Nôm to chữ Quốc Ngữ - grade {grade}.apkg')

def create_cards_by_grade(df, deck_name, model, grade):

    df_by_grade = df[df['grade'] == grade]

    ## new dataframe for creating deck
    new_df = pd.DataFrame(columns=['Character', 'Readings', 'Examples'])

    for _, row in df_by_grade.iterrows():
        if row['Character'] in new_df['Character'].values:
            new_df.iloc[-1, new_df.columns.get_loc('Readings')] += ', ' + row['Reading']
            new_df.iloc[-1, new_df.columns.get_loc('Examples')] += '\n' + row['Examples']
        else:
            new_df.loc[len(new_df)] = [row['Character'], row['Reading'], row['Examples']]

    create_deck(new_df, deck_name, model, grade)

for i in range(1, 7):
    create_cards_by_grade(standard_table_lv1, 'Standard chữ Hán Nôm to chữ Quốc Ngữ (level 1)', my_model, i)

print("Anki deck created from DataFrame successfully!")


Successfully adding 244 cards
Successfully adding 342 cards
Successfully adding 464 cards
Successfully adding 478 cards
Successfully adding 490 cards
Successfully adding 1957 cards
Anki deck created from DataFrame successfully!
