# Data Cleaning
This script takes a raw dataset and cleans it.
The cleaned data has the following format.

```kana<TAB>alpha```

kana: [ァ-ヴー]

alpha: [a-zç]

In [1]:
import sys
import utils.character_converters as cc

In [2]:
RAW_INPUT_FILE = 'training_data/scraped_aggregated_kana_alpha.txt'
OUTPUT_FILE = 'training_data/alpha_to_kana_cleaned.txt'

cleaned_pairs = []
kanas_map = cc.create_char_maps_kana()
alphas_map = cc.create_char_maps_alpha()

In [3]:
with open(RAW_INPUT_FILE, "r", encoding="utf-8") as fp:

    line_no = 1
    num_good = 0
    num_bad = 0
    for line in fp:
        fields = line.strip().split('\t')
        if len(fields) != 2:
            print(f'[{line_no}]: invalid line found in "{line.strip()}"]')
            num_bad += 1
            continue
            
        raw_kana  = fields[0]
        raw_alpha = fields[1]
        good = True
        alpha = ""
        kana  = ""

        for c in raw_alpha.strip():
            if c not in alphas_map:
                print(f'[{line_no}]: invalid alphabet found in "{raw_alpha}"]')
                good = False
                break
            else:
                alpha += alphas_map[c]

        for c in raw_kana.strip():
            if c not in kanas_map:
                print(f'[{line_no}]: invalid kana found in "{raw_kana}"]')
                good = False
                break
            else:
                kana += kanas_map[c]
        if good:   
            cleaned_pairs.append( (kana, alpha) )
            num_good +=1
        else:
            num_bad += 1
        line_no += 1


[755]: invalid alphabet found in "Aung San"]
[756]: invalid alphabet found in "Aung San Suu Kyi"]
[1097]: invalid alphabet found in "Aguilar Zinser"]
[1163]: invalid alphabet found in "Aksel Lund"]
[1346]: invalid alphabet found in "À Kempis"]
[1708]: invalid line found in "アス"]
[2308]: invalid line found in "アッス"]
[2898]: invalid alphabet found in "Ad Rock"]
[3127]: invalid alphabet found in "Anuman Rajadhom"]
[3435]: invalid alphabet found in "Abī al"]
[3439]: invalid alphabet found in "Abī al-Qāsim"]
[3440]: invalid alphabet found in "Abī al-khayr"]
[3452]: invalid alphabet found in "Abī Waqqās"]
[3512]: invalid alphabet found in "Afdal ad-dīn"]
[3585]: invalid alphabet found in "Axundzad∽e"]
[3590]: invalid alphabet found in "Abu Eitta"]
[3603]: invalid alphabet found in "Abu Shanab"]
[3616]: invalid alphabet found in "Abd Allah"]
[3639]: invalid alphabet found in "Abd al-Salām"]
[3661]: invalid alphabet found in "Abd al-Karīm"]
[3664]: invalid alphabet found in "Abdel Shafi"]
[367

[42224]: invalid alphabet found in "Sian Chin"]
[42252]: invalid alphabet found in "Hsien Loong"]
[42253]: invalid alphabet found in "Hsien Loong"]
[42402]: invalid alphabet found in "Cielo Filho"]
[42404]: invalid alphabet found in "Shien Biau"]
[42440]: invalid alphabet found in "Siow Yue"]
[42447]: invalid alphabet found in "Siong Kie"]
[42550]: invalid alphabet found in "Sigur●＠7AB3dur"]
[42861]: invalid alphabet found in "Si tu"]
[43298]: invalid alphabet found in "Simonis dze"]
[43319]: invalid alphabet found in "Simonsz."]
[43522]: invalid alphabet found in "Șaguna"]
[43916]: invalid alphabet found in "Shams al-Din"]
[43920]: invalid alphabet found in "Shams al-Dīn"]
[43956]: invalid alphabet found in "Sharaf al"]
[43957]: invalid alphabet found in "Sharaf al-Dīn"]
[44003]: invalid alphabet found in "Sharīat Madārī"]
[44123]: invalid alphabet found in "Śar ba pa"]
[44230]: invalid alphabet found in "Śangs ston"]
[44303]: invalid alphabet found in "Sharwood Smith"]
[44418]: inval

[80414]: invalid alphabet found in "Mac Anthony"]
[80808]: invalid alphabet found in "Mac Murrough"]
[80809]: invalid alphabet found in "Mac Murrough"]
[81496]: invalid alphabet found in "Masjed Jamei"]
[81690]: invalid alphabet found in "Mata Hari"]
[81737]: invalid alphabet found in "Ma cig"]
[83126]: invalid alphabet found in "Maha Swe"]
[83151]: invalid alphabet found in "Maha Rahtathara"]
[83285]: invalid alphabet found in "Ma Ma Lay"]
[83558]: invalid alphabet found in "Marie France"]
[83716]: invalid alphabet found in "Mary Lynn"]
[84347]: invalid alphabet found in "Mar pa"]
[85184]: invalid alphabet found in "Māja al-Qazwīnī"]
[85549]: invalid alphabet found in "Min Thu Wun"]
[85639]: invalid alphabet found in "Mi bskyod"]
[85740]: invalid alphabet found in "Migel Angel"]
[86249]: invalid alphabet found in "Michael Wiśniowiecki"]
[86285]: invalid alphabet found in "Mi pham rgya mtsho"]
[86354]: invalid alphabet found in "Mya Than Tint"]
[86355]: invalid alphabet found in "Mya M

[126484]: invalid alphabet found in "De Vries"]
[126486]: invalid alphabet found in "De Vries"]
[126533]: invalid alphabet found in "De Broglie"]
[126552]: invalid alphabet found in "De Broglie"]
[126561]: invalid alphabet found in "De Bouillon"]
[126562]: invalid alphabet found in "Khro phu"]
[126576]: invalid alphabet found in "De Bellis"]
[126589]: invalid alphabet found in "De Boor"]
[126591]: invalid alphabet found in "De Botton"]
[126601]: invalid alphabet found in "De Beaune"]
[126687]: invalid alphabet found in "De Murville"]
[126713]: invalid alphabet found in "Khrom bsher"]
[126729]: invalid alphabet found in "De Maiziére"]
[126733]: invalid alphabet found in "De Médicis"]
[126750]: invalid alphabet found in "De Moivre"]
[126756]: invalid alphabet found in "De Morgan"]
[126759]: invalid alphabet found in "De Montigny"]
[126811]: invalid alphabet found in "De Lahunta"]
[126873]: invalid alphabet found in "De La Geniere"]
[126879]: invalid alphabet found in "De la Saussaye"]
[1

The numbers above shows the total line numbers processed, number of good pairs, and the nuber of bad pairs removed.

In [4]:
print (f'Lines processed:[{line_no - 1}], Good pairs:[{num_good}], Bad pairs removed:[{num_bad}]')

Lines processed:[149459], Good pairs:[148365], Bad pairs removed:[1113]


In [5]:
with open(OUTPUT_FILE, "w", encoding="utf-8") as fp:
    for k, a in cleaned_pairs:
        fp.write(f'{a}\t{k}\n')