# Korean Character Recognition with Convnets

---
## Introduction 
This semester, I am studying abroad at Yonsei University in South Korea. 

![Yonsei's Campus](./images/yonsei_campus.JPG)

I spend two hours per day in Korean class, so I wanted to make at least one post related to Korean. I figured using convnets to recognize Korean characters would be fun, and it's also quite a challenge. There are 10 digits for MNIST and 26 letters for the English alphabet, but the Korean alphabet contains 11,172 possible character combinations. In reality, however, only 2,350 characters are frequently used ([source](https://ko.m.wikipedia.org/wiki/%ED%95%9C%EA%B8%80_%EC%9D%8C%EC%A0%88))

> (Translated from Korean) These characters can be expressed in all combinations of Korean characters, but KS X 1001 Korean complete encoding only contains 2,350 characters which are frequently used, so the remaining 8,822 characters cannot be expressed. The recently used extended completion code and Unicode series support all 11,172 characters.

In this post, we will use the PHD08 Korean characters dataset which contains 2,187 samples of each of the 2,350 Korean character classes for a total of 5,139,450 data samples. Uncompressed, the datset is 7.52 GB and it's in a... "unique" format, so we'll get to spend an entire section reformatting it 👍

---
## Preparing the Data

### Downloading the Data
You can download the data from its [original provider](http://cv.jbnu.ac.kr/index.php?mid=notice&document_srl=189) or use this direct [Dropbox link](https://www.dropbox.com/s/69cwkkqt4m1xl55/phd08.alz?dl=0). 

### Unzipping the Data
The data is in some propriety `.alz` form. If you're on Windows, you can unzip this using ALZip. If you are on Mac, I recommend using Unarchiver. If you are asked for the encoding format when unzipping in, select "(MS, DOC) Korean" which should show an output file like 가.txt.

![You may not like it, but this is what peak proprietary formats look like](./images/unzip.jpg)

### Inspecting the Data
Go ahead and look at the output files. You'll notice that there are 2,350 `.txt` files – one for each Korean character. Inside the text files are 2,187 samples of the character represented in the following format: `sample id`, `dimensions` (rows then columns), `binary representation`. If you squint hard enough at the example below, you should see that it resembles the character 가. 

```
s_0_0_0_0_1
22 29
00000000000000000000011100000
00000000000000000000011100000
11111111111111100000011100000
11111111111111100000011100000
00000000000011100000011100000
00000000000001100000011000000
00000000000011100000011100000
00000000000011100000011000000
00000000000001100000011100000
00000000000001100000011111111
00000000000011100000011111111
00000000000011100000011100000
00000000000011100000011000000
00000000000000000000011100000
00000000000000000000011100000
00000000000000000000011000000
00000000000000000000011000000
00000000000000000000011000000
00000000000000000000011000000
00000000000000000000011000000
00000000000000000000011100000
00000000000000000000011000000
```

### What We Need To Do
We need to complete the following steps
1. Write a file parser to turn each sample into a 2D numpy array
2. Write a function to save all of the 2D numpy arrays as images 
3. Iterate through all the files and save the images to train/validation/test folders

#### Parsing Files Into Numpy Arrays
The following code will parse the file and turn each image into a Numpy array

In [63]:
import re
import numpy as np 

def load_images(file):
    """Return all the characters as a list of 2D numpy arrays"""
    # Precompile match patterns for reuse 
    image_dimensions_regex = re.compile(r'^(\d+) (\d+)$')
    image_binary_regex = re.compile(r'^(\d+)$')

    # Return all images from file 
    images = [] 

    # Open the file 
    with open(file, 'r') as file:
        sample_id = 0 
        image = None 
        for line in file: 
            # Blank Lines: Add image to list 
            if line == '\n':
                if image is not None:
                    images.append(image)
                    image = None
                continue 

            # Sample IDs: Increment the sample number 
            if '_' in line:
                sample_id += 1
                continue 

            # Image Dimensions: Create numpy array 
            dims = re.match(image_dimensions_regex, line)
            if dims:
                rows = int(dims.group(1))
                cols = int(dims.group(2))
                row_index = iter(range(rows))
                image = np.zeros((rows, cols))

            # Binary Image: Add each row to array
            row_data = re.match(image_binary_regex, line)
            if row_data:
                row = next(row_index)
                data = [int(c) for c in list(row_data.group(1))]
                image[row] = data

    return images          

Given a list of images and an output directory, this code will save each image as a jpeg. Since our images were 0 or 1 before, we multiply them by 255. This means we will have to scale them when we load them into our convnet.

In [64]:
import os
from PIL import Image

def save_images(images, output_dir):
    """Saves an array of numpy images to the specified output directory"""
    for (idx, image) in enumerate(images):
        img = Image.fromarray(image*255.).convert("L")
        img_name = str(idx) + '.jpg'
        output_path = os.path.join(output_dir, img_name)
        img.save(output_path)

Finally, this script creates all the output directories, iterates through all the files, and saves the images accordingly. You should change the source directory and the target directory based on your information. Also, we use 1,187 images for training, 500 images for validation, and the remaining (500) images for testing. You can adjust this if you want as well. 

NOTE: This took 2 hours and 10 minutes to run on my MacBook Pro 👎🤕👎

In [68]:
# ------------------------------------------------------------
# UPDATE THESE  
# ------------------------------------------------------------
phd08_source_dir = '/Users/jtbergman/Datasets/phd08/'
phd08_target_dir = '/users/jtbergman/Datasets/phd08processed'

# Paths for train, test, validation directories 
train_dir = os.path.join(phd08_target_dir, 'train')
test_dir = os.path.join(phd08_target_dir, 'test')
val_dir = os.path.join(phd08_target_dir, 'validation')

# Train / Val / Test split
train_split = 1187 
val_split = 500

# ------------------------------------------------------------
# DON'T CHANGE   
# ------------------------------------------------------------

def mkdir(directories):
    """Create directories if they don't already exist."""
    if type(directories) != list:
        directories = [directories]
    for d in directories:
        if not os.path.exists(d):
            os.mkdir(d)

mkdir([phd08_target_dir, train_dir, test_dir, val_dir])

def output_directories_for_file(file):
    """Return output directories for a character's images."""
    filename = os.path.basename(file)
    character = filename.split('.')[0]
    train_out = os.path.join(train_dir, character) 
    test_out = os.path.join(test_dir, character)
    val_out = os.path.join(val_dir, character)
    mkdir([train_out, test_out, val_out])
    return train_out, test_out, val_out 

# Iterate over all the files and save them to train/dev/test
for file in os.listdir(phd08_source_dir):
    train, test, val = output_directories_for_file(file)
    images = load_images(os.path.join(phd08_source_dir,file))
    save_images(images[:train_split], train)
    save_images(images[train_split:train_split+val_split], val)
    save_images(images[train_split+val_split:], test)

---
## Architecture Decisions