#### Introduction
- このコードは"OCR model for reading Captchas"を可変長のテキストを持つIAMデータセットに拡張する方法について取り上げる。
- データセットの各サンプルは手書きテキストの画像で、対応するターゲットは画像に存在する文字列である。

#### Data
- データセットがどのように構成されているかプレビューする。
- #で始まる行は単なるメタデータである。

In [15]:
!head -20 data/IAM_Words/words.txt

#--- words.txt ---------------------------------------------------------------#
#
# iam database word information
#
# format: a01-000u-00-00 ok 154 1 408 768 27 51 AT A
#
#     a01-000u-00-00  -> word id for line 00 in form a01-000u
#     ok              -> result of word segmentation
#                            ok: word was correctly
#                            er: segmentation of word can be bad
#
#     154             -> graylevel to binarize the line containing this word
#     1               -> number of components for this word
#     408 768 27 51   -> bounding box around this word in x,y,w,h format
#     AT              -> the grammatical tag for this word, see the
#                        file tagset.txt for an explanation
#     A               -> the transcription for this word
#
a01-000u-00-00 ok 154 408 768 27 51 AT A
a01-000u-00-01 ok 154 507 766 213 48 NN MOVE


#### Setup

In [16]:
from keras.layers import StringLookup
import keras

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import os

#### Dataset splitting

In [17]:
base_path = "data/IAM_Words/"
words_list = []

words = open(f"{base_path}/words.txt", "r").readlines()
for line in words:
    if line[0] == "#":
        continue
    if line.split(" ")[1] != "err": # エラーを除外
        words_list.append(line)

print(len(words_list))
np.random.shuffle(words_list)

96456


In [18]:
# データセットを train:90, validation:5, test:5 の割合で分割
split_idx = int(0.9 * len(words_list))
train_samples = words_list[:split_idx]
test_samples = words_list[split_idx:]

val_split_idx  = int(0.5 * len(test_samples))
validation_samples = test_samples[:val_split_idx]
test_samples = test_samples[val_split_idx:]

assert len(words_list) == len(train_samples) + len(validation_samples) + len(test_samples)

print(f"Total training samples: {len(train_samples)}")
print(f"Toral validation samples: {len(validation_samples)}")
print(f"Total test samples: {len(test_samples)}")

Total training samples: 86810
Toral validation samples: 4823
Total test samples: 4823


#### Data input pipeline

In [19]:
# 画像パスの準備
base_image_path = os.path.join(base_path)

def get_image_paths_and_labels(samples):
    paths = []
    corrected_samples = []
    for (i, file_line) in enumerate(samples):
        line_split = file_line.strip()
        line_split = line_split.split(" ")

        # 各行が分割されると、対応する画像は次のような書式になる: 
        # part1/part1-part2/part1-part2-part3.png
        image_name = line_split[0]
        partI = image_name.split("-")[0]
        partII = image_name.split("-")[1]
        img_path = os.path.join(
            base_image_path, partI, partI + "-" + partII, image_name + ".png"
        )
        if os.path.getsize(img_path):
            paths.append(img_path)
            corrected_samples.append(file_line.split("\n")[0])

    return paths, corrected_samples

train_img_paths, train_labels = get_image_paths_and_labels(train_samples)
validation_img_paths, validation_labels = get_image_paths_and_labels(validation_samples)
test_img_paths, test_labels = get_image_paths_and_labels(test_samples)

In [22]:
# 訓練データから最大長および語彙のサイズを求める
train_labels_cleaned = []
characters = set()
max_len = 0

for label in train_labels:
    label = label.split(" ")[-1].strip()
    for char in label:
        characters.add(char)

    max_len = max(max_len, len(label))
    train_labels_cleaned.append(label)

characters = sorted(list(characters))

print(f"Maximum length: {max_len}")
print(f"Vocab size: {len(characters)}")

# ラベルの確認
train_labels_cleaned[:10]

Maximum length: 21
Vocab size: 78


['present', 'the', 'is', 'one', 'was', 'was', 'that', 'Europe', 'get', 'those']