In [1]:
from docx import Document
import time
from OCRDataGenerator import OCRDataGenerator

def extract_docx_text(docx_path):
    doc = Document(docx_path)
    all_text = []
    buffer_text = ""

    for para in doc.paragraphs:
        words = para.text.split(" ")
        
        for word in words:
            if len(buffer_text) + len(word) + 1 > 30:
                all_text.append(buffer_text.strip())
                buffer_text = word
            else:
                buffer_text += " " + word
    
    if buffer_text:
        all_text.append(buffer_text.strip())
        
    return all_text

def generate_images_from_docx(docx_path, fonts, output_dir):
    generator = OCRDataGenerator(font_paths=fonts)

    texts = extract_docx_text(docx_path)
    
    for text in texts:
        text = text.strip()
        
        if len(text) < 1:
            continue

        image, metadata = generator.generate_image(
            text=text,
            min_font_size=24,
            max_font_size=48,
            horizontal_padding=40,
            vertical_padding=20,
            min_height=64,
            add_noise=False,
            random_transform=False
        )
        print(f"Text: {text}")
        print(f"Image size: {metadata['image_size']}\n")

        ts = time.time()

        # save TIF
        image.save(f"{output_dir}/{ts}.tif")
            
        # save TXT
        with open(f"{output_dir}/{ts}.gt.txt", "w", encoding='utf-8') as text_file:
            text_file.write(text)

        print(f"Saved image for word: {text}")


In [2]:
from pathlib import Path

datasets = Path("./kawtai-dataset")
output_dir = "./output"
fonts = [
    "./Shan.ttf",
    "./PangLong.ttf"
]

for file_path in datasets.rglob("*.docx"):
    if file_path.is_file():
        docx_path = file_path
        generate_images_from_docx(docx_path, fonts, output_dir)

Text: 10။ သိုဝ်ႇၶၢဝ်ႇ လႄႈ
Image size: (454, 107)

Saved image for word: 10။ သိုဝ်ႇၶၢဝ်ႇ လႄႈ
Text: ၵၢၼ်မိူင်း
Image size: (246, 103)

Saved image for word: ၵၢၼ်မိူင်း
Text: ႁဝ်းၸၢမ်းဝူၼ်ႉတူၺ်းလူး
Image size: (374, 81)

Saved image for word: ႁဝ်းၸၢမ်းဝူၼ်ႉတူၺ်းလူး
Text: ဝႃႈႁဝ်းလႆႈႁူႉၸွမ်း ငဝ်းလၢႆး
Image size: (430, 85)

Saved image for word: ဝႃႈႁဝ်းလႆႈႁူႉၸွမ်း ငဝ်းလၢႆး
Text: လႄႈ
Image size: (160, 70)

Saved image for word: လႄႈ
Text: ၶေႃႈမုၼ်းၵၢၼ်မိူင်းၸိူဝ်းၼႆႉ
Image size: (614, 107)

Saved image for word: ၶေႃႈမုၼ်းၵၢၼ်မိူင်းၸိူဝ်းၼႆႉ
Text: တီႈလႂ် မႃး။ ၵမ်ႈၼမ်ၼမ်တႄႉ
Image size: (502, 88)

Saved image for word: တီႈလႂ် မႃး။ ၵမ်ႈၼမ်ၼမ်တႄႉ
Text: တိုၼ်းတေႁပ်ႉႁၼ်ဝႃႈ
Image size: (453, 101)

Saved image for word: တိုၼ်းတေႁပ်ႉႁၼ်ဝႃႈ
Text: လႆႈတီႈသိုဝ်ႇၶၢဝ်ႇမႃးၼႆယူႇ။
Image size: (433, 88)

Saved image for word: လႆႈတီႈသိုဝ်ႇၶၢဝ်ႇမႃးၼႆယူႇ။
Text: ပေႃးၼၼ်
Image size: (232, 91)

Saved image for word: ပေႃးၼၼ်
Text: သိုဝ်ႇၶၢဝ်ႇဢၼ်ဝႃႈၼၼ်ႉသမ်ႉ
Image size: (419, 80)

Saved image for word: သိုဝ်ႇၶၢဝ်ႇဢၼ်ဝႃႈၼၼ