## OCR Training (PaddleOCR)

(setup for files only)

### 1. Install Dependencies & Clone PaddleOCR Repo

In [1]:
!pip install paddlepaddle-gpu -q
!pip install paddleocr pyyaml -q
!pip install pandas -q
!pip install numpy -q
!pip install scikit-learn -q
!pip install jiwer -q

# Clone the repo if not already cloned
!if [ ! -d "PaddleOCR" ]; then git clone https://github.com/PaddlePaddle/PaddleOCR.git; fi



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip
! was unexpected at this time.


In [2]:
# Install the repo's requirements
!pip install -r PaddleOCR/requirements.txt -q


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


### 2. Define Paths and Load DataFrames

In [None]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import yaml # For creating the config file
from jiwer import wer, cer # For evaluation
from sklearn.model_selection import train_test_split # To split our data

def _wer(gt, pred):
    return wer(gt, pred)

def _cer(gt, pred):
    return cer(gt, pred)

# --- PATHS TO SAVE MODELS ---
PADDLE_DATA_DIR = Path("./paddle_training_data")
PADDLE_DATA_DIR.mkdir(exist_ok=True)

# --- FILE PATHS ---
GT_CSV = Path("./ground_truth_lines.csv") # Main ground truth file

TRAIN_LABEL_FILE = PADDLE_DATA_DIR / "train_label.txt"
TEST_LABEL_FILE = PADDLE_DATA_DIR / "test_label.txt"
DICT_FILE = PADDLE_DATA_DIR / "dict.txt"

# --- LOAD DATAFRAME --- (tba: remove train/test and use folders instead for full dataset training)
try:
    gt_df = pd.read_csv(GT_CSV)
    print(f"Loaded {len(gt_df)} total samples from {GT_CSV}.")

    # Split the data into training and testing sets (e.g., 90% train, 10% test)
    train_df, test_df = train_test_split(gt_df, test_size=0.1, random_state=42)

    print(f"Split into {len(train_df)} training samples and {len(test_df)} testing samples.")

    # Check if 'image_path' and 'transcription' columns exist
    if 'image_path' not in train_df.columns or 'transcription' not in train_df.columns:
        raise ValueError("CSV must contain 'image_path' and 'transcription' columns.")

    print("\nSample image path from CSV:")
    print(train_df.iloc[0]['image_path'])
    print("\nTraining DataFrame head:")
    print(train_df.head())

except FileNotFoundError:
    print(f"Error: Could not find {GT_CSV}.")
except Exception as e:
    print(f"An error occurred: {e}")

Loaded 9 total samples from ground_truth_lines.csv.
Split into 8 training samples and 1 testing samples.

Sample image path from CSV:
splits/data_1/line_1.png

Training DataFrame head:
                  image_path  \
1   splits/data_1/line_1.png   
5   splits/data_1/line_6.png   
0   splits/data_1/line_0.png   
8   splits/data_1/line_9.png   
2  splits/data_1/line_10.png   

                                       transcription  
1                 Sig: 1 tab once a morning for itch  
5                                                 #5  
0                             Loratadine 10mg/tab #5  
8            3. Clobetasol Propionate 0.05% cream #1  
2  Sig: Ipahid nang manipis sa mga apektadong bahagi  


### 3. Prepare PaddleOCR Data Files

Paddle's training script requires two types of files:
1.  **Label Files (`.txt`):** A text file where each line is `image_path\ttranscription`.
2.  **Dictionary File (`dict.txt`):** A vocabulary file listing every unique character in your training set, one character per line.

In [None]:
# --- 1. Generate Label Files ---
print(f"Writing {TRAIN_LABEL_FILE}...")
with open(TRAIN_LABEL_FILE, "w", encoding="utf-8") as f:
    for _, row in train_df.iterrows():
        f.write(f"{str(row['image_path']).replace('//', '/')}\t{row['transcription']}\n")

print(f"Writing {TEST_LABEL_FILE}...")
with open(TEST_LABEL_FILE, "w", encoding="utf-8") as f:
    for _, row in test_df.iterrows():
        f.write(f"{str(row['image_path']).replace('//', '/')}\t{row['transcription']}\n")

# --- 2. Generate Dictionary (Vocabulary) File ---
print(f"Generating {DICT_FILE}...")
all_text = "".join(gt_df["transcription"].tolist())
unique_chars = sorted(list(set(all_text)))

with open(DICT_FILE, "w", encoding="utf-8") as f:
    for char in unique_chars:
        f.write(f"{char}\n")

print(f"Dictionary created with {len(unique_chars)} unique characters.")

Writing paddle_training_data\train_label.txt...
Writing paddle_training_data\test_label.txt...
Generating paddle_training_data\dict.txt...
Dictionary created with 38 unique characters.


### Test: Using pre-trained Model as baseline

In [None]:
PRETRAINED_MODEL_URL = "https://paddleocr.bj.bcebos.com/PP-OCRv4/english/en_PP-OCRv4_rec_train.tar"
PRETRAINED_MODEL_PATH = Path("./en_PP-OCRv4_rec_train")

if not PRETRAINED_MODEL_PATH.exists():
    print("Extracting pre-trained model...")
    !tar -xf en_PP-OCRv4_rec_train.tar
    print(f"Model downloaded and extracted to {PRETRAINED_MODEL_PATH}")
else:
    print(f"Pre-trained model already exists at {PRETRAINED_MODEL_PATH}")

pretrained_model_path = os.path.join(base_dir, PRETRAINED_MODEL_PATH, "best_accuracy")

Pre-trained model already exists at en_PP-OCRv4_rec_train


### 4. Create Custom Training Configuration (`.yml`)

Paddle is controlled by a YAML file. We are creating a new file, `my_custom_config.yml`, that inherits from the base config but overrides the key parameters to point to our data.

In [None]:
# absolute paths because the training script runs from within the 'PaddleOCR' directory
base_dir = os.path.abspath(os.getcwd())
dict_path = os.path.join(base_dir, DICT_FILE)
train_label_path = os.path.join(base_dir, TRAIN_LABEL_FILE)
test_label_path = os.path.join(base_dir, TEST_LABEL_FILE)
save_model_dir = os.path.join(base_dir, "output/my_paddle_model")

In [5]:
train_label_path = train_label_path.replace('\\', '/')
test_label_path = test_label_path.replace('\\', '/')

use_gpu = "true"
epochs = 100
images_root = "./"
batch_size = 8

In [6]:
# Define the custom config content
config_content = f"""
Global:
  debug: false
  use_gpu: {use_gpu}
  epoch_num: {epochs}
  distributed: false
  save_model_dir: {save_model_dir}
  save_epoch_step: 10
  print_batch_step: 10
  eval_batch_step: [0, 1000] # Evaluate every 1000 steps, and at the start (step 0)
  save_best: "acc" 
  cal_metric_during_train: true
  pretrained_model: ""
  checkpoints: ""
  save_inference_dir: {save_model_dir}/inference
  use_visualdl: true
  infer_img: ""
  character_dict_path: {dict_path}
  max_text_length: 50
  use_space_char: true
  infer_mode: false
  log_smooth_window: 20
  save_res_path: {save_model_dir}/train_results.txt
Optimizer:
  name: Adam
  lr:
    name: Cosine
    learning_rate: 0.001
  regularizer:
    name: L2
    factor: 0.00001
Architecture:
  model_type: rec
  algorithm: CRNN
  Transform:
  Backbone:
    name: ResNet
    layers: 34
  Neck:
    name: SequenceEncoder
    encoder_type: rnn
    hidden_size: 256
  Head:
    name: CTCHead
    fc_decay: 0.00001
Loss:
  name: CTCLoss
PostProcess:
  name: CTCLabelDecode
  character_dict_path: {dict_path}
  use_space_char: true
Metric:
  name: RecMetric
  main_indicator: acc
  character_dict_path: {dict_path}
  use_space_char: true
Train:
  dataset:
    name: SimpleDataSet
    data_dir: {images_root}
    label_file_list: ["{train_label_path}"]
    transforms:
      - DecodeImage: {{img_mode: BGR, channel_first: false}}
      - RecResizeImg: {{image_shape: [3, 32, 320]}}
      - CTCLabelEncode: {{}}
      - KeepKeys: {{keep_keys: ["image", "label", "length"]}}
  loader:
    shuffle: true
    batch_size_per_card: {batch_size}
    drop_last: true
    num_workers: 4
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: {images_root}
    label_file_list: ["{train_label_path}"]
    transforms:
      - DecodeImage: {{img_mode: BGR, channel_first: false}}
      - RecResizeImg: {{image_shape: [3, 32, 320]}}
      - CTCLabelEncode: {{}}
      - KeepKeys: {{keep_keys: ["image", "label", "length"]}}
  loader:
    shuffle: false
    batch_size_per_card: {batch_size}
    drop_last: false
    num_workers: 2
"""

# Save the config file inside the PaddleOCR directory
CONFIG_FILE_PATH = Path("PaddleOCR/configs/rec/my_custom_config.yml")
with open(CONFIG_FILE_PATH, "w", encoding="utf-8") as f:
    f.write(config_content)

print(f"Custom config file saved to {CONFIG_FILE_PATH}")

Custom config file saved to PaddleOCR\configs\rec\my_custom_config.yml
