# Handwritten Chinese OCR Training Notebook

This notebook trains a ResNet + CTC based OCR model for handwritten Chinese text recognition using the CASIA-HWDB2.x dataset.

**Requirements:**
- Google Drive with the project folder containing `main.py`, `test.py`, and preprocessed dataset
- GPU runtime recommended (Runtime ‚Üí Change runtime type ‚Üí GPU)

## 1. Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Google Drive
if os.path.exists('/content/drive/MyDrive'):
    print("‚úÖ Google Drive already mounted")
else:
    drive.mount('/content/drive')
    print("‚úÖ Google Drive mounted successfully")

## 2. Set Project Path

**‚ö†Ô∏è Modify `PROJECT_PATH` below to match your Google Drive folder location.**

In [None]:
# ‚ö†Ô∏è MODIFY THIS PATH to your project folder in Google Drive
PROJECT_PATH = '/content/drive/MyDrive/handwritten-chinese-ocr-samples'

# Change to project directory
%cd {PROJECT_PATH}

# Verify project structure
!ls -la
print("\n" + "="*50)
print(f"‚úÖ Working directory: {os.getcwd()}")

## 3. Install Dependencies

In [None]:
# Install dependencies from requirements.txt
!pip install -q -r requirements.txt

# Verify PyTorch installation
import torch
print(f"‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")

## 4. Data Preparation Notes

### ‚ö†Ô∏è IMPORTANT: Dataset Preprocessing Required

Before training, the raw **CASIA-HWDB2.x** binary files (`.dgrl`, `.dgr`, `.gnt` formats) must be preprocessed into image files (PNG) and label files.

### Preprocessing Utilities

The following utility scripts are available in `utils/casia-hwdb-data-preparation/`:

| Script | Purpose | Input Format |
|--------|---------|-------------|
| `preprocess_dgrl.py` | Convert DGRL text line files to PNG + labels | `.dgrl` |
| `dgr2png.c` | C/C++ utility for converting DGR files | `.dgr` |
| `gnt2png.py` | Convert GNT character files to PNG | `.gnt` |

### Expected Dataset Structure After Preprocessing

```
data/hwdb2.0/
‚îú‚îÄ‚îÄ train/
‚îÇ   ‚îú‚îÄ‚îÄ 000000.png
‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îú‚îÄ‚îÄ val/
‚îú‚îÄ‚îÄ test/
‚îú‚îÄ‚îÄ train_img_id_gt.txt    # Format: image_name,label_text
‚îú‚îÄ‚îÄ val_img_id_gt.txt
‚îú‚îÄ‚îÄ test_img_id_gt.txt
‚îî‚îÄ‚îÄ chars_list.txt         # One character per line
```

### Example: Preprocessing DGRL Files

```bash
python preprocess_dgrl.py --train_dir HWDB2.0Train --test_dir HWDB2.0Test --output_dir data/hwdb2.0 --val_split 0.1
```

### Compiling dgr2png.c (if needed)

```bash
cd utils/casia-hwdb-data-preparation
gcc -o dgr2png dgr2png.c
./dgr2png <input.dgr> <output_dir>
```

## 5. Verify Dataset

In [None]:
# Dataset path (modify if using a different location)
DATASET_PATH = 'data/hwdb2.0'

import os

required_files = [
    f'{DATASET_PATH}/train_img_id_gt.txt',
    f'{DATASET_PATH}/val_img_id_gt.txt', 
    f'{DATASET_PATH}/test_img_id_gt.txt',
    f'{DATASET_PATH}/chars_list.txt',
    f'{DATASET_PATH}/train',
    f'{DATASET_PATH}/val',
    f'{DATASET_PATH}/test'
]

print("Dataset verification:")
print("="*50)
all_ok = True
for f in required_files:
    exists = os.path.exists(f)
    status = "‚úÖ" if exists else "‚ùå"
    print(f"{status} {f}")
    if not exists:
        all_ok = False

if all_ok:
    for split in ['train', 'val', 'test']:
        gt_file = f'{DATASET_PATH}/{split}_img_id_gt.txt'
        with open(gt_file, 'r', encoding='utf-8') as f:
            count = len(f.readlines())
        print(f"   {split}: {count} samples")
    with open(f'{DATASET_PATH}/chars_list.txt', 'r', encoding='utf-8') as f:
        num_chars = len(f.readlines())
    print(f"   Character vocabulary: {num_chars}")
else:
    print("\n‚ö†Ô∏è Some files are missing. Please preprocess the dataset first.")

## 6. Training

In [None]:
# Training configuration
DATASET_PATH = 'data/hwdb2.0'
BATCH_SIZE = 8
EPOCHS = 10
PRINT_FREQ = 50
NUM_WORKERS = 2

# Run training
!python main.py -m hctr \
    -d {DATASET_PATH} \
    -b {BATCH_SIZE} \
    -ep {EPOCHS} \
    -pf {PRINT_FREQ} \
    -j {NUM_WORKERS}

## 7. Find Best Model

In [None]:
import glob

model_files = glob.glob('hctr_*.pth.tar')

if model_files:
    print("Saved models:")
    for f in sorted(model_files):
        size_mb = os.path.getsize(f) / (1024*1024)
        print(f"  üìÅ {f} ({size_mb:.1f} MB)")
    
    acc_models = [f for f in model_files if 'acc' in f]
    if acc_models:
        BEST_MODEL = sorted(acc_models)[-1]
    else:
        BEST_MODEL = 'hctr_checkpoint.pth.tar'
    print(f"\n‚úÖ Selected model: {BEST_MODEL}")
else:
    print("‚ùå No model files found. Please run training first.")
    BEST_MODEL = None

## 8. Evaluation

In [None]:
# Evaluation configuration
DATASET_PATH = 'data/hwdb2.0'
MODEL_FILE = BEST_MODEL
TEST_PATH = f'{DATASET_PATH}/test'
BATCH_SIZE = 16

if MODEL_FILE and os.path.exists(MODEL_FILE):
    print(f"Evaluating model: {MODEL_FILE}")
    print(f"Test set: {TEST_PATH}")
    print("="*50)
    
    !python test.py -m hctr \
        -f {MODEL_FILE} \
        -i {TEST_PATH} \
        -b {BATCH_SIZE} \
        -bm \
        -dm greedy-search \
        -pf 20
else:
    print("‚ùå Model file not found. Please run training first.")

## 9. Save Model to Drive

In [None]:
import shutil

SAVE_DIR = f'{PROJECT_PATH}/checkpoints'
os.makedirs(SAVE_DIR, exist_ok=True)

model_files = glob.glob('hctr_*.pth.tar')
for f in model_files:
    dst = os.path.join(SAVE_DIR, f)
    shutil.copy2(f, dst)
    print(f"‚úÖ Saved: {dst}")

print(f"\nÔøΩÔøΩ All models saved to: {SAVE_DIR}")