Skip to content

LeeHakHo/clipstr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VLM-based model with contrastive learning for Scene Text Recognition

CLIPSTR_structure
CLIPSTR_example

CLIPSTR Training Instructions

1. Project Directory

/home/ohh/PycharmProject/clipseq/

2. Execution Command

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NCCL_P2P_DISABLE=1 \
HYDRA_FULL_ERROR=1 \
python3 train.py \
  +experiment=parclip \
  dataset=real \
  charset=36_lowercase

3. Key Code Files

  • /strhub/models/parclip/system.py
  • /strhub/models/parclip/simclr.py

4. Experiment Setup

  1. Base configuration: /configs/main.yaml. Adjust dataset paths, batch sizes, or learning rates there.
  2. For fresh encoding + contrastive training, in system.py set:
    • self.prompt = True
    • self.new = True
    • self.contrastive = True
    • All other flags (load, save, etc.) = False
    To persist encoded features, also set self.save = True.
  3. To skip encoding and reuse saved features, in system.py set:
    • self.prompt = True
    • self.new = True
    • self.load = True
    • All other flags = False
    This will load precomputed embeddings and resume training without re-encoding.

Prior Work

This work extends Apache License 2.0 arXiv preprint In Proc. ECCV 2022 Gradio demo

Sample Results

Input Image PARSeq-SA ABINet TRBA ViTSTR-S CRNN
CHEWBACCA CHEWBACCA CHEWBAGGA CHEWBACCA CHEWBACCA CHEWUACCA
Chevron Chevrol Chevro_ Chevro_ Chevr__ Chevr__
SALMON SALMON SALMON SALMON SALMON SA_MON
Verbandstoffe Verbandsteffe Verbandsteffe Verbandstelle Verbandsteffe Verbandsleffe
Kappa Kappa Kappa Kaspa Kappa Kaada
3rdAve 3rdAve 3=-Ave 3rdAve 3rdAve Coke

NOTE: Bold letters and underscores indicate wrong and missing character predictions, respectively.

Installation

Requires Python >= 3.8 and PyTorch >= 1.10 (until 1.13). The default requirements files will install the latest versions of the dependencies (as of June 1, 2023).

# Use specific platform build. Other PyTorch 1.13 options: cu116, cu117, rocm5.2
platform=cpu
# Generate requirements files for specified PyTorch platform
make torch-${platform}
# Install the project and core + train + test dependencies. Subsets: [train,test,bench,tune]
pip install -r requirements/core.${platform}.txt -e .[train,test]

Updating dependency version pins

pip install pip-tools
make clean-reqs reqs  # Regenerate all the requirements files

Datasets

Download the datasets from the following links:

  1. LMDB archives for MJSynth, SynthText, IIIT5k, SVT, SVTP, IC13, IC15, CUTE80, ArT, RCTW17, ReCTS, LSVT, MLT19, COCO-Text, and Uber-Text.
  2. LMDB archives for TextOCR and OpenVINO.

Pretrained Models via Torch Hub

Available models are: abinet, crnn, trba, vitstr, parseq_tiny, and parseq.

import torch
from PIL import Image
from strhub.data.module import SceneTextDataModule

# Load model and image transforms
parseq = torch.hub.load('baudm/parseq', 'parseq', pretrained=True).eval()
img_transform = SceneTextDataModule.get_transform(parseq.hparams.img_size)

img = Image.open('/path/to/image.png').convert('RGB')
# Preprocess. Model expects a batch of images with shape: (B, C, H, W)
img = img_transform(img).unsqueeze(0)

logits = parseq(img)
logits.shape  # torch.Size([1, 26, 95]), 94 characters + [EOS] symbol

# Greedy decoding
pred = logits.softmax(-1)
label, confidence = parseq.tokenizer.decode(pred)
print('Decoded label = {}'.format(label[0]))

Training

The training script can train any supported model. You can override any configuration using the command line. Please refer to Hydra docs for more info about the syntax. Use ./train.py --help to see the default configuration.

Sample commands for different training configurations

Finetune using pretrained weights

./train.py pretrained=parseq-tiny  # Not all experiments have pretrained weights

Train a model variant/preconfigured experiment

The base model configurations are in configs/model/, while variations are stored in configs/experiment/.

./train.py +experiment=parseq-tiny  # Some examples: abinet-sv, trbc

Specify the character set for training

./train.py charset=94_full  # Other options: 36_lowercase or 62_mixed-case. See configs/charset/

Specify the training dataset

./train.py dataset=real  # Other option: synth. See configs/dataset/

Change general model training parameters

./train.py model.img_size=[32, 128] model.max_label_length=25 model.batch_size=384

Change data-related training parameters

./train.py data.root_dir=data data.num_workers=2 data.augment=true

Change pytorch_lightning.Trainer parameters

./train.py trainer.max_epochs=20 trainer.accelerator=gpu trainer.devices=2

Note that you can pass any Trainer parameter, you just need to prefix it with + if it is not originally specified in configs/main.yaml.

Resume training from checkpoint (experimental)

./train.py +experiment=<model_exp> ckpt_path=outputs/<model>/<timestamp>/checkpoints/<checkpoint>.ckpt

Evaluation

The test script, test.py, can be used to evaluate any model trained with this project. For more info, see ./test.py --help.

PARSeq runtime parameters can be passed using the format param:type=value. For example, PARSeq NAR decoding can be invoked via ./test.py parseq.ckpt refine_iters:int=2 decode_ar:bool=false.

Sample commands for reproducing results

Lowercase alphanumeric comparison on benchmark datasets (Table 6)

./test.py outputs/<model>/<timestamp>/checkpoints/last.ckpt  # or use the released weights: ./test.py pretrained=parseq

Sample output:

Dataset # samples Accuracy 1 - NED Confidence Label Length
IIIT5k 3000 99.00 99.79 97.09 5.09
SVT 647 97.84 99.54 95.87 5.86
IC13_1015 1015 98.13 99.43 97.19 5.31
IC15_2077 2077 89.22 96.43 91.91 5.33
SVTP 645 96.90 99.36 94.37 5.86
CUTE80 288 98.61 99.80 96.43 5.53
Combined 7672 95.95 98.78 95.34 5.33

Benchmark using different evaluation character sets (Table 4)

./test.py outputs/<model>/<timestamp>/checkpoints/last.ckpt  # lowercase alphanumeric (36-character set)
./test.py outputs/<model>/<timestamp>/checkpoints/last.ckpt --cased  # mixed-case alphanumeric (62-character set)
./test.py outputs/<model>/<timestamp>/checkpoints/last.ckpt --cased --punctuation  # mixed-case alphanumeric + punctuation (94-character set)

Lowercase alphanumeric comparison on more challenging datasets (Table 5)

./test.py outputs/<model>/<timestamp>/checkpoints/last.ckpt --new

Benchmark Model Compute Requirements (Figure 5)

./bench.py model=parseq model.decode_ar=false model.refine_iters=3
<torch.utils.benchmark.utils.common.Measurement object at 0x7f8fcae67ee0>
model(x)
  Median: 14.87 ms
  IQR:    0.33 ms (14.78 to 15.12)
  7 measurements, 10 runs per measurement, 1 thread
| module                | #parameters   | #flops   | #activations   |
|:----------------------|:--------------|:---------|:---------------|
| model                 | 23.833M       | 3.255G   | 8.214M         |
|  encoder              |  21.381M      |  2.88G   |  7.127M        |
|  decoder              |  2.368M       |  0.371G  |  1.078M        |
|  head                 |  36.575K      |  3.794M  |  9.88K         |
|  text_embed.embedding |  37.248K      |  0       |  0             |

Latency Measurements vs Output Label Length (Appendix I)

./bench.py model=parseq model.decode_ar=false model.refine_iters=3 +range=true

Orientation robustness benchmark (Appendix J)

./test.py outputs/<model>/<timestamp>/checkpoints/last.ckpt --cased --punctuation  # no rotation
./test.py outputs/<model>/<timestamp>/checkpoints/last.ckpt --cased --punctuation --rotation 90
./test.py outputs/<model>/<timestamp>/checkpoints/last.ckpt --cased --punctuation --rotation 180
./test.py outputs/<model>/<timestamp>/checkpoints/last.ckpt --cased --punctuation --rotation 270

Using trained models to read text from images (Appendix L)

./read.py outputs/<model>/<timestamp>/checkpoints/last.ckpt --images demo_images/*  # Or use ./read.py pretrained=parseq
Additional keyword arguments: {}
demo_images/art-01107.jpg: CHEWBACCA
demo_images/coco-1166773.jpg: Chevrol
demo_images/cute-184.jpg: SALMON
demo_images/ic13_word_256.png: Verbandsteffe
demo_images/ic15_word_26.png: Kaopa
demo_images/uber-27491.jpg: 3rdAve

# use NAR decoding + 2 refinement iterations for PARSeq
./read.py pretrained=parseq refine_iters:int=2 decode_ar:bool=false --images demo_images/*

Tuning

We use Ray Tune for automated parameter tuning of the learning rate. See ./tune.py --help. Extend tune.py to support tuning of other hyperparameters.

./tune.py tune.num_samples=20  # find optimum LR for PARSeq's default config using 20 trials
./tune.py +experiment=tune_abinet-lm  # find the optimum learning rate for ABINet's language model

About

VLM-based STR model with contrastive learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors