Skip to content

Latest commit

 

History

History

VLPT-STD

Vision-Language Pre-Training for Boosting Scene Text Detectors

The official PyTorch implementation of VLPT-STD (CVPR 2022).

VLPT-STD is a new pre-training paradigm for scene text detection that only requires text annotations. We propose three vision-language pretraining pretext tasks: imagetext contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP) to learn contextualized, joint representations, for the sake of enhancing the performance of scene text detectors. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors.

Paper

VLPT-STD Model

Install requirements

pip3 install -r requirements.txt

Dataset

Download synthtext dataset.

  • The structure of data folder as below.
data
└── SynthText
    ├── 1
    ├── 2
    ├── 3
    ├── ...
    └── gt.mat
  • Use write_synthtext_pyarrow.py to prepare arrow data format for pretraining.

Pretrained Models

pretrained resnet50 at this url.

Training

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch -nproc_per_node=8 main.py --exp_name base

Benchmarks

Performances on EAST, DB and PSENet are summaried as follows:

ICDAR2015 ICDAR2017 MSRA-TD500
P R F P R F P R F
EAST + SynthText 89.6 81.5 85.3 75.1 61.9 67.9 86.9 77.6 82.0
EAST + VLPT-STD 91.5 85.4 88.3 77.7 64.6 70.5 88.5 76.7 82.2
ICDAR2015 Total-Text MSRA-TD500
P R F P R F P R F
DB + SynthText 88.2 82.7 85.4 87.1 82.5 84.7 91.5 79.2 84.9
DB + VLPT-STD 92.0 81.6 86.5 88.7 84.0 86.3 92.3 84.9 88.5
ICDAR2015 Total-Text CTW1500
P R F P R F P R F
PSENet + SynthText 84.3 78.4 81.3 89.2 79.2 83.9 83.6 79.7 81.6
PSENet + VLPT-STD 86.0 82.8 84.3 90.8 82.0 86.1 86.3 80.7 83.3

Acknowledgements

This implementation has been based on ViLT.

Citation

If you find this work useful, please cite:

@inproceedings{song2022vision,
  title={Vision-Language Pre-Training for Boosting Scene Text Detectors},
  author={Song, Sibo and Wan, Jianqiang and Yang, Zhibo and Tang, Jun and Cheng, Wenqing and Bai, Xiang and Yao, Cong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={15681--15691},
  year={2022}
}

License

VLPT-STD is released under the terms of the Apache License, Version 2.0.

VLPT-STD is an algorithm for scene text detection pretraining and the code and models herein created by the authors from Alibaba can only be used for research purpose.
Copyright (C) 1999-2022 Alibaba Group Holding Ltd. 

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.