The official code of NASTR (TCSVT 2025).
The language modeling paradigm for scene text recognition (STR) has demonstrated impressive universal capabilities across extensive STR scenarios. However, existing methods still encounter challenges in effectively handling text images with irregular shapes and diverse appearances (e.g., curve, artistic, multi-oriented) due to the absence of contextual information during initial decoding. In this work, inspired by the principle of ‘forest before trees’ in human visual perception, we introduce NASTR, a non-autoregressive scene text recognizer capable of endowing global-aware for the attentional decoder. Specifically, we design a global-to-local attention procedure, simulating the mechanism of globally holistic visual signal processing preceding locally detailed response in the human brain visual system. This is achieved by leveraging the global image information queries to condition the generation of glimpse vectors at each decoding time step. This procedure empowers the NASTR model to achieve on-par performance with its state-of-the-art autoregressive counterparts while executing fully parallel. Moreover, we propose multiple optional and flexible encoding constraint components to alleviate the representation quality degradation issue caused by the global image information queries in handling STR takes with multilingual and multi-domains. These components constrain the global image features from the perspective of global structural, global semantic, and linguistic knowledge. Extensive experimental results demonstrate that NASTR consistently outperforms existing methods on both Chinese and English STR benchmarks.
The adopted datasets can be downloaded from the following links:
- Union14M, Revisiting Scene Text Recognition: A Data Perspective
- Scene and Web, Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical Study
For Chinese datasets,
python==3.11, pytorch==2.3.1, torchvision==0.18.1 are recommended.
First, clone the repository locally:
git clone https://github.com/ML-HDU/NASTR.git
Then, install PyTorch 1.11.0+, torchvision 0.12.0+, and other requirements:
conda create -n nastr python=3.8
conda activate nastr
conda install pytorch==1.11.0 torchvision==0.12.0 -c pytorch
pip install -r requirements.txt
- Union14M Dataset: Average accuracy on Union14M-Benchmark (Curve, Multi-oriented, Artistic, Contextless, Salient, Multi-Words, and General)
| Model | Average Acc | Pre-trained weights | Training log | Testing log |
|---|---|---|---|---|
| 76.05 | Google Drive | log | log | |
| 78.21 | Google Drive | log | log | |
| 79.26 | Google Drive | log | log |
- Chinese Datasets
| Model | Scene | Pre-trained weights | Training log | Testing log |
|---|---|---|---|---|
| 78.50 | Google Drive | log | log | |
| 77.34 | Google Drive | log | log | |
| Model | Web | Pre-trained weights | Training log | Testing log |
| 70.72 | Google Drive | log | log | |
| 72.10 | Google Drive | log | log |
- Modify
train_datasetandval_datasetargs inconfig.jsonfile, includingalphabet,save_dir. - Modify
keys.txtinutil_files/keys.txtfile if needed according to the vocabulary of your dataset.
python train_STR.py -c configs/config_lmdb_scene.json
or
python train_STR.py -c configs/config_lmdb_web.json
python train_STR_en.py -c configs/config_English.json
- Modify
checkpoint_pathandtest_datasetargs inconfig.jsonfile, includingalphabet,save_dir. - Modify
model_archto match the model architecture in the pretrained weights. E.g.,$\mathtt{NASTR}_{\mathtt{ViT}}$ ,model_arch ➡️ encoder_kwargs ➡️ type: "vit"
python test_STR.py -c configs/config_lmdb_TEST.json
python test_STR_en.py -c configs/config_lmdb_TEST_English.json
You can specify the name of the training session in config.json files:
"name": "NASTR"
"run_id": "example"
The checkpoints with best performance at training stage will be saved in save_dir/models/name/run_id_timestamp/model_best.pth, with timestamp in mmdd_HHMMSS format.
This project is licensed under the MIT License. See LICENSE for more details.

