A curated bi-lingual scene text detection and language identification benchmark dataset, EMBiL, comprises English and Manipuri (Meitei Mayek / Meetei Mayek) texts embedded in the scene images. The paper is accepted at CAIP ' 23, to be held at Limassol, Cyprus, in September.
The Manipuri language (called "Meetei Mayek") is one of India's scheduled recognized languages. Statistically, this language is used by only 0.15% (3.6 million out of 1.4 billion) of the country's (India) total demography.
The dataset includes various naturally occurring visual noises and distortions collected from diverse scenarios, such as local markets, billboards, navigation and traffic signs, graffiti, shop banners, etc. Owing to language, culture, and history differences, scene text images in Manipur have distinctive features that combine English and Meetei Mayek languages.
We describe the diversity of EMBiL in three levels: : (1) Image-level diversity; (2) Scene-level diversity, and 3) Text instance-level diversity.
EMBiL contains bi-lingual text images with a total of 720 images. It is divided into a 70% train set, 20% validation, and 10% test set.
Mail at veronica.naosekpam@iiitg.ac.in for the complete dataset.
@inproceedings{naosekpam2023embil,
title={EMBiL: An English-Manipuri Bi-lingual Benchmark for Scene Text Detection and Language Identification},
author={Naosekpam, Veronica and Islam, Mushtaq and Chourasia, Amul and Sahu, Nilkanta},
booktitle={International Conference on Computer Analysis of Images and Patterns},
pages={65--75},
year={2023},
organization={Springer}
}
Naosekpam, Veronica, and Nilkanta Sahu. "Multi-label Indian scene text language identification." Intelligent Systems and Applications in Computer Vision (2023).