Skip to content

A curated bi-lingual scene text detection and language identification benchmark dataset named EMBiL, comprising English and Manipuri texts embedded in the scene images. The paper is accepted at CAIP ' 23.

Notifications You must be signed in to change notification settings

Naosekpam/EMBiL-English-Manipuri-Benchmark-for-scene-text-detection-and-language-identification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 

Repository files navigation

EMBiL-Dataset

A curated bi-lingual scene text detection and language identification benchmark dataset, EMBiL, comprises English and Manipuri (Meitei Mayek / Meetei Mayek) texts embedded in the scene images. The paper is accepted at CAIP ' 23, to be held at Limassol, Cyprus, in September.

The Manipuri language (called "Meetei Mayek") is one of India's scheduled recognized languages. Statistically, this language is used by only 0.15% (3.6 million out of 1.4 billion) of the country's (India) total demography.

The dataset includes various naturally occurring visual noises and distortions collected from diverse scenarios, such as local markets, billboards, navigation and traffic signs, graffiti, shop banners, etc. Owing to language, culture, and history differences, scene text images in Manipur have distinctive features that combine English and Meetei Mayek languages.

We describe the diversity of EMBiL in three levels: : (1) Image-level diversity; (2) Scene-level diversity, and 3) Text instance-level diversity.

EMBiL contains bi-lingual text images with a total of 720 images. It is divided into a 70% train set, 20% validation, and 10% test set.

Mail at veronica.naosekpam@iiitg.ac.in for the complete dataset.

Baseline architecture :

Please cite the following papers if code or part of the code is used :

@inproceedings{naosekpam2023embil, 
  title={EMBiL: An English-Manipuri Bi-lingual Benchmark for Scene Text Detection and Language Identification}, 
  author={Naosekpam, Veronica and Islam, Mushtaq and Chourasia, Amul and Sahu, Nilkanta}, 
  booktitle={International Conference on Computer Analysis of Images and Patterns},
  pages={65--75},
  year={2023}, 
  organization={Springer} 
}

Naosekpam, Veronica, and Nilkanta Sahu. "Multi-label Indian scene text language identification." Intelligent Systems and Applications in Computer Vision (2023).

About

A curated bi-lingual scene text detection and language identification benchmark dataset named EMBiL, comprising English and Manipuri texts embedded in the scene images. The paper is accepted at CAIP ' 23.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published