Skip to content

AI4Bharat/indicnlp_catalog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

🔖 The Indic NLP Catalog

A Collaborative Catalog of Resources for Indic Language NLP

The Indic NLP Catalog repository is an attempt to collaboratively build the most comprehensive catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent.

Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:

[Wikipedia Dumps](https://dumps.wikimedia.org/)

Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.

👍 Featured Resources

Indian language NLP has come a long way. We feature a few resources that are illustrative of the trends in recent times along various axes and point to a bright future.

  • Universal Language Contribution API (ULCA): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part of the Bhasini mission. You can upload, discover models, datasets and benchmarks here. This is one repository we really need and hope to see this evolving into a standard, large-scale platform for resource discovery and dissemination.
  • We are seeing the rise of large-scale datasets across many tasks like IndicCorp (text corpus/9 billion tokens), Samanantar (parallel corpus/50 million sentence pairs), Naamapadam (named entity/5.7 million sentences), HiNER (named entity/100k sentences), Aksharantar (transliteration/26 million pairs) , etc. These are being built using either large-scale mining of web-resource or large human annotation efforts or both.
  • As we aim higher, the datasets and models are achieving higher language coverage. While earlier datasets would be available for only a handful of Indian languages, then for 10-12 languages - we are now reaching the next frontier where we are creating resources like Aksharantar (transliteration/21 languages), FLORES-200 (translation/27 languages), IndoWordNet (wordnet/18 languages) spanning almost all languages listed in the Indian constitution and more. Datasets and models spanning a large number of languages.
  • Particularly, we are seeing datasets getting created for extremely low-resourced languages or languages not yet covered in any dataset like Bodo, Kangri, Khasi, etc.
  • From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like AI4Bharat, BUET CSE NLP, KMI, L3Cube, iNLTK, IIT Patna, etc. who are contributing to building resources for Indian languages.

Browse the entire catalog...

🙋Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.

Major Indic Language NLP Repositories

Libraries and Tools

  • Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, etc
  • pyiwn: Python Interface to IndoWordNet
  • Indic-OCR : OCR for Indic Scripts
  • CLTK: Toolkit for many of the world's classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library.
  • iNLTK: iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
  • Sanskrit Coders Indic Transliteration: Script conversion and romanization for Indian languages.
  • Smart Sanskirt Annotator: Annotation tool for Sanskrit paper
  • BNLP: Bengali language processing toolkit with tokenization, embedding, POS tagging, NER suppport
  • CodeSwitch: Language identification, POS Tagging, NER, sentiment analysis support for code mixed data including Hindi and Nepali language
  • IndIE: An Open Information Extraction tool (triple extractor) in Hindi. It is conjectured to work for Tamil, Telugu, and Urdu as well.
  • Hindi-BenchIE: A triple evaluation tool for 112 Hindi sentences.

Evaluation Benchmarks

Benchmarks spanning multiple tasks.

  • AI4Bharat IndicGLUE: NLU benchmark for 11 languages.
  • AI4Bharat IndicNLG Suite: NLG benchmark for 11 languages spanning 5 generation tasks: biography generation, sentence summarization, headline generation, paraphrase generation and question generation.
  • GLUECoS: For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI).
  • AI4Bharat Text Classification: A compilation of classification datasets for 10 languages.
  • WAT 2021 Translation Dataset: Standard train and test sets for translation between English and 10 Indian languages.

Standards

Text Corpora

Monolingual Corpus

Language Identification

Lexical Resources and Semantic Similarity

NER Corpora

Parallel Translation Corpus

MT Evaluation

  • WMT23 QE task: QE datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, te) with DA annotations. The references are also available, so these can also be used for reference based metrics. For Marathi, post-edits are also available as are word-level annotations error annotations are also available. 26k training sentences for Marathi, 7k for the others. report
  • AI4Bharat IndicMT-Eval: MT evaluation datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, ml) with Multidimensional Quality Metric (MQM) annotations. 1400 sentence annotations per language (200 sentences and outputs from 7 MT systems).

Parallel Transliteration Corpus

Text Classification

Textual Entailment/Natural Language Inference

Paraphrase

Sentiment, Sarcasm, Emotion Analysis

Hate Speech and Offensive Comments

Question Answering

  • Facebook Multilingual QA datasets: Contains dev and test sets for Hindi.
  • TyDi QA datasets: QA dataset for Bengali and Telugu.
  • bAbi 1.2 dataset: Has Hindi version of bAbi tasks in romanized Hindi.
  • MMQA dataset: Hindi QA dataset described in this paper
  • XQuAD: testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in this paper
  • XQA: testset for Tamil QA. Described in this paper
  • HindiRC: A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in this paper
  • IITH HiDG: A Distractor Generation Dataset for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in this paper
  • Chaii a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, here is a good collection of papers on multilingual Question Answering.
  • csebuetnlp Bangla QA: A Question Answering (QA) dataset for Bengali. Described in this paper.
  • XOR QA: A large-scale cross-lingual open-retrieval QA dataset (includes Bengali and Telugu) with 40k newly annotated open-retrieval questions that cover seven typologically diverse languages. Described in this paper. More information is available here.
  • IITB HiQuAD: A question answering dataset in Hindi consisting of 6555 question-answer pairs. Described in this paper.

Dialog

Discourse

Information Extraction

  • EventXtract-IL: Event extraction for Tamil and Hindi. Described in this paper.
  • [EDNIL-FIRE2020]https://ednilfire.github.io/ednil/2020/index.html): Event extraction for Tamil, Hindi, Bengali, Marathi, English. Described in this paper.
  • Amazon MASSIVE: A Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation containing one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. Described in this paper.
  • Facebook - MTOP Benchmark: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark with a dataset comprising of 100k annotated utterances in 6 languages(including Indic language: Hindi) across 11 domains. Described in this paper.

POS Tagged corpus

Chunk Corpus

Dependency Parse Corpus

Coreference Corpus

Summarization

  • XL-Sum: A Large-Scale Multilingual Abstractive Summarization for 44 Languages with a comprehensive and diverse dataset comprising of 1 million professionally annotated article-summary pairs from BBC. Span 150k examples across 10 Indic languages. Described in this paper.
  • TeSum: Telugu Abstractive Summarization dataset containing 20k+ article-summary pairs, with the summaries being manually created. [paper]
  • WikiLingua: Cross-lingual summarization dataset created from WikiHow. Contains 9k English-Hindi article-summary pairs. [paper]
  • MassiveSum: A large summarization dataset for containing 13 Indian languages with ~1.9million article-summary pairs. The summaries are mined from article metadata. [paper]

Data to Text

  • XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages comprising of a high quality XF2T dataset in 7 languages: Hindi, Marathi, Gujarati, Telugu, Tamil, Kannada, Bengali, and monolingual dataset in English. The dataset is available upon request. Described in this paper.

Models

Language Identification

  • NLLB-200: LID for 200 languages including 27 Indic languages.

Word Embeddings

Pre-trained Language Models

  • AI4Bharat IndicBERT: Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English).
  • AI4Bharat IndicBART: A multilingual,sequence-to-sequence pre-trained model based on the mBART architecture focusing on 11 Indic languages and English for Natural Language Generation of Indic Languages. Described in this paper.
  • MuRIL: Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding (paper).
  • BERT Multilingual: BERT model trained on Wikipedias of many languages (including major Indic languages).
  • mBART50: seq2seq pre-trained model trained on CommonCrawl of many languages (including major Indic languages).
  • BLOOM: GPT3 like multilingual transformer-decoder language model (includes major Indic languages.
  • iNLTK: ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles.
  • albert-base-sanskrit: ALBERT-based model trained on Sanskrit Wikipedia.
  • RoBERTa-hindi-guj-san: Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati.
  • Bangla-BERT-Base: Bengali BERT model trained on Bengali wikipedia and OSCAR datasets.
  • BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Described in this paper.
  • EM-ALBERT: The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences.
  • LaBSE: Encoder models suitable for sentence retrieval tasks supporting 109 languages (including all major Indic languages) [paper].
  • LASER3: Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 27 Indic languges).

Multilingual Word Embeddings

Morphanalyzers

Translation Models

  • IndicTrans: Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported.
  • Shata-Anuvaadak: SMT for 110 language pairs (all pairs between English and 10 Indian languages.
  • LTRC Vanee: Dependency based Statistical MT system from English to Hindi.
  • NLLB-200: Models for 200 languages including 27 Indic languages.

Transliteration Models

  • AI4Bharat IndicXlit: A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion and vice versa that supports 21 Indic languages. Described in this paper.

Speech Models

NER

Speech Corpora

OCR Corpora

Multimodal Corpora

Language Specific Catalogs

Pointers to language-specific NLP resource catalogs