Skip to content

AI4Bharat/Indic-Glossaries

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Indic Glossary Datasets & Workflows


An umbrella portal for all the glossary creation workflows and datasets

License: MIT


Indic-Glossaries is an open source portal that is an umbrella to expose all the workflows for the creation of Indic glossaries and also the created/curated glossary datasets.

Datasets

The datasets are licensed under : CC BY 4.0

(The entire collected glossary datasets are submitted to ULCA platform. Please refer https://bhashini.gov.in/ulca for further details)

Breakdown by Collection Source

Collection Source Glossary Corpus Count Download link
IndoWordNet 1,724,816 download
Bharatavani 866,423 download
CSTT 319,420 download
OSF 212,951 download
NLPC - Univ of Moratuwa 19,273 download
Anuvaad 1,690 download
NCF-NCERT 272 download

Breakdown by Language Pair

Language Pair Glossary Corpus Count
English-Assamese 75,256
English-Bengali 102,855
English-Bodo 158,457
English-Dogri 8,624
English-Goan Konkani 87,300
English-Gujarati 168,343
English-Hindi 852,324
English-Kannada 156,777
English-Kashmiri 69,673
English-Maithili 8,297
English-Malayalam 98,092
English-Manipuri 7,183
English-Marathi 106,787
English-Nepali 65,795
English-Odia 130,801
English-Punjabi 152,081
English-Sanskrit 130,042
English-Sindhi 4,797
English-Tamil 316,976
English-Telugu 92,839
English-Urdu 79,253
Hindi-Tamil 113
Hindi-Telugu 13,340
Hindi-Urdu 3,133
Hindi-English 130,422
Malayalam-English 9,269
Sanskrit-Hindi 111,155
Tamil-Hindi 4,868

Breakdown by Domain

Domain Glossary Corpus Count
general 2,209,449
economy 21,762
technology 26,360
education 351,387
geography 92,169
legal 46,870
financial 17,163
automobile 6,303
healthcare 278,436
national-security-and-defence 30,570
agriculture 4,807
parliamentary 22,180
history 23,955
news 7,303
lifestyle 3,506
entertainment 1,143
philosophy 1,489

Goal

The goal is to build high quality glossary datasets for the Indian languages across various domains (General, Legal, Education, Healthcare, Automobile, News etc).

Read more about Glossary Explorer @ https://glossary.ai4bharat.org

Communication Forum

Any information/help/discussion required, can be taken up using the following link : https://github.com/AI4Bharat/Indic-Glossaries/discussions

Code of Conduct

This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@ai4bharat.org.

About

Collection of datasets for glossaries in Indian languages

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published