An umbrella portal for all the glossary creation workflows and datasets
Indic-Glossaries is an open source portal that is an umbrella to expose all the workflows for the creation of Indic glossaries and also the created/curated glossary datasets.
The datasets are licensed under :
(The entire collected glossary datasets are submitted to ULCA platform. Please refer https://bhashini.gov.in/ulca for further details)
Collection Source | Glossary Corpus Count | Download link |
---|---|---|
IndoWordNet | 1,724,816 | download |
Bharatavani | 866,423 | download |
CSTT | 319,420 | download |
OSF | 212,951 | download |
NLPC - Univ of Moratuwa | 19,273 | download |
Anuvaad | 1,690 | download |
NCF-NCERT | 272 | download |
Language Pair | Glossary Corpus Count |
---|---|
English-Assamese | 75,256 |
English-Bengali | 102,855 |
English-Bodo | 158,457 |
English-Dogri | 8,624 |
English-Goan Konkani | 87,300 |
English-Gujarati | 168,343 |
English-Hindi | 852,324 |
English-Kannada | 156,777 |
English-Kashmiri | 69,673 |
English-Maithili | 8,297 |
English-Malayalam | 98,092 |
English-Manipuri | 7,183 |
English-Marathi | 106,787 |
English-Nepali | 65,795 |
English-Odia | 130,801 |
English-Punjabi | 152,081 |
English-Sanskrit | 130,042 |
English-Sindhi | 4,797 |
English-Tamil | 316,976 |
English-Telugu | 92,839 |
English-Urdu | 79,253 |
Hindi-Tamil | 113 |
Hindi-Telugu | 13,340 |
Hindi-Urdu | 3,133 |
Hindi-English | 130,422 |
Malayalam-English | 9,269 |
Sanskrit-Hindi | 111,155 |
Tamil-Hindi | 4,868 |
Domain | Glossary Corpus Count |
---|---|
general | 2,209,449 |
economy | 21,762 |
technology | 26,360 |
education | 351,387 |
geography | 92,169 |
legal | 46,870 |
financial | 17,163 |
automobile | 6,303 |
healthcare | 278,436 |
national-security-and-defence | 30,570 |
agriculture | 4,807 |
parliamentary | 22,180 |
history | 23,955 |
news | 7,303 |
lifestyle | 3,506 |
entertainment | 1,143 |
philosophy | 1,489 |
The goal is to build high quality glossary datasets for the Indian languages across various domains (General, Legal, Education, Healthcare, Automobile, News etc).
Read more about Glossary Explorer @ https://glossary.ai4bharat.org
Any information/help/discussion required, can be taken up using the following link : https://github.com/AI4Bharat/Indic-Glossaries/discussions
This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@ai4bharat.org.