Indic Language NLP is a Python-based solution for natural language processing tasks for Indic languages. It includes a variety of tools and models for tasks such as language identification π, named entity recognition π, part-of-speech tagging π, sentiment analysis π€, and more. The repository contains code samples, documentation, and data resources that can be used to build and customize these models for different languages and domains.
Here are some of the main features of Indic Language NLP:
Indic Language NLP includes models for identifying the language, word-by-word, of a given input text. It currently supports several Indic languages such as Hindi, Bengali, Tamil, Telugu, and more.
Indic Language NLP includes models for transliteration, which can convert text from one script to another. It currently supports Hindi-English code mixed text transliteration, but can be extended to support other languages as well.
Indic Language NLP includes models for named entity recognition (NER), which can identify and classify entities such as people, organizations, locations, and more in a given input text. It supports several Indic languages and can be trained on custom datasets for specific domains.
Indic Language NLP includes models for part-of-speech (POS) tagging, which can identify and tag the parts of speech in a given input text. It supports several Indic languages and can be trained on custom datasets for specific domains.
Indic Language NLP includes models for sentiment analysis, which can classify the sentiment of a given input text as positive, negative, or neutral. It supports several Indic languages and can be trained on custom datasets for specific domains.
Indic Language NLP provides pre-trained models and datasets that can be used out-of-the-box for several tasks. The repository also includes resources for training custom models on specific domains and languages.
It is important to note that the pre-trained models and datasets provided in Indic Language NLP are developed using data provided by NIC (National Informatics Centre), which is a department of the Indian Government. Therefore, the dataset cannot be publicly shared or distributed, and the models should only be used for research and non-commercial purposes. However, users can still train their own models using their own datasets, or fine-tune the pre-trained models on their own specific domains and languages.
Indic Language NLP is a code pipeline designed for performing natural language processing tasks on Indic languages. The code is available in a Google Colab notebook, which can be accessed by opening the notebook in Colab and running the cells in order.
We welcome contributions from the community to improve and extend Indic Language NLP. You can contribute by reporting issues, suggesting new features, or submitting pull requests.
Indic Language NLP is licensed under the MIT License. See the LICENSE file for more details.
Indic Language NLP uses several open-source tools and libraries, including the Hugging Face Transformers library, spaCy, scikit-learn, and more. We would like to acknowledge the contributions of these communities to the field of natural language processing.