Skip to content

Sugoto/Indic-Language-NLP

Repository files navigation

Indic Language NLP πŸŒπŸ“ŠπŸ“

Indic Language NLP is a Python-based solution for natural language processing tasks for Indic languages. It includes a variety of tools and models for tasks such as language identification 🌐, named entity recognition πŸ“Š, part-of-speech tagging πŸ“, sentiment analysis πŸ€”, and more. The repository contains code samples, documentation, and data resources that can be used to build and customize these models for different languages and domains.

Features

Here are some of the main features of Indic Language NLP:

Language Identification 🌐

Indic Language NLP includes models for identifying the language, word-by-word, of a given input text. It currently supports several Indic languages such as Hindi, Bengali, Tamil, Telugu, and more.

Transliteration 🌟

Indic Language NLP includes models for transliteration, which can convert text from one script to another. It currently supports Hindi-English code mixed text transliteration, but can be extended to support other languages as well.

Named Entity Recognition (WIPπŸ—οΈ) πŸ“Š

Indic Language NLP includes models for named entity recognition (NER), which can identify and classify entities such as people, organizations, locations, and more in a given input text. It supports several Indic languages and can be trained on custom datasets for specific domains.

Part-of-Speech Tagging (WIPπŸ—οΈ) πŸ“

Indic Language NLP includes models for part-of-speech (POS) tagging, which can identify and tag the parts of speech in a given input text. It supports several Indic languages and can be trained on custom datasets for specific domains.

Sentiment Analysis (WIPπŸ—οΈ) πŸ€”

Indic Language NLP includes models for sentiment analysis, which can classify the sentiment of a given input text as positive, negative, or neutral. It supports several Indic languages and can be trained on custom datasets for specific domains.

Models and Datasets 🧰

Indic Language NLP provides pre-trained models and datasets that can be used out-of-the-box for several tasks. The repository also includes resources for training custom models on specific domains and languages. It is important to note that the pre-trained models and datasets provided in Indic Language NLP are developed using data provided by NIC (National Informatics Centre), which is a department of the Indian Government. Therefore, the dataset cannot be publicly shared or distributed, and the models should only be used for research and non-commercial purposes. However, users can still train their own models using their own datasets, or fine-tune the pre-trained models on their own specific domains and languages.

Usage πŸš€

Indic Language NLP is a code pipeline designed for performing natural language processing tasks on Indic languages. The code is available in a Google Colab notebook, which can be accessed by opening the notebook in Colab and running the cells in order.

Contributing 🀝

We welcome contributions from the community to improve and extend Indic Language NLP. You can contribute by reporting issues, suggesting new features, or submitting pull requests.

License πŸ”–

Indic Language NLP is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgments πŸ™

Indic Language NLP uses several open-source tools and libraries, including the Hugging Face Transformers library, spaCy, scikit-learn, and more. We would like to acknowledge the contributions of these communities to the field of natural language processing.

About

Indic Language NLP is a Python-based solution for natural language processing tasks such as language identification 🌐, named entity recognition πŸ“Š, part-of-speech tagging πŸ“, sentiment analysis πŸ€”, and more.

Resources

License

Stars

Watchers

Forks

Contributors