-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nlp tools #379
Add nlp tools #379
Conversation
@@ -455,6 +455,7 @@ This repository contains a curated list of awesome open source libraries that wi | |||
|
|||
## Industry Strength NLP | |||
* [AdaptNLP](https://github.com/Novetta/adaptnlp) ![](https://img.shields.io/github/stars/Novetta/adaptnlp.svg?style=social) - Built atop Zalando Research's Flair and Hugging Face's Transformers library, AdaptNLP provides Machine Learning Researchers and Scientists a modular and adaptive approach to a variety of NLP tasks with an Easy API for training, inference, and deploying NLP-based microservices. | |||
* [Argilla](https://github.com/argilla-io/argilla) ![](https://img.shields.io/github/stars/argilla-io/argilla.svg?style=social) - Argilla helps domain experts and data teams to build better NLP datasets in less time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems Argilla is more relevant for data labelling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question, @axsaucedo I am thinking about the current solo categorization in the current structure of our list since we are not allowed to attribute more than one label to a single tool as indicated by #383.
As far as I understand, if a tool is designed for data labeling, it should be universally used for data labeling, which means, ideally practitioners, can use it to label their data regardless of their working domain.
On the other hand, if a tool only caters to a specific domain, such as the NLP dataset. It is a domain tool that is built exclusively for NLP practitioners (since other practitioners cannot use it for other labeling purpoess). Then we should categorize it as an NLP tools rather than a general data labeling tool. What's your idea? @axsaucedo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well
I know this tool has the functionalies of dataset management but it is exclusively focused on NLP dataset, for example, we cannot use this tool to create/manage/visualize other modal dataset such as images, audio, code, etc... @axsaucedo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality
I think it is better to discuss it over here: #387
I found a lot of times this kind issue make a dilemma for the contributors to decide which categories to choose for a new tool. @axsaucedo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, if you move it to data labelling we can continue discussing in issue, alternative option is to remove from PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to know what makes things right and then do the right thing, rather than make a detour. I have proposed some issues but nobody discussed them at all. If so, why should I make the issues? @axsaucedo
@@ -469,10 +470,13 @@ This repository contains a curated list of awesome open source libraries that wi | |||
* [Haystack](https://github.com/deepset-ai/haystack) ![](https://img.shields.io/github/stars/deepset-ai/haystack.svg?style=social) - Haystack is an open source NLP framework to interact with your data using Transformer models and LLMs (GPT-3 and alike). Haystack offers production-ready tools to quickly build ChatGPT-like question answering, semantic search, text generation, and more. | |||
* [Kashgari](https://github.com/BrikerMan/Kashgari) ![](https://img.shields.io/github/stars/BrikerMan/Kashgari.svg?style=social) - Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks. | |||
* [LangChain](https://github.com/hwchase17/langchain) ![](https://img.shields.io/github/stars/hwchase17/langchain.svg?style=social) - LangChain assists in building applications with LLMs through composability. | |||
* [NLTK](https://github.com/nltk/nltk) ![](https://img.shields.io/github/stars/nltk/nltk.svg?style=social) - NLTK (Natural Language Toolkit) is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NLTK is a great library but more of a library than productionisation relevant framework
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NLTK is a great library but more of a library than productionisation relevant framework
I saw a lot of NLP teams still take advantage of NLTK. Though it is losing its market share compared to other tools such as Spacy, it is still impactful as a powerful NLP tool. @axsaucedo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes NLTK is a good general library for NLP but more of a library than productionisation relevant framework
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes NLTK is a good general library for NLP but more of a library than productionisation relevant framework
What does productionisation mean in your mind? I mean, if this general tool belongs to machine learning category and contribute to real-world prodcuts, does it count as a productionisation tool? In this case, NLTK is a productionisation tool. @axsaucedo Maybe we can open up an issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way to think about is that this list should have MLOps tools that are used to productionise, not tools that are present in production because they are used primarily for experimentation and are deployed/productionised - this would be part of this issue #367, but I still don't see as part of this list, which is why removing for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this tool helps with the deployment of NLP products then it should be included in the list. After viewing the tools in the list, the scope seems not narrowing down to exclusively the deployment/post-deployment things only. For example, there exist a lot of tools that focus most on experimentation things such as Mlflow, ModelDB, Jax, etc. If we do not intend to include tools that helps with ML production, then we should remove more than half od the tools in our list. @axsaucedo
* [OpenAI GPT-2](https://github.com/openai/gpt-2) ![](https://img.shields.io/github/stars/openai/gpt-2.svg?style=social) - OpenAI's code from their paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf). | ||
* [refinery](https://github.com/code-kern-ai/refinery) ![](https://img.shields.io/github/stars/code-kern-ai/refinery.svg?style=social) - The data scientist's open-source choice to scale, assess and maintain natural language data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems also more relevant for data labelling section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question, @axsaucedo I am thinking about the current solo categorization in the current structure of our list since we are not allowed to attribute more than one label to a single tool as indicated by #383.
As far as I understand, if a tool is designed for data labeling, it should be universally used for data labeling, which means, ideally practitioners, can use it to label their data regardless of their working domain.
On the other hand, if a tool only caters to a specific domain, such as the NLP dataset. It is a domain tool that is built exclusively for NLP practitioners (since other practitioners cannot use it for other labeling purpoess). Then we should categorize it as an NLP tools rather than a general data labeling tool. What's your idea? @axsaucedo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well
I know this tool has the functionalities of dataset management but it is exclusively focused on NLP dataset, for example, we cannot use this tool to create/manage/visualize other modal datasets such as images, audio, code, etc... @axsaucedo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality
I think it is better to discuss it over here: #387
I found a lot of times this kind issue make a dilemma for the contributors to decide which categories to choose for a new tool. @axsaucedo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, if you move it to data labelling we can continue discussing in issue, alternative option is to remove from PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to know what makes things right and then do the right thing, rather than make a detour. I have proposed some issues but nobody discussed them at all. If so, why should I make the issues? @axsaucedo
Sorry @zhimin-z but there's only so much back and forth we can go for a PR that adds 3 lines where two are being requested to be removed, so closing as unfortunately there needs to be a decision and we can't be discussing forever |
Hi @axsaucedo I do suggest we first settle down the issues before making comments in the PRs. It does not make sense for me to turn down a PR without fully discussing a still open issue. |
No description provided.