Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nlp tools #379

Closed
wants to merge 4 commits into from
Closed

Add nlp tools #379

wants to merge 4 commits into from

Conversation

zhimin-z
Copy link
Collaborator

@zhimin-z zhimin-z commented Mar 3, 2023

No description provided.

@@ -455,6 +455,7 @@ This repository contains a curated list of awesome open source libraries that wi

## Industry Strength NLP
* [AdaptNLP](https://github.com/Novetta/adaptnlp) ![](https://img.shields.io/github/stars/Novetta/adaptnlp.svg?style=social) - Built atop Zalando Research's Flair and Hugging Face's Transformers library, AdaptNLP provides Machine Learning Researchers and Scientists a modular and adaptive approach to a variety of NLP tasks with an Easy API for training, inference, and deploying NLP-based microservices.
* [Argilla](https://github.com/argilla-io/argilla) ![](https://img.shields.io/github/stars/argilla-io/argilla.svg?style=social) - Argilla helps domain experts and data teams to build better NLP datasets in less time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Argilla is more relevant for data labelling

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question, @axsaucedo I am thinking about the current solo categorization in the current structure of our list since we are not allowed to attribute more than one label to a single tool as indicated by #383.

As far as I understand, if a tool is designed for data labeling, it should be universally used for data labeling, which means, ideally practitioners, can use it to label their data regardless of their working domain.

On the other hand, if a tool only caters to a specific domain, such as the NLP dataset. It is a domain tool that is built exclusively for NLP practitioners (since other practitioners cannot use it for other labeling purpoess). Then we should categorize it as an NLP tools rather than a general data labeling tool. What's your idea? @axsaucedo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well

I know this tool has the functionalies of dataset management but it is exclusively focused on NLP dataset, for example, we cannot use this tool to create/manage/visualize other modal dataset such as images, audio, code, etc... @axsaucedo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality

I think it is better to discuss it over here: #387
I found a lot of times this kind issue make a dilemma for the contributors to decide which categories to choose for a new tool. @axsaucedo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, if you move it to data labelling we can continue discussing in issue, alternative option is to remove from PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to know what makes things right and then do the right thing, rather than make a detour. I have proposed some issues but nobody discussed them at all. If so, why should I make the issues? @axsaucedo

@@ -469,10 +470,13 @@ This repository contains a curated list of awesome open source libraries that wi
* [Haystack](https://github.com/deepset-ai/haystack) ![](https://img.shields.io/github/stars/deepset-ai/haystack.svg?style=social) - Haystack is an open source NLP framework to interact with your data using Transformer models and LLMs (GPT-3 and alike). Haystack offers production-ready tools to quickly build ChatGPT-like question answering, semantic search, text generation, and more.
* [Kashgari](https://github.com/BrikerMan/Kashgari) ![](https://img.shields.io/github/stars/BrikerMan/Kashgari.svg?style=social) - Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.
* [LangChain](https://github.com/hwchase17/langchain) ![](https://img.shields.io/github/stars/hwchase17/langchain.svg?style=social) - LangChain assists in building applications with LLMs through composability.
* [NLTK](https://github.com/nltk/nltk) ![](https://img.shields.io/github/stars/nltk/nltk.svg?style=social) - NLTK (Natural Language Toolkit) is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NLTK is a great library but more of a library than productionisation relevant framework

Copy link
Collaborator Author

@zhimin-z zhimin-z Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NLTK is a great library but more of a library than productionisation relevant framework

I saw a lot of NLP teams still take advantage of NLTK. Though it is losing its market share compared to other tools such as Spacy, it is still impactful as a powerful NLP tool. @axsaucedo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes NLTK is a good general library for NLP but more of a library than productionisation relevant framework

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes NLTK is a good general library for NLP but more of a library than productionisation relevant framework

What does productionisation mean in your mind? I mean, if this general tool belongs to machine learning category and contribute to real-world prodcuts, does it count as a productionisation tool? In this case, NLTK is a productionisation tool. @axsaucedo Maybe we can open up an issue for it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way to think about is that this list should have MLOps tools that are used to productionise, not tools that are present in production because they are used primarily for experimentation and are deployed/productionised - this would be part of this issue #367, but I still don't see as part of this list, which is why removing for now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this tool helps with the deployment of NLP products then it should be included in the list. After viewing the tools in the list, the scope seems not narrowing down to exclusively the deployment/post-deployment things only. For example, there exist a lot of tools that focus most on experimentation things such as Mlflow, ModelDB, Jax, etc. If we do not intend to include tools that helps with ML production, then we should remove more than half od the tools in our list. @axsaucedo

* [OpenAI GPT-2](https://github.com/openai/gpt-2) ![](https://img.shields.io/github/stars/openai/gpt-2.svg?style=social) - OpenAI's code from their paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).
* [refinery](https://github.com/code-kern-ai/refinery) ![](https://img.shields.io/github/stars/code-kern-ai/refinery.svg?style=social) - The data scientist's open-source choice to scale, assess and maintain natural language data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems also more relevant for data labelling section

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question, @axsaucedo I am thinking about the current solo categorization in the current structure of our list since we are not allowed to attribute more than one label to a single tool as indicated by #383.

As far as I understand, if a tool is designed for data labeling, it should be universally used for data labeling, which means, ideally practitioners, can use it to label their data regardless of their working domain.

On the other hand, if a tool only caters to a specific domain, such as the NLP dataset. It is a domain tool that is built exclusively for NLP practitioners (since other practitioners cannot use it for other labeling purpoess). Then we should categorize it as an NLP tools rather than a general data labeling tool. What's your idea? @axsaucedo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well

I know this tool has the functionalities of dataset management but it is exclusively focused on NLP dataset, for example, we cannot use this tool to create/manage/visualize other modal datasets such as images, audio, code, etc... @axsaucedo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality

Copy link
Collaborator Author

@zhimin-z zhimin-z Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality

I think it is better to discuss it over here: #387
I found a lot of times this kind issue make a dilemma for the contributors to decide which categories to choose for a new tool. @axsaucedo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, if you move it to data labelling we can continue discussing in issue, alternative option is to remove from PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to know what makes things right and then do the right thing, rather than make a detour. I have proposed some issues but nobody discussed them at all. If so, why should I make the issues? @axsaucedo

README.md Show resolved Hide resolved
@axsaucedo
Copy link
Collaborator

Sorry @zhimin-z but there's only so much back and forth we can go for a PR that adds 3 lines where two are being requested to be removed, so closing as unfortunately there needs to be a decision and we can't be discussing forever

@axsaucedo axsaucedo closed this Mar 7, 2023
@zhimin-z
Copy link
Collaborator Author

zhimin-z commented Mar 7, 2023

Hi @axsaucedo I do suggest we first settle down the issues before making comments in the PRs. It does not make sense for me to turn down a PR without fully discussing a still open issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants