Add nlp tools #379

zhimin-z · 2023-03-03T03:22:42Z

No description provided.

axsaucedo · 2023-03-05T17:51:47Z

README.md

@@ -455,6 +455,7 @@ This repository contains a curated list of awesome open source libraries that wi

 ## Industry Strength NLP
 * [AdaptNLP](https://github.com/Novetta/adaptnlp) ![](https://img.shields.io/github/stars/Novetta/adaptnlp.svg?style=social) - Built atop Zalando Research's Flair and Hugging Face's Transformers library, AdaptNLP provides Machine Learning Researchers and Scientists a modular and adaptive approach to a variety of NLP tasks with an Easy API for training, inference, and deploying NLP-based microservices.
+* [Argilla](https://github.com/argilla-io/argilla) ![](https://img.shields.io/github/stars/argilla-io/argilla.svg?style=social) - Argilla helps domain experts and data teams to build better NLP datasets in less time.


It seems Argilla is more relevant for data labelling

That is a good question, @axsaucedo I am thinking about the current solo categorization in the current structure of our list since we are not allowed to attribute more than one label to a single tool as indicated by #383.

As far as I understand, if a tool is designed for data labeling, it should be universally used for data labeling, which means, ideally practitioners, can use it to label their data regardless of their working domain.

On the other hand, if a tool only caters to a specific domain, such as the NLP dataset. It is a domain tool that is built exclusively for NLP practitioners (since other practitioners cannot use it for other labeling purpoess). Then we should categorize it as an NLP tools rather than a general data labeling tool. What's your idea? @axsaucedo

Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well

Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well

I know this tool has the functionalies of dataset management but it is exclusively focused on NLP dataset, for example, we cannot use this tool to create/manage/visualize other modal dataset such as images, audio, code, etc... @axsaucedo

IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality

IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality

I think it is better to discuss it over here: #387
I found a lot of times this kind issue make a dilemma for the contributors to decide which categories to choose for a new tool. @axsaucedo

For sure, if you move it to data labelling we can continue discussing in issue, alternative option is to remove from PR

I think it is better to know what makes things right and then do the right thing, rather than make a detour. I have proposed some issues but nobody discussed them at all. If so, why should I make the issues? @axsaucedo

axsaucedo · 2023-03-05T17:52:10Z

README.md

@@ -469,10 +470,13 @@ This repository contains a curated list of awesome open source libraries that wi
 * [Haystack](https://github.com/deepset-ai/haystack) ![](https://img.shields.io/github/stars/deepset-ai/haystack.svg?style=social) - Haystack is an open source NLP framework to interact with your data using Transformer models and LLMs (GPT-3 and alike). Haystack offers production-ready tools to quickly build ChatGPT-like question answering, semantic search, text generation, and more.
 * [Kashgari](https://github.com/BrikerMan/Kashgari) ![](https://img.shields.io/github/stars/BrikerMan/Kashgari.svg?style=social) - Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.
 * [LangChain](https://github.com/hwchase17/langchain) ![](https://img.shields.io/github/stars/hwchase17/langchain.svg?style=social) - LangChain assists in building applications with LLMs through composability.
+* [NLTK](https://github.com/nltk/nltk) ![](https://img.shields.io/github/stars/nltk/nltk.svg?style=social) - NLTK (Natural Language Toolkit) is a suite of open source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing.


NLTK is a great library but more of a library than productionisation relevant framework

NLTK is a great library but more of a library than productionisation relevant framework

I saw a lot of NLP teams still take advantage of NLTK. Though it is losing its market share compared to other tools such as Spacy, it is still impactful as a powerful NLP tool. @axsaucedo

Yes NLTK is a good general library for NLP but more of a library than productionisation relevant framework

Yes NLTK is a good general library for NLP but more of a library than productionisation relevant framework

What does productionisation mean in your mind? I mean, if this general tool belongs to machine learning category and contribute to real-world prodcuts, does it count as a productionisation tool? In this case, NLTK is a productionisation tool. @axsaucedo Maybe we can open up an issue for it.

The way to think about is that this list should have MLOps tools that are used to productionise, not tools that are present in production because they are used primarily for experimentation and are deployed/productionised - this would be part of this issue #367, but I still don't see as part of this list, which is why removing for now

If this tool helps with the deployment of NLP products then it should be included in the list. After viewing the tools in the list, the scope seems not narrowing down to exclusively the deployment/post-deployment things only. For example, there exist a lot of tools that focus most on experimentation things such as Mlflow, ModelDB, Jax, etc. If we do not intend to include tools that helps with ML production, then we should remove more than half od the tools in our list. @axsaucedo

axsaucedo · 2023-03-05T17:53:15Z

README.md

 * [OpenAI GPT-2](https://github.com/openai/gpt-2) ![](https://img.shields.io/github/stars/openai/gpt-2.svg?style=social) - OpenAI's code from their paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).
+* [refinery](https://github.com/code-kern-ai/refinery) ![](https://img.shields.io/github/stars/code-kern-ai/refinery.svg?style=social) - The data scientist's open-source choice to scale, assess and maintain natural language data.


This seems also more relevant for data labelling section

That is a good question, @axsaucedo I am thinking about the current solo categorization in the current structure of our list since we are not allowed to attribute more than one label to a single tool as indicated by #383.

As far as I understand, if a tool is designed for data labeling, it should be universally used for data labeling, which means, ideally practitioners, can use it to label their data regardless of their working domain.

On the other hand, if a tool only caters to a specific domain, such as the NLP dataset. It is a domain tool that is built exclusively for NLP practitioners (since other practitioners cannot use it for other labeling purpoess). Then we should categorize it as an NLP tools rather than a general data labeling tool. What's your idea? @axsaucedo

Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well

Looking at the repo it does seem better fit for data labelling than general NLP as invovles dataset mgmt as well

I know this tool has the functionalities of dataset management but it is exclusively focused on NLP dataset, for example, we cannot use this tool to create/manage/visualize other modal datasets such as images, audio, code, etc... @axsaucedo

IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality

IMO that's ok, what you suggest is just multimodal data labelling, it's ok for datalabelling being single modality

I think it is better to discuss it over here: #387
I found a lot of times this kind issue make a dilemma for the contributors to decide which categories to choose for a new tool. @axsaucedo

For sure, if you move it to data labelling we can continue discussing in issue, alternative option is to remove from PR

I think it is better to know what makes things right and then do the right thing, rather than make a detour. I have proposed some issues but nobody discussed them at all. If so, why should I make the issues? @axsaucedo

README.md

axsaucedo · 2023-03-07T05:42:34Z

Sorry @zhimin-z but there's only so much back and forth we can go for a PR that adds 3 lines where two are being requested to be removed, so closing as unfortunately there needs to be a decision and we can't be discussing forever

zhimin-z · 2023-03-07T05:49:47Z

Hi @axsaucedo I do suggest we first settle down the issues before making comments in the PRs. It does not make sense for me to turn down a PR without fully discussing a still open issue.

zhimin-z added 4 commits March 2, 2023 20:17

add Argilla

d4a76b5

add refinery

d35a55f

add Gretel Synthetics

7d95ff0

add NLTK

5993f85

axsaucedo reviewed Mar 5, 2023

View reviewed changes

README.md Show resolved Hide resolved

zhimin-z mentioned this pull request Mar 6, 2023

Proposal to make the categorization standard explicit and clear #387

Closed

axsaucedo closed this Mar 7, 2023

zhimin-z mentioned this pull request Mar 8, 2023

Proposal for better guideline of commitment and merge #394

Closed

zhimin-z deleted the add-nlp-tools branch March 25, 2023 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nlp tools #379

Add nlp tools #379

zhimin-z commented Mar 3, 2023

axsaucedo Mar 5, 2023

zhimin-z Mar 5, 2023

axsaucedo Mar 6, 2023

zhimin-z Mar 6, 2023

axsaucedo Mar 6, 2023

zhimin-z Mar 6, 2023

axsaucedo Mar 7, 2023

zhimin-z Mar 7, 2023

axsaucedo Mar 5, 2023

zhimin-z Mar 5, 2023 •

edited

Loading

axsaucedo Mar 6, 2023

zhimin-z Mar 6, 2023

axsaucedo Mar 7, 2023

zhimin-z Mar 7, 2023

axsaucedo Mar 5, 2023

zhimin-z Mar 5, 2023

axsaucedo Mar 6, 2023

zhimin-z Mar 6, 2023

axsaucedo Mar 6, 2023

zhimin-z Mar 6, 2023 •

edited

Loading

axsaucedo Mar 7, 2023

zhimin-z Mar 7, 2023

axsaucedo commented Mar 7, 2023

zhimin-z commented Mar 7, 2023

		* [OpenAI GPT-2](https://github.com/openai/gpt-2) ![](https://img.shields.io/github/stars/openai/gpt-2.svg?style=social) - OpenAI's code from their paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).
		* [refinery](https://github.com/code-kern-ai/refinery) ![](https://img.shields.io/github/stars/code-kern-ai/refinery.svg?style=social) - The data scientist's open-source choice to scale, assess and maintain natural language data.

Add nlp tools #379

Add nlp tools #379

Conversation

zhimin-z commented Mar 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhimin-z Mar 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhimin-z Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axsaucedo commented Mar 7, 2023

zhimin-z commented Mar 7, 2023

zhimin-z Mar 5, 2023 •

edited

Loading

zhimin-z Mar 6, 2023 •

edited

Loading