Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of the "low interest" modules/packages #196

Closed
mariosasko opened this issue Oct 8, 2020 · 6 comments
Closed

Removal of the "low interest" modules/packages #196

mariosasko opened this issue Oct 8, 2020 · 6 comments
Milestone

Comments

@mariosasko
Copy link
Collaborator

mariosasko commented Oct 8, 2020

IMO, there are some modules/packages that don't add any value to the project or are of low interest for English speaking users. By removing them, the codebase gets cleaner and we no longer have to maintain such modulues/packages. If we decide to keep them, our project will not have a clear direction. Consequently, it could affect the number of users in the long run.

The removal of the module/package is a 5-step process:

  1. See if the module/package is used in the project examples. This requires further discussion
  2. Remove the module (the .py file) or the package directory.
  3. Check the __init__.py in the parent directory and delete the related entries.
  4. Remove the corresponding test file/directory if there is one.
  5. Optionally, remove the doc entry if there is one.
  6. See if the removed module/package requires a specific dependency. If it does, remove the dependency from setup.py.

I'll edit this post to discuss and stage the module/package removal.

Feel free to comment.

cc @mttk @FilipBolt @ivansmokovic

@mttk
Copy link
Member

mttk commented Oct 8, 2020

Absolutely agreed. Removal of WIP modules is also something that can increase cleanliness, not only for English speaking users.

@mariosasko
Copy link
Collaborator Author

mariosasko commented Oct 11, 2020

I propose removal of the following packages/modules:

  • metrics/* - just a thin wrapper around sklearn.metrics
  • preproc/lemmatizer/* - it uses our internal servers and it's bound only to Croatian, we can use spacy for that (Croatian is supported)
  • preproc/stemmer/* - bound only to Croatian. I assume it's not easy to found a Croatian stemmer (not sure if the code is copied though) so this is a valid argument for keeping it, but still users can use Croatian lemmatizer via spacy and lemmatization is preferred over stemming in most cases
  • preproc/yake.py - this component is a thin wrapper around the yake library, doesn't interect with podium
  • preproc/stop_words.py - holds a list of Croatian stopwords, spacy has its own for Croatian language model, pretty sure there are other online sources too
  • validation/* - not sure but kfold cross validation is applicable only if the model/dataset is small, sadly not the case in modern DL/NLP, imo this method is too specific and more related to ML models (SVM, ...) and not to DL/NLP

Please note that I am not claiming that the packages/modules listed above need to be removed completely. If they are not hosted on any other platform, we can create a separate repo to hold them there.

@mariosasko
Copy link
Collaborator Author

One more thing. If we remove preproc/stemmer/* then we no longer need zip_safe=False in setup.py because, by doing so, no files will be left that are holding data. Having zip_safe set to True is preferred whenever possible. This stemmer (python source) is already published online.

@mariosasko
Copy link
Collaborator Author

One more reason to remove preproc/yake.py. Rn, PyPI doesn't support links to online repos in the list of required packages in setup.py. The yake library relies on this, but we don't really need this component (it doesn't interact with podium at all).

@FilipBolt
Copy link
Collaborator

Adding a few things here. Instead of complete removal since some downstream dependencies already depend on these libraries (and more will depend soon) let's try to move most of them a new respository (something like podium-models).

I propose that the only the podium/metrics/metrics.py is removed. The following modules I propose to move to the other (non-core repository):

  • podium/preproc/lemmatizer/
  • podium/preproc/stemmer/
  • podium/preproc/yake.py
  • podium/preproc/util.py (used only by lemmatizer and stemmer)
  • podium/dataload/eurovoc.py
  • podium/dataload/ner_croatian.py
  • podium/models/impl/blcc/chain_crf.py
  • podium/models/impl/blcc_model.py
  • podium/models/eurovoc_models/multilabel_svm.py

Migrating these modules to a separate repository should allow for also migrating classes SCPDownloader (podium/storage/resources/downloader.py) and SCPLargeResource (podium/storage/resources/large_resource.py) which should significantly simplify the downloading and large resource modules.

@mttk
Copy link
Member

mttk commented Mar 28, 2021

More or less done IMO with the recent private transfer & changes.

@mttk mttk closed this as completed Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants