Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement rclone-based storage.py equivalent #2942

Closed
RafalSkolasinski opened this issue Feb 10, 2021 — with Board Genius Sync · 3 comments · Fixed by #3159
Closed

implement rclone-based storage.py equivalent #2942

RafalSkolasinski opened this issue Feb 10, 2021 — with Board Genius Sync · 3 comments · Fixed by #3159

Comments

Copy link
Contributor

RafalSkolasinski commented Feb 10, 2021

Initial tests with rclone are very positive. Library performs very well, is easy to use, ticks all of the boxes and community is really great and fast responding.

We already have an example on building rclone-based init container here.

There are however few things to consider:

  • we need to configure the new tool to work out-of-the-box with public S3 / GCS buckets
  • provide compatibility level for uri's defined with older pattern (s3://seldon/sklearn/iris vs s3:seldon/sklearn/iris format that rclone would prefer)
  • provide compatibility layer for secrets (environmental variable) configured for storage.py
  • provide a nice python API so the tool can easily be used from the Jupyter notebook to upload artefacts to model repositories

This probably closes #1028

Note: storage.py is used here more in context of kfserving storage initializer mechanism than actual python package

@RafalSkolasinski RafalSkolasinski added the triage Needs to be triaged and prioritised accordingly label Feb 10, 2021
@RafalSkolasinski
Copy link
Contributor Author

We should then use the said Python API then in our jupyter notebooks exmaples that require artefacts operations, for example: https://docs.seldon.io/projects/seldon-core/en/latest/examples/minio-sklearn.html

@ukclivecox ukclivecox removed the triage Needs to be triaged and prioritised accordingly label Feb 11, 2021
@RafalSkolasinski
Copy link
Contributor Author

With Minio's mc client it is

mc config host add gcs https://storage.googleapis.com "" ""
mc config host add minio-seldon http://localhost:8090 minioadmin minioadmin

mc mb minio-seldon/iris -p
mc cp gcs/seldon-models/sklearn/iris/model.joblib minio-seldon/iris/

The rclone solution would require to either following .config/rclone/rclone.conf config:

[minio]
type = s3
provider = minio
env_auth = false

access_key_id = minioadmin
secret_access_key = minioadmin
endpoint = http://localhost:8090

[gcs]
type = google cloud storage
anonymous = true

or equivalent environmental variable config

RCLONE_CONFIG_MINIO_SELDON_TYPE=s3
RCLONE_CONFIG_MINIO_SELDON_PROVIDER=minio
RCLONE_CONFIG_MINIO_SELDON_ENV_AUTH=false
RCLONE_CONFIG_MINIO_SELDON_ACCESS_KEY_ID=minioadmin
RCLONE_CONFIG_MINIO_SELDON_SECRET_ACCESS_KEY=minioadmin
RCLONE_CONFIG_MINIO_SELDON_ENDPOINT=http://localhost:8090

RCLONE_CONFIG_GCS_TYPE=google cloud storage
RCLONE_CONFIG_GCS_ANONYMOUS=true

with command being

rclone copy gcs:seldon-models/sklearn/iris/model.joblib minio-seldon:iris/

@ukclivecox ukclivecox removed their assignment Feb 18, 2021
@axsaucedo axsaucedo changed the title implement rclone-based storage.py equivalent OSS-200: implement rclone-based storage.py equivalent Apr 26, 2021
@axsaucedo axsaucedo changed the title OSS-200: implement rclone-based storage.py equivalent implement rclone-based storage.py equivalent Apr 27, 2021
@RafalSkolasinski
Copy link
Contributor Author

RafalSkolasinski commented May 4, 2021

There are however few things to consider:

  1. we need to configure the new tool to work out-of-the-box with public S3 / GCS buckets
  2. provide compatibility level for uri's defined with older pattern (s3://seldon/sklearn/iris vs s3:seldon/sklearn/iris format that rclone would prefer)
  3. provide compatibility layer for secrets (environmental variable) configured for storage.py
  4. provide a nice python API so the tool can easily be used from the Jupyter notebook to upload artefacts to model repositories

Ad 1. Public S3 may be tricky as often it requires to configure region first. It seems we do not use examples from AWS S3 so I believe the only out-of-the-box compatibility should be for example models we have on Google Cloud, e.g. gs://seldon-models/sklearn/iris

This can be easily done by embedding these two env variable into the image

RCLONE_CONFIG_GS_TYPE=google cloud storage
RCLONE_CONFIG_GS_ANONYMOUS=true

Ad 2. It turns out that Rclone seems to ignore the leading slashes so gs://seldon-models/sklearn/iris works like gs:seldon-models/sklearn/iris would so no problem here.

Ad 3. TBC but IMO does not bring benefit in the long run as it only adds complexity and maintenance burden - much better to ask users to redefine secrets using rclone-compatible format.

Ad 4. Not a priority for making rclone-based init containers a default one. It's a nice to have but can be can be done as follow up.


Considering above in order to make https://github.com/SeldonIO/seldon-core/tree/master/components/rclone-storage-initializer a default Storage Initializer I suggest following:

  • Make it work out of the box with GCS, e.g. gs://seldon-models/sklearn/iris
  • Add do pre-packaged model server docs details on rclone-based storage initializer (exmaple configs, links, etc)
  • Add dedicated upgrading notes with notes on how to fallback to the kfserving storage initializer
  • Make it default in helm charts and make sure it passes all the tests we have

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants