Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: properly assign default server option values #649

Merged
merged 22 commits into from
May 31, 2021

Conversation

olevski
Copy link
Member

@olevski olevski commented May 10, 2021

This properly assigns the default values to server options in all cases.

The logic is as follows:

  • You do not have to specify serverOptions at all in the api call to start a session. In this case all the defaults from the values.yaml are used.
  • If you do specify the server options, then the parameters for cpu, memory, default url and LFS auto-fetch are mandatory. The parameters for disk size and gpu are still allowed to be missing and replaced with defaults from values.yaml.
  • The disk size request and gpus are not mandatory in values.yaml so in the case that the deployment does not include them in the values.yaml file then backup defaults are hardcoded in the renku notebooks code. These backup defaults are used only if the values.yaml file does not specify these parameters. The backup defaults are 0 gpus and 1G of disk space.

As for limiting the size when using emptyDir, I tested and k8s will evict the pods if they even go over the limit for their emptyDir by a few megabytes. When the limit is not respected the eviction happens in a matter of minutes if not seconds. However there are a few issues:

  • The autosave branch is not created so all work is lost
  • The status of the pod is evicted but the UI and notebooks do not really use or properly communicate this status. So a session is essentially stuck in limbo because it does not go away but its status is set to evicted but from the perspective of the notebooks it considers anything that is not with status running to be loading. The UI does not allow the user to shut down a session while it is loading so the evicted session can only be cleaned up by using kubectl or calling the API directly.

I have created a follow up issue to properly handle the size limit for emptyDir and cleanup of evicted sessions here #648.

One last question I have is what we want to do with the gpu and disk size portions for server options in values.yaml. As the code is written right now the cpu, memory, default url and LFS auto-fetch are mandatory in the values.yaml. Gpus and disk size are not mandatory and defaults for them are hardcoded in renku-notebooks code.

/deploy

@olevski olevski requested a review from a team as a code owner May 10, 2021 22:24
@olevski olevski temporarily deployed to renku-ci-nb-649 May 10, 2021 22:25 Inactive
@RenkuBot
Copy link
Contributor

You can access the deployment of this PR at https://renku-ci-nb-649.dev.renku.ch

@olevski olevski temporarily deployed to renku-ci-nb-649 May 10, 2021 22:37 Inactive
@olevski olevski temporarily deployed to renku-ci-nb-649 May 10, 2021 22:50 Inactive
@ableuler
Copy link
Contributor

ableuler commented May 11, 2021

Thanks for the immediate action on this! I have 4 questions/remarks:

If you do specify the server options, then the parameters for cpu, memory, default url and LFS auto-fetch are mandatory. The parameters for disk size and gpu are still allowed to be missing and replaced with defaults from values.yaml.

Why not use the same logic for all parameters, ie allow all parameters to be missing and simply use the defaults from values.yaml for all missing params? I think that would be the simpler logic and would allow admins to for example disable showing the memory selection in the UI through the values file if they for some reason wanted to do this.

These backup defaults are used only if the values.yaml file does not specify these parameters. The backup defaults are 0 gpus and 1G of disk space.

I would prefer to have these defaults (also) set in the values.yaml that we ship with the renku-notebook chart. The helm chart is currently the only way people deploy this application and the values.yaml is where people go first to look for configuration. If defaults are also set in the code, preferably only in the app config.

The status of the pod is evicted but the UI and notebooks do not really use or properly communicate this status. So a session is essentially stuck in limbo because it does not go away but its status is set to evicted but from the perspective of the notebooks it considers anything that is not with status running to be loading. The UI does not allow the user to shut down a session while it is loading so the evicted session can only be cleaned up by using kubectl or calling the API directly.

I guess that is the case already now for pods that are evicted because of insufficient disk space on a node. So in that sense it's already an improvement over the status quo since this PR makes sure that the eviction hits a session that indeed consumes a lot of disk space.

One more thing to figure out for sure: do you know if the notebook containers usage of the empyDir volume is also counted against the containers ephemeral-storage quota? I noticed that yesterday a pod without any limit on the empty dir got evicted when I added more than ~22GB of data to the emptyDir volume from within the session. This session had the default ephemeral-storage limit of 20GB. I'd find this a rather unexpected behaviour. But if what I describe above holds true, we might want to set the ephemeral storage limit on the notebooks container to 20GB + emptyDir limit to get a more expected behaviour.

@olevski
Copy link
Member Author

olevski commented May 11, 2021

One more thing to figure out for sure: do you know if the notebook containers usage of the empyDir volume is also counted against the containers ephemeral-storage quota? I noticed that yesterday a pod without any limit on the empty dir got evicted when I added more than ~22GB of data to the emptyDir volume from within the session. This session had the default ephemeral-storage limit of 20GB. I'd find this a rather unexpected behaviour. But if what I describe above holds true, we might want to set the ephemeral storage limit on the notebooks container to 20GB + emptyDir limit to get a more expected behaviour.

You are correct @ableuler, the emptydir usage counts against the ephemeral storage limit. I started a session where the ephemeral storage limit was set to 1GB but the size limit on empty dir was set to 10GB. After adding 2GB worth of data in the user session the pod was evicted with the message:

Warning  Evicted              13s    kubelet            Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.
Normal   Killing              13s    kubelet            Stopping container notebook

@olevski olevski requested a review from a team as a code owner May 12, 2021 13:53
@olevski olevski temporarily deployed to renku-ci-nb-649 May 12, 2021 14:09 Inactive
@olevski olevski temporarily deployed to renku-ci-nb-649 May 12, 2021 15:18 Inactive
@olevski
Copy link
Member Author

olevski commented May 12, 2021

@ableuler I addressed all your comments from above. In addition to this I already implemented the changes required for #651 because it was easier to just include this rather than not and then add logic to parse and extract data from the current way we provide the server options.

Now the values.yaml files have the following sections:

serverOptionsUI:
  defaultUrl:
    order: 1
    displayName: Default Environment
    type: enum
    default: /lab
    options: [/lab]
  cpu_request:
    order: 2
    displayName: Number of CPUs
    type: enum
    default: 0.5
    options: [0.5, 1.0]
  # ....
serverOptionsDefaults:
  defaultUrl: /lab
  cpu_request: 0.5
  mem_request: 1G
  disk_request: 1G
  gpu_request: 0
  lfs_auto_fetch: false

The serverOptionsUI do not have to be mandatory, but it seems that the UI cannot handle server options that are fully empty because I think it always expects to have to render something. When I deployed with no server options for the UI the page for launching a new environment would never finish loading. I will submit an issue about this on the UI repo. So for now I made it so that at least the option for lfs auto fetch has to be provided in the values file - all other fields are optional.

Other than that all the other points we discussed are implemented:

  • the defaults are always used and overwritten by any server options if any are passed in the post request for launching a session
  • launching a session succeeds even if absolutely no server options are passed or even if an empty dictionary is passed for them (in this case the defaults from values.yaml are fully used)
  • the ephemeral storage limit and disk size limit are both utilized - I added a fairly detailed explanation in the values file about this
  • now the person who deploys renku can decide what server options to allow the user to change and which should not be changed and their only valid values are the values specified under serverOptionsDefaults in the values.yaml file.

I think this covers all use cases we may come across now and in the future.

@olevski
Copy link
Member Author

olevski commented May 12, 2021

After this PR is merged then this PR should also be merged: SwissDataScienceCenter/renku#2121

@olevski olevski temporarily deployed to renku-ci-nb-649 May 17, 2021 13:16 Inactive
jupyterhub/spawners.py Show resolved Hide resolved
helm-chart/renku-notebooks/templates/configmap.yaml Outdated Show resolved Hide resolved
@olevski olevski temporarily deployed to renku-ci-nb-649 May 18, 2021 10:12 Inactive
@olevski olevski requested a review from ableuler May 18, 2021 10:17
Copy link
Contributor

@ableuler ableuler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for the updates! 👍

@olevski olevski deployed to renku-ci-nb-649 May 18, 2021 21:11 Active
@rokroskar
Copy link
Member

@olevski any reason not to merge this?

@olevski
Copy link
Member Author

olevski commented May 31, 2021

@olevski any reason not to merge this?

No, no reason at all. I should have merged it earlier.

@olevski olevski merged commit eda1685 into master May 31, 2021
@olevski olevski deleted the fix-properly-assign-default-server-option-values branch May 31, 2021 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants