Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SFTP service to basehub #20

Open
7 tasks
Tracked by #2201
consideRatio opened this issue Feb 15, 2023 · 4 comments
Open
7 tasks
Tracked by #2201

Add SFTP service to basehub #20

consideRatio opened this issue Feb 15, 2023 · 4 comments

Comments

@consideRatio
Copy link
Member

consideRatio commented Feb 15, 2023

This service, which ships with the yuvipanda/jupyterhub-ssh helm chart, is used to bring data in and out of home directories and works without involvement from a user server by having an SFTP server mount the user storage's directly.

The jupyterhub-ssh chart also provides another kind of service - to start or access already started user servers via ssh. Setup of that is not part of this issue - only the sftp server.

Action point

  • Find agreement to work towards adding this to basehub - use 👍 or comment why not
  • Upstream relevant chore maintenance
  • Add an opt-in dependency on the helm chart and trial it
    See technical notes for an example of what this can involve, as examplified with configuration from the JMTE deployment I've managed.
  • Write documentation
    • User facing docs
    • Engineer facing docs
  • Followup by reopening closed support tickets
    • Freshdesk ticket 447

Technical notes

This ought to be an opt-in feature not enabled by default initially.

Below are snippets of config from configuring jupyterhub-ssh (both ssh part and sftp part) for hub.jupytearth.org also referred to as the JMTE project, as seen in changes in 2i2c-org/infrastructure#436. There, jupyterhub-ssh was added as a dependency to the daskhub helm chart.

From config/clusters/jmte/common.values.yaml

basehub:
  jupyterhub:
    proxy:
      service:
        # jupyterhub-ssh/sftp integration part 1/3:
        #
        # We must accept traffic to the k8s Service (proxy-public) receiving traffic
        # from the internet. Port 22 is typically used for both SSH and SFTP, but we
        # can't use the same port for both so we use 2222 for SFTP in this example.
        #
        extraPorts:
          - name: ssh
            port: 22
            targetPort: ssh
          - name: sftp
            port: 2222
            targetPort: sftp
      traefik:
        # jupyterhub-ssh/sftp integration part 2/3:
        #
        # We must accept traffic arriving to the autohttps pod (traefik) from the
        # proxy-public service. Expose a port and update the NetworkPolicy
        # to tolerate incoming (ingress) traffic on the exposed port.
        #
        extraPorts:
          - name: ssh
            containerPort: 8022
          - name: sftp
            containerPort: 2222
        networkPolicy:
          allowedIngressPorts: [http, https, ssh, sftp]
        # jupyterhub-ssh/sftp integration part 3/3:
        #
        # extraStaticConfig is adjusted by staging/prod values
        # extraDynamicConfig is adjusted by staging/prod values



# jupyterhub-ssh values.yaml reference:
# https://github.com/yuvipanda/jupyterhub-ssh/blob/main/helm-chart/jupyterhub-ssh/values.yaml
#
jupyterhub-ssh:
  hubUrl: http://proxy-http:8000

  ssh:
    enabled: true

  sftp:
    # enabled is adjusted by staging/prod values
    # enabled: true
    pvc:
      enabled: true
      name: home-nfs

From config/clusters/jmte/prod.values.yaml:

basehub:
  jupyterhub:
    proxy:
      traefik:
        # jupyterhub-ssh/sftp integration part 3/3:
        #
        # We must let traefik know it should listen for traffic (traefik entrypoint)
        # and route it (traefik router) onwards to the jupyterhub-ssh k8s Service
        # (traefik service).
        #
        extraStaticConfig:
          entryPoints:
            ssh-entrypoint:
              address: :8022
            sftp-entrypoint:
              address: :2222
        extraDynamicConfig:
          tcp:
            services:
              ssh-service:
                loadBalancer:
                  servers:
                    - address: jupyterhub-ssh:22
              sftp-service:
                loadBalancer:
                  servers:
                    - address: jupyterhub-sftp:22
            routers:
              ssh-router:
                entrypoints: [ssh-entrypoint]
                rule: HostSNI(`*`)
                service: ssh-service
              sftp-router:
                entrypoints: [sftp-entrypoint]
                rule: HostSNI(`*`)
                service: sftp-service



jupyterhub-ssh:
  sftp:
    enabled: true

From helm-charts/daskhub/Chart.yaml:

  - name: jupyterhub-ssh
    version: 0.0.1-n142.h402a3d6
    repository: https://yuvipanda.github.io/jupyterhub-ssh/

From helm-charts/daskhub/values.schema.yaml:

  # jupyterhub-ssh is a dependent helm chart, we rely on its schema validation
  # for values passed to it and are not imposing restrictions on them in this
  # helm chart.
  jupyterhub-ssh:
    type: object
    additionalProperties: true
@consideRatio consideRatio changed the title Add SFTP service to basehub, as that service is used to bring data in and out of home directories. JupyterHub SSH itself doesn't seem to be used much, so we can probably ignore that one. Add SFTP service to basehub Feb 15, 2023
@yuvipanda
Copy link
Member

I think there's a lot of demand for sftp in particular, so I think we should definitely do this! I also would say we should turn on sftp by default (but not ssh), as it just uses openssh and is fairly secure.

@consideRatio
Copy link
Member Author

@yuvipanda I think its reasonable to expose by default long term, but that we for the sake of stability across hubs let it be piloted a while in a few hubs until we have ensured that the dependency is sufficiently mature.

I'm thinking for example if this service would be enbled in a hub where security is critical, and we end up using a quite old build of a docker image that may have outdated dependencies with known vulnerabilities, for example in OpenSSH.

@yuvipanda
Copy link
Member

Yep makes sense! Let's not turn it on by default to start with.

@tsnow03
Copy link

tsnow03 commented Apr 10, 2024

Hi @yuvipanda and @consideRatio. Bumping SSH capabilities in the cloud. We have been getting a number of use cases for SSH capabilities in CryoCloud. Here are some of the use cases laid out to include the one mentioned in CryoInTheCloud/hub-image/issues/54:

  • A student running simulations in the Ice Sheet System Model (ISSM), which runs using SLURM and a super computer. The workflow becomes cumbersome if only one person (the advisor) has access to an HPC (in this case one needing NASA credentials). Small changes regularly need to be made and tested to configure a model run. Right now if the student wants to test a new configuration, they need to send the files and data to their advisor (or share on CryoCloud), the advisor needs to pull the files out of CryoCloud, send it through the NASA HPC, and send the output back to the student to put back into cryocloud for post-processing. With SSH capabilities in CryoCloud, the student could easily do the pre- and post-processing in CryoCloud, sharing the workflow and having the advisor test or run the new configurations with no transfer of data or files required. This sort of workflow is critical for any of the modelers in our group. In particular we are hoping to use this capability for an upcoming ISMIP7 (Ice Sheet Model Intercomparison Study) for the next IPPC reports which will be starting to ramp up in the next 3-6 months. They are users that have access to the GHub Buffallo and NASA HPCs.

  • There are similar use cases for other glacier modeling groups, compute-intensive geostatistics (UF NVIDIA), and massive remote sensing processing projects (ITS_LIVE) that need HPC compute regularly but that you want to do the pre- and post-processing and data streaming in the cloud. NASA has solved this problem for some of their NASA internal users by creating a cloud-HPC like setup that uses SLURM, but it is very costly and seems like a waste of resources when there is an HPC sitting at NASA that could be used for a similar use. It would be extremely cost-saving to our community to have SSH capabilities.

  • As mentioned above, SSH or FTP capability in the terminal command line within CryoCloud would enable users to upload their own data remotely, streamlining data access and transfer that is currently a little more clunky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Ready to work
Development

No branches or pull requests

3 participants