Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create basic example workflow in kubernetes #1

Open
mbjones opened this issue Apr 8, 2022 · 11 comments
Open

create basic example workflow in kubernetes #1

mbjones opened this issue Apr 8, 2022 · 11 comments
Assignees

Comments

@mbjones
Copy link
Member

mbjones commented Apr 8, 2022

Create an example workflow that executes a parsl job on a kubernetes cluster.

@mbjones mbjones self-assigned this Apr 8, 2022
mbjones added a commit that referenced this issue Apr 8, 2022
@mbjones
Copy link
Member Author

mbjones commented Apr 8, 2022

Committed a minimal example in sha a9cfb14 that uses rancher desktop on localhost with hard-coded configuration.

@julietcohen
Copy link
Collaborator

julietcohen commented Nov 27, 2023

Progress with Docker & Kubernetes

I was successful running a simple PDG viz workflow in serial using docker on my laptop. This means I was able to:

  1. build a docker image that installs the PDG packages and the few other requirements and copies the input data and workflow script
  2. run a container that executes a simple version of the workflow with a small amount of input data
  3. save the workflow outputs to a persistent local directory

Now it's time to bring kubernetes and parsl into the mix. Matt created an example repo that uses kubernetes and parsl here. This repo does not contain a script that runs the viz-workflow, but rather a simple python script just to practice running code with the kubernetes cluster with parsl, accessed through the Datateam server.

The following steps document basics for how to use the kubernetes cluster in this context. These will be moved into a formal documentation file in this repository when I make them more comprehensive and generalized.

  • create or update the .kube files in your home dir on Datateam (a config file and a config-dev file)
  • clone the repo of interest (Matt's repo) into home dir on datateam
  • create a virtualenv and install the requirements.txt located in the repo
    • issues could arise from version discrepancies between parsl and python, or the versions of parsl and python between the env in the terminal versus the env in the image, or using a conda env, so using a virtualenv may be safer, but this has yet to be proven to be the case
  • ensure the correct context is activated. It's important to use dev contexts for testing.
    • run kubectl config get-contexts to check all available contexts
    • run kubectl config use-context dev-pdgrun to use the dev-pdgrun context
    • run kubectl config current-context to simply check which one you're using
  • run the python script that includes the parsl & kubernetes config and does work
  • if the script fails, you need to check which pods are hanging by running kubectl get pods. Then delete those pods one-by-one with kubectl delete pod {pod-name}
  • if it succeeds, the pods will be deleted by parsl (it's part of the parsl cleaning process). But it's still a good idea to check if there are any pods hanging, because parsl only cleans the pods that were started up while the work was being done. If the work was very quick, and there were pods queued up to start that didn't start before the script finished, they will still need to be manually deleted.

@julietcohen
Copy link
Collaborator

Progress with Kubernetes and parsl

I was successful in running the python script in Matt's repo, using a modified local version of the requirements.txt to build my local env activated in the terminal. In order to make the version of parsl in my env match the version in the image, I had to specify in the requirements.txt that it should be the older version 2023.07.10 that was printed in an error message. If I tried to specify the newer parsl version 2023.11.27 in the requirements.txt or did not specify one at all, then the compatibility did not match because the image was not being rebuilt, it was already built with 2023.07.10. Based on my understanding of that hurdle, it will be essential moving forward to specify the version of parsl in the requirements.txt before building and pointing to the image.

@mbjones
Copy link
Member Author

mbjones commented Nov 28, 2023

Great catch, @julietcohen . By the way, one other thing we could have done to fix this is to rebuild the image to use the newer version of parsl (which is probably a good thing). But the important part is that they match, so I agree we should specify the exact version in the requirements file.

@julietcohen
Copy link
Collaborator

Progress with Kubernetes and parsl on Datateam

  • got permission to publish packages to the repo (was added to the docker group)
  • working in the branch docker-parsl-workflow
  • published very first package to viz-workflow repo, version 0.1
  • in order to pull data files into the viz workflow and write files, we need to add a persistent volume differently than using the -v option in the docker run command, which is what I was doing when learning how to use docker and kubernetes on my local computer (example: docker run -v /Users/jcohen/Documents/docker/repositories/docker_python_basics/app:/app image_name). Instead, Matt or Nick will set up a persistent volume on Datateam, and I will be able to copy data files there to input into the workflow, then also write files there. In order to point the workflow to that volume, I need to add it's name to the parsl config with persistent_volumes=[('name_of_volume', '/var/data/...')]
  • long term to-do item: in order to get rid of the warning: Running pip as the 'root' user can result in broken permissions and conflicting behavior with the system package manager. It is recommended to use a virtual environment instead , I will need to create a virtual env with the Dockerfile before activating it and installing the requirements.txt in that env

@julietcohen
Copy link
Collaborator

julietcohen commented Jan 12, 2024

Updates

  • Matt and Nick helpfully set up a persistent volume for Datateam, called pdgrun-dev-0
  • this persistent directory is only needed when using parsl. If I am just running a smaller, simpler docker image on Datateam for testing, I would still create a persistent volume the way I was doing it on my local computer, which was with the -v option in the docker run command as noted in the previous comment
  • Currently working in my personal repo to test how to make every component work so I'm not publishing too much to the viz-workflow repo before I know what I'm doing
  • First, I successfully built and ran a simple docker image on Datateam in my home dir in that personal repo that just uses geopandas to read in a data file, subset it, and save it to a persistent directory that I specified in my home dir
  • Next, I ran a simple version of the viz workflow on a small number of polygons without docker or parsl or kubernetes at all, using a fresh env with just the requirements.txt that will be used for the docker & parsl & kubernetes workflow, and this was successful
  • Next, built a docker image (on Datateam, not pushed to github) for the same simple version of the workflow, still without parsl, then got an error output when I ran docker run -v /home/jcohen/docker_python_basics/app-data:/app-data {image_name} that seems to be related to dependencies of geopandas: rtree and libspatialindex
    • same error documented here, but the suggestions to run the command sudo apt install libspatialindex-dev python-rtree doesn't work in a requirements.txt file
error
Traceback (most recent call last):
  File "/home/jcohen/docker_python_basics/./simple_workflow.py", line 10, in <module>
    import pdgstaging
  File "/usr/local/lib/python3.10/site-packages/pdgstaging/__init__.py", line 1, in <module>
    from .ConfigManager import ConfigManager
  File "/usr/local/lib/python3.10/site-packages/pdgstaging/ConfigManager.py", line 5, in <module>
    from .Deduplicator import deduplicate_neighbors, deduplicate_by_footprint
  File "/usr/local/lib/python3.10/site-packages/pdgstaging/Deduplicator.py", line 2, in <module>
    import geopandas as gpd
  File "/usr/local/lib/python3.10/site-packages/geopandas/__init__.py", line 1, in <module>
    from geopandas._config import options
  File "/usr/local/lib/python3.10/site-packages/geopandas/_config.py", line 109, in <module>
    default_value=_default_use_pygeos(),
  File "/usr/local/lib/python3.10/site-packages/geopandas/_config.py", line 95, in _default_use_pygeos
    import geopandas._compat as compat
  File "/usr/local/lib/python3.10/site-packages/geopandas/_compat.py", line 242, in <module>
    import rtree  # noqa: F401
  File "/usr/local/lib/python3.10/site-packages/rtree/__init__.py", line 9, in <module>
    from .index import Rtree, Index  # noqa
  File "/usr/local/lib/python3.10/site-packages/rtree/index.py", line 6, in <module>
    from . import core
  File "/usr/local/lib/python3.10/site-packages/rtree/core.py", line 77, in <module>
    rt.Error_GetLastErrorNum.restype = ctypes.c_int
  File "/usr/local/lib/python3.10/ctypes/__init__.py", line 387, in __getattr__
    func = self.__getitem__(name)
  File "/usr/local/lib/python3.10/ctypes/__init__.py", line 392, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: python: undefined symbol: Error_GetLastErrorNum

Next step: determine why there is a dependency error in the docker image but not the simple workflow outside of the docker image, probably has something to do with the base image needing another install that is not needed in my independent python env

@julietcohen
Copy link
Collaborator

Using base image FROM python:3.9 rather than 3.10 seemed to solve the dependency error

@julietcohen
Copy link
Collaborator

Moving on to the docker & kubernetes & parsl workflow, using a docker image published to the repo (with python:3.9 as the base image), I got the following error output immediately. Seems to be another one related to package versions.

error
Traceback (most recent call last):
  File "/home/jcohen/anaconda3/envs/testing_python3-9/lib/python3.9/site-packages/parsl/process_loggers.py", line 27, in wrapped
    r = func(*args, **kwargs)
  File "/home/jcohen/anaconda3/envs/testing_python3-9/lib/python3.9/site-packages/parsl/dataflow/dflow.py", line 1216, in cleanup
    block_ids = executor.scale_in(len(job_ids))
  File "/home/jcohen/anaconda3/envs/testing_python3-9/lib/python3.9/site-packages/parsl/executors/high_throughput/executor.py", line 666, in scale_in
    managers = self.connected_managers()
  File "/home/jcohen/anaconda3/envs/testing_python3-9/lib/python3.9/site-packages/parsl/executors/high_throughput/executor.py", line 540, in connected_managers
    return self.command_client.run("MANAGERS")
  File "/home/jcohen/anaconda3/envs/testing_python3-9/lib/python3.9/site-packages/parsl/executors/high_throughput/zmq_pipes.py", line 58, in run
    self.zmq_socket.send_pyobj(message, copy=True)
  File "/home/jcohen/anaconda3/envs/testing_python3-9/lib/python3.9/site-packages/zmq/sugar/socket.py", line 955, in send_pyobj
    return self.send(msg, flags=flags, **kwargs)
  File "/home/jcohen/anaconda3/envs/testing_python3-9/lib/python3.9/site-packages/zmq/sugar/socket.py", line 696, in send
    return super().send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 742, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 789, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 250, in zmq.backend.cython.socket._send_copy
  File "zmq/backend/cython/checkrc.pxd", line 13, in zmq.backend.cython.checkrc._check_rc

@mbjones
Copy link
Member Author

mbjones commented Jan 13, 2024

One thought? Are you using identical versions of python on datateam and in your image? i.e., they are both anaconda with the python release and the same environment? I've mentioned to you before that I've struggled with Anaconda in the past so I tend to avoid it in favor of the Ubuntu-shipped python. YMMV. But they do need to match.

@julietcohen
Copy link
Collaborator

I am using the same version of python for the terminal environment and in the container. I'm using a conda environment in the terminal built with python==3.9 and the only installs are from the requirements.txt that is used to build the environment in the container. The base image is FROM python:3.9. I am not specifying a more specific release of python than 3.9. I have not yet gotten to resolving issue #30 so the Dockerfile currently builds the environment in the container. It is a good point to also try venv instead of conda.

The error pasted above that seemed to be related to package versions has been resolved. This issue has been broken into sub tasks that are documented as separate issues. The most important to solve first is #36. Other subtasks are #33 and #30.

@julietcohen
Copy link
Collaborator

Note that @shishichen from Google has successfully run the PDG workflow with Google Kubernetes Engine and parsl. 🎉 More documentation to come, and hopefully this will be scaled up in the coming weeks to run full datasets. See Shishi's fork of the viz-workflow repo here.

Additionally, the QGreenland development team, including @rushirajnenuji and @mfisher87, has been working running the viz workflow with parsl kubernetes and Argo. See this repo for argo exploration and this repo for parsl exploration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Status: In Progress
Development

No branches or pull requests

2 participants