Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the architecture and questions on user space #118

Open
sneumann opened this issue Jun 22, 2021 · 1 comment
Open

Understanding the architecture and questions on user space #118

sneumann opened this issue Jun 22, 2021 · 1 comment

Comments

@sneumann
Copy link

Hi team calrissian,

I am working on a REST service that will execute some simple(!) data / file conversion tools, so not any read workflows. In my first prototype, I manually assembled a command line that performs the conversion, in my second prototype, i had three runners (local, localdocker and kubernetes). For K8S I (also) need a ReadWriteMany shared volume between REST server (placing input files into said volume) and the K8S jobs.

So after the first prototypes (and before allowing more functionality) we'd like to get the architecture right and improve maintainability :-) Hence we are going to 1) use CWL to describe the conversion tools and 2) consider cwl-runner and calrissian as job runners.

Currently a calrissian CWL job is submitted as K8S job by crafting the K8S job definition
https://github.com/Duke-GCB/calrissian/blob/master/examples/CalrissianJob-revsort.yaml#L3
using the dukegcb/calrissian:latest image as master pod and passing arguments to the calrissian python stuff, which in turn builds a pod to execute the actual cwl-runner.

The main benefits I get are

  • calrissian takes the hints:DockerRequirement:dockerPull: whateverimage:latest and 1) puts that into the pod definition and 2) removes that from the cwl-runner inside that pod to avoid confusing cwl-runner
  • it maintains a simple JobResourceQueue
  • There is some convenient usage reporting.

I was wondering:

  • How do I know that my input job is finished ? Do I need to keep the K8S job id of my CalrissianJob-revsort and poll its status ? Or did I miss an easier way ?
  • Why not use K8S jobs instead of the JobResourceQueue and building another scheduler/queue into calrissian ? I found https://de.slideshare.net/DanLeehr/cwl-on-kubernetes-183727221
    => what is missing, and is that still missing today ? Is it the maximum memory and max CPU ? Are jobs still tenacious ?
  • How do I access the usage reports ?

Thanks in advance, Yours, Steffen

@johnbradley
Copy link
Contributor

Hi @sneumann. See below for my thoughts on your questions.


How do I know that my input job is finished ? Do I need to keep the K8S job id of my CalrissianJob-revsort and poll its status ? Or did I miss an easier way ?

We attached a label and watched for K8S events for the jobs with the attached label. Here is the code we used that watched for job status changes: wait_for_job_events.


Why not use K8S jobs instead of the JobResourceQueue and building another scheduler/queue into calrissian ? I found https://de.slideshare.net/DanLeehr/cwl-on-kubernetes-183727221
=> what is missing, and is that still missing today ? Is it the maximum memory and max CPU ? Are jobs still tenacious ?

We found that the K8S jobs would retry jobs that failed after running for quite some time wasting resources. For example if there is a problem with a job's data and the job fails after 3 hours. A K8S job will retry this some number of times. We did need to retry if the problem was temporary(which we found rather common in K8S).


How do I access the usage reports ?

I assume you are referring to the --usage-report command line option. This should write a JSON file in the location you specify once the calrissian process completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants