Understanding the architecture and questions on user space #118

sneumann · 2021-06-22T16:02:42Z

Hi team calrissian,

I am working on a REST service that will execute some simple(!) data / file conversion tools, so not any read workflows. In my first prototype, I manually assembled a command line that performs the conversion, in my second prototype, i had three runners (local, localdocker and kubernetes). For K8S I (also) need a ReadWriteMany shared volume between REST server (placing input files into said volume) and the K8S jobs.

So after the first prototypes (and before allowing more functionality) we'd like to get the architecture right and improve maintainability :-) Hence we are going to 1) use CWL to describe the conversion tools and 2) consider cwl-runner and calrissian as job runners.

Currently a calrissian CWL job is submitted as K8S job by crafting the K8S job definition
https://github.com/Duke-GCB/calrissian/blob/master/examples/CalrissianJob-revsort.yaml#L3
using the dukegcb/calrissian:latest image as master pod and passing arguments to the calrissian python stuff, which in turn builds a pod to execute the actual cwl-runner.

The main benefits I get are

calrissian takes the hints:DockerRequirement:dockerPull: whateverimage:latest and 1) puts that into the pod definition and 2) removes that from the cwl-runner inside that pod to avoid confusing cwl-runner
it maintains a simple JobResourceQueue
There is some convenient usage reporting.

I was wondering:

How do I know that my input job is finished ? Do I need to keep the K8S job id of my CalrissianJob-revsort and poll its status ? Or did I miss an easier way ?
Why not use K8S jobs instead of the JobResourceQueue and building another scheduler/queue into calrissian ? I found https://de.slideshare.net/DanLeehr/cwl-on-kubernetes-183727221
=> what is missing, and is that still missing today ? Is it the maximum memory and max CPU ? Are jobs still tenacious ?
How do I access the usage reports ?

Thanks in advance, Yours, Steffen

The text was updated successfully, but these errors were encountered:

johnbradley · 2021-07-06T13:32:57Z

Hi @sneumann. See below for my thoughts on your questions.

How do I know that my input job is finished ? Do I need to keep the K8S job id of my CalrissianJob-revsort and poll its status ? Or did I miss an easier way ?

We attached a label and watched for K8S events for the jobs with the attached label. Here is the code we used that watched for job status changes: wait_for_job_events.

Why not use K8S jobs instead of the JobResourceQueue and building another scheduler/queue into calrissian ? I found https://de.slideshare.net/DanLeehr/cwl-on-kubernetes-183727221
=> what is missing, and is that still missing today ? Is it the maximum memory and max CPU ? Are jobs still tenacious ?

We found that the K8S jobs would retry jobs that failed after running for quite some time wasting resources. For example if there is a problem with a job's data and the job fails after 3 hours. A K8S job will retry this some number of times. We did need to retry if the problem was temporary(which we found rather common in K8S).

How do I access the usage reports ?

I assume you are referring to the --usage-report command line option. This should write a JSON file in the location you specify once the calrissian process completes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the architecture and questions on user space #118

Understanding the architecture and questions on user space #118

sneumann commented Jun 22, 2021

johnbradley commented Jul 6, 2021

Understanding the architecture and questions on user space #118

Understanding the architecture and questions on user space #118

Comments

sneumann commented Jun 22, 2021

johnbradley commented Jul 6, 2021