Containerize custom tasks #5

adlersantos · 2021-04-22T15:44:34Z

Note: The following is taken from @tswast's recommendation on a separate thread.

What are you trying to accomplish?

One of the Airflow "gotchas" is that workers share resources with the scheduler, so any "real work" that uses CPU and/or memory can cause slowdowns in the scheduler or even instability if memory is used up.

The recommendation is to do any "real work" in one of:

separate node pool via KubernetesPodOperator
Apache Beam / Dataflow
BigQuery
Spark
other external processing environment

What challenges are you running into?

In the generated DAG, I see the following operator:

    # Run the custom/csv_transform.py script to process the raw CSV contents into a BigQuery friendly format
    process_raw_csv_file = bash_operator.BashOperator(
        task_id="process_raw_csv_file",
        bash_command="SOURCE_CSV=$airflow_home/data/$dataset/$pipeline/{{ ds }}/raw-data.csv TARGET_CSV=$airflow_home/data/$dataset/$pipeline/{{ ds }}/data.csv python $airflow_home/dags/$dataset/$pipeline/custom/csv_transform.py\n",
        env={'airflow_home': '{{ var.json.shared.airflow_home }}', 'dataset': 'covid19_tracking', 'pipeline': 'city_level_cases_and_deaths'},
    )

I haven't looked closely at the csv_transform.py script yet, but I'd expect it to use non-trivial CPU / memory resources.

For custom Python scripts such as this, I'd expect us to use the KubernetesPodOperator, where the work is scheduled on a separate node pool.

Checklist

I created this issue in accordance with the Code of Conduct.
This issue is appropriately labeled.

The text was updated successfully, but these errors were encountered:

leahecole · 2021-04-26T17:02:32Z

Hey @adlersantos! When the time comes we may also want to look at the GKEPodOperator which is an extension of the KubernetesPodOperator that can create new clusters and then launch pods into the specified cluster.

adlersantos · 2021-04-28T16:44:25Z

Hey @leahecole. Thanks for the suggestion! We definitely need a way for pipelines to use their own GKE clusters in various contexts. I'll look into it as I add support for KubernetesPodOperator.

leahecole · 2021-04-28T17:18:39Z

Happy to work with you on this. I started going down this road with my interns last summer and I can look back on their notes and share them whenever you're ready

adlersantos · 2021-05-11T20:59:07Z

@leahecole I'm prioritizing this as we're starting to receive onboarding requests with "heavier" workloads. Would be nice to chat with you about it and look at your interns' notes as well. See ya!

adlersantos · 2021-08-12T23:19:00Z

Closing this. We now support operators that can create and delete GKE clusters, plus start GKE pods in those clusters.

adlersantos added the feature request New feature or request label Apr 22, 2021

adlersantos assigned tswast and adlersantos Apr 22, 2021

adlersantos mentioned this issue May 17, 2021

feat: Support building and pushing container images shared within a dataset folder #27

Merged

4 tasks

adlersantos closed this as completed Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containerize custom tasks #5

Containerize custom tasks #5

adlersantos commented Apr 22, 2021 •

edited

leahecole commented Apr 26, 2021 •

edited

adlersantos commented Apr 28, 2021

leahecole commented Apr 28, 2021

adlersantos commented May 11, 2021 •

edited

adlersantos commented Aug 12, 2021

Containerize custom tasks #5

Containerize custom tasks #5

Comments

adlersantos commented Apr 22, 2021 • edited

What are you trying to accomplish?

What challenges are you running into?

Checklist

leahecole commented Apr 26, 2021 • edited

adlersantos commented Apr 28, 2021

leahecole commented Apr 28, 2021

adlersantos commented May 11, 2021 • edited

adlersantos commented Aug 12, 2021

adlersantos commented Apr 22, 2021 •

edited

leahecole commented Apr 26, 2021 •

edited

adlersantos commented May 11, 2021 •

edited