Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerize custom tasks #5

Closed
2 tasks done
adlersantos opened this issue Apr 22, 2021 · 5 comments
Closed
2 tasks done

Containerize custom tasks #5

adlersantos opened this issue Apr 22, 2021 · 5 comments
Assignees
Labels
feature request New feature or request

Comments

@adlersantos
Copy link
Member

adlersantos commented Apr 22, 2021

Note: The following is taken from @tswast's recommendation on a separate thread.

What are you trying to accomplish?

One of the Airflow "gotchas" is that workers share resources with the scheduler, so any "real work" that uses CPU and/or memory can cause slowdowns in the scheduler or even instability if memory is used up.

The recommendation is to do any "real work" in one of:

What challenges are you running into?

In the generated DAG, I see the following operator:

    # Run the custom/csv_transform.py script to process the raw CSV contents into a BigQuery friendly format
    process_raw_csv_file = bash_operator.BashOperator(
        task_id="process_raw_csv_file",
        bash_command="SOURCE_CSV=$airflow_home/data/$dataset/$pipeline/{{ ds }}/raw-data.csv TARGET_CSV=$airflow_home/data/$dataset/$pipeline/{{ ds }}/data.csv python $airflow_home/dags/$dataset/$pipeline/custom/csv_transform.py\n",
        env={'airflow_home': '{{ var.json.shared.airflow_home }}', 'dataset': 'covid19_tracking', 'pipeline': 'city_level_cases_and_deaths'},
    )

I haven't looked closely at the csv_transform.py script yet, but I'd expect it to use non-trivial CPU / memory resources.

For custom Python scripts such as this, I'd expect us to use the KubernetesPodOperator, where the work is scheduled on a separate node pool.

Checklist

  • I created this issue in accordance with the Code of Conduct.
  • This issue is appropriately labeled.
@adlersantos adlersantos added the feature request New feature or request label Apr 22, 2021
@leahecole
Copy link
Contributor

leahecole commented Apr 26, 2021

Hey @adlersantos! When the time comes we may also want to look at the GKEPodOperator which is an extension of the KubernetesPodOperator that can create new clusters and then launch pods into the specified cluster.

@adlersantos
Copy link
Member Author

Hey @leahecole. Thanks for the suggestion! We definitely need a way for pipelines to use their own GKE clusters in various contexts. I'll look into it as I add support for KubernetesPodOperator.

@leahecole
Copy link
Contributor

Happy to work with you on this. I started going down this road with my interns last summer and I can look back on their notes and share them whenever you're ready

@adlersantos
Copy link
Member Author

adlersantos commented May 11, 2021

@leahecole I'm prioritizing this as we're starting to receive onboarding requests with "heavier" workloads. Would be nice to chat with you about it and look at your interns' notes as well. See ya!

@adlersantos
Copy link
Member Author

Closing this. We now support operators that can create and delete GKE clusters, plus start GKE pods in those clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants