Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support building and pushing container images shared within a dataset folder #27

Merged
merged 13 commits into from
May 18, 2021

Conversation

adlersantos
Copy link
Member

@adlersantos adlersantos commented May 17, 2021

Description

Based out of a discussion on #5.

We need to support operators such as KubernetesPodOperator and other GKE operators that allows tasks to be run in specific clusters to isolate resource usage from other data pipelines.

As a prerequisite for such tasks, Docker images need to be defined, built, and pushed to GCR. This PR adds a workflow for building and pushing images.

  1. Create an _images folder under your dataset folder if it doesn't exist.

  2. Inside the _images folder, create another folder and name it after what the image is expected to do, e.g. process_shapefiles, read_cdf_metadata.

  3. In that subfolder, create a Dockerfile and any scripts you need to process the data. Use the COPY command in your Dockerfile to include your scripts in the image.

The resulting file tree for a dataset that uses two container images may look like

datasets
└── sample_patterns
    ├── _images
    │   ├── get_google_cloud_urls
    │   │   ├── Dockerfile
    │   │   └── script.py
    │   └── ping_google
    │       ├── Dockerfile
    │       └── script.py
    ├── _terraform/**/*
    ├── containerized_pipeline/**/*
    └── dataset.yaml

Now, running scripts/generate_dag.py will build and push the containers automatically.

Here's a screenshot of a successful run on the Airflow webserver (Cloud Composer):

Screen Shot 2021-05-15 at 2 18 42 PM

Checklist

  • Tests pass.
  • Linters pass.
  • Please merge this PR for me once it is approved.
  • If this PR adds/edits/deletes a feature, I have updated the README accordingly.

@adlersantos adlersantos added revision: readme Improvements or additions to the README feature request New feature or request labels May 17, 2021
@adlersantos adlersantos changed the title feat: Supports building and pushing container images feat: Support building and pushing container images per dataset May 17, 2021
@adlersantos adlersantos changed the title feat: Support building and pushing container images per dataset feat: Support building and pushing container images shared within a dataset folder May 17, 2021
Copy link
Contributor

@leahecole leahecole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions

  • How will we be expecting users to represent the requirements for the scripts that will be in their images? (requirements file? Pipfile?) Are we going to be prescriptive, or is that out of scope? Either way, we may want to mention it and show an example of our preferred way, because their Dockerfile will need to copy in those requirements and do the installation path as welll

README.md Outdated Show resolved Hide resolved
scripts/generate_dag.py Show resolved Hide resolved
@adlersantos adlersantos merged commit de9d1b9 into main May 18, 2021
@adlersantos adlersantos deleted the kpod-operator branch May 18, 2021 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request revision: readme Improvements or additions to the README
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants