Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow example running Snowflake queries and publish OpenLineage events. #2

Closed
wants to merge 9 commits into from
Closed

Airflow example running Snowflake queries and publish OpenLineage events. #2

wants to merge 9 commits into from

Conversation

fm100
Copy link
Collaborator

@fm100 fm100 commented Apr 13, 2022

Description

This airflow example uses SnowflakeOperator in order to run queries, and publishes OpenLineage events, which is auto-generated in OPEN_LINEAGE_ACCESS_HISTORY view, to the configured OpenLineage backend.

…nts.

Signed-off-by: Minkyu Park <minkyu@datakin.com>
Copy link
Collaborator

@sfc-gh-kbijon sfc-gh-kbijon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good @fm100 . I left some comments. please see if those make sense.
One other thing:
instead of a single directory "dags", should we create two separates directories
a) dag_queries - contains all the files the generate queries for dag
b) dag_exract - contains etl_openlineage.py
WDYT?

that sends generated OpenLineage events to the configured backend.

## Prerequisite
* `OPEN_LINEAGE` database and `FOOD_DELIVERY` schema in Snowflake need to be created to run this example.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we give a setup.py to create this database and schema in the account and document that in the prerequisite?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can probably rename the database OPEN_LINEAGE_EXAMPLE?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer OPENLINEAGE_EXAMPLE, as OpenLineage is one word.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a script, possibly a code block here with the statements:

CREATE DATABASE ...
...

that way it is easier for the user to follow along. Ross, I think you'll have more feedback when you try it out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienledem Because it's important to keep current_organization as a session variable, I chose to do this as a SnowSQL example. Let me know what you think.

* `OPEN_LINEAGE` database and `FOOD_DELIVERY` schema in Snowflake need to be created to run this example.

## Environment Variables
Following environment variables need to be set in order to send the OpenLineage events to the OL backend:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does OL stand for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid shortening OpenLineage to OL in these docs, just in case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. Let's spell out OpenLineage

@@ -0,0 +1,54 @@
from airflow import DAG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should briefly explain what this script does. We can probably also document these scripts in the readme. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, we could add some doc here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a bit of content to the README that explains what these scripts do 👍

* SNOWFLAKE_USER
* SNOWFLAKE_PASSWORD
* SNOWFLAKE_ACCOUNT
* OPENLINEAGE_URL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should also document the following

  1. what are the things the user needs to install in order to run this example?
  2. The steps to execute
    a) First run the all the query generator scripts
    b) Wait for 3o min
    c) run etl_openlineage.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We could point to documentations about:

  1. installing Marquez
  2. getting a Datakin instance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some of this to the README, trying to keep it minimal. Let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. ty!

Copy link
Collaborator

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The screenshot looks nice!

that sends generated OpenLineage events to the configured backend.

## Prerequisite
* `OPEN_LINEAGE` database and `FOOD_DELIVERY` schema in Snowflake need to be created to run this example.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a script, possibly a code block here with the statements:

CREATE DATABASE ...
...

that way it is easier for the user to follow along. Ross, I think you'll have more feedback when you try it out.

* `OPEN_LINEAGE` database and `FOOD_DELIVERY` schema in Snowflake need to be created to run this example.

## Environment Variables
Following environment variables need to be set in order to send the OpenLineage events to the OL backend:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. Let's spell out OpenLineage

* SNOWFLAKE_USER
* SNOWFLAKE_PASSWORD
* SNOWFLAKE_ACCOUNT
* OPENLINEAGE_URL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We could point to documentations about:

  1. installing Marquez
  2. getting a Datakin instance.

@@ -0,0 +1,54 @@
from airflow import DAG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, we could add some doc here

FROM food_delivery.tmp_categories
''',
session_parameters={
'QUERY_TAG': 'etl_categories'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment explaining why we add this

SNOWFLAKE_PASSWORD = os.getenv('SNOWFLAKE_PASSWORD')
SNOWFLAKE_ACCOUNT = os.getenv('SNOWFLAKE_ACCOUNT')


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some doc explaining that this is the logic forwarding the OpenLineage events from Snowflake to an OpenLineage endpoint

Ross Turk added 5 commits April 18, 2022 23:05
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
@rossturk
Copy link
Collaborator

looks good @fm100 . I left some comments. please see if those make sense.
One other thing:
instead of a single directory "dags", should we create two separates directories
a) dag_queries - contains all the files the generate queries for dag
b) dag_exract - contains etl_openlineage.py
WDYT?

I agree - I chose to name them etl and lineage. My rationale is that they are already in a directory called dags, and the word extract applies to both metadata and data. Let me know if you like these names.

Copy link
Collaborator

@sfc-gh-kbijon sfc-gh-kbijon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rossturk The README looks good. I left some comments. Please take a look. thank you!


Once the environment is prepared, initialize Airflow with docker-compose:
```bash
% docker-compose up airflow-init
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of attaching docker-compose.yaml, I wonder if we should document the steps similar to "Installing Marquez", i.e., git clone https://github.com/anilkulkarni87/airflow-docker and document the steps. In that way, this scripts in the repo are purely the examples and not related to any prerequisite files, etc.. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually how I started, but it requires a lot of changes to docker-compose.yaml to make this example work smoothly:

  • custom image build with extra python requirements
  • pass through environment
  • skip loading example dags

I opted to keep the procedure simple.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that file has a license and I actually didn't talk to our Legal team for that. Is it very complex to document what to do what docker-compose.yaml instead of directly post the file here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's pretty complex and would be difficult to describe. This file is apache 2 licensed, which is permissive and non-viral.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-kbijon I have updated the comment block at the top of the file to clarify its intended use, warn that it's not for production, and point folks to the original

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments in the file. please address.

examples/airflow/README.md Show resolved Hide resolved
Ross Turk added 2 commits April 19, 2022 19:20
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
Copy link
Collaborator

@sfc-gh-kbijon sfc-gh-kbijon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @rossturk. It looks good. I left one comment. Please address that. ty!

* SNOWFLAKE_USER
* SNOWFLAKE_PASSWORD
* SNOWFLAKE_ACCOUNT
* OPENLINEAGE_URL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. ty!


Once the environment is prepared, initialize Airflow with docker-compose:
```bash
% docker-compose up airflow-init
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments in the file. please address.

examples/airflow/docker-compose.yaml Show resolved Hide resolved
Signed-off-by: Ross Turk <ross@datakin.com>
Copy link
Collaborator

@sfc-gh-kbijon sfc-gh-kbijon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rossturk @fm100 the changes look good. thanks a lot!


This example uses Airflow to run a collection of Snowflake queries for a fictitious food delivery service. Lineage data for these queries is recorded within Snowflake [ACCESS_HISTORY](https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html) and, using the OpenLineage Access History View, emitted to an OpenLineage backend.

This is done using a series of DAGs in `dags/queries` that each use SnowflakeOperator to run queries, along with a DAG in `dags/extract` that uses PythonOperator to send generated OpenLineage events to the configured backend.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rossturk this needs to be updated to dags/elt and dags/lineage

Copy link
Collaborator

@rossturk rossturk Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-kbijon very odd, I have made this change and committed it to the branch from @fm100 but it does not show up here. I think we should close this PR and open a new one, to see if it includes all of the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants