-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airflow example running Snowflake queries and publish OpenLineage events. #2
Conversation
…nts. Signed-off-by: Minkyu Park <minkyu@datakin.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good @fm100 . I left some comments. please see if those make sense.
One other thing:
instead of a single directory "dags", should we create two separates directories
a) dag_queries - contains all the files the generate queries for dag
b) dag_exract - contains etl_openlineage.py
WDYT?
examples/airflow/README.md
Outdated
that sends generated OpenLineage events to the configured backend. | ||
|
||
## Prerequisite | ||
* `OPEN_LINEAGE` database and `FOOD_DELIVERY` schema in Snowflake need to be created to run this example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we give a setup.py to create this database and schema in the account and document that in the prerequisite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can probably rename the database OPEN_LINEAGE_EXAMPLE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually prefer OPENLINEAGE_EXAMPLE
, as OpenLineage is one word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a script, possibly a code block here with the statements:
CREATE DATABASE ...
...
that way it is easier for the user to follow along. Ross, I think you'll have more feedback when you try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julienledem Because it's important to keep current_organization
as a session variable, I chose to do this as a SnowSQL example. Let me know what you think.
examples/airflow/README.md
Outdated
* `OPEN_LINEAGE` database and `FOOD_DELIVERY` schema in Snowflake need to be created to run this example. | ||
|
||
## Environment Variables | ||
Following environment variables need to be set in order to send the OpenLineage events to the OL backend: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does OL stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid shortening OpenLineage to OL in these docs, just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. Let's spell out OpenLineage
@@ -0,0 +1,54 @@ | |||
from airflow import DAG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should briefly explain what this script does. We can probably also document these scripts in the readme. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, we could add some doc here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a bit of content to the README that explains what these scripts do 👍
* SNOWFLAKE_USER | ||
* SNOWFLAKE_PASSWORD | ||
* SNOWFLAKE_ACCOUNT | ||
* OPENLINEAGE_URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should also document the following
- what are the things the user needs to install in order to run this example?
- The steps to execute
a) First run the all the query generator scripts
b) Wait for 3o min
c) run etl_openlineage.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We could point to documentations about:
- installing Marquez
- getting a Datakin instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some of this to the README, trying to keep it minimal. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. ty!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The screenshot looks nice!
examples/airflow/README.md
Outdated
that sends generated OpenLineage events to the configured backend. | ||
|
||
## Prerequisite | ||
* `OPEN_LINEAGE` database and `FOOD_DELIVERY` schema in Snowflake need to be created to run this example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a script, possibly a code block here with the statements:
CREATE DATABASE ...
...
that way it is easier for the user to follow along. Ross, I think you'll have more feedback when you try it out.
examples/airflow/README.md
Outdated
* `OPEN_LINEAGE` database and `FOOD_DELIVERY` schema in Snowflake need to be created to run this example. | ||
|
||
## Environment Variables | ||
Following environment variables need to be set in order to send the OpenLineage events to the OL backend: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. Let's spell out OpenLineage
* SNOWFLAKE_USER | ||
* SNOWFLAKE_PASSWORD | ||
* SNOWFLAKE_ACCOUNT | ||
* OPENLINEAGE_URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We could point to documentations about:
- installing Marquez
- getting a Datakin instance.
@@ -0,0 +1,54 @@ | |||
from airflow import DAG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, we could add some doc here
FROM food_delivery.tmp_categories | ||
''', | ||
session_parameters={ | ||
'QUERY_TAG': 'etl_categories' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment explaining why we add this
SNOWFLAKE_PASSWORD = os.getenv('SNOWFLAKE_PASSWORD') | ||
SNOWFLAKE_ACCOUNT = os.getenv('SNOWFLAKE_ACCOUNT') | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add some doc explaining that this is the logic forwarding the OpenLineage events from Snowflake to an OpenLineage endpoint
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
I agree - I chose to name them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rossturk The README looks good. I left some comments. Please take a look. thank you!
|
||
Once the environment is prepared, initialize Airflow with docker-compose: | ||
```bash | ||
% docker-compose up airflow-init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of attaching docker-compose.yaml, I wonder if we should document the steps similar to "Installing Marquez", i.e., git clone https://github.com/anilkulkarni87/airflow-docker and document the steps. In that way, this scripts in the repo are purely the examples and not related to any prerequisite files, etc.. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually how I started, but it requires a lot of changes to docker-compose.yaml
to make this example work smoothly:
- custom image build with extra python requirements
- pass through environment
- skip loading example dags
I opted to keep the procedure simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that file has a license and I actually didn't talk to our Legal team for that. Is it very complex to document what to do what docker-compose.yaml instead of directly post the file here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's pretty complex and would be difficult to describe. This file is apache 2 licensed, which is permissive and non-viral.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-kbijon I have updated the comment block at the top of the file to clarify its intended use, warn that it's not for production, and point folks to the original
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments in the file. please address.
Signed-off-by: Ross Turk <ross@datakin.com>
Signed-off-by: Ross Turk <ross@datakin.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @rossturk. It looks good. I left one comment. Please address that. ty!
* SNOWFLAKE_USER | ||
* SNOWFLAKE_PASSWORD | ||
* SNOWFLAKE_ACCOUNT | ||
* OPENLINEAGE_URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. ty!
|
||
Once the environment is prepared, initialize Airflow with docker-compose: | ||
```bash | ||
% docker-compose up airflow-init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments in the file. please address.
Signed-off-by: Ross Turk <ross@datakin.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
This example uses Airflow to run a collection of Snowflake queries for a fictitious food delivery service. Lineage data for these queries is recorded within Snowflake [ACCESS_HISTORY](https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html) and, using the OpenLineage Access History View, emitted to an OpenLineage backend. | ||
|
||
This is done using a series of DAGs in `dags/queries` that each use SnowflakeOperator to run queries, along with a DAG in `dags/extract` that uses PythonOperator to send generated OpenLineage events to the configured backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rossturk this needs to be updated to dags/elt
and dags/lineage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-kbijon very odd, I have made this change and committed it to the branch from @fm100 but it does not show up here. I think we should close this PR and open a new one, to see if it includes all of the changes.
Description
This airflow example uses SnowflakeOperator in order to run queries, and publishes OpenLineage events, which is auto-generated in OPEN_LINEAGE_ACCESS_HISTORY view, to the configured OpenLineage backend.