Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement --export-slice-airflow #320

Merged
merged 16 commits into from
Oct 25, 2021
Merged

Implement --export-slice-airflow #320

merged 16 commits into from
Oct 25, 2021

Conversation

marov
Copy link
Contributor

@marov marov commented Oct 18, 2021

Closes #275

  • Adds a new CLI option --export-slice-airflow (or simply --airflow
  • Exports to a working DAG sliced_housing_dag.py
  • Currently simply prints out p from lineapy tests/housing.py --slice "p value" --airflow sliced_housing_dag
  • Install airflow
  • Tests and doc

@@ -156,6 +161,30 @@ def sliced_func(self, slice_name: str, func_name: str) -> str:
full_code = format_str(full_code, mode=black_mode)
return full_code

def sliced_aiflow_dag(self, slice_name: str, func_name: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving the airflow specific code to a different folder, like linea/airflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created plugins/airflow.py, wdyt?

@marov marov requested a review from yifanwu October 19, 2021 05:32
@marov
Copy link
Contributor Author

marov commented Oct 19, 2021

@saulshanabrook @yifanwu This produces a valid DAG, but I still need to add tests, hence Draft.
Will be adding tests tomorrow evening

@yifanwu
Copy link
Contributor

yifanwu commented Oct 19, 2021

Nit: please add the example to README: lineapy --slice 'p value' --export-slice sliced_housing tests/housing.py

Please also add instructions in the readme about how we can run the airflow jobs?

@yifanwu
Copy link
Contributor

yifanwu commented Oct 19, 2021

Also for future reference, please have, in your PR description, a brief summary of your changes?

@marov marov marked this pull request as ready for review October 20, 2021 06:46
+ AIRFLOW_IMPORTS_TEMPLATE
+ "\n\n"
+ code_block
+ f"\n\tprint({artifact_var})" # TODO What to do with artifact_var in a DAG?
Copy link
Contributor

@yifanwu yifanwu Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to discuss and design this in more detail before we can commit this PR. @marov you are better equipped than us to help design what to do with this artifact in a deployed airflow process. Let's look at some reference cases together on our next meeting (Friday).

@@ -84,6 +84,19 @@ jupyter nbconvert --to notebook --execute tests/test_notebook.ipynb --inplace --
Or you can open it in a notebook UI (JupyterLab, JupyterNotebook, VS Code, etc.)
and re-run it manually

### Airflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be in the README.md, but since it's not ready for "release", maybe we can create another doc called EXPERIMENTAL.md or something and move this there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under impression that CONTRIBUTING.md serves this function.

```
lineapy tests/housing.py --slice "p value" --airflow sliced_housing_dag
```
This creates a `sliced_housing_dag.py` file in the current dir. It can be executed with:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this into a separate folder (when generating)? Cleaner organization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, how would you call it? output or export?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or something like airflow_jobs/airflow_dags?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this we are saving to the same dir where the original file is, which is tests for the housing.py. This seems to be OK for me.
I can change to any other place of course :)


dag = DAG(
dag_id="DAG_NAME_dag",
schedule_interval="*/15 * * * *", # Every 15 minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to figure out how to parametrize this for API design. I'm not sure about the workflow where the user goes in and manually modifies things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we need to think about the entire UX. We can't just keep adding options to CLI, right?
I think a config filler a set of configs are in order. I'd vote for YAML, but open to anything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's talk about it this afternoon.

default_dag_args = {"owner": "airflow", "retries": 2, "start_date": days_ago(1)}

dag = DAG(
dag_id="DAG_NAME_dag",
Copy link
Contributor

@yifanwu yifanwu Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we have two artifacts? Will this logic break?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we can only slice one artifact through --slice, it produces code which we save as this DAG. If there are other artifacts in DB it won't be a problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not following, do the dag_ids not need to be unique?

@saulshanabrook
Copy link
Contributor

I was trying to debug the mypy error. I re-run an earlier commit here on CI and found this error in install:

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies
The conflict is caused by:
    lineapy[dev] 0.0.1 depends on click>=8.0.0
    black 18.3a0 depends on click
    lineapy 0.0.1 depends on click>=8.0.0
    flask 1.1.4 depends on click<8.0 and >=5.1
    lineapy[dev] 0.0.1 depends on click>=8.0.0
    black 18.3a0 depends on click
    lineapy 0.0.1 depends on click>=8.0.0
    flask 1.1.3 depends on click<8.0 and >=5.1

I wonder why flask is being pulled in?

@marov
Copy link
Contributor Author

marov commented Oct 21, 2021

I was trying to debug the mypy error. I re-run an earlier commit here on CI and found this error in install:

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies
The conflict is caused by:
    lineapy[dev] 0.0.1 depends on click>=8.0.0
    black 18.3a0 depends on click
    lineapy 0.0.1 depends on click>=8.0.0
    flask 1.1.4 depends on click<8.0 and >=5.1
    lineapy[dev] 0.0.1 depends on click>=8.0.0
    black 18.3a0 depends on click
    lineapy 0.0.1 depends on click>=8.0.0
    flask 1.1.3 depends on click<8.0 and >=5.1

I wonder why flask is being pulled in?

I think Airflow uses flask for it's UI, see here https://github.com/apache/airflow/blob/f5ad26dcdd7bcb724992528dce71056965b94d26/setup.py#L422

@saulshanabrook
Copy link
Contributor

saulshanabrook commented Oct 22, 2021

@marov thanks for checking that out. I am trying to debug the versioning problems with mypy. It looks like currently it's installing an older version of sqlalchemy, because of some transitive dependency of airflow, which doesn't include the mypy plugin, which is breaking the tests. I am looking into if there is a solution that can work for us, to keep a newer version of sqlalchemy.

@saulshanabrook
Copy link
Contributor

it seems the issue is that flask-appbuilder depends on an old version of sqlalchemy (dpgaspar/Flask-AppBuilder#1710), which is in turn waiting on this PR to be merged into flask-sqlalchemy (pallets-eco/flask-sqlalchemy#1001)

@@ -20,7 +20,7 @@ relative_files = true

# https://docs.sqlalchemy.org/en/14/orm/extensions/mypy.html
# https://pydantic-docs.helpmanual.io/mypy_plugin/#enabling-the-plugin
plugins = ["sqlalchemy.ext.mypy.plugin", "pydantic.mypy"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we try just removing the sqlalchemy plugin but keeping the mypy one? And commenting that its disabled until we can add sqlalchemy >= 1.4, which adds mypy support, but is blocked on flask-appbuilder, which is a dep of airflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that. However, as I've mentioned to @yifanwu today, Airflow is but one of the data tools that linea can have "plugins" to.
New tools will come with new dependencies and more issues like this.
Hence I propose:

  • Merge this PR (removing sqlalchemy plugin and leaving a TODO in comments)
  • Next create in a different repo (or just a folder?) a place for external plugins with examples and tests
  • Each of the "plugins" will have its own setup.py decoupling it from the rest
  • Is Jupyter also falls in this category? One can argue both ways - we can say that Jupyter support is the core feature of lineapy, but I'd argue that it lineapy can work without Jupyter. And Jupyter is not the only notebook server either (Zeppelin etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next create in a different repo (or just a folder?) a place for external plugins with examples and tests

I am +1 on monorepos when the code is tightly coupled!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, @yifanwu @saulshanabrook how shall we proceed?

@saulshanabrook
Copy link
Contributor

We talked about this at standup today and sounds good to merge now.

@saulshanabrook saulshanabrook merged commit 0ef50e4 into main Oct 25, 2021
@saulshanabrook saulshanabrook deleted the export-slice-airflow branch October 25, 2021 17:31
@saulshanabrook saulshanabrook self-assigned this Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a test Airflow DAG from housing.py
4 participants