Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UX Design: Deploying Workflows #236

Open
yifanwu opened this issue Oct 1, 2021 · 12 comments
Open

UX Design: Deploying Workflows #236

yifanwu opened this issue Oct 1, 2021 · 12 comments
Assignees
Labels
documentation Improvements or additions to documentation User Story

Comments

@yifanwu
Copy link
Contributor

yifanwu commented Oct 1, 2021

High-level goal: For any code that's executed on Linea. We want to ensure that it can be ran on an Airflow connected EC2 instance as well.

This means that we need to address a few reproduceability and networking challenges, all of which may have existing solutions via Airflow:

  • We need to ensure that we have access to the data from the users' resources (there are operators with existing platforms, like GCP operators that we can use).
  • We need a server that could now also access the code that Lineapy captured locally (or maybe we just re-package the code into Airflow DAG compatible code so that this is shared transparently)
  • We need to create containers on the server machine so that we can support multiple runs (seems like we can use the k8s executor).

Here are some notes on Airflow, and maybe @dorx can comment:

Here is the architecture: https://airflow.apache.org/docs/apache-airflow/2.0.0/start.html#basic-airflow-architecture
image

Which makes me think that it makes the most sense for us to synthesize Airflow code, but even better would be if they had some server level APIs. They seems pretty extensible (example here).

Need to explore more!

DAG Factories

Working with notebooks

https://medium.com/ai%C2%B3-theory-practice-business/how-to-build-machine-learning-pipelines-with-airflow-papermill-6baef3832bc6

@yifanwu yifanwu added the documentation Improvements or additions to documentation label Oct 1, 2021
@yifanwu
Copy link
Contributor Author

yifanwu commented Oct 1, 2021

My current plan is to have a cli command that supports creating an airflow dag, which will create a new code file (note that the abstraction is mediated through code) that will load the artifact's graph and call the Executor on it.

Alternatively, we can just churn out the raw code and wrap that in a function. The latter will be directly transparent to the user, but the former will allow us to do execution optimizations as well as making it easier to setup the environments automatically.

The other thing that's weird right now is that Linea also stores the results, where as with vanilla Airflows, users have to save the values themselves---we probably should think through the UX there.

I'm going to have a config mode to support both and see how it goes.

This line of thinking also starts to sketch out an interface for Linea as a set of execution and value APIs.

@yifanwu
Copy link
Contributor Author

yifanwu commented Oct 2, 2021

Some open questions (an on-going list):

UX

  • If we notice an error in one of the nodes (e.g., the function didn't return the result we had expected, maybe because a service is down)---how we do surface that through Airflow.

Implementation

  • I don't know how Airflows controls the execution environments for each job (this would be a good interview question)---the website seems to suggest that you can run the whole thing via a docker, but that's different from having a docker for each task.

@yifanwu
Copy link
Contributor Author

yifanwu commented Oct 6, 2021

Here is another reference that Daniel L found https://stackoverflow.com/questions/51573768/how-to-run-jupyter-notebook-in-airflow that talks more about notebooks on airflow.

Another good reference here: https://hex.tech/blog/hex-two-point-oh

@yifanwu
Copy link
Contributor Author

yifanwu commented Oct 6, 2021

@marov here is a discussion on the pipeline automation work that we would love to get your help on! (The previous comment was in the wrong issue, sorry!)

@yifanwu
Copy link
Contributor Author

yifanwu commented Oct 11, 2021

Took some notes from what @marov has shared

  • Instead of just focusing on Airflows, we can convert to a working dag in several options, e.g., Algo/Prefect. This fits nicely with the DAG factory pattern.
  • Mike thinks that we shouldn't need to host the Airflow, because data engineering is too broad of a scope, we shouldn't handle more bespoke devops things, and there are lots of issues related running DAGs---its just ONE of the components, there is storage, pub/sub, and writing to DBs.
  • In terms of whether we use raw code for execution or via the lineapy executor, Mike proposes the three biggest factors:
    • Portability---Linea becomes an interpreter that's used throughout---if we want to work with enterprise, this would a hard dependency. Cause headaches for devops.
    • BIGGEST is performance and predictability.
      • They care because the capacity of hardware is not catching up with the amount of data. If throwing money at it can solve it, it's a good situation. A bad situation is when you cannot throw money fast enough. Start with small number of machines and then scale up. For instance, if there is an input that has hundreds of features and suddenly there is an unusual input and our system is 10x slower.
      • Y: remember the 10 am heuristic that Ryan mentioned---if we are slow that breaks assumptions.
    • Liability is another problem when there is error.
    • Transparency is NOT BIG USABILITY issue ---so long as we show the code, no one will poke around.,

Our proposed initial plan of attack:

  • Adding boiler plate is not too difficult. Take the extracted code, put it in a folder, start with Airflow---simple airflow server just to test. We don't need to run the airflow server (there is a test harness). Will get this done by EOW.

@dorx
Copy link
Contributor

dorx commented Oct 11, 2021

Per our discussions this morning, we would produce the sliced code, and possibly the sliced code translated into an Airflow DAG as the output program, not one where we call the lineapy executor to run the lineapy DAG.

In this mode, optimizations involving reusing cached results would result in altered output programs, e.g., for a simple pipeline

a = foo()
b = bar(a)
c = b.func()

reusing the result for b would produce a program that looks something like

b = load_pickle(/path/to/where/we/saved/the/pickle/file)
c = b.func()

Note that instead of a custom lineapy.load function, we're calling a standard data loading function in Python.

@saulshanabrook
Copy link
Contributor

Note that I just closed #69 and #30, which were about storing some version information. If we intend to take on trying to store enough to reproduce an environment in linea, we can re-open them

Issue Triage automation moved this from Proposed UX Features to Done Oct 14, 2021
@saulshanabrook saulshanabrook moved this from Done to Proposed UX Features in Issue Triage Oct 14, 2021
@saulshanabrook
Copy link
Contributor

We should write a more explicit user story here

@yifanwu
Copy link
Contributor Author

yifanwu commented Oct 21, 2021

Some new questions surfaced from the PR #320. They all fall under a more robust templating mechanism (and by implication, some abstraction for how we think about deployment). Right now we are hard coding quite a few important things.

  • We are currently print-ing the artifact. It's probably not what a pipeline wants to do (I'm imagining saving something to disk or DB, but @marov would know more).
  • It's not clear how related artifacts compose---e.g., if one artifact is a cleaned DF and another artifact is a model that's trained on the cleaned DF. The template is hard coded to unrelated artifacts.
  • The template I think only allows for ONE job right now (the job id is hard coded).
  • We need to figure out how to expose Airflow configs, e.g., schedule_interval="*/15 * * * *", # Every 15 minutes. Currently, the user can directly come in and change things, but that potentially limits Linea's visibility into what's actually deployed.

@yifanwu
Copy link
Contributor Author

yifanwu commented Oct 27, 2021

As I was creating my own Airflow dags, I realized that there are a few heuristics that we can use:

  • When sharing data between functions in the dag, it's done via this XCom and (we need to investigate) it might be bad perf to have too many function that triggers these write/reads.
  • Some workflows have shared processes and we can lift out the shared methods automatically.
  • When there are nodes that do not depend on other nodes, they can be ran in parallel.

Will note down more things to consider as they surface.

@yifanwu yifanwu changed the title Deploying workflows UX Design: Deploying Workflows Nov 3, 2021
@marov
Copy link
Contributor

marov commented Nov 4, 2021

I use joblib simply to persis dataframes, see https://joblib.readthedocs.io/en/latest/persistence.html

@yifanwu
Copy link
Contributor Author

yifanwu commented Dec 2, 2021

I ran across another talk https://youtube.com/watch?v=ja2siGyklq0&list=PLGudixcDaxY1noceCfAKU-kfpYIOgbC84&index=9 that uses "Dataclasses as Pipeline Definitions in Airflow". It's pretty cool and might be a more elegant design than the dag factory one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation User Story
Projects
No open projects
Issue Triage
Proposed UX Features
Development

No branches or pull requests

4 participants