UX Design: Deploying Workflows #236

yifanwu · 2021-10-01T21:42:19Z

High-level goal: For any code that's executed on Linea. We want to ensure that it can be ran on an Airflow connected EC2 instance as well.

This means that we need to address a few reproduceability and networking challenges, all of which may have existing solutions via Airflow:

We need to ensure that we have access to the data from the users' resources (there are operators with existing platforms, like GCP operators that we can use).
We need a server that could now also access the code that Lineapy captured locally (or maybe we just re-package the code into Airflow DAG compatible code so that this is shared transparently)
We need to create containers on the server machine so that we can support multiple runs (seems like we can use the k8s executor).

Here are some notes on Airflow, and maybe @dorx can comment:

Here is the architecture: https://airflow.apache.org/docs/apache-airflow/2.0.0/start.html#basic-airflow-architecture

Which makes me think that it makes the most sense for us to synthesize Airflow code, but even better would be if they had some server level APIs. They seems pretty extensible (example here).

Need to explore more!

DAG Factories

Programmatically generate dags https://www.astronomer.io/guides/dynamically-generating-dags
One article Mike referenced: https://towardsdatascience.com/how-to-build-a-dag-factory-on-airflow-9a19ab84084c really validates that the Airflow creation is tedious and could potentially be automated away (they are even using function introspection to name the steps)

Working with notebooks

https://medium.com/ai%C2%B3-theory-practice-business/how-to-build-machine-learning-pipelines-with-airflow-papermill-6baef3832bc6

yifanwu · 2021-10-01T23:44:06Z

My current plan is to have a cli command that supports creating an airflow dag, which will create a new code file (note that the abstraction is mediated through code) that will load the artifact's graph and call the Executor on it.

Alternatively, we can just churn out the raw code and wrap that in a function. The latter will be directly transparent to the user, but the former will allow us to do execution optimizations as well as making it easier to setup the environments automatically.

The other thing that's weird right now is that Linea also stores the results, where as with vanilla Airflows, users have to save the values themselves---we probably should think through the UX there.

I'm going to have a config mode to support both and see how it goes.

This line of thinking also starts to sketch out an interface for Linea as a set of execution and value APIs.

yifanwu · 2021-10-02T00:14:13Z

Some open questions (an on-going list):

UX

If we notice an error in one of the nodes (e.g., the function didn't return the result we had expected, maybe because a service is down)---how we do surface that through Airflow.

Implementation

I don't know how Airflows controls the execution environments for each job (this would be a good interview question)---the website seems to suggest that you can run the whole thing via a docker, but that's different from having a docker for each task.

yifanwu · 2021-10-06T17:53:59Z

Here is another reference that Daniel L found https://stackoverflow.com/questions/51573768/how-to-run-jupyter-notebook-in-airflow that talks more about notebooks on airflow.

Another good reference here: https://hex.tech/blog/hex-two-point-oh

yifanwu · 2021-10-06T22:35:40Z

@marov here is a discussion on the pipeline automation work that we would love to get your help on! (The previous comment was in the wrong issue, sorry!)

yifanwu · 2021-10-11T21:12:23Z

Took some notes from what @marov has shared

Instead of just focusing on Airflows, we can convert to a working dag in several options, e.g., Algo/Prefect. This fits nicely with the DAG factory pattern.
Mike thinks that we shouldn't need to host the Airflow, because data engineering is too broad of a scope, we shouldn't handle more bespoke devops things, and there are lots of issues related running DAGs---its just ONE of the components, there is storage, pub/sub, and writing to DBs.
In terms of whether we use raw code for execution or via the lineapy executor, Mike proposes the three biggest factors:
- Portability---Linea becomes an interpreter that's used throughout---if we want to work with enterprise, this would a hard dependency. Cause headaches for devops.
- BIGGEST is performance and predictability.
  - They care because the capacity of hardware is not catching up with the amount of data. If throwing money at it can solve it, it's a good situation. A bad situation is when you cannot throw money fast enough. Start with small number of machines and then scale up. For instance, if there is an input that has hundreds of features and suddenly there is an unusual input and our system is 10x slower.
  - Y: remember the 10 am heuristic that Ryan mentioned---if we are slow that breaks assumptions.
- Liability is another problem when there is error.
- Transparency is NOT BIG USABILITY issue ---so long as we show the code, no one will poke around.,

Our proposed initial plan of attack:

Adding boiler plate is not too difficult. Take the extracted code, put it in a folder, start with Airflow---simple airflow server just to test. We don't need to run the airflow server (there is a test harness). Will get this done by EOW.

dorx · 2021-10-11T23:46:08Z

Per our discussions this morning, we would produce the sliced code, and possibly the sliced code translated into an Airflow DAG as the output program, not one where we call the lineapy executor to run the lineapy DAG.

In this mode, optimizations involving reusing cached results would result in altered output programs, e.g., for a simple pipeline

a = foo()
b = bar(a)
c = b.func()

reusing the result for b would produce a program that looks something like

b = load_pickle(/path/to/where/we/saved/the/pickle/file)
c = b.func()

Note that instead of a custom lineapy.load function, we're calling a standard data loading function in Python.

saulshanabrook · 2021-10-14T20:57:37Z

Note that I just closed #69 and #30, which were about storing some version information. If we intend to take on trying to store enough to reproduce an environment in linea, we can re-open them

saulshanabrook · 2021-10-15T22:49:57Z

We should write a more explicit user story here

yifanwu · 2021-10-21T00:46:12Z

Some new questions surfaced from the PR #320. They all fall under a more robust templating mechanism (and by implication, some abstraction for how we think about deployment). Right now we are hard coding quite a few important things.

We are currently print-ing the artifact. It's probably not what a pipeline wants to do (I'm imagining saving something to disk or DB, but @marov would know more).
It's not clear how related artifacts compose---e.g., if one artifact is a cleaned DF and another artifact is a model that's trained on the cleaned DF. The template is hard coded to unrelated artifacts.
The template I think only allows for ONE job right now (the job id is hard coded).
We need to figure out how to expose Airflow configs, e.g., schedule_interval="*/15 * * * *", # Every 15 minutes. Currently, the user can directly come in and change things, but that potentially limits Linea's visibility into what's actually deployed.

yifanwu · 2021-10-27T00:24:36Z

As I was creating my own Airflow dags, I realized that there are a few heuristics that we can use:

When sharing data between functions in the dag, it's done via this XCom and (we need to investigate) it might be bad perf to have too many function that triggers these write/reads.
Some workflows have shared processes and we can lift out the shared methods automatically.
When there are nodes that do not depend on other nodes, they can be ran in parallel.

Will note down more things to consider as they surface.

marov · 2021-11-04T23:20:51Z

I use joblib simply to persis dataframes, see https://joblib.readthedocs.io/en/latest/persistence.html

yifanwu · 2021-12-02T20:53:41Z

I ran across another talk https://youtube.com/watch?v=ja2siGyklq0&list=PLGudixcDaxY1noceCfAKU-kfpYIOgbC84&index=9 that uses "Dataclasses as Pipeline Definitions in Airflow". It's pretty cool and might be a more elegant design than the dag factory one.

yifanwu added the documentation Improvements or additions to documentation label Oct 1, 2021

yifanwu assigned yifanwu and dorx Oct 1, 2021

marov mentioned this issue Oct 12, 2021

Create a test Airflow DAG from housing.py #275

Closed

saulshanabrook added this to Proposed UX Features in Issue Triage Oct 14, 2021

saulshanabrook closed this as completed Oct 14, 2021

Issue Triage automation moved this from Proposed UX Features to Done Oct 14, 2021

saulshanabrook reopened this Oct 14, 2021

saulshanabrook moved this from Done to Proposed UX Features in Issue Triage Oct 14, 2021

saulshanabrook added the User Story label Oct 15, 2021

yifanwu mentioned this issue Oct 25, 2021

UX Design: persisted side effects (e.g., via files/db) #258

Open

yifanwu changed the title ~~Deploying workflows~~ UX Design: Deploying Workflows Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UX Design: Deploying Workflows #236

UX Design: Deploying Workflows #236

yifanwu commented Oct 1, 2021 •

edited

Loading

yifanwu commented Oct 1, 2021 •

edited

Loading

yifanwu commented Oct 2, 2021

yifanwu commented Oct 6, 2021

yifanwu commented Oct 6, 2021

yifanwu commented Oct 11, 2021 •

edited by marov

Loading

dorx commented Oct 11, 2021

saulshanabrook commented Oct 14, 2021

saulshanabrook commented Oct 15, 2021

yifanwu commented Oct 21, 2021

yifanwu commented Oct 27, 2021

marov commented Nov 4, 2021

yifanwu commented Dec 2, 2021

UX Design: Deploying Workflows #236

UX Design: Deploying Workflows #236

Comments

yifanwu commented Oct 1, 2021 • edited Loading

yifanwu commented Oct 1, 2021 • edited Loading

yifanwu commented Oct 2, 2021

yifanwu commented Oct 6, 2021

yifanwu commented Oct 6, 2021

yifanwu commented Oct 11, 2021 • edited by marov Loading

dorx commented Oct 11, 2021

saulshanabrook commented Oct 14, 2021

saulshanabrook commented Oct 15, 2021

yifanwu commented Oct 21, 2021

yifanwu commented Oct 27, 2021

marov commented Nov 4, 2021

yifanwu commented Dec 2, 2021

yifanwu commented Oct 1, 2021 •

edited

Loading

yifanwu commented Oct 1, 2021 •

edited

Loading

yifanwu commented Oct 11, 2021 •

edited by marov

Loading