Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building DVC Pipelines #765

Closed
pd-t opened this issue Sep 6, 2022 · 11 comments
Closed

Building DVC Pipelines #765

pd-t opened this issue Sep 6, 2022 · 11 comments

Comments

@pd-t
Copy link
Contributor

pd-t commented Sep 6, 2022

As DVC user I would like to transform my jupyter notebook into a DVC pipeline using LineaPy.

At the moment it is possible to generate Airflow pipelines as described here. The idea is to add another framework flag for dvc, e.g. framework="DVC".

Now following the Airflow example a set of python source files is generated, i.e. an iris_preprocessed.py

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")
color_map = {"Setosa": "green", "Versicolor": "blue", "Virginica": "red"}
df["variety_color"] = df["variety"].map(color_map)
df["d_versicolor"] = df["variety"].apply(lambda x: 1 if x == "Versicolor" else 0)
df["d_virginica"] = df["variety"].apply(lambda x: 1 if x == "Virginica" else 0)
HERE SAVE ARTIFACTS TO THE IRIS_PREPROCESSED_FOLDER

and an iris_model.py file

from sklearn.linear_model import LinearRegression

df_processed = HERE LOAD ARTIFACT FROM THE IRIS_PREPROCESSED_FOLDER
mod = LinearRegression()
mod.fit(
   X=df_processed[["petal.width", "d_versicolor", "d_virginica"]],
   y=df_processed["sepal.width"],
)
HERE SAVE ARTIFACT TO THE IRIS_MODEL_FOLDER

In addition to this, a DVC yaml file has to be generated, e.g.

stages:
  iris_preprocessed:
    cmd: python iris_preprocessed.py
    deps:
      - iris_preprocessed.py
    outs:
      - data/IRIS_PREPROCESSED_FOLDER
  iris_model:
    cmd: python iris_model.py
    deps:
      - iris_model.py
      - data/IRIS_PREPROCESSED_FOLDER
    outs:
      - data/IRIS_MODEL_FOLDER

Last but not least, a requirements.txt file would finish the DVC pipeline ready for production!

@andycui97
Copy link
Contributor

Hi @pd-t !

Thanks for the suggestion, more integrations with different orchestrators and execution engines (DVC included) is something that we are currently tracking and planning to work on so its great to hear that there is a need for these features.

For DVC specifically, I'll loop in @hogepodge to provide a more concrete timeline and how we're prioritizing orchestrators and execution engines.

@andycui97
Copy link
Contributor

@pd-t quick update. I synced with our team on this and we’re putting it on our roadmap.

We've also reached out to the DVC dev team to partner with them to implement this integration. We'll keep you updated on the progress!

@pd-t
Copy link
Contributor Author

pd-t commented Sep 8, 2022

@andycui97 Great news, I am very curious already!

@andycui97
Copy link
Contributor

@pd-t , I've put out the first PR for DVC pipeline support. Right now you can call to_pipeline(..., framework='DVC', ...) and it should produce a dvc.yaml file with one stage that runs all the modules.

We're planning to add more options to modularize the stages as you provided in your example so that each stage corresponds to a session or an artifact and have the right dependencies and outputs.

Feel free to check the PR here and add comments you may have: #801

@pd-t
Copy link
Contributor Author

pd-t commented Oct 4, 2022

@andycui97 Great! Unfortunately I was too late for review :( I had a Conference Talk today. But I will have a look on it this week!

@casperdcl
Copy link

casperdcl commented Oct 18, 2022

We've also reached out to the DVC dev team

Just noticed this comment but we don't see anything; 😅 did you reach out via one of these options?

@Zweigg
Copy link

Zweigg commented Oct 21, 2022

Hi! 👋 Daniel here from team LineaPy. It looks like Doris (Linea's founder) is in touch directly with Dmitry. We plan on following up early next week with more details through one of the links you shared. Thx & look forward to collaborating! 👍

@pd-t
Copy link
Contributor Author

pd-t commented Dec 7, 2022

@andycui97 I added a pull request for the 'StagePerArtifact' flavour: StagePerArtifact

@pd-t pd-t mentioned this issue Dec 7, 2022
2 tasks
@pd-t
Copy link
Contributor Author

pd-t commented Jan 10, 2023

@andycui97 With the last pull request, the question now is how to proceed with the module file. Should I try to insert the necessary code sections into the task files with the help of the BasePipelineWriter?

@pd-t
Copy link
Contributor Author

pd-t commented Jan 10, 2023

However, all the requirements described above are met with the PR. So we can also close this ticket and open a new one for this issue.

@andycui97
Copy link
Contributor

I think yes, insert the necessary code into the task files, overriding BasePipelineWriters _write_module method.

I also agree, let's open a new Issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants