Lin 467 artifact dependencies within same session by mingjerli · Pull Request #714 · LineaLabs/lineapy

mingjerli · 2022-07-02T01:52:17Z

Description

This PR implements SessionArtifacts class that analysis variables and artifacts dependencies within a session.
It will break the session graph into multiple segments; each segment is responsible for one artifact or calculation of common variables used in multiple artifacts. Each node of the graph will only be assigned to one artifact to make sure there is no duplicated execution.

Example output can be found in the test files.

Fixes # (issue)

LIN-467

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

How Has This Been Tested?

Add new tests.

…acts

lineapy/graph_reader/graph_refactorer.py

yoonspark · 2022-07-03T18:22:22Z

With the toy example we used, i.e.,

art = {}
a0 = 0
a0 += 1
art['a0'] = lineapy.save(a0,"a0")
a=1
art['a'] = lineapy.save(a, "a")

a+=1
b = a*2 + a0
c = b+3
d = a*4
e = d+5
e+=6
art['c'] = lineapy.save(c, "c")
art['e'] = lineapy.save(e, "e")

f = c+7
art['f'] = lineapy.save(f, "f")
a+=1
g = c+e *2
art['g'] = lineapy.save(g,'g')
h = a+g
art['h'] = lineapy.save(h,'h')

the current implementation produces:

def pipeline():
    a = get_a()
    a = get_a_for_artifact_c_and_downstream(a)
    c = get_c(a)
    f = get_f(c)
    e = get_e(a)
    g = get_g(c,e)
    h = get_h(a,g)
    return a, c, f, e, g, h

which is missing a = get_a_for_artifact_h_and_downstream(a). That is, h relies on yet another mutated a, but this second mutation of a is not captured in graph refactoring.

UPDATE: @mingjerli rightly pointed out that this is actually an expected behavior: since the last a += 1 affects h only, this is "absorbed" in to the definition of get_h, which is a desired behavior in fact. So, marking this discussion thread resolved.

mingjerli · 2022-07-06T02:02:30Z

lineapy/data/graph.py

                (parent_id, node.id)
                for node in nodes
                for parent_id in node.parents()
+                if parent_id in set(self.ids.keys())


Need this to make sure the graph is self-contained within the list of input nodes.

yoonspark · 2022-07-09T20:36:12Z

With the following session code:

# Load train data
train_df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")

# Initiate the model
mod = LinearRegression()

# Fit the model
mod.fit(
    X=train_df[["petal.width"]],
    y=train_df["petal.length"],
)

# Save the fitted model as an artifact
lineapy.save(mod, "iris_model")

# Load data to predict (assume it comes from a different source)
pred_df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")

# Make predictions
petal_length_pred =  mod.predict(X=pred_df[["petal.width"]])

# Save the predictions
lineapy.save(petal_length_pred, "iris_petal_length_pred")

we get the following refactored code:

def get_pd_train_df_for_artifact_iris_model_and_downstream():
    import pandas as pd
    return pd, train_df

def get_iris_model(pd):
    from sklearn.linear_model import LinearRegression
    train_df = pd.read_csv(
        "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
    )
    mod = LinearRegression()
    mod.fit(
        X=train_df[["petal.width"]],
        y=train_df["petal.length"],
    )
    return mod

def get_iris_petal_length_pred(mod, pd, train_df):
    pred_df = pd.read_csv(
        "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
    )
    petal_length_pred = mod.predict(X=pred_df[["petal.width"]])
    return petal_length_pred

def pipeline():
    pd, train_df = get_pd_train_df_for_artifact_iris_model_and_downstream()
    mod = get_iris_model(pd)
    petal_length_pred = get_iris_petal_length_pred(mod, pd, train_df)
    return mod, petal_length_pred

if __name__=="__main__":
    pipeline()

which is not exactly what we want.

dorx · 2022-07-10T06:18:00Z

Additionally, the import behavior is also inconsistent. While it might be desirable to do imports within specific functions, we currently don't always import things that are needed within a function in the function itself, e.g., in

def get_iris_petal_length_pred(mod, pd, train_df):
    pred_df = pd.read_csv(
        "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
    )
    petal_length_pred = mod.predict(X=pred_df[["petal.width"]])
    return petal_length_pred

we need to import pandas as pd for it to be self.contained.

yifanwu

Just left some notes from a quick skim of the code. I will do a more thorough review once I get to check out the code and test out corner cases.

yifanwu · 2022-07-15T19:09:53Z

lineapy/graph_reader/graph_refactorer.py

+@dataclass
+class GraphSegment:
+    """
+    This class contains information required to wrap a lineapy Graph object as a function


Should have richer documentation here. Ideally an example code snippet and the corresponding GraphSegment.

lineapy/graph_reader/graph_refactorer.py

yifanwu · 2022-07-15T19:21:11Z

lineapy/graph_reader/graph_refactorer.py

+                )
+
+            # Determine the tracked variables of each node
+            # Fix me when you see return variables in refactor behaves strange


Not sure what this FIXME is.

lineapy/graph_reader/graph_refactorer.py

yifanwu · 2022-07-15T20:00:37Z

lineapy/graph_reader/graph_refactorer.py

+        self.input_variables = input_variables
+        self.return_variables = return_variables
+
+    def get_function_definition(self, indentation=4) -> str:


This might be in the territory of "future proofing", but there is danger in our function generation being purely text manipulation (as opposed to graph to graph manipulation, and then graph to text generation): we will not be able to reuse these analysis and logic later for execution features (e.g., caching). cc @moustafa-a for things to watch out for platform.

lineapy/graph_reader/graph_refactorer.py

dorx · 2022-07-17T18:53:26Z

lineapy/data/graph.py

                (parent_id, node.id)
                for node in nodes
                for parent_id in node.parents()
+                if parent_id in set(self.ids.keys())


In what scenarios will the parent_id be missing from self.ids.keys(). Should we throw an error or let it pass? What are the downstream implications?

The Graph.get_subgraph(nodes) method needs this constraint to guarantee the return graph does not have extra nodes.

dorx · 2022-07-27T01:44:15Z

lineapy/graph_reader/node_collection.py

+
+        return codeblock
+
+    def get_import_block(self, indentation=0) -> str:


Fix the existing get_import_block function and use it here instead of redefining it.

lionsardesai · 2022-07-28T20:34:57Z

tests/unit/graph_reader/expected/module_import_all

+    # right after calculation in case of mutation downstream
+    artifacts = []
+    df = get_df()
+    artifacts.append(copy.deepcopy(df))


only concern here is that deepcopy might not work with all object types and this code generation might restrict us in the future.

lionsardesai

approving conditional to a refactor after multi-session collections and script writers are done (sangyoon+mjl going to sync on this). Also conditional to a change that merges the existing import function with the new one introduced by MJL here.

yoonspark and others added 11 commits June 26, 2022 21:56

Prototype graph refactorer

e71c07c

Refactor: Round 2

aeccc62

Remove dev only function out of API

8d782c0

Done initial implementation of sessionartifact and graphsegment

b477523

Remove unnecessary code

10ea0e7

Fix internal variable reference

958f397

WIP - refactor SessionArtifacts

52a8a4b

Add session level code generation

8f70857

WIP - working code to extracting common vvariables for multiple artif…

d8596a7

…acts

WIP - clean up string template

5b12d25

WIP fix typing, need to add tests

55467c3

yoonspark reviewed Jul 3, 2022

View reviewed changes

lineapy/graph_reader/graph_refactorer.py Outdated Show resolved Hide resolved

mingjerli added 4 commits July 3, 2022 23:38

Add indentation option for get_function_call_block

1c0aaaa

Fix python<3.9 type checking

d0020f6

Add test for SessionArtifacts

cd9108a

Fix empty spaces line

86fdd9f

mingjerli commented Jul 6, 2022

View reviewed changes

mingjerli marked this pull request as ready for review July 6, 2022 02:03

mingjerli added 4 commits July 7, 2022 22:28

Add test case

737d3a3

WIP - variable tracking

c039539

Use tracked variables for return

ce918fe

Fix variable name tracking

2ed41d3

mingjerli requested a review from yifanwu July 8, 2022 21:37

mingjerli added 4 commits July 11, 2022 16:21

WIP - fix import from

6352dd2

WIP - fix import from, node dependent variables

44da600

Fix pre-commit

0c95052

Add test cases

4a2b094

mingjerli requested review from lionsardesai and removed request for yifanwu July 12, 2022 15:50

mingjerli added 5 commits July 12, 2022 14:32

Add more comment in code

7685621

Fix artifact mutated after lineapy.save

8ab39cb

WIP - input parameeters working example

db222a8

Add get_session_module

485fc6f

Fix pass through variables is not in the input parameters

ba14055

yifanwu reviewed Jul 15, 2022

View reviewed changes

dorx reviewed Jul 17, 2022

View reviewed changes

mingjerli added 8 commits July 20, 2022 13:04

WIP - fork from demo branch

bf51cee

Add get_subgraph_from_id method

ae4ea1b

Clean up SessionArtifact implementation

8cfe641

Fix pass through input parameters

225964d

Update docs

f653fff

Lift out _is_import_node to separate utils file

10fdeb3

Add doc

d2e247d

Rename graphsegment to nodecollection

7efb652

yoonspark mentioned this pull request Jul 22, 2022

LIN-468 Support multi-session pipeline building (for new graph refactor) #734

Merged

1 task

Add basic test cases

b36ed2e

dorx reviewed Jul 27, 2022

View reviewed changes

mingjerli added 2 commits July 28, 2022 12:45

Refactor SessionArtifact tests

9996e5f

Fix pytest failure

48d93ea

lionsardesai reviewed Jul 28, 2022

View reviewed changes

lionsardesai approved these changes Jul 28, 2022

View reviewed changes

mingjerli merged commit a909a0c into main Jul 28, 2022

lionsardesai deleted the LIN-467-artifact-dependencies-within-same-session branch October 20, 2022 17:57


		return codeblock

		def get_import_block(self, indentation=0) -> str:

Comments

Conversation

mingjerli commented Jul 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Uh oh!

Uh oh!

yoonspark commented Jul 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yoonspark commented Jul 9, 2022

Uh oh!

dorx commented Jul 10, 2022

Uh oh!

yifanwu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lionsardesai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mingjerli commented Jul 2, 2022 •

edited

Loading

yoonspark commented Jul 3, 2022 •

edited

Loading