Lin 467 artifact dependencies within same session#714
Conversation
|
With the toy example we used, i.e., art = {}
a0 = 0
a0 += 1
art['a0'] = lineapy.save(a0,"a0")
a=1
art['a'] = lineapy.save(a, "a")
a+=1
b = a*2 + a0
c = b+3
d = a*4
e = d+5
e+=6
art['c'] = lineapy.save(c, "c")
art['e'] = lineapy.save(e, "e")
f = c+7
art['f'] = lineapy.save(f, "f")
a+=1
g = c+e *2
art['g'] = lineapy.save(g,'g')
h = a+g
art['h'] = lineapy.save(h,'h')the current implementation produces: def pipeline():
a = get_a()
a = get_a_for_artifact_c_and_downstream(a)
c = get_c(a)
f = get_f(c)
e = get_e(a)
g = get_g(c,e)
h = get_h(a,g)
return a, c, f, e, g, hwhich is missing UPDATE: @mingjerli rightly pointed out that this is actually an expected behavior: since the last |
| (parent_id, node.id) | ||
| for node in nodes | ||
| for parent_id in node.parents() | ||
| if parent_id in set(self.ids.keys()) |
There was a problem hiding this comment.
Need this to make sure the graph is self-contained within the list of input nodes.
|
With the following session code: # Load train data
train_df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")
# Initiate the model
mod = LinearRegression()
# Fit the model
mod.fit(
X=train_df[["petal.width"]],
y=train_df["petal.length"],
)
# Save the fitted model as an artifact
lineapy.save(mod, "iris_model")
# Load data to predict (assume it comes from a different source)
pred_df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")
# Make predictions
petal_length_pred = mod.predict(X=pred_df[["petal.width"]])
# Save the predictions
lineapy.save(petal_length_pred, "iris_petal_length_pred")we get the following refactored code: def get_pd_train_df_for_artifact_iris_model_and_downstream():
import pandas as pd
return pd, train_df
def get_iris_model(pd):
from sklearn.linear_model import LinearRegression
train_df = pd.read_csv(
"https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
)
mod = LinearRegression()
mod.fit(
X=train_df[["petal.width"]],
y=train_df["petal.length"],
)
return mod
def get_iris_petal_length_pred(mod, pd, train_df):
pred_df = pd.read_csv(
"https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
)
petal_length_pred = mod.predict(X=pred_df[["petal.width"]])
return petal_length_pred
def pipeline():
pd, train_df = get_pd_train_df_for_artifact_iris_model_and_downstream()
mod = get_iris_model(pd)
petal_length_pred = get_iris_petal_length_pred(mod, pd, train_df)
return mod, petal_length_pred
if __name__=="__main__":
pipeline()which is not exactly what we want. |
|
Additionally, the import behavior is also inconsistent. While it might be desirable to do imports within specific functions, we currently don't always import things that are needed within a function in the function itself, e.g., in def get_iris_petal_length_pred(mod, pd, train_df):
pred_df = pd.read_csv(
"https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
)
petal_length_pred = mod.predict(X=pred_df[["petal.width"]])
return petal_length_predwe need to |
yifanwu
left a comment
There was a problem hiding this comment.
Just left some notes from a quick skim of the code. I will do a more thorough review once I get to check out the code and test out corner cases.
| @dataclass | ||
| class GraphSegment: | ||
| """ | ||
| This class contains information required to wrap a lineapy Graph object as a function |
There was a problem hiding this comment.
Should have richer documentation here. Ideally an example code snippet and the corresponding GraphSegment.
| ) | ||
|
|
||
| # Determine the tracked variables of each node | ||
| # Fix me when you see return variables in refactor behaves strange |
There was a problem hiding this comment.
Not sure what this FIXME is.
| self.input_variables = input_variables | ||
| self.return_variables = return_variables | ||
|
|
||
| def get_function_definition(self, indentation=4) -> str: |
There was a problem hiding this comment.
This might be in the territory of "future proofing", but there is danger in our function generation being purely text manipulation (as opposed to graph to graph manipulation, and then graph to text generation): we will not be able to reuse these analysis and logic later for execution features (e.g., caching). cc @moustafa-a for things to watch out for platform.
| (parent_id, node.id) | ||
| for node in nodes | ||
| for parent_id in node.parents() | ||
| if parent_id in set(self.ids.keys()) |
There was a problem hiding this comment.
In what scenarios will the parent_id be missing from self.ids.keys(). Should we throw an error or let it pass? What are the downstream implications?
There was a problem hiding this comment.
The Graph.get_subgraph(nodes) method needs this constraint to guarantee the return graph does not have extra nodes.
|
|
||
| return codeblock | ||
|
|
||
| def get_import_block(self, indentation=0) -> str: |
There was a problem hiding this comment.
Fix the existing get_import_block function and use it here instead of redefining it.
| # right after calculation in case of mutation downstream | ||
| artifacts = [] | ||
| df = get_df() | ||
| artifacts.append(copy.deepcopy(df)) |
There was a problem hiding this comment.
only concern here is that deepcopy might not work with all object types and this code generation might restrict us in the future.
lionsardesai
left a comment
There was a problem hiding this comment.
approving conditional to a refactor after multi-session collections and script writers are done (sangyoon+mjl going to sync on this). Also conditional to a change that merges the existing import function with the new one introduced by MJL here.
Description
This PR implements SessionArtifacts class that analysis variables and artifacts dependencies within a session.
It will break the session graph into multiple segments; each segment is responsible for one artifact or calculation of common variables used in multiple artifacts. Each node of the graph will only be assigned to one artifact to make sure there is no duplicated execution.
Example output can be found in the test files.
Fixes # (issue)
LIN-467
Type of change
How Has This Been Tested?
Add new tests.