Skip to content

Comments

Lin 467 artifact dependencies within same session#714

Merged
mingjerli merged 39 commits intomainfrom
LIN-467-artifact-dependencies-within-same-session
Jul 28, 2022
Merged

Lin 467 artifact dependencies within same session#714
mingjerli merged 39 commits intomainfrom
LIN-467-artifact-dependencies-within-same-session

Conversation

@mingjerli
Copy link
Contributor

@mingjerli mingjerli commented Jul 2, 2022

Description

This PR implements SessionArtifacts class that analysis variables and artifacts dependencies within a session.
It will break the session graph into multiple segments; each segment is responsible for one artifact or calculation of common variables used in multiple artifacts. Each node of the graph will only be assigned to one artifact to make sure there is no duplicated execution.

Example output can be found in the test files.

Fixes # (issue)

LIN-467

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

Add new tests.

@yoonspark
Copy link
Contributor

yoonspark commented Jul 3, 2022

With the toy example we used, i.e.,

art = {}
a0 = 0
a0 += 1
art['a0'] = lineapy.save(a0,"a0")
a=1
art['a'] = lineapy.save(a, "a")

a+=1
b = a*2 + a0
c = b+3
d = a*4
e = d+5
e+=6
art['c'] = lineapy.save(c, "c")
art['e'] = lineapy.save(e, "e")

f = c+7
art['f'] = lineapy.save(f, "f")
a+=1
g = c+e *2
art['g'] = lineapy.save(g,'g')
h = a+g
art['h'] = lineapy.save(h,'h')

the current implementation produces:

def pipeline():
    a = get_a()
    a = get_a_for_artifact_c_and_downstream(a)
    c = get_c(a)
    f = get_f(c)
    e = get_e(a)
    g = get_g(c,e)
    h = get_h(a,g)
    return a, c, f, e, g, h

which is missing a = get_a_for_artifact_h_and_downstream(a). That is, h relies on yet another mutated a, but this second mutation of a is not captured in graph refactoring.

UPDATE: @mingjerli rightly pointed out that this is actually an expected behavior: since the last a += 1 affects h only, this is "absorbed" in to the definition of get_h, which is a desired behavior in fact. So, marking this discussion thread resolved.

(parent_id, node.id)
for node in nodes
for parent_id in node.parents()
if parent_id in set(self.ids.keys())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need this to make sure the graph is self-contained within the list of input nodes.

@mingjerli mingjerli marked this pull request as ready for review July 6, 2022 02:03
@mingjerli mingjerli requested a review from yifanwu July 8, 2022 21:37
@yoonspark
Copy link
Contributor

With the following session code:

# Load train data
train_df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")

# Initiate the model
mod = LinearRegression()

# Fit the model
mod.fit(
    X=train_df[["petal.width"]],
    y=train_df["petal.length"],
)

# Save the fitted model as an artifact
lineapy.save(mod, "iris_model")

# Load data to predict (assume it comes from a different source)
pred_df = pd.read_csv("https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv")

# Make predictions
petal_length_pred =  mod.predict(X=pred_df[["petal.width"]])

# Save the predictions
lineapy.save(petal_length_pred, "iris_petal_length_pred")

we get the following refactored code:

def get_pd_train_df_for_artifact_iris_model_and_downstream():
    import pandas as pd
    return pd, train_df

def get_iris_model(pd):
    from sklearn.linear_model import LinearRegression
    train_df = pd.read_csv(
        "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
    )
    mod = LinearRegression()
    mod.fit(
        X=train_df[["petal.width"]],
        y=train_df["petal.length"],
    )
    return mod

def get_iris_petal_length_pred(mod, pd, train_df):
    pred_df = pd.read_csv(
        "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
    )
    petal_length_pred = mod.predict(X=pred_df[["petal.width"]])
    return petal_length_pred

def pipeline():
    pd, train_df = get_pd_train_df_for_artifact_iris_model_and_downstream()
    mod = get_iris_model(pd)
    petal_length_pred = get_iris_petal_length_pred(mod, pd, train_df)
    return mod, petal_length_pred

if __name__=="__main__":
    pipeline()

which is not exactly what we want.

@dorx
Copy link
Contributor

dorx commented Jul 10, 2022

Additionally, the import behavior is also inconsistent. While it might be desirable to do imports within specific functions, we currently don't always import things that are needed within a function in the function itself, e.g., in

def get_iris_petal_length_pred(mod, pd, train_df):
    pred_df = pd.read_csv(
        "https://raw.githubusercontent.com/LineaLabs/lineapy/main/examples/tutorials/data/iris.csv"
    )
    petal_length_pred = mod.predict(X=pred_df[["petal.width"]])
    return petal_length_pred

we need to import pandas as pd for it to be self.contained.

@mingjerli mingjerli requested review from lionsardesai and removed request for yifanwu July 12, 2022 15:50
Copy link
Contributor

@yifanwu yifanwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left some notes from a quick skim of the code. I will do a more thorough review once I get to check out the code and test out corner cases.

@dataclass
class GraphSegment:
"""
This class contains information required to wrap a lineapy Graph object as a function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have richer documentation here. Ideally an example code snippet and the corresponding GraphSegment.

)

# Determine the tracked variables of each node
# Fix me when you see return variables in refactor behaves strange
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this FIXME is.

self.input_variables = input_variables
self.return_variables = return_variables

def get_function_definition(self, indentation=4) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be in the territory of "future proofing", but there is danger in our function generation being purely text manipulation (as opposed to graph to graph manipulation, and then graph to text generation): we will not be able to reuse these analysis and logic later for execution features (e.g., caching). cc @moustafa-a for things to watch out for platform.

(parent_id, node.id)
for node in nodes
for parent_id in node.parents()
if parent_id in set(self.ids.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenarios will the parent_id be missing from self.ids.keys(). Should we throw an error or let it pass? What are the downstream implications?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Graph.get_subgraph(nodes) method needs this constraint to guarantee the return graph does not have extra nodes.


return codeblock

def get_import_block(self, indentation=0) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the existing get_import_block function and use it here instead of redefining it.

# right after calculation in case of mutation downstream
artifacts = []
df = get_df()
artifacts.append(copy.deepcopy(df))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only concern here is that deepcopy might not work with all object types and this code generation might restrict us in the future.

Copy link
Contributor

@lionsardesai lionsardesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving conditional to a refactor after multi-session collections and script writers are done (sangyoon+mjl going to sync on this). Also conditional to a change that merges the existing import function with the new one introduced by MJL here.

@mingjerli mingjerli merged commit a909a0c into main Jul 28, 2022
@lionsardesai lionsardesai deleted the LIN-467-artifact-dependencies-within-same-session branch October 20, 2022 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants