Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return specified intermediate outputs #104

Closed
AlexanderGeiger opened this issue Aug 28, 2019 · 3 comments · Fixed by #105
Closed

Return specified intermediate outputs #104

AlexanderGeiger opened this issue Aug 28, 2019 · 3 comments · Fixed by #105
Assignees
Labels
approved The issue is approved and work can be started new feature
Milestone

Comments

@AlexanderGeiger
Copy link

Description

We want to introduce a way of specifying exactly which variable(s) from which primitive(s) should be returned. This way we would have the ability to get multiple intermediate outputs from the pipeline without needing to return the whole context.

Possible approach

We let the user define in the Pipeline JSON which intermediate outputs he wants to see. Using that information, MLBlocks keeps track of the outputs while iterating over the primitives and returns a dictionary containing all the outputs.
The JSON could look like:

{
...
"intermediate_output":
        ["sklearn.preprocessing.MinMaxScaler#1.X",
        "keras.Sequential.LSTMTimeSeriesRegressor#1.y"] 
}

Also, we might want to add a general output field to the JSON, where the user can specify what the last output of the pipeline will be and that will be returned as an array.
Then we would have the general output of the pipeline and the intermediate outputs.

@csala you already had some specifics about the implementation in mind, so please let me know what you think about it and how you would do it.

@AlexanderGeiger AlexanderGeiger changed the title Specified intermediate outputs Return specified intermediate outputs Aug 28, 2019
@dyuliu
Copy link

dyuliu commented Aug 29, 2019

@AlexanderGeiger

I like the way to dump the intermediate outputs like this.

One question about adding a general output field to the JSON. What is the difference between "general output field" and "intermediate output"? They seem serving the same purpose. In this regard, why not just name the "intermediate_output" to "output"?

@csala
Copy link
Contributor

csala commented Aug 29, 2019

I like the approach of specifying the outputs in the JSON file, but I want to suggest a slightly different JSON structure.

The concepts would be:

  • An output specification is a string that follows the same pattern as the current output_ specification: {primitive-name}#{counter}.{variable-name}. However, contrary to the fit and predict output_ argument, in this case the {variable-name} part is mandatory and cannot be skipped.
  • Optionally, the output specification can be a list of strings instead of a string, all of them following the same specification.

Then, in the JSON I would do the following:

  • Add an outputs field in the JSON. This field is optional, and can be:
    • missing: all the outputs from the last pipeline step produce method will be taken as the default output specification, in the same order (this is the current behavior).
    • A string or a list: it is just the default pipeline output specification.
    • A dictionary: it must contain at least one entry called default. This entry will be considered the default pipeline output specification just like in the previous steps. And, a part from the default, any other named output specifications can be added.

Some examples of possible specifications:

"output": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y"

"output": [
    "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y"
]

"output": {
    "default": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y,
}

"output": {
    "default": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y,
    "debug": [
        "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.X",
        "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.y",
        "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.target_index",
        "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat",
    ]
}

Notice how the first three examples are completely equivalent, and only the last one introduces an alternative output.
Also, internally, all the three options will end up represented in the dict format, with an entry called "default".
Finally, notice how the default output is NOT included in the list of debug outputs.

Now, the behavior will be: when executing the pipeline, the output_ argument will allow the user to either specify specific outputs (like in the current behavior) or give "named outputs".
These named outputs can be "default" or any other name specified, like "debug".

And the internal behavior will be:

  • If no output_ is given, "default" will be used.
  • If a string is given, it can either be one of the named outputs ("default", "debug", etc.) or one individual output specification ("{primitive-name}#{counter}.{variable-name}").
  • If a list is given, each element in the list can either be a named output or an individual output specification. If named outputs are given, they will be concatenated, forming a single output specification that contains all the elements from all the named outputs, in order.

Finally, when returning, if the output specification ends up having a single element, that element will be returned alone.
If more than one element exists in the output specification, all the elements will be returned as a tuple, in the exact same order, like in any multi-output method call.

Following this specification, if a pipeline is created using the last output example above, all these calls would be valid:

# return the default output, which is the y in the last primitive
anomalies = pipeline.predict(X)
anomalies = pipeline.predict(X, output_="default")

# return ONLY the debug outputs
X, y, target_index, y_hat = pipeline.predict(X, output_="debug")

# return BOTH the default and the debug outputs
anomalies, X, y, target_index, y_hat = pipeline.predict(X, output_=["default", "debug"])

# return ONLY one variable, y_hat
y_hat = pipeline.predict(X, output_="keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat")

# return the default output and also one variable
y_hat = pipeline.predict(X, output_=["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])

On a side note, the "get the whole context" behavior from the current implementation should be kept. This means that, even though the JSON specification will always require the {variable-name}, the output_ contents can point at a particular primitive context without variable. In this case, a deep copy of that context will be returned in that place.

@csala csala added the under discussion The issue is still being discussed label Aug 29, 2019
@csala
Copy link
Contributor

csala commented Aug 29, 2019

Here is an additional proposal on top of the previous one.

A part from specifying the outputs in the JSON file as a single string, allow them to be specified as a dictionary with two entries:

  • name: a final name for the output. For example, "anomalies".
  • variable: the output specification from above: {primitive-name}#{counter}.{variable-name}.

On top of that, add these two methods to the MLPipeline object:

  • get_outputs(outputs=None): Return the list of dictionaries with the specification of the outputs that will be returned. If no outputs are passed, return the default outputs. Otherwise, if some outputs specification is given, compute the outputs and return the list of their specifications.
  • get_output_names(outputs=None): Just like get_outputs, but return the name of each output instead of the complete specification. If an output has no name because it was a single string, return the string.

For example, if the pipeline JSON specifies:

"output": {
    "default": {
        "name": "events"
        "variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
    },
    "debug": [
        {
            "name": "X",
            "variable": "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.X",
        },
        {
            "name": "y",
            "variable": "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.y",
        },
        {
            "name": "index",
            "variable: "mlprimitives.custom.timeseries_preprocessing.rolling_window_sequences#1.target_index",
        {
            "name": "y_hat",
            "variable": "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat",
        }
    ]
}

One can do:

>>> pipeline.get_outputs()
[
    {
            "name": "events"
            "variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
    }
]
>>> pipeline.get_outputs_names()
["events"]
>>> pipeline.get_outputs(["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])
[
    {
            "name": "events"
            "variable": "mlprimitives.custom.timeseries_anomalies.find_anomalies#1.y",
    },
    "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"
]
>>> pipeline.get_output_names(["default", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"])
["anomalies", "keras.Sequential.LSTMTimeSeriesRegressor#1.y_hat"]
>>> pipeline.get_output_names(["default", "debug"])
["anomalies", "X", "y", "index", "y_hat"]

And, potentially:

>>> outputs = ["default", "debug"]
>>> output_names = pipeline.get_output_names(outputs)
>>> output_values = pipeline.predict(data, output_=outputs)
>>> output_dict = dict(zip(output_names, output_values))
>>> output_dict
{
    "anomalies": ...,
    "X": ...
    "y": ...
    ...
}

@csala csala self-assigned this Aug 30, 2019
@csala csala added approved The issue is approved and work can be started new feature and removed under discussion The issue is still being discussed labels Aug 30, 2019
@csala csala added this to the 0.3.3 milestone Aug 30, 2019
@csala csala closed this as completed in #105 Sep 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved The issue is approved and work can be started new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants