# Flexible outputs specification

In a previous tutorial we have learnt how to obtain intermediate pipeline
outputs in order to debug its internal behavior.

In this guide we will go a bit further and learn how to define flexible outputs
for the pipeline in order to obtain the output of multiple primitives
at once.

Note that some steps are not explained for simplicity. Full details
about them can be found in the previous parts of the tutorial.

We will:

1. Load a pipeline and a dataset
2. Explore the output specification formats

## Load a pipeline and a datset

The first step will be to load the Census dataset and the pipeline that we will be using.

In [1]:
from mlprimitives.datasets import load_dataset

dataset = load_dataset('census')

In [2]:
X_train, X_test, y_train, y_test = dataset.get_splits(1)

In [3]:
from mlblocks import MLPipeline

primitives = [
    'mlprimitives.custom.preprocessing.ClassEncoder',
    'mlprimitives.custom.feature_extraction.CategoricalEncoder',
    'sklearn.impute.SimpleImputer',
    'xgboost.XGBClassifier',
    'mlprimitives.custom.preprocessing.ClassDecoder'
]
pipeline = MLPipeline(primitives)

Also, just as a reminder, let's have a quick look at the steps of this pipeline

In [4]:
pipeline.primitives

['mlprimitives.custom.preprocessing.ClassEncoder',
 'mlprimitives.custom.feature_extraction.CategoricalEncoder',
 'sklearn.impute.SimpleImputer',
 'xgboost.XGBClassifier',
 'mlprimitives.custom.preprocessing.ClassDecoder']

And at the `X` and `y` variables that we will be passing to our pipeline.

`X` is a `pandas.DataFrame` that conatins the demographics data of the subjects:

In [5]:
X_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
28291,25,Private,193379,Assoc-acdm,12,Never-married,Craft-repair,Not-in-family,White,Male,0,0,45,United-States
28636,55,Federal-gov,176904,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States
7919,30,Private,284395,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,50,United-States
24861,17,Private,239346,10th,6,Never-married,Other-service,Own-child,White,Male,0,0,18,United-States
23480,51,Private,57698,HS-grad,9,Married-spouse-absent,Other-service,Unmarried,White,Female,0,0,40,United-States


And `y` is a `numpy.ndarray` that contains the label that indicates whether the subject has a salary
above or under 50K.

In [6]:
y_train[0:5]

array([' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K'], dtype=object)

## Explore the output specification formats

In the previous tutorial we learnt that the output of a pipeline can be specified
in multiple formats:

* An integer indicating the pipeline step index, which will return us the complete
  context after producing the corresponding step.
* A string indicating the name of a step, which will also return us the complete
  context after producing the corresponding step.
  
A part from these two options, there are a few more.

### Single variable specification

Variables can be individually specified by passing a string in the format
`{pipeline-step-name}.{variable-name}`.

Note that the `pipeline-step-name` part is not only the primitive name, but
also the counter number at the end of it.

For example, if we want to explore the `classes` variable generated by
the `ClassEncoder` primitive during `fit`, we can do the following:

In [7]:
output_spec = 'mlprimitives.custom.preprocessing.ClassEncoder#1.classes'
pipeline.fit(X_train, y_train, output_=output_spec)

array([' <=50K', ' >50K'], dtype=object)

**NOTE**: Just like with the full context specification, when a variable is specified
the pipeline will be executed only up to the step that produces the indicated variable.

### List of variables

In some cases we will be interested in obtaining more than one variable
at a time.

In order to do this, instead of a single string specification we can pass
a list of strings.

In [8]:
output_spec = [
    'mlprimitives.custom.preprocessing.ClassEncoder#1.y',
    'mlprimitives.custom.preprocessing.ClassEncoder#1.classes',
]
out = pipeline.fit(X_train, y_train, output_=output_spec)

The output will be a `tuple` containing the variables in the specified order.

In [9]:
y, classes = out

If we want to obtain variables from multiple pipeline steps we simply need
to specify all of them at once. Again, **MLBlocks** will run all the necessary
pipeline steps, accumulating the desired variables up to the last step needed.

In [10]:
output_spec = [
    'sklearn.impute.SimpleImputer#1.X',
    'mlprimitives.custom.preprocessing.ClassEncoder#1.y',
    'mlprimitives.custom.preprocessing.ClassEncoder#1.classes',
]
X, y, classes = pipeline.fit(X_train, y_train, output_=output_spec)

If required, we can even capture the same variable along the different pipeline steps!

In [11]:
output_spec = [
    'mlprimitives.custom.feature_extraction.CategoricalEncoder#1.X',
    'sklearn.impute.SimpleImputer#1.X',
    'mlprimitives.custom.preprocessing.ClassEncoder#1.y',
    'mlprimitives.custom.preprocessing.ClassEncoder#1.classes',
]
X_1, X_2, y, classes = pipeline.fit(X_train, y_train, output_=output_spec)

In [12]:
X_1.shape

(24420, 108)

In [13]:
X_2.shape

(24420, 108)