# Partial execution and pipeline debugging

In this guide we will show you how to execute a pipeline partially in order to
debug its internal behavior or optimize tuning processes.

Note that some steps are not explained for simplicity. Full details
about them can be found in the previous parts of the tutorial.

We will:

1. Load a pipeline and a dataset
2. Explore the context after fitting the first primitive.
3. Fit the rest of the pipeline
4. Partial execution during Predict
5. Rerunning the last steps

## Load a pipeline and a datset

The first step will be to load the Census dataset.

In [1]:
from mlprimitives.datasets import load_dataset

dataset = load_dataset('census')

In [2]:
X_train, X_test, y_train, y_test = dataset.get_splits(1)

As a reminder, we have a loot at what the `X` and `y` variables that we will be passing to our
pipeline look like.

`X` is a `pandas.DataFrame` that conatins the demographics data of the subjects:

In [3]:
X_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
28291,25,Private,193379,Assoc-acdm,12,Never-married,Craft-repair,Not-in-family,White,Male,0,0,45,United-States
28636,55,Federal-gov,176904,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States
7919,30,Private,284395,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,50,United-States
24861,17,Private,239346,10th,6,Never-married,Other-service,Own-child,White,Male,0,0,18,United-States
23480,51,Private,57698,HS-grad,9,Married-spouse-absent,Other-service,Unmarried,White,Female,0,0,40,United-States


And `y` is a `numpy.ndarray` that contains the label that indicates whether the subject has a salary
above or under 50K.

In [4]:
y_train[0:5]

array([' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K'], dtype=object)

And we build a suitable pipeline for our dataset.

In [5]:
from mlblocks import MLPipeline

primitives = [
    'mlprimitives.custom.preprocessing.ClassEncoder',
    'mlprimitives.custom.feature_extraction.CategoricalEncoder',
    'sklearn.impute.SimpleImputer',
    'xgboost.XGBClassifier',
    'mlprimitives.custom.preprocessing.ClassDecoder'
]
pipeline = MLPipeline(primitives)

## Explore the context after fitting the first primitive

Once we know what primitives we are executing, we will execute only the first one
and see how the context changed after it.

For this, we will execute the `fit` method passing the index of the last pipeline
step that we want to execute before returning. In this case, `0`.

In [6]:
fit_context = pipeline.fit(X_train, y_train, output_=0)

**NOTE**: Optionally, instead of passing the pipeline step index, we could pass the complete name
of the step, including the counter number: `mlprimitives.custom.preprocessing.ClassEncoder#1`

In [7]:
output_step = 'mlprimitives.custom.preprocessing.ClassEncoder#1'
fit_context = pipeline.fit(X_train, y_train, output_=output_step)

In both cases, the output will be a dictionary containing all the context variables after
fitting and producing the first pipeline step.

In [8]:
fit_context.keys()

dict_keys(['X', 'y', 'classes'])

Notice how we find the `X` and `y` variables that we passed to the `fit` method, but also a new `classes` variable
that was generated by the `mlprimitives.custom.preprocessing.ClassEncoder` primitive of the first pipeline step.

This `classes` variable contains the list of unique values that the variable `y` originally had.

In [9]:
fit_context['classes']

array([' <=50K', ' >50K'], dtype=object)

Also notice that the variable `y` has been transformed by the primitive into an array of
integer values.

In [10]:
fit_context['y'][0:5]

array([0, 0, 0, 0, 0])

## Fit the rest of the pipeline

After exploring the context generated by the first pipeline step we will now run
a few steps more, up to the point where the feature matrix is ready for the XGBClassifier.

For this we will run the `fit` method again passing back the context that we just obtained
as well as the `start_` argument indicating that we need to start fitting on the second
step of the pipeline, skipping the first one, and the `output_` argument indicating that
we want to stop on the third step, right before the `XGBClassifier` primitive.

Note how the context is passed using a double asterisk `**` syntax, but that individual
variables could also be passed as keyword arguments.

In [11]:
fit_context = pipeline.fit(start_=1, output_=2, **fit_context)

Now the context still contains the same variables as before

In [12]:
fit_context.keys()

dict_keys(['classes', 'X', 'y'])

But the variable `X` has been completely modified by the CategoricalEncoder and Imputer
primitives, so now it is a 100% numerical `numpy.ndarray` ready for the `XGBClassifier`

In [13]:
fit_context['X'][0]

array([2.50000e+01, 1.93379e+05, 1.20000e+01, 0.00000e+00, 0.00000e+00,
       4.50000e+01, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00,
       0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.000

Finally, we can pass the new context to the rest of the pipeline to finish fitting it.

Note how, just like the `output_`, the `start_` step can also be indicated using the step
name instead of the index.

In [14]:
pipeline.fit(start_='xgboost.XGBClassifier#1', **fit_context)

## Partial execution during Predict

Just like in the `fit` stage, the `predict` method also accepts a partial output specification.

In [15]:
predict_context = pipeline.predict(X_test, output_=2)

In [16]:
predict_context.keys()

dict_keys(['X', 'y'])

As well as a partial execution after a specific pipeline step

In [17]:
predictions = pipeline.predict(start_=3, **predict_context)

In [18]:
predictions[0:5]

array([' >50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

## Rerunning the last steps

One of the key advantages of the partial execution that we just explored is the
possibility to re-fit and make new predictions multiple times with different
hyperparameter values for the last half of the pipeline without the need to
re-fit and re-execute the first half.

This has the potential to greatly accelerate tuning processes in cases where there
are no tunable hyperparameters (or there are but we do not want to tune them) in
the preprocessing steps but the execution times are long.

As an example, let's evaluate the performance of the pipeline and try to optimize
it by changing some hyperparameters of the classifier.

In [19]:
dataset.score(y_test, predictions)

0.8602137329566393

In [20]:
hyperparameters = {
    'xgboost.XGBClassifier#1': {
        'learning_rate': 0.5
    }
}
pipeline.set_hyperparameters(hyperparameters)

In [21]:
pipeline.fit(start_=3, **fit_context)
predictions = pipeline.predict(start_=3, **predict_context)

In [22]:
dataset.score(y_test, predictions)

0.872251566146665