# ML Pipelines with ZenML

***Key Concepts:*** *ML Pipelines, Steps*

In this notebook, we will learn how to easily convert existing ML code into ML pipelines using ZenML.

Machine learning in production consists of wide variety of tasks ranging from experiment tracking to orchestration, from model deployment to monitoring, from drift detection to feature stores and much, much more than that. Even though there are already some seemingly well-established solutions for these tasks, it can become increasingly difficult to establish a running production system in a reliable and modular manner once all these solutions are brought together. This is a problem which is especially critical when switching from research setting to a production setting. Due to a lack of standards, the time and resources invested in proof of concepts frequently go completely to waste, because the initial system cannot easily be transferred to a production-grade setting. 

To solve the above challenging problem, Zen ML was introduced. This has got a set of standards and well-structured abstractions. It is essential that these abstractions not only cover concepts such as pipelines and steps but also the infrastructure elements on which the pipelines run. This helps to simply infrastructure configuration and management. ZenML is a framework to create reproducible, production-ready machine learning pipelines. It is built for data scientist to transition their models from a local experimental setup to a robust modern MLOPS infrastructure in production.  

Since we will build models with [sklearn](https://scikit-learn.org/stable/), you will need to have the ZenML sklearn integration installed. You can install ZenML and the sklearn integration with the following command, which will also restart the kernel of your notebook.

In [None]:
%pip install zenml
!zenml integration install sklearn -y
%pip install pyparsing==2.4.2  # required for Colab

import IPython

# automatically restart kernel
IPython.Application.instance().kernel.do_shutdown(restart=True)

As an ML practitioner, you are probably familiar with building ML models using Scikit-learn, PyTorch, TensorFlow, or similar. An **[ML Pipeline](https://docs.zenml.io/developer-guide/steps-and-pipelines)** is simply an extension, including other steps you would typically do before or after building a model, like data acquisition, preprocessing, model deployment, or monitoring. The ML pipeline essentially defines a step-by-step procedure of your work as an ML practitioner. Defining ML pipelines explicitly in code is great because:
- We can easily rerun all of our work, not just the model, eliminating bugs and making our models easier to reproduce.
- Data and models can be versioned and tracked, so we can see at a glance which dataset a model was trained on and how it compares to other models.
- If the entire pipeline is coded up, we can automate many operational tasks, like retraining and redeploying models when the underlying problem or data changes or rolling out new and improved models with CI/CD workflows.

Having a clearly defined ML pipeline is essential for ML teams that aim to serve models on a large scale.

## ZenML Setup
Throughout this series, we will define our ML pipelines using [ZenML](https://github.com/zenml-io/zenml/). ZenML is an excellent tool for this task, as it is straightforward and intuitive to use and has [integrations](https://docs.zenml.io/mlops-stacks/integrations) with most of the advanced MLOps tools we will want to use later. Make sure you have ZenML installed (via `pip install zenml`). Let's run some commands to make sure you start with a fresh ML stack. You can ignore the details for now, as we will learn about it in more detail in a later chapter.

In [None]:
!rm -rf .zen

# Initialize zenML repository:

Below command will internally create a local directory with a bunch of configuration for your MLOPs stack. Stacks represent different configurations of MLOps tools and infrastructure; Each stack consists of multiple Stack Components that each come in several Flavors. The default local stack will be 'default'. This local configuration will only take effect when you’re running ZenML from the initialized repository root, or from a subdirectory. The default stack consists of  

- Orchestrator: This is essentially your python kernel. 

- Artifact store: This store all the artifacts that flow through between steps 

- Metadata store: This keeps tracks of all the parameters that flow through your pipeline. 

Repositories link stacks to the pipeline and step code of your ML projects. 

In [None]:
!zenml init

Profiles manage these stacks and enable having various ZenML configurations on the same machine. 

In [None]:
!zenml profile create zenbytes

In [None]:
!zenml profile set zenbytes

In [None]:
!zenml stack set default

In [None]:
!zenml stack get

## Example Experimentation ML Code
Let us get started with some simple exemplary ML code. In the following, we train a Scikit-learn SVC classifier to classify images of handwritten digits. We load the data, train a model on the training set, then test it on the test set.

Let's first do the import

In [None]:
!git clone https://github.com/ultralytics/yolov5

In [2]:
%cd yolov5

C:\Users\vsriniva\Desktop\object_detection\yolov5


In [4]:
%pip install -r requirements.txt --proxy  http://approxy.rockwellcollins.com:9090



[notice] A new release of pip available: 22.1.2 -> 22.2.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
!python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --cache

In [8]:
!python train.py --img 320 --batch 1 --epochs 2 --data dataset.yaml --weights yolov5s.pt --cache

[34m[1mtrain: [0mweights=yolov5s.pt, cfg=, data=dataset.yaml, hyp=data\hyps\hyp.scratch-low.yaml, epochs=2, batch_size=1, imgsz=320, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=ram, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs\train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
fatal: unable to access 'https://github.com/ultralytics/yolov5/': Could not resolve proxy: approxy.rockwellcollins.com
YOLOv5  v6.2-53-gf0e5a60 Python-3.8.8 torch-1.12.1+cpu CPU

[34m[1mhyperparameters: [0mlr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0,

Command 'git fetch origin' returned non-zero exit status 128.


## Turning experiments into ML pipelines with ZenML

In ZenML, all the things can be defined as functions using the functional API. All you must do is to define a function, define its inputs, define its output, and then write python code in the middle. You just need to decorate that function with step decorator which you import from ZenML. Steps are the atomic components of a ZenML pipeline. Each step is defined by its inputs, the logic it applies and its outputs. 

For simple illustrations, assume your ML workflow contains data loading, model training, and model evaluation. In practice, your ML workflows will, of course, be much more complicated than that. You might have complex preprocessing that you do not want to redo every time you train a model, you will need to compare the performance of different models, deploy them in a production setting, and much more. Here ML pipelines come into play, allowing us to define our workflows in modular steps that we can then mix and match.

![Digits Pipeline](https://github.com/zenml-io/zenbytes/blob/main/_assets/1-1/digits_pipeline.png?raw=1)

We can identify three distinct steps in our example: data loading, model training, and model evaluation. Let us now define each of them as a ZenML **[Pipeline Step](https://docs.zenml.io/developer-guide/steps-and-pipelines#step)** simply by moving each step to its own function and decorating them with ZenML's `@step` [Python decorator](https://realpython.com/primer-on-python-decorators/).

Steps are the atomic components of a ZenML pipeline. Each step is defined by its inputs, the logic it applies and its outputs. 

In [None]:
from zenml.steps import step, Output

@step
def importer() -> Output(
    X_train=np.ndarray,
    X_test=np.ndarray,
    y_train=np.ndarray,
    y_test=np.ndarray,
):
    """Load the digits dataset as numpy arrays."""
    digits = load_digits()
    data = digits.images.reshape((len(digits.images), -1))
    X_train, X_test, y_train, y_test = train_test_split(
        data, digits.target, test_size=0.2, shuffle=False
    )
    return X_train, X_test, y_train, y_test

As this step has multiple outputs, we need to use the zenml.steps.step_output.Output class to indicate the names of each output. These names can be used to directly access the outputs of steps after running a pipeline.

Let's come up with a second step that consumes the output of our first step and performs some sort of transformation on it. In this case, let's train a support vector machine classifier on the training data using sklearn:

In [None]:
@step
def svc_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train a sklearn SVC classifier."""
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    return model

In [None]:
@step
def evaluator(
    X_test: np.ndarray,
    y_test: np.ndarray,
    model: ClassifierMixin,
) -> float:
    """Calculate the test set accuracy of an sklearn model."""
    test_acc = model.score(X_test, y_test)
    print(f"Test accuracy: {test_acc}")
    return test_acc

Similarly, we can use ZenML's `@pipeline` decorator to connect all of our steps into an ML pipeline.This is agnostic of the implementation and can be done by routing outputs through the steps within the pipeline.

Note that the pipeline definition does not depend on the concrete step functions we defined above; it merely establishes a recipe for how data moves through the steps. This means we can replace steps as we wish, e.g., to run the same pipeline with different models to compare their performances.

In [None]:
from zenml.pipelines import pipeline


@pipeline
def digits_pipeline(importer, trainer, evaluator):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

In case you want to run the step function outside the context of a ZenML pipeline, all you need to do is call the .entrypoint() method with the same input signature. For example:
trainer.entrypoint(X_train=..., y_train=...)

## Running ZenML Pipelines
Finally, we initialize our pipeline with concrete step functions and call the `run()` method to run it.

With your pipeline recipe in hand you can now specify which concrete step implementations to use when instantiating the pipeline:

In [None]:
digits_svc_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)

Currently, you cannot use the same step twice in a pipeline because step names must be unique. If you would like to reuse a step, use the clone_step() utility function to create a copy of the step with a new name.

To give each pipeline run a name:
When running a pipeline by calling my_pipeline.run(), ZenML uses the current date and time as the name for the pipeline run. In order to change the name for a run, pass run_name as a parameter to the run() function:

pipeline_instance.run(run_name="custom_pipeline_run_name")

Pipeline run names must be unique, so make sure to compute it dynamically if you plan to run your pipeline multiple times.

You can then execute your pipeline instance with the .run() method:

In [None]:
digits_svc_pipeline.run()

And that's it, we just built our first ML pipeline! Great job!