# Introduction

**ZenML** is an open-source MLOps library designed to manage the **end-to-end** lifecycle of machine learning (**ML**) pipelines. In this tutorial, we'll walk through how ZenML can be used in a typical data science project to manage the following steps:

- Loading Data
- Processing Data
- Exploring Models
- Fine-tuning Models
- Deployment

We will use ZenML to structure the pipeline and automate tasks like versioning, reproducibility, and monitoring, which will be beneficial for any ML project, including customer churn prediction (ChurnDetect), fraud detection, or any other ML use case.

In [None]:
pip install zenml

### Terms

- **Pipeline**:
A pipeline is a sequence of steps that define the end-to-end workflow of your machine learning process. Each step performs a specific task like data cleaning, transformation, model training, etc.

- **Step**:
A step is a single operation or transformation in the pipeline. Steps can be anything from data preprocessing, feature engineering, training a model, to model evaluation.

- **Artifact**:
Artifacts are data objects passed between steps in the pipeline. They can be datasets, models, or any data produced or consumed by steps. Artifacts help ZenML track data flow between steps.

- **DataArtifact**:
A specific type of artifact that contains data (like a dataset). It is passed between the steps of the pipeline and can be saved or loaded from various sources.

- **Component**:
A component is a reusable piece of functionality that can be used in multiple steps. Components encapsulate code, such as data loaders, model trainers, or evaluators, and make it easier to reuse across pipelines.

- **Metadata Store**:
The metadata store tracks the state of your pipelines, steps, and artifacts. It stores the parameters, metrics, and outputs of each step, enabling reproducibility.

- **Run**:
A run is an execution of the pipeline. Each time you execute a pipeline, ZenML records the inputs, outputs, and metadata of the run.

- **Experiment**:
An experiment is a group of related runs. It helps in organizing different configurations or variations of a pipeline that you want to compare.

- **Context**:
Context refers to the environment or session in which a pipeline or step is executed. This includes the configuration, dependencies, and available resources.

- **Artifact Store**:
An artifact store is where artifacts are saved. It can be a cloud storage, local file system, or any other storage backend.

- **Orchestrator**:
An orchestrator manages the execution of ZenML pipelines. It helps execute the steps across distributed systems, and ZenML supports different orchestrators like Kubeflow, Airflow, and others.

- **Visualizer**:
Visualizer is a component used to visualize the results or outcomes of pipeline runs, including metrics, performance graphs, and other visualizations.

- **Flavor**:
A flavor is a version or a configuration of a component. For example, a model might have different flavors depending on the framework (like TensorFlow, PyTorch) used.

- **Registry**:
A registry stores different versions of components and models. It allows you to version and store your models or data transformations for reuse.

- **Custom Step**:
A custom step is a user-defined step that you create to perform a specific operation in your pipeline that isn't already available in ZenML.
- **Versioning**:
Versioning refers to tracking changes in models, steps, or data. ZenML tracks versions of artifacts, pipeline runs, and components to ensure reproducibility.

# 1. Loading Data

ZenML makes it easy to handle data loading by integrating with **various data sources**, ensuring that datasets are **versioned** and can be reused in **future experiments**.

### 1.1 Create a ZenML Artifact Store
ZenML stores datasets and models as **artifacts**, and you need an artifact store to save these objects.

In [None]:
from zenml.core import store
from zenml.artifacts import DataArtifact

# Initialize artifact store (local or cloud-based)
artifact_store = store.get_artifact_store()

### 1.2 Define a Data Loading Step
You can create a custom data loading function as part of your ZenML pipeline.

In [None]:
from zenml.steps import step

@step
def load_data() -> DataArtifact:
    import pandas as pd
    data = pd.read_csv('customer_churn_data.csv')
    return data  # Return the loaded data

### 1.3 Run the Step
Load the data and track it as an artifact.

In [None]:
# Example of loading and storing the data
data_step = load_data()
data = data_step()
print(data.head())

## Benefits:
- Data is versioned automatically by ZenML.
- You can track different versions of the dataset used in your experiments, making it easy to go back to previous versions.

# 2. Processing Data

Once the data is loaded, it's time to preprocess it. This includes **cleaning** the data, **handling missing values**, **encoding categorical** variables, and **feature scaling**.

### 2.1 Define Data Preprocessing Step:
You can create a reusable data preprocessing function that will be part of your ZenML pipeline.

In [None]:
from zenml.steps import step
from zenml.pipelines import pipeline
from zenml.artifacts import DataArtifact

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

@step
def clean_data(data: DataArtifact) -> DataArtifact:
    """Remove missing values from the dataset."""
    import pandas as pd
    
    data = data.dropna()
    return data

@step
def encode_categorical_features(data: DataArtifact) -> DataArtifact:
    """Encode categorical columns in the dataset."""
    import pandas as pd

    encoder = LabelEncoder()
    data['gender'] = encoder.fit_transform(data['gender'])
    
    return data

@step
def feature_engineering(data: DataArtifact) -> DataArtifact:
    """Feature engineering for the dataset (creating new features if needed)."""
    # Example: Adding a new feature based on existing ones (e.g., tenure * monthly_charges)
    data['tenure_monthly_charge'] = data['tenure'] * data['monthly_charges']
    
    return data


@step
def scale_data(data: DataArtifact) -> DataArtifact:
    """Scale numeric features in the dataset."""
    import pandas as pd
    scaler = StandardScaler()
    data[['tenure', 'monthly_charges', 'total_charges']] = scaler.fit_transform(
        data[['tenure', 'monthly_charges', 'total_charges']]
    )
    
    return data

@pipeline
def churn_pipeline(raw_data: DataArtifact) -> DataArtifact:
    cleaned_data = clean_data(data=raw_data)
    encoded_data = encode_categorical_features(data=cleaned_data)
    engineered_data = feature_engineering(data=encoded_data)
    scaled_data = scale_data(data=engineered_data)
    return scaled_data

### 2.2 Run Preprocessing Step:
You can chain the data preprocessing step after the data loading step in your pipeline.

In [None]:
# Run the pipeline
pipeline_instance = churn_pipeline()
pipeline_instance.run()

# Access the final output (scaled data)
final_data = pipeline_instance.output
print(final_data)

## Benefits:
- Ensures that preprocessing steps are reproducible.
- Allows easy adjustments and fine-tuning of preprocessing techniques during different experiments.

# 3. Exploring Models

### 3.1 Define a Model Training Step:
Create a reusable step for training models. This allows you to experiment with different models easily.

In [None]:
from zenml.steps import step
from zenml.artifacts import DataArtifact, ModelArtifact
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

@step
def train_model(data: DataArtifact) -> ModelArtifact:
    # Split data into features and target
    X = data.drop('churn', axis=1)
    y = data['churn']
    
    # Train a Random Forest model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X, y)
    
    # Evaluate the model
    predictions = model.predict(X)
    accuracy = accuracy_score(y, predictions)
    print(f'Model accuracy: {accuracy}')

    return model

Exploring different models involves experimenting with various **machine learning algorithms** to find the best one for your task. ZenML allows you to organize and **track the different models** you test.

### 3.2 Run Model Exploration Step:
You can define multiple steps to explore different models.

In [None]:
random_forest_model = train_model(data=processed_data_result)
model_result = random_forest_model()

## Benefits:
- Keeps track of multiple models and their results, allowing easy comparisons.
- Allows experimentation with various algorithms in a clean and organized way.

# 4. Fine-Tuning Models

### 4.1 Define a Hyperparameter Tuning Step:
You can create a tuning step using Optuna for example.

In [None]:
import optuna
from zenml.steps import step
from sklearn.ensemble import RandomForestClassifier

@step
def tune_model(data: DataArtifact) -> ModelArtifact:
    def objective(trial):
        n_estimators = trial.suggest_int('n_estimators', 50, 200)
        max_depth = trial.suggest_int('max_depth', 1, 20)
        model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
        
        # Train and evaluate model
        X = data.drop('churn', axis=1)
        y = data['churn']
        model.fit(X, y)
        score = model.score(X, y)
        return score

    # Start the hyperparameter search
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=10)

    best_params = study.best_params
    print(f'Best hyperparameters: {best_params}')

    # Train the best model with optimal parameters
    best_model = RandomForestClassifier(**best_params)
    best_model.fit(X, y)

    return best_model

### 4.2 Run Fine-Tuning Step:

In [None]:
tuned_model = tune_model(data=processed_data_result)
tuned_model_result = tuned_model()

## Benefits:
- Automates the hyperparameter tuning process.
- Ensures that the model fine-tuning process is well-documented and reproducible.

# 5. Deployment

Once you have the best-performing model, ZenML can help you deploy it to a production environment. You can automate the deployment process as part of the pipeline.

### 5.1 Define a Deployment Step:
ZenML makes it easy to define a deployment pipeline step that pushes your model into a production environment.

In [None]:
@step
def deploy_model(model: ModelArtifact):
    import pickle
    # Save the model to a file (or deploy to a server)
    with open('deployed_model.pkl', 'wb') as f:
        pickle.dump(model, f)
    print("Model deployed successfully!")

### 5.2 Run Deployment Step:
You can chain the deployment step after model training and fine-tuning.

In [None]:
deployed_model = deploy_model(model=tuned_model_result)

## Benefits:
- Automates the deployment process, ensuring consistent deployment configurations.
- Ensures that the deployed model is the same version that was trained and tuned.

# Storing and Loading ZenML Pipelines

## Storing Pipelines
To store a ZenML pipeline, you need to use ZenML's built-in artifact stores and pipeline management features. This allows you to save the entire pipeline, including all the steps, inputs, and outputs, into an artifact store.

### Define the Pipeline:
Once you’ve defined all the steps in your pipeline (like loading data, processing data, training models, etc.), you need to compose them into a pipeline.

In [None]:
from zenml.pipelines import pipeline

# Define your pipeline
@pipeline
def churn_pipeline():
    data = load_data()
    processed_data = preprocess_data(data)
    model = train_model(processed_data)
    tuned_model = tune_model(processed_data)
    deploy_model(tuned_model)

# Run the pipeline
churn_pipeline_instance = churn_pipeline()
churn_pipeline_instance.run()

### Store the Pipeline:
You can store the pipeline by saving it to an artifact store. ZenML will automatically store the pipeline’s steps, configurations, and results in a persistent backend.

In [None]:
from zenml.core.repo import Repository

# Store the pipeline
repo = Repository()
pipeline_id = repo.save_pipeline(churn_pipeline_instance)
print(f"Pipeline stored with ID: {pipeline_id}")

This step ensures that your pipeline, along with its steps and configurations, are saved in the ZenML repository and artifact store.

## Loading Pipelines
Once you have stored your pipeline, you can load it from a different file, project, or environment. This is useful when you need to reuse the same pipeline configuration or when you are working across different environments (e.g., testing vs. production).

### Load a Pipeline:
You can load an existing pipeline from the repository by using the load_pipeline function.

In [None]:
from zenml.core.repo import Repository

# Load the stored pipeline
repo = Repository()
pipeline_id = 'your-pipeline-id'  # Replace with the actual pipeline ID from the repository
loaded_pipeline = repo.load_pipeline(pipeline_id)
print(f"Pipeline loaded with ID: {pipeline_id}")

### Run the Loaded Pipeline:
Once the pipeline is loaded, you can execute it just like you would with any other pipeline.

In [None]:
# Run the loaded pipeline
loaded_pipeline.run()

This will execute all the steps in the pipeline, using the exact configurations and data from the time it was originally saved.