# Objective

Provide comprehensive understanding of MLOps, including an overview of the concept, the importance of model pipelines in cloud platforms such as AWS and Azure, and the integration of version control using Git.

# Introduction

## DevOps -> MLOps

![devops](assets/devops-cycle.png)

[Source](https://www.hiclipart.com/free-transparent-background-png-clipart-pytym)

The DevOps cycle encompasses a set of steps that facilitate the development, deployment, and operation of software systems. In the context of machine learning, these steps can be adapted to create an ML DevOps cycle, which focuses on managing and automating the machine learning workflow. Here's an explanation of each step in the ML DevOps cycle, along with an example for each stage:

1. Code:
   - In the ML DevOps cycle, the "Code" stage involves developing and maintaining the machine learning codebase. This includes writing code for data preprocessing, model training, evaluation, and deployment.
   - Example: A data scientist writes Python scripts to preprocess data, build and train machine learning models, and evaluate their performance.

2. Build:
   - The "Build" stage involves packaging the code and its dependencies into a deployable format. This step ensures that the ML codebase can be easily reproduced and deployed in different environments.
   - Example: Using tools like Docker, the ML codebase is containerized, creating a portable and self-contained package that encapsulates the required dependencies and configurations.

3. Test:
   - The "Test" stage focuses on verifying the functionality, quality, and performance of the ML codebase. This includes running unit tests, integration tests, and evaluating model performance on test datasets.
   - Example: Unit tests are written to verify the correctness of individual functions or modules in the ML codebase, while integration tests assess the interoperability of different components.

4. Release:
   - The "Release" stage involves preparing the ML codebase for deployment in a production environment. This includes generating deployment artifacts, documenting release notes, and ensuring that all necessary dependencies are included.
   - Example: A trained machine learning model is packaged along with the required preprocessing scripts, trained weights, and configuration files for seamless deployment.

5. Deploy:
   - In the "Deploy" stage, the ML codebase is deployed to a production environment, making it available for serving predictions or integrating with other systems.
   - Example: The packaged ML codebase is deployed to a cloud-based server or an edge device, allowing real-time predictions to be made based on incoming data.

6. Operate:
   - The "Operate" stage focuses on monitoring and managing the deployed ML system. This includes logging relevant metrics, handling errors, and ensuring the system's health and availability.
   - Example: A monitoring system is set up to track the prediction latency, accuracy, and other relevant metrics of the deployed ML model. Alerts are generated if the system's performance deviates from defined thresholds.

7. Monitor:
   - The "Monitor" stage involves continuously monitoring the ML system's performance, data quality, and model behavior in a production environment. This helps identify anomalies, detect drift, and ensure ongoing reliability.
   - Example: A monitoring pipeline periodically collects real-time data and evaluates the model's performance over time. Anomaly detection algorithms are applied to identify unexpected behavior or data drift.

8. Plan:
   - The "Plan" stage focuses on gathering feedback, analyzing performance, and incorporating improvements into future iterations of the ML system. It involves planning for enhancements, bug fixes, and updates.
   - Example: Based on feedback from users or performance metrics, the ML team identifies areas for improvement, such as collecting additional data, retraining the model with new algorithms, or optimizing specific parts of the pipeline.

## Continuous Integration/Continuous Deployment (CI/CD)

CI/CD (Continuous Integration/Continuous Deployment) is a crucial component of the DevOps cycle in the context of machine learning. It focuses on automating the process of integrating code changes, testing, and deploying ML models, ensuring a streamlined and efficient workflow. Here's a brief summary of CI/CD in the context of the DevOps cycle for machine learning:

Continuous Integration (CI):
- CI involves automating the integration of code changes made by different developers into a shared repository. It aims to prevent integration conflicts and maintain code quality.
- In the ML context, CI ensures that changes to the ML codebase, including data preprocessing, model training, and evaluation, are automatically integrated into a central repository.
- CI systems, such as AWS CodeBuild, Azure DevOps or Jenkins, trigger automated builds and tests whenever changes are pushed to the repository.
- Continuous integration facilitates collaboration, identifies integration issues early, and promotes a consistent and stable codebase.

Continuous Deployment (CD):
- CD extends CI by automating the deployment process, allowing ML models and related components to be deployed to production environments in a reliable and reproducible manner.
- CD enables the automated execution of the machine learning pipeline, including model training, evaluation, packaging, and deployment.
- CD systems leverage CI artifacts and trigger deployment processes based on predefined criteria, such as passing tests or specific branch merges.
- CD ensures that ML models are consistently deployed to production, reducing manual effort and minimizing the risk of human error.
- It enables faster and more frequent deployments, enabling rapid iteration and quicker delivery of ML-based applications.

CI/CD for machine learning encompasses the integration, testing, and deployment of ML code changes, resulting in a streamlined and automated pipeline. By leveraging CI/CD practices, ML teams can achieve greater efficiency, collaboration, and reliability in developing and deploying machine learning models.

## Key steps in CI/CD for MLOps

An operational CI/CD MLOps system hinges on two key steps:

1. Pipeline Delineation:
   A machine learning pipeline defines the sequence of steps required to train, evaluate, and deploy machine learning models. It encompasses various stages, such as data preprocessing, feature engineering, model training, validation, testing, and deployment. The pipeline ensures consistency and reproducibility in the machine learning workflow.

   The pipeline can be represented as a series of interconnected components or modules, each responsible for a specific task. These components can be implemented as scripts, functions, or Docker containers. The pipeline delineation specifies the order of execution and the dependencies between the components.

2. Version Control System for Tracking Pipeline Changes:
   In an MLOps workflow, it is crucial to track changes made to the machine learning pipeline, including modifications to code, configurations, and dependencies. A version control system, such as Git, serves as the foundation for tracking and managing these changes.

   With a version control system, each change made to the pipeline is recorded as a commit, capturing the specific modifications made at a given point in time. This enables traceability, reproducibility, and collaboration among team members.

   When a version control system detects a change in the pipeline, it triggers the execution of the pipeline by default. This ensures that any modifications to the pipeline automatically initiate the necessary steps for retraining, reevaluation, or redeployment.

   For example, if a developer makes changes to the preprocessing component of the pipeline, such as adding new data transformations or modifying existing ones, the version control system registers the changes. As a result, the pipeline execution is triggered, rerunning the preprocessing step with the updated logic.

   Similarly, if a change is made to the model training component, such as using a different algorithm or adjusting hyperparameters, the version control system captures the modifications and initiates the retraining process.

   By coupling the version control system with automatic pipeline execution, the MLOps system ensures that any changes to the pipeline are automatically incorporated into the workflow, reducing the manual effort required for execution and promoting reproducibility across different stages of the machine learning lifecycle.

Let us now look at each of these components in detail.

# Pipelines

## Concept

Pipelines are a series of steps that are connected with each other; each step accomplishes a specific machine learning task.

For example, a typical machine learning pipeline might include the following steps:
- Data ingestion and preprocessing: Cleansing, transformation, and feature engineering on raw data.
- Model training: Training machine learning models using a specific algorithm and hyperparameters.
- Model evaluation: Assessing the model's performance using appropriate metrics and validation techniques.
- Model deployment: Deploying the trained model for inference or integration into a production environment.

![ml-pipeline-example](assets/pipeline.drawio.png)

## Implementation

### Setup

**General Imports**

In [1]:
import logging
import json

**SageMaker Imports**

In [2]:
import sagemaker

from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.properties import PropertyFile

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.processing import ScriptProcessor

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.sklearn.estimator import SKLearn

from sagemaker.inputs import TrainingInput

SageMaker imports now include those that will enable us to assemple pipelines as a series of steps (e.g., `ProcessingStep` and `TrainingStep`).

**Azure ML imports**

In [3]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

from azure.ai.ml import command, Input, Output, dsl

A new import here is `dsl` which implements a domain-specific language to provide the functionality to define and build pipelines using the Azure ML Pipelines framework.

**SageMaker Authentication**

In [4]:
pipeline_session = sagemaker.Session(
    default_bucket="sagemaker-pipeline-examples"
)

In [5]:
try:
    aws_role = sagemaker.get_execution_role()
except ValueError:
    print('Local configuration is not complete; use SageMaker Studio')

In [6]:
print(f"AWS execution role associated with the account {aws_role}")
print(f"Default bucket associated with the account: {pipeline_session.default_bucket()}")
print(f"Default boto region associated with the account: {pipeline_session.boto_region_name}")

AWS execution role associated with the account arn:aws:iam::321112151583:role/default-sagemaker-access
Default bucket associated with the account: sagemaker-pipeline-examples
Default boto region associated with the account: ap-south-1


**Azure authentication**

In [7]:
subscription_id = "5bcad9c4-40fb-4136-b614-cc90116dd8b3"
resource_group = "tf"
workspace = "cloud-teach"

We could define Azure to not log verbose configuration messages by specifying the log settings to only include warnings.

In [8]:
logger = logging.getLogger("azure.core.pipeline.policies.http_logging_policy")
logger.setLevel(logging.WARNING)

In [9]:
az_credentials = DefaultAzureCredential(
    exclude_interactive_browser_credential=False
)

In [10]:
ml_client = MLClient(
    az_credentials, subscription_id, resource_group, workspace
)

In [11]:
for registered_data in ml_client.data.list():
    print(registered_data.name)

winequality-local
winequality-red
user-likes-media
socialmediaengagement
imdb_reviews
diamond-prices-jan
diamond-prices-feb
wine-quality-indicator
diamond-prices-may


### SageMaker Pipelines

SageMaker Pipelines is a powerful feature of Amazon SageMaker that allows you to create, orchestrate, and automate end-to-end machine learning workflows. It provides a high-level abstraction for defining and managing interconnected steps within a pipeline and automating the handoff between these steps using SageMaker abstractions. Here's an overview of how SageMaker Pipelines works:

1. Step Definition:
   - Each step in the pipeline represents a unit of work, such as data preprocessing, model training, or model deployment.
   - You define each step as a reusable component using SageMaker Step Functions.
   - The step definition includes the specific actions, algorithms, configurations, and inputs/outputs required for that particular step.

2. Pipeline Definition:
   - You define the pipeline as a series of interconnected steps using the SageMaker Pipelines SDK.
   - The pipeline definition specifies the sequence of steps, the input/output dependencies between them, and any conditions or branching logic.
   - You can parameterize the pipeline to make it configurable and flexible, allowing for easy experimentation with different settings.

3. Data Flow and Handoff:
   - SageMaker Pipelines automatically manages the flow of data between steps in the pipeline.
   - Each step consumes the output produced by its preceding steps and produces outputs that can be consumed by subsequent steps.
   - The data handoff between steps is automated by SageMaker, ensuring the seamless transfer of data and artifacts without manual intervention.

4. Execution and Monitoring:
   - Once the pipeline is defined, you can execute it using the SageMaker Pipelines SDK or the SageMaker Management Console.
   - During pipeline execution, SageMaker handles the provisioning and management of the underlying resources required for each step.
   - You can monitor the pipeline's progress, track metrics, and log intermediate outputs using Amazon CloudWatch and Amazon S3.

5. Reusability and Versioning:
   - SageMaker Pipelines encourage reusability and modularity by allowing you to create and reuse step definitions across multiple pipelines.
   - You can version the step definitions and pipelines, making it easier to track changes, rollback to previous versions, and ensure reproducibility.

By leveraging SageMaker Pipelines, you can build complex and scalable machine learning workflows that seamlessly integrate data preprocessing, model training, evaluation, deployment, and monitoring. The automated handoff between steps and the use of SageMaker abstractions simplifies the pipeline creation and management process, enabling efficient and reliable end-to-end machine learning workflows in Amazon SageMaker.

To illustrate how SageMaker Pipelines work, let us assemble and execute a three-step pipeline that processes raw data, trains a model on the processed data and computes evaluation metrics using the test data and the estimated model.

#### Step 1: Preprocessing

![aws-processing-map](assets/aws-processing-job.drawio.png) ![aws-processing-map](assets/aws-processing-map.drawio.png) 

In [12]:
input_data_uri = 's3://sagemaker-ap-south-1-321112151583/prices/diamond-prices.csv'

In [13]:
sklearn_processor = SKLearnProcessor(
    framework_version="1.0-1",
    instance_type="ml.m5.xlarge",
    instance_count=1,
    base_job_name="sklearn-diamond-prices-process",
    role=aws_role,
    sagemaker_session=pipeline_session
)

In [14]:
step_process = ProcessingStep(
    name="DiamondsProcess",
    processor=sklearn_processor,
    inputs=[
      ProcessingInput(source=input_data_uri, destination="/opt/ml/processing/input"),  
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test")
    ],
    code="aws/transform.py",
)

A key difference from before is the usage of a `ProcessingOutput` class as the output of the step. This class will store the persistent location of the processing output as one of its attributes. The processing script `transform.py` is exactly the same as before.

At this stage the processing step is registered but not executed.

In [17]:
step_process.inputs

[<sagemaker.processing.ProcessingInput at 0x7f5057d7c640>]

In [16]:
step_process.outputs

[<sagemaker.processing.ProcessingOutput at 0x7f5058126850>,
 <sagemaker.processing.ProcessingOutput at 0x7f50581267c0>]

#### Step 2: Training

![training-job](assets/aws-training-job.drawio.png) ![training-map](assets/aws-training-data-map.drawio.png)

In [22]:
sklearn_estimator = SKLearn(
    entry_point="aws/dt.py",
    framework_version="1.0-1",
    role=aws_role,
    sagemaker_session=pipeline_session,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    volume_size=1
)

As we did before, we instantiate an `Estimator` that assembles the compute environment, the compute infrastructure and the training script that will be run (the training script `dt.py` is unchanged from before).

With this estimator in place, we can now create the training step of the pipeline by linking the outputs of the processing step (`step_process`) to the inputs of the training step (this is the hand-off automation)

In [23]:
step_train = TrainingStep(
    name="DiamondsTrain",
    estimator=sklearn_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv"
        ),
        "test": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            content_type="text/csv"
        )
    }
)

In [24]:
step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri

<sagemaker.workflow.properties.Properties at 0x7f505828ddc0>

The role of the `ProcessingOutput` class is crucial in defining the output data generated by the processing step. It allows you to specify the name of the output (output_name) and the source path within the processing container (source) where the processed data will be saved. These output artifacts can be utilized by subsequent steps in the pipeline, enabling the flow of data and dependencies between the pipeline components. This property will be dynamically updated when the pipeline runs.

#### Step 3: Evaluation

Once the model is trained, we will need to evaluate it using an evaluation script (`evaluate.py`). This script generates predictions from the model saved in the previous step and logs metrics on test data. This evaluation is often used to conditionally trigger next steps (e.g., if the accuracy is not greater than the current baseline do not trigger further execution). 

In SageMaker pipelines, evaluation metric computation is considered a processing step because it involves performing some data processing or analysis to generate evaluation metrics for a machine learning model. This step is typically implemented using a `ScriptProcessor` and a `PropertyFile` in SageMaker pipelines. 

1. ScriptProcessor:
   - The `ScriptProcessor` is a SageMaker feature that enables the execution of custom scripts as part of a processing step.
   - In the case of evaluation metric computation, a custom script is used to perform the necessary calculations or analysis to derive the desired metrics.

2. PropertyFile:
   - A `PropertyFile` is often used in conjunction with the `ScriptProcessor` to capture and store the evaluation metrics generated by the custom script.
   - The custom script writes the evaluation metrics to a property file, which is then registered as an output of the processing step.
   - This allows the evaluation metrics to be captured and passed to subsequent steps in the pipeline for further processing or analysis.

In this way, we can:

- Maintain a modular and reusable pipeline structure, where each step has a well-defined purpose and encapsulates specific processing or analysis tasks.
- Easily track and manage the flow of data and metrics within the pipeline, enabling seamless integration with other pipeline components.
- Leverage the flexibility and scalability of SageMaker's processing capabilities, such as distributed processing, managed infrastructure, and resource provisioning.

To implement the evaluation step, we need to assemble the execution environment and hand it over to the `ScriptProcessor`.

In [25]:
session_region = 'ap-south-1'  # Replace with your desired region

# Get the specific SKLearn image URI for the given region
sklearn_image_uri = sagemaker.image_uris.retrieve(
    framework='sklearn',
    version='1.0-1',
    region=session_region
)

print(sklearn_image_uri)

720646828776.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3


In [26]:
script_eval = ScriptProcessor(
    image_uri=sklearn_image_uri,
    command=["python"],
    instance_type='ml.m5.xlarge',
    instance_count=1,
    base_job_name="script-diamonds-eval",
    role=aws_role,
    sagemaker_session=pipeline_session
)

In [27]:
evaluation_report = PropertyFile(
    name="EvaluationReport",
    output_name="evaluation",
    path="evaluation.json"
)

Evaluation metric collection within a SageMaker pipeline should follow a [specific format](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-model-quality-metrics.html).

In [28]:
step_eval = ProcessingStep(
    name="DiamondsEval",
    processor=script_eval,
    inputs=[
        ProcessingInput(
            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model"
        ),
        ProcessingInput(
            source=step_process.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            destination="/opt/ml/processing/test"
        )
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="aws/evaluate.py",
    property_files=[evaluation_report]
)

Notice how the input and output channels are grabbing connections from the previous step. 

1. `inputs=[ProcessingInput(...), ProcessingInput(...)]`:

The inputs parameter is a list that defines the input data required for the evaluation step.
The first `ProcessingInput` specifies the source location of the trained model artifacts (`step_train.properties.ModelArtifacts.S3ModelArtifacts`) and sets the destination path within the processing container as `/opt/ml/processing/model`. This provides the model artifacts to the evaluation script.
The second `ProcessingInput` specifies the source location of the processed test data output from a previous step (`step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri`) and sets the destination path within the processing container as `/opt/ml/processing/test`. This provides the test data to the evaluation script.

2. `outputs=[ProcessingOutput(...)]`:

The `outputs` parameter is a list that defines the output generated by the evaluation step.
The `ProcessingOutput` specifies the output name as `evaluation` and sets the source path within the processing container as `/opt/ml/processing/evaluation`. This defines where the evaluation results will be stored. 

#### Step 4: Assembling the pipeline

Now we have all the key components of a training pipeline that processes data, trains a model on the processed data and logs metrics from the data. It is now time to assemble the pipeline.

In [29]:
pipeline = Pipeline(
    name="DiamondsPipeline",
    steps=[step_process, step_train, step_eval],
    sagemaker_session=pipeline_session
)

In [30]:
json.loads(pipeline.definition())

Using provided s3_resource


{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [],
 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
 'Steps': [{'Name': 'DiamondsProcess',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': 'ml.m5.xlarge',
      'InstanceCount': 1,
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '720646828776.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3',
     'ContainerEntrypoint': ['python3',
      '/opt/ml/processing/input/code/transform.py']},
    'RoleArn': 'arn:aws:iam::321112151583:role/default-sagemaker-access',
    'ProcessingInputs': [{'InputName': 'input-1',
      'AppManaged': False,
      'S3Input': {'S3Uri': 's3://sagemaker-ap-south-1-321112151583/prices/diamond-prices.csv',
       'LocalPath': '/opt/ml/processing/input',
       'S3DataType': 'S3Prefix',
       'S3InputMode': 'File',
       'S3Da

Once a pipeline is registered, it needs to be "upserted" (= update + insert) on the SageMaker infrastructure. The purpose of `pipeline.upsert(role_arn=aws_role)` is to create or update the pipeline in the AWS infrastructure, ensuring that the specified role has the necessary permissions to execute the pipeline. This operation enables us to deploy and manage the pipeline definition within our AWS environment.

In [31]:
pipeline.upsert(role_arn=aws_role)

Using provided s3_resource
Using provided s3_resource


{'PipelineArn': 'arn:aws:sagemaker:ap-south-1:321112151583:pipeline/diamondspipeline',
 'ResponseMetadata': {'RequestId': '909d8e6e-c938-4e28-b6fe-2c55ab3592fa',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '909d8e6e-c938-4e28-b6fe-2c55ab3592fa',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '85',
   'date': 'Tue, 06 Jun 2023 12:39:07 GMT'},
  'RetryAttempts': 0}}

The pipeline is now ready for execution (`HTTPStatusCode` is 200).

In [32]:
execution = pipeline.start()

The status of execution can be monitored using the `describe()` method of the execution object or in the SageMaker Studio UI.

In [35]:
execution.describe()

{'PipelineArn': 'arn:aws:sagemaker:ap-south-1:321112151583:pipeline/diamondspipeline',
 'PipelineExecutionArn': 'arn:aws:sagemaker:ap-south-1:321112151583:pipeline/diamondspipeline/execution/94zb3pycywap',
 'PipelineExecutionDisplayName': 'execution-1686055153748',
 'PipelineExecutionStatus': 'Executing',
 'PipelineExperimentConfig': {'ExperimentName': 'diamondspipeline',
  'TrialName': '94zb3pycywap'},
 'CreationTime': datetime.datetime(2023, 6, 6, 18, 9, 13, 630000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2023, 6, 6, 18, 9, 13, 630000, tzinfo=tzlocal()),
 'CreatedBy': {},
 'LastModifiedBy': {},
 'ResponseMetadata': {'RequestId': '5ebf94a1-59b7-4cb3-9ad1-cadafdae4009',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '5ebf94a1-59b7-4cb3-9ad1-cadafdae4009',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '489',
   'date': 'Tue, 06 Jun 2023 12:40:13 GMT'},
  'RetryAttempts': 0}}

### Azure Pipelines

Azure ML pipelines are a way to define and orchestrate a series of interconnected steps or tasks to automate end-to-end machine learning workflows. These pipelines provide a structured and scalable approach to building, deploying, and managing machine learning workflows using Azure ML abstractions. Here's an explanation of Azure ML pipelines:

1. Components:
   - An Azure ML pipeline consists of multiple steps (called components), where each step represents a specific task or operation within the workflow.
   - Each step can include activities such as data preparation, feature engineering, model training, model evaluation, deployment, and more.
   - Steps can be connected in a sequential manner, where the output of one step serves as the input for the next step, allowing for a seamless flow of data and execution.

2. Interconnected Workflow:
   - Azure ML pipelines enable the creation of a workflow where the steps are interconnected, allowing for the automatic flow of data and dependencies between the steps.
   - By defining the dependencies between steps, Azure ML ensures that each step is executed in the correct order, respecting the dependencies and data flow.

3. Handoff Automation:
   - Azure ML pipelines automate the handoff between steps by managing the input and output data of each step.
   - The output of one step is automatically passed as input to the subsequent step, eliminating the need for manual intervention or data transfer.
   - This automation simplifies the overall workflow and reduces the risk of errors or inconsistencies in data handoff.

#### Step 1: Preprocessing

![azure-processing-map](assets/azure-processing-map.drawio.png)

In [36]:
diamond_prices = ml_client.data.get("diamond-prices-jan", version="1")

In [37]:
step_process = command(
    name="data_prep_diamond_prices",
    display_name="Data preparation for training",
    description="read a .csv input, split the input to train and test",
    inputs={
        "data": Input(type="uri_folder"),
        "test_train_ratio": Input(type="number"),
    },
    outputs=dict(
        train_data=Output(type="uri_folder", mode="rw_mount"),
        test_data=Output(type="uri_folder", mode="rw_mount"),
    ),
    # The source folder of the component
    code='azure/components/data_prep/',
    command="""python tts.py \
            --data ${{inputs.data}} --test_train_ratio ${{inputs.test_train_ratio}} \
            --train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}} \
            """,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
)

A key difference from before is the usage of the `Output` class to capture the output from the processing script. This object captures the output as a registered folder and holds the state as an attribute. This will enable us to automate the handoff between the processing step and the training step.

Previously this command could itself be executed using `ml_client.create_or_update` method. However, we are going to hold off execution and treat each command as a component. So, while the command is aware of the input types and the output types, it cannot be executed unless specific instances of these are realized as a part of the pipeline.

In [38]:
step_process.component.inputs

{'data': {'type': 'uri_folder'}, 'test_train_ratio': {'type': 'number'}}

In [39]:
step_process.component.outputs

{'train_data': {'type': 'uri_folder', 'mode': 'rw_mount'},
 'test_data': {'type': 'uri_folder', 'mode': 'rw_mount'}}

#### Step 2: Training

![azure-train-map](assets/azure-training-map.drawio.png)

In [40]:
step_train = command(
    name="train_diamond_prices_model",
    display_name="Training a diamond price model",
    description="train a gradient boosted regression model",
    inputs={
        "train_data": Input(type="uri_folder"),
        "test_data": Input(type="uri_folder"),
        "learning_rate": Input(type="number"),
        "registered_model_name": Input(type="string")
    },
    outputs=dict(
        model=Output(type="uri_folder", mode="rw_mount")
    ),
    # The source folder of the component
    code='azure/components/train/',
    command="""python gbr.py \
              --train_data ${{inputs.train_data}} \
              --test_data ${{inputs.test_data}} \
              --learning_rate ${{inputs.learning_rate}} \
              --registered_model_name ${{inputs.registered_model_name}} \
              --model ${{outputs.model}}
            """,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
)

Notice that the inputs are specified exactly as there before. The script is also parameterized the same way. The output configuration is set to be a folder where the model object will be stored.

In [41]:
step_train.component.inputs

{'train_data': {'type': 'uri_folder'},
 'test_data': {'type': 'uri_folder'},
 'learning_rate': {'type': 'number'},
 'registered_model_name': {'type': 'string'}}

In [42]:
step_train.component.outputs

{'model': {'type': 'uri_folder', 'mode': 'rw_mount'}}

So far we have two components that are executable on their own but are not connected to each other. Here is where the `pipeline` function comes into play.

## Step 3: Assemble Pipeline

In [43]:
@dsl.pipeline(
    compute='c002',
    description="data preparation and training pipeline"
)
def diamond_prices_pipeline(
    pipeline_job_data_input,
    pipeline_job_test_train_ratio,
    pipeline_job_learning_rate,
    pipeline_job_registered_model_name,
):
    # using data_prep_function like a python call with its own inputs
    data_prep_job = step_process(
        data=pipeline_job_data_input,
        test_train_ratio=pipeline_job_test_train_ratio,
    )

    # using train_func like a python call with its own inputs
    train_job = step_train(
        train_data=data_prep_job.outputs.train_data,  # note: using outputs from previous step
        test_data=data_prep_job.outputs.test_data,  # note: using outputs from previous step
        learning_rate=pipeline_job_learning_rate,  # note: using a pipeline input as parameter
        registered_model_name=pipeline_job_registered_model_name,
    )

    # a pipeline returns a dictionary of outputs
    # keys will code for the pipeline output identifier
    return {
        "pipeline_job_train_data": data_prep_job.outputs.train_data,
        "pipeline_job_test_data": data_prep_job.outputs.test_data,
    }

Notice how this function converts each of the steps defined before to be jobs by providing the values for the parameters. For the training job the component attributes are accessed to enable connections between components.

In [44]:
registered_model_name = "diamond_prices_model_v1"

In [45]:
pipeline = diamond_prices_pipeline(
    pipeline_job_data_input=Input(type="uri_file", path=diamond_prices.path),
    pipeline_job_test_train_ratio=0.2,
    pipeline_job_learning_rate=0.1,
    pipeline_job_registered_model_name=registered_model_name,
)

In [46]:
pipeline_job = ml_client.jobs.create_or_update(
    pipeline,
    # Project's name
    experiment_name="Training pipeline with registered components"
)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Creating a pipeline job in Azure ML creates a UI where the progress of the job can be tracked. Logs could also be streamed from the pipeline like so:

In [47]:
ml_client.jobs.stream(pipeline_job.name)

RunId: patient_fig_wc9sd73mqr
Web View: https://ml.azure.com/runs/patient_fig_wc9sd73mqr?wsid=/subscriptions/5bcad9c4-40fb-4136-b614-cc90116dd8b3/resourcegroups/tf/workspaces/cloud-teach

Execution Summary
RunId: patient_fig_wc9sd73mqr
Web View: https://ml.azure.com/runs/patient_fig_wc9sd73mqr?wsid=/subscriptions/5bcad9c4-40fb-4136-b614-cc90116dd8b3/resourcegroups/tf/workspaces/cloud-teach



So far, we have built and executed a single pipeline. While this pipeline can be executed on demand, what if we want it to be triggered automatically whenever any component within the pipeline changes? This includes changes in hyperparameter values or evaluation metrics.

To achieve automatic execution upon detecting changes in the pipeline, we need mechanisms to identify specific modifications and define actions to be taken when changes are detected. Let's now focus on the initial aspect of this challenge: tracking changes in the pipeline.

# Version Control with `git`

## Introduction

Version control is essential in an MLOps workflow for several reasons.

a. Collaboration and Teamwork:
   - In machine learning projects, multiple team members often work together on different aspects, such as data preprocessing, model development, and evaluation. Version control systems, like Git, enable seamless collaboration by allowing team members to work concurrently on the same project without conflicts.
   - Git allows individuals to work on their own branches, making it easy to merge changes from different team members. It provides a centralized platform for coordination, code reviews, and efficient collaboration.

b. Traceability and Reproducibility:
   - MLOps workflows require traceability and reproducibility to ensure transparency and maintain the integrity of the pipeline. Version control systems provide a mechanism to track changes made to the code, data, and configurations over time.
   - With Git, each commit represents a specific version of the codebase, making it possible to trace the evolution of the pipeline. This traceability ensures that the entire history of changes is available, enabling reproducibility and error investigation.

c. Experimentation and Iteration:
   - Machine learning involves experimentation and iteration to improve model performance. Version control systems facilitate the management and tracking of different experiments, enabling teams to compare different approaches and analyze their impact on the model.
   - With Git, each experiment can be treated as a separate branch or commit, allowing for easy comparison and rollback if necessary. This ability to iterate quickly and maintain a historical record of experiments is crucial for optimizing models and achieving desired results.

d. Rollbacks and Bug Fixes:
   - In complex ML pipelines, issues and bugs are inevitable. Version control systems offer the capability to revert to previous working states, providing a safety net for rollbacks and bug fixes.
   - Git allows teams to roll back to a specific commit or create a new branch to fix issues while preserving the integrity of the codebase. It ensures that previous working versions can be easily restored, preventing potential disruptions in the pipeline.

## The `clone` - `commit` - `push` workflow

![git-workflow](assets/git-workflow.drawio.png)

The clone-commit-push workflow is a common Git workflow for working with remote repositories. It involves cloning a repository, making changes, committing those changes, and pushing them back to the remote repository. Here's a step-by-step explanation with examples:

1. Clone the Repository:
   - To clone a remote repository, use the `git clone` command followed by the repository's URL.
   - Example:
     ```
     git clone https://github.com/example-user/example-repo.git
     ```
   - This creates a local copy of the remote repository on your machine.

2. Make Changes:
   - Change into the cloned repository's directory:
     ```
     cd example-repo
     ```
   - Make the necessary changes to the files in the repository using any text editor or IDE.

3. Commit Changes:
   - Stage the changes you want to commit using the `git add` command.
   - Example:
     ```
     git add modified_file.py
     ```
   - Commit the staged changes with a meaningful commit message using the `git commit` command.
   - Example:
     ```
     git commit -m "Update modified_file.py with new feature"
     ```

4. Push Changes:
   - Push the committed changes to the remote repository using the `git push` command.
   - Example:
     ```
     git push origin master
     ```
   - This command pushes the committed changes to the `master` branch of the remote repository named `origin`.

Note: The `origin` is the default name of the remote repository. You can replace it with the appropriate name if your remote repository has a different name.

5. Pull Changes (Optional):
   - If you are working in a team or collaborating with others, it's a good practice to pull the latest changes from the remote repository before making your own changes.
   - Use the `git pull` command to fetch and merge the latest changes from the remote repository.
   - Example:
     ```
     git pull origin master
     ```
   - This ensures that your local repository is up to date with the remote repository before you make your own modifications.

The clone-commit-push workflow allows you to work on your local repository, make changes, commit them with informative messages, and push them back to the remote repository. It facilitates collaboration, version control, and the seamless integration of changes into the project.

## Branching

![git-branches](assets/git-branch.drawio.png)

In a collaborative development environment where multiple developers are working on different aspects of a proposed change, Git branching is a valuable feature that enables an organized and efficient workflow. Each developer can work on a separate branch, allowing them to make independent changes without interfering with each other's work. Here's an explanation of the usage of Git branching in this context:

1. Creating Branches:
   - Each developer starts by creating their own branch, typically based on the main branch (e.g., "master" or "main").
   - Developers can use the `git branch` command to create a new branch or the `git checkout -b` command to create and switch to a new branch in one step.
   - Example:
     ```
     git branch feature-branch
     ```
     or
     ```
     git checkout -b feature-branch
     ```

2. Working on Branches:
   - Each developer now works on their respective branch, focusing on their specific tasks or changes.
   - They can make changes, add new features, fix bugs, or modify code without affecting the main branch or other developers' work.
   - Developers commit their changes locally to their branch using the standard `git add` and `git commit` commands.

3. Sharing Branches:
   - Developers can push their local branches to a shared remote repository to collaborate with others.
   - They use the `git push` command with the branch name and the remote repository to push their branch.
   - Example:
     ```
     git push origin feature-branch
     ```

4. Reviewing and Merging Changes:
   - Once a developer has completed their work on the branch, they can create a pull request or merge request, depending on the Git hosting platform being used (e.g., GitHub, GitLab, Bitbucket).
   - The pull request allows other developers to review the changes, provide feedback, and discuss the proposed changes before merging them into the main branch.
   - After the review and approval process, the changes from the branch can be merged into the main branch using the platform's interface.

5. Updating Branches:
   - During the development process, other developers may make changes to the main branch.
   - To incorporate those changes into their branch, developers can perform a branch update by switching to their branch and using the `git merge` command or `git rebase` command to integrate the latest changes from the main branch.
   - Example:
     ```
     git checkout feature-branch
     git merge main
     ```

By utilizing Git branching, developers can work on different aspects of a proposed change simultaneously, without interfering with each other's work. It allows for parallel development, easy collaboration, and the ability to review, discuss, and merge changes in an organized manner.

# References

- **SageMaker Pipelines**

1. [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html)
2. [Example 1](https://catalog.us-east-1.prod.workshops.aws/workshops/9a6bcca9-93d6-4e09-ada2-64b692267342/en-US/pipelines)
3. [Example 2]([Example](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.html)) 

- **Azure Pipelines**

1. [DSL Documentation](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.dsl?view=azure-python#functions)
2. [Example 1](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-component-pipeline-python?view=azureml-api-2) 