<a href="https://colab.research.google.com/github/Hyperspectral01/AzureML_Step-by-Step_Pipelining/blob/main/readme_MLOps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**References**:

Sentiment Analysis Model Creation and Pretraining :https://www.kaggle.com/code/priyankdl/sentiment-analysis-imdb-torchtext-gru

Whats is MLOps? : https://www.youtube.com/watch?v=6SRifO6dmuE&t=663s

Special Thanks to Microsoft Azure for giving Free Credits :)

**Pre-Requisites:**

Pytorch

GRU

Numpy

Familiarity with terms like Training,Testing and API Endpoints.



**WHY PIPELINES in AI and ML**?

Normally Machine Learning or AI Models can be deployed directly to platforms like Microsoft Azure ML, AWS and so on. So why bother about creating a pipeline for the same. Let's say you want to deploy a new model that does the same work as the previous model, but better. How will you do it? Will you deploy it on a second REST endpoint and delete the endpoint for the first model. Let's say the model is made for a field like cricket, and it displays the probable number of runs that would be scored by a team during a match. As you would imagine, in this scenario the model has to be trained on the data of all the previous matches and on inference time, it will be given parameters like run rate, wickets down and so on to get the final prediction of runs scored. This entire process cannot be driven manually. Everything from training to testing to deployment on an endpoint has to be automated. That brings us to the first goal of Pipelines->AUTOMATION. And the second most important goal of the pipeline is ACCESS CONTROL of different parts of the pipeline and its subsequent development.

**What is done in this project?**

**Brief Idea:**

We have taken a model for Sentiment Analysis and created CI and CD Pipelines such that it automates training and deployement in the following way:

The CI Pipeline takes the newly uploaded dataset and trains the already deployed model and registers that as a seperate model. It then checks whether this new model is better than the already deployed model and and stored a bool value in a txt file as an output.

The CD Pipeline gets called if the bool value output of the CI Pipeline is true. It then takes the registered model with the latest version and deploys it.

**Detailed Idea:**

1. Setup Workspace and Compute Resources
   - Created a workspace and configured necessary compute resources for processing and training.

2. Upload Files and Resources
   - Uploaded pretrained model, and vocab.txt to the workspace.

3. Connect to Workspace
   - Established a connection to the Azure workspace for managing and accessing resources.

4. Model Registration
   - Registered the pretrained model within the workspace for version control and accessibility.

5. Upload Vocabulary File
   - Uploaded vocab.txt to the workspace for use in scripts requiring tokenization.

6. Dataset Preparation
   - Stored dataset.csv in Azure Blob Storage to make it accessible for processing scripts.

7. script1.py: Data Preparation
   - This script takes the dataset from the input argument and does preprocessing on it and divides it into train_x,train_y,test_x,test_y and stores them in output argument

8. script2.py: Model Training
   - Developed a training script that takes the registered model and train_x,train_y from the input argument and uses that to train the model and register the new trained model. Puts test_x,test_y from input argument to output argument and old and new model names into output argument

9. script3.py: Model Comparison and Evaluation
   - Takes the test_x,test_y,old model name,new model name from the input argument and test both models to see if new model is better than old model,if better, stored True in output text file, otherwise stores False.

10. Pipeline for Model Training and Evaluation
    - Designed and set up a pipeline to automate the steps of training, evaluating, and model comparison.

11. score.py: Web Input Processing and Prediction
    - Wrote a script that tokenizes incoming web input using vocab.txt, feeds it to the model, and generates a prediction output.

12. script4.py : Deployment Script
    - Created a deployment script to deploy the latest registered model along with Script 4 to serve web requests.

13. Pipeline for Deployment
    - Developed a pipeline to streamline the deployment process, integrating it with the model and input processing script.

14. Made a function MLOps
    - It takes uploaded dataset name as parameter and runs both the pipelines



---



--------------------------------------------------------------------------        **STEPS TO FOLLOW TO REPRODUCE THE PROJECT**         ---------------------------------------------------------------------



---



**STEP 1** : **Login**

Click on [Microsoft Azure](https://azure.microsoft.com/en-in) to go to the website and click on the login button on the top right hand side corner. Fill in details like outlook mail id, phone number, and debit or credit card details and finally login.

**STEP 2:** **Go to Azure ML**

In Azure Services section look for Azure ML and click on it.

**STEP 3:** **Initialising a Workspace and a Resource Group:**

Click on the Create button and then New Workspace to create a new workspace.

Subscription Type:

        Free trail or Any Subsciption

Resource Group :

        Create New and then name it.

Workspace Details:

        Name: Give any name

        Region: Whatever is closest to you

        Region, Storage Account, Key Vault, Application Insights will be filled by default.

        Container Registry: None (We would not be making any containers explicitly)

Then Click on Review+Create and then Create


**STEP 4:** **Initialising a compute**:

Click on the launch studio

1.Click on the Compute section on the left hand side of the screen.

2.Click on the Compute Clusters on the top

3.Click on the "New" Button

4.Select the appropriate options and click on NEXT.

5.Give a Compute Name (you can look it up later again when needed), do not enable SSH validation

Click on "Create"

**STEP 5:** **Install the following dependencies**

Open a colab notebook (.ipynb) file and follow the following steps


In [None]:
#We would be using azureml.core library as a Python SDK connect to our Azure Workspace as a client
#Restart the session if required

!pip install azureml
!pip install azureml.core
!pip install azureml-dataset-runtime
!pip install azureml.pipeline
!pip install --upgrade azureml-sdk

Collecting azureml.pipeline
  Using cached azureml_pipeline-1.58.0-py3-none-any.whl.metadata (1.8 kB)
Collecting azureml-pipeline-core~=1.58.0 (from azureml.pipeline)
  Downloading azureml_pipeline_core-1.58.0-py3-none-any.whl.metadata (1.0 kB)
Collecting azureml-pipeline-steps~=1.58.0 (from azureml.pipeline)
  Downloading azureml_pipeline_steps-1.58.0-py3-none-any.whl.metadata (1.1 kB)
Collecting azureml-train-core~=1.58.0 (from azureml-pipeline-steps~=1.58.0->azureml.pipeline)
  Using cached azureml_train_core-1.58.0-py3-none-any.whl.metadata (1.8 kB)
Collecting azureml-train-automl-client~=1.58.0 (from azureml-pipeline-steps~=1.58.0->azureml.pipeline)
  Using cached azureml_train_automl_client-1.58.0-py3-none-any.whl.metadata (1.4 kB)
Collecting azureml-automl-core~=1.58.0 (from azureml-train-automl-client~=1.58.0->azureml-pipeline-steps~=1.58.0->azureml.pipeline)
  Downloading azureml_automl_core-1.58.0-py3-none-any.whl.metadata (1.9 kB)
Collecting azureml-telemetry~=1.58.0 (from a

**STEP 6:** **Import all the dependencies**

**Do Not** Forget to Upload all the files linked in this project to the session storage on the left hand side on colab

In [None]:
#These libraries are used for uploading, downloading from azure blob storage, registering environemnts,models,datasets
from azureml.core import Workspace, Datastore, Dataset, Model, Environment, Experiment

#Configuring the environment and the compute for the scripts running in the pipeline
from azureml.core.runconfig import RunConfiguration
from azureml.core.compute import ComputeTarget, AmlCompute

#Initialising pipeline
from azureml.pipeline.core import Pipeline, PipelineData, PipelineParameter
from azureml.pipeline.steps import PythonScriptStep

#To get a published pipeline
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.pipeline.core import PublishedPipeline

**STEP 7:** **Connecting to our workspace**

In [None]:
# Connect to the workspace by specifying subscription ID, resource group, and workspace name
workspace = Workspace(subscription_id="<your-subsrciption-id>", #These can be copied by clicking on free trial on top left hand side of the azure portal after launching the studio
                      resource_group="<your-resource-group-name>",
                      workspace_name="your-workspace-name")


print("Connection successful")

# Check connection
workspace.get_details()


**STEP 8:** **Upload the existing model (new_model_state_dict.pth)**

In [None]:
#NOTE : MAKE SURE THAT ALL FILES OF THIS PROJECT HAVE BEEN UPLOADED TO SESSION STORAGE IN COLAB ON THE LEFT HAND SIDE

#The upload is done by registering the model on the azure workspace (gets stored there as an artifact)

models = Model.list(workspace)

#Gets the latest version so far
latest_version = max([int(model.tags.get("version")) for model in models], default=0)

# Register the model
model = Model.register(workspace=workspace,
                       model_path="/content/new_model_state_dict.pth",  # Path to your model.pkl file
                       model_name="new_model_state_dict.pth",
                       description="Model version 1",
                       tags={"version": f"{latest_version+1}" , "deployed": "true"})

print(f"Model registered: {model.name}, Version: {model.version}")

**STEP 9:** **Upload the vocab.txt file**

In [None]:
#We will upload by registering it as a dataset (artifact) as well as uploading it in Azure Blob Storage

# Get the default datastore for the workspace
datastore = workspace.get_default_datastore()

# Upload the vocab.txt file to the datastore
datastore.upload_files(files=["/content/vocab.txt"],
                       target_path="vocab_data/",  # Folder in the datastore where the file will be stored
                       overwrite=True)

# Register the file as a dataset in the workspace (optional)
vocab_dataset = Dataset.File.from_files((datastore, "vocab_data/vocab.txt"))
vocab_dataset = vocab_dataset.register(workspace=workspace,
                                       name="vocab_dataset",
                                       description="Vocabulary file",
                                       create_new_version=True)

print("File uploaded and registered as dataset.")

**STEP 10:** **Upload the Sample Dataset File**

In [None]:
#Refer to the format of the dataset file for custom dataset csv files

# Get the default datastore (usually points to Azure Blob Storage)
blob_datastore = workspace.get_default_datastore()

# Upload dataset.csv to Blob Storage in a specific folder
blob_datastore.upload_files(files=["/content/dataset.csv"], #Our dataset file name is dataset.csv, it could be anything as long as everything in the codes in consistent
                            target_path="datasets/",   # Folder in Blob Storage
                            overwrite=True)

print("File uploaded to Azure Blob Storage.")
#You can always check out the blob storage in the Data Section in the Azure Machine Learning Studio

**STEP 11:** **Changes in score.py**

1)Go to Microsoft Azure Home, click on Storage Accounts, select your storage account (refer step 3) , on Left hand side, click on Data Storage->Containers->Container name with prefix "azureml-blobstore"
Then copy the entire title with the prefix "azureml-blobstore" and paste it in the "container_name" variable in score.py

2)Go to Microsoft Azure Home, click on Storage Accounts, select your storage account (refer step 3) , on Left hand side, click on Security+networking->Access keys , then click on "show" for any connection string and then copy it and paste it in "connection_string" variable in score.py

**Changes in script1.py, script2.py, script3.py, script4.py**

In the files, make change in these lines of code:  


    workspace = Workspace(subscription_id=<your-subscripton-id>,
                          resource_group=<your-resource-group-name>,
                          workspace_name=<your-workspace-name>)   
                      
  corresponding to your own subscription id, resource_grp and workspace_name

**STEP 12:** **Registering a New Environment for the pipeline**

In [None]:
# Load the environment from your .yml file
#NOTE: This yml file is according to the scripts used in this pipeline, if some other scripts are used, then make sure to include all dependencies in the env.yml file
env = Environment.from_conda_specification(name="pipeline-env", file_path="/content/env.yml")

# Register the environment (without the version argument)
env.register(workspace=workspace) #It will automatically take care of the version in case the environment name is same

**STEP 13:** **Loading the Environment**

In [None]:
env=Environment.get(workspace=workspace, name="pipeline-env") #Loading the same registered environment with its name
#By default takes the latest version

**STEP 14:** **Initialising the run_config for pipeline**

In [None]:
run_config = RunConfiguration()
run_config.environment = env

**STEP 15:** **Initialising the compute cluster for the pipelines**

In [None]:
compute_target = ComputeTarget(workspace=workspace, name="<name-of-compute-cluster-from-step-4>")  #Remember the compute cluster that we had initialised in step 4
#If you dont remember the name of the cluster, you can always look it up by clicking on the ompute section on the left hand side in the azure machine learning studio

**STEP 16:** **Defining Pipeline Parameters**

In [None]:
# Define data paths
processed_data = PipelineData("processed_data", datastore=workspace.get_default_datastore()) #Output of script1.py and input to script2.py
processed_data2 = PipelineData("processed_data2", datastore=workspace.get_default_datastore())  # Output of script2.py and input to script3.py

# Define dataset name parameter for the pipeline
dataset_name_param = PipelineParameter(name="dataset_name", default_value="dataset.csv") #input to script1.py

**STEP 17:** **Creating and Publishing the CI Pipeline**

In [None]:
#DEFINING THE STEPS OF CI PIPELINE

# Step 1: Run script1 (data preprocessing)
script1_step = PythonScriptStep(
    name="Data_Preprocessing",
    script_name="script1.py",
    arguments=["--dataset_name", dataset_name_param,"--processed_data",processed_data],
    outputs=[processed_data],
    compute_target=compute_target,
    runconfig=run_config,
    source_directory="/content"  # Ensure this is the correct path
)

# Step 2: Run script2 (model training and registration)
script2_step = PythonScriptStep(
    name="Model_Training",
    script_name="script2.py",
    arguments=["--processed_data", processed_data,"--processed_data2",processed_data2],
    inputs=[processed_data],
    outputs=[processed_data2],  # This should be a different output, not the same as input
    compute_target=compute_target,
    runconfig=run_config,
    source_directory="/content"  # Ensure this is the correct path
)

# Step 3: Run script3 (evaluation and tagging)
script3_step = PythonScriptStep(
    name="Model_Evaluation_and_Tagging",
    script_name="script3.py",
    arguments=["--processed_data2",processed_data2],
    inputs=[processed_data2],
    compute_target=compute_target,
    runconfig=run_config,
    source_directory="/content"  # Ensure this is the correct path
)

# CREATING THE CI PIPELINE FROM THE STEPS
ci_pipeline = Pipeline(workspace=workspace, steps=[script1_step, script2_step, script3_step])

# Validate the pipeline
ci_pipeline.validate()

# PUBLISHING THE CI PIPELINE
ci_pipeline.publish(name="CI_Pipeline", description="CI pipeline for model training and evaluation.")



#WHILE RUNNING THIS CELL, YOU NEED TO GO TO AZURE MACHINE LEARNING STUDIO (Review step 4) and CLICK ON PIPELINES SECTION ON LEFT HAND SIDE, CLICK ON THE LATEST JOB THAT SHOWS RUNNING AND DOUBLE CLICK ON THE BLOCK OF PIPELINE THAT SHOWS RUNNING AND GO TO OUTPUTS+LOGS SECTION AND REFRESH UNTIL YOU SEE USER LOGS, THEN CLICK ON std_log.txt and refresh UNTIL A LINK WITH A CODE APPEARS, GO TO THE LINK AND PASTE THE CODE.
#THE ABOVE STEP HAS TO BE REPEATED FOR ALL THE BLOCKS OF THE PIPELINES
#This is a One Time authentication required by the pipelines to access your resources like azure blob storage

**STEP 18:** **Creating and Publishing the CD Pipeline**

In [None]:
# Step 4: Run script4 (deployment)
script4_step = PythonScriptStep(
    name="Model_Deployment",
    script_name="script4.py",
    compute_target=compute_target,
    runconfig=run_config,
    source_directory="/content"
)

# Create CD pipeline
cd_pipeline = Pipeline(workspace=workspace, steps=[script4_step])
cd_pipeline.validate()

# Publish the CD pipeline
cd_pipeline.publish(name="CD_Pipeline", description="CD pipeline for model deployment.")


#WHILE RUNNING THIS CELL, YOU NEED TO GO TO AZURE MACHINE LEARNING STUDIO (Review step 4) and CLICK ON PIPELINES SECTION ON LEFT HAND SIDE, CLICK ON THE LATEST JOB THAT SHOWS RUNNING AND DOUBLE CLICK ON THE BLOCK OF PIPELINE THAT SHOWS RUNNING AND GO TO OUTPUTS+LOGS SECTION AND REFRESH UNTIL YOU SEE USER LOGS, THEN CLICK ON std_log.txt and refresh UNTIL A LINK WITH A CODE APPEARS, GO TO THE LINK AND PASTE THE CODE.
#This is a One Time authentication required by the pipelines to access your resources like azure blob storage

**STEP 19:** **Getting the Published Pipelines to run them**

In [None]:
# Authentication
interactive_auth = InteractiveLoginAuthentication()
datastore = workspace.get_default_datastore()

# Reference to published CI and CD pipelines by their name
def get_pipeline_by_name(workspace, pipeline_name):

    # List all published pipelines and find the one by name
    published_pipelines = PublishedPipeline.list(workspace)
    for pipeline in published_pipelines:
        if pipeline.name == pipeline_name:
            return pipeline
    raise ValueError(f"Pipeline with name '{pipeline_name}' not found.")

# Get the reference to published CI and CD pipelines
ci_pipeline_published = get_pipeline_by_name(workspace=workspace, pipeline_name="CI_Pipeline")
cd_pipeline_published = get_pipeline_by_name(workspace=workspace, pipeline_name="CD_Pipeline")

**STEP 20:** **Calling a function MLOps() to run the pipelines**

In [None]:
def MLOps(name_of_the_dataset):

    # Run CI pipeline with dynamic dataset name parameter
    ci_run = ci_pipeline_published.submit(
        workspace,
        experiment_name="CI_Pipeline_Run",
        pipeline_parameters={"dataset_name": name_of_the_dataset}
    )
    ci_run.wait_for_completion()

    # Specify the path in the datastore
    file_path = 'outputs/ci_output_status.txt'

    # Download the file from the datastore to your local machine
    local_path = './'  # Local path where you want to store the file


    #Downloading the output of the CI pipeline
    datastore.download(target_path=local_path, prefix=file_path, overwrite=True)

    # Check if CI pipeline run succeeded
    if ci_run.get_status() == "Finished":

        # Load the CI output status from the file created in script3
        with open("./outputs/ci_output_status.txt", "r") as f:
            ci_output = f.read().strip() == "True"

        if ci_output:

            # Run CD pipeline if new model is better than the deployed model
            cd_run = cd_pipeline_published.submit(workspace, experiment_name="CD_Pipeline_Run")
            cd_run.wait_for_completion()
            print("Model successfully deployed.")

        else:
            print("Model evaluation did not pass. CD pipeline will not run.")
    else:
        print("CI pipeline failed, stopping deployment.")





MLOps("dataset.csv")  #Here you can put the name of the dataset that you had uploaded



#WHILE RUNNING THIS CELL FOR THE FIRST TIME, YOU NEED TO GO TO AZURE MACHINE LEARNING STUDIO (Review step 4) and CLICK ON PIPELINES SECTION ON LEFT HAND SIDE, CLICK ON THE LATEST JOB THAT SHOWS RUNNING AND DOUBLE CLICK ON THE BLOCK OF PIPELINE THAT SHOWS RUNNING AND GO TO OUTPUTS+LOGS SECTION AND REFRESH UNTIL YOU SEE USER LOGS, THEN CLICK ON std_log.txt and refresh UNTIL A LINK WITH A CODE APPEARS, GO TO THE LINK AND PASTE THE CODE.
#THE ABOVE STEP HAS TO BE REPEATED FOR ALL THE BLOCKS OF THE PIPELINES
#This is a One Time authentication required by the pipelines when run for the first time to access your resources like azure blob storage.

You can now view your deployment on **Azure Machine Learning Studio** (Review Step 4), click on the **Endpoints Section** on the left hand side, and click on **sentiment-analyser-endpoint**, and view your API Endpoint

**Incoming JSON Data File Format** : {"input_text" : "a sample text or a review here"}

**Output JSON Data File Format** : {"prediction" : 0 or 1} 0 for positive and 1 for Negative Review