## Preliminary steps
1. Make sure to change the variables specified whenever there is a "TODO:"
2. When uploading the data make sure to select the type as "File (url_file)"
3. The file name could be different but make sure to change accordingly.
![Uploading_dataset](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/Upload_dataset.png)

4. Change the kernel on the top right corner to be from Python 3 (ipykernel) to be Python 3.8 - Azure ML
![Select_kernel.png](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/Select_kernel.png)

In [None]:
## Install Azure SDK library
!pip3 install azure-ai-ml

## Connect to the workspace

Before you dive in the code, you'll need to connect to your Azure ML workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning.

We're using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` is used to handle most Azure SDK authentication scenarios. 

Reference for more available credentials if it doesn't work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

In the next cell, enter your Subscription ID, Resource Group name and Workspace name. To find these values:

1. In the upper right Azure Machine Learning studio toolbar, select your workspace name.
1. Copy the value for workspace, resource group and subscription ID into the code.  
1. You'll need to copy one value, close the area and paste, then come back for the next one.

![Credentials.png](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/Credentials.png)

In [None]:
##### TODO: change parameters below
# Get a handle to the workspace

ml_client = MLClient(
    credential=credential,
    subscription_id="xxxxxx",
    resource_group_name="xxxxx",
    workspace_name="xxxxx",
)

## Create a compute resource to run your job

You already have a compute instance you're using to run the notebook.  But now you'll add another type, a **compute cluster** that you'll use to run your training job. The compute cluster can be single or multi-node machines with Linux or Windows OS, or a specific compute fabric like Spark.

You'll provision a Linux compute cluster. See the [full list on VM sizes and prices](https://azure.microsoft.com/pricing/details/machine-learning/) .

For this example, you only need a basic cluster, so you'll use a Standard_DS11_v2 model with 2 vCPU cores, 14-GB RAM. If you have already created a compute cluster, please specify its name in **cpu_compute_target** variable.

In [None]:
from azure.ai.ml.entities import AmlCompute

# Name assigned to the compute cluster
cpu_compute_target = "cpu-cluster"

try:
    # let's see if the compute target already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print("Creating a new cpu compute target...")

    # Let's create the Azure ML compute object with the intended parameters
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # Azure ML Compute is the on-demand VM service
        type="amlcompute",
        # VM Family
        size="STANDARD_DS11_V2",
        # Minimum running nodes when there is no job running
        min_instances=0,
        # Nodes in cluster
        max_instances=1,
        # How many seconds will the node running after the job termination
        idle_time_before_scale_down=180,
        # Dedicated or LowPriority. The latter is cheaper but there is a chance of job termination
        tier="Dedicated",
    )
    print(
         f"AMLCompute with name {cpu_cluster.name} will be created, with compute size {cpu_cluster.size}"
          )
    # Now, we pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)



## Create a job environment

To run your AzureML job on your compute cluster, you'll need an [environment](https://docs.microsoft.com/azure/machine-learning/concept-environments). An environment lists the software runtime and libraries that you want installed on the compute where you’ll be training. It's similar to your Python environment on your local machine.

AzureML provides many curated or ready-made environments, which are useful for common training and inference scenarios. You can also create your own custom environments using a docker image, or a conda configuration.

In this example, you'll create a custom conda environment for your jobs, using a conda yaml file.

First, create a directory to store the file in.

In [7]:
import os

dependencies_dir = "./dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

Now, create the file in the dependencies directory. The cell below uses IPython magic to write the conda.yml file into the directory you just created.\
For dependecies, we need to list Python version and all library versions.

In [8]:
%%writefile {dependencies_dir}/conda.yml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.21.2
  - pip=21.2.4
  - scikit-learn=0.24.2
  - scipy=1.7.1
  - pandas>=1.1,<1.2
  - seaborn=0.12.2 
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - xlrd==2.0.1
    - mlflow== 1.26.1
    - azureml-mlflow==1.42.0
    - psutil>=5.8,<5.9
    - tqdm>=4.59,<4.60
    - ipykernel~=6.0

Overwriting ./dependencies/conda.yml


**Note:** The reason why numpy, scikit-learn, scipy, pandas, and seaborn are not listed under pip section in the conda.yml file is that these packages are available in the Conda package manager and can be installed using the conda command.

Conda is a package manager that can install packages from different channels, including conda-forge, which is specified in the channels section of the conda.yml file. By specifying these packages under the dependencies section, Conda will ensure that the specified versions of these packages, along with their dependencies, are installed in the environment. This ensures compatibility and stability of the environment.

On the other hand, pip is a package manager for Python packages that are not available in the Conda channels. The pip section in the conda.yml file is used to specify additional Python packages that are not available in Conda channels and need to be installed using pip.


Reference this *yaml* file to create and register this custom environment in your workspace:

In [None]:
from azure.ai.ml.entities import Environment

custom_env_name = "aml-scikit-learn"

pipeline_job_env = Environment(
    name=custom_env_name,
    description="Custom environment for Bank Deposit pipeline",
    tags={"scikit-learn": "0.24.2"},
    conda_file=os.path.join(dependencies_dir, "conda.yml"),
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)

print(
    f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
)

## What is a command job?

You'll create an Azure ML *command job* to train a model for credit default prediction. The command job is used to run a *training script* in a specified environment on a specified compute resource.  You've already created the environment and the compute resource.  Next you'll create the training script.

The *training script* handles the data preparation, training and registering of the trained model. In this tutorial, you'll create a Python training script.

Command jobs can be run from CLI, Python SDK, or studio interface. In this tutorial, you'll use the Azure ML Python SDK v2 to create and run the command job.

After running the training job, you'll deploy the model, then use it to produce a prediction.

### Build the command job to train
Now that you have all assets required to run your job, it's time to build the job itself, using the Azure ML Python SDK v2. We will be creating a command job.

An AzureML command job is a resource that specifies all the details needed to execute your training code in the cloud: inputs and outputs, the type of hardware to use, software to install, and how to run your code. the command job contains information to execute a single command.

**Create training script**

Let's start by creating the training script - the *main.py* python file.

In [5]:
import os

train_src_dir = "./src"
os.makedirs(train_src_dir, exist_ok=True)

This script handles the preprocessing of the data, splitting it into test and train data. It then consumes this data to train a Random forest model and return the output model. It is essentially the finalized and clean version of **Tutorial - Predictive Model.ipynb**, which focuses on constructing the final model.

[MLFlow](https://mlflow.org/docs/latest/tracking.html) will be used to log the parameters and metrics during our pipeline run. 

The cell below uses IPython magic to write the training script into the directory you just created. Additionally, you could create the main.py using any text editor programs.

In [9]:
%%writefile {train_src_dir}/main.py
import pandas as pd
import numpy as np
import warnings
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import os
import argparse
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--n_estimators", required=False, default=100, type=int)
    parser.add_argument("--max_depth", required=False, default=15, type=float)
    parser.add_argument("--registered_model_name", type=str, help="model name")
    args = parser.parse_args()
   
    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    ###################
    # <Load the data>
    ###################
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print("input data:", args.data)
    
    df_bank = pd.read_csv(args.data)

    # Log the size of dataframe
    mlflow.log_metric("num_samples", df_bank.shape[0])
    mlflow.log_metric("num_features", df_bank.shape[1] - 1)
    
    ###################
    # </Load the data>
    ###################
    
    ##################
    #<Data preprocessing>
    ##################
    
    # Copying original dataframe
    df_bank_ready = df_bank.copy()
    
    # Select Features
    feature = df_bank.drop('deposit', axis=1)

    # Select Target
    target = df_bank['deposit'].apply(lambda deposit: 1 if deposit == 'yes' else 0)

    # Set Training and Testing Data
    X_train, X_test, y_train, y_test = train_test_split(feature , target, 
                                                        shuffle = True, 
                                                        test_size=0.2, 
                                                        random_state=1)

    # Transform data
    numeric_columns = ['age', 'balance', 'day', 'campaign', 'pdays', 'previous']
    categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

    # define a function that cleans the balance column
    def clean_balance(x):
        return np.maximum(x, 0)

    # define a custom transformer that applies the clean_balance function to the balance column
    clean_balance_transformer = FunctionTransformer(clean_balance)

    scaler = StandardScaler()
    one_hot_encoder = OneHotEncoder()

    # Embeded both transformation into ColumnTransformer so that it could automatically transform data when 
    # having new data
    preprocessor = ColumnTransformer(transformers=[
        ('clean_balance', clean_balance_transformer, ['balance']),
        ('num', scaler, numeric_columns),
        ('cat', one_hot_encoder, categorical_columns)
    ])

    # We fit preprocessor with X_train instead of the whole dataset to prevent data leakage
    preprocessor.fit(X_train)
    
    X_train_preprocessed = preprocessor.transform(X_train)
    X_test_preprocessed = preprocessor.transform(X_test)


    print(f"Training with data of shape {X_train_preprocessed.shape}")
    
    ##################
    #</Data preprocessing>
    ##################

    ##################
    #<train the model>
    ##################
    clf = RandomForestClassifier(
        n_estimators=args.n_estimators, max_depth=args.max_depth,
        min_samples_split=40, min_samples_leaf=60
    )
    clf.fit(X_train_preprocessed, y_train)
    
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('model', clf)
    ])

    y_pred = pipeline.predict(X_test)

#     y_pred = clf.predict(X_test)

    print(classification_report(y_test, y_pred))
    ###################
    #</train the model>
    ###################

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=pipeline,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    # Saving the model to a file
    mlflow.sklearn.save_model(
        sk_model=pipeline,
        path=os.path.join(args.registered_model_name, "trained_model"),
    )
    ###########################
    #</save and register model>
    ###########################
    
    # Stop Logging
    mlflow.end_run()

if __name__ == "__main__":
    main()

Overwriting ./src/main.py


As you can see in this script, once the model is trained, the model file is saved and registered to the workspace. Now you can use the registered model in inferencing endpoints.

## Configure the command

Now that you have a script that can perform the desired tasks, you'll use the general purpose **command** that can run command line actions. This command line action can be directly calling system commands or by running a script. 

Here, you'll create input variables to specify the input data, split ratio, learning rate and registered model name.  The command script will:
* Use the compute created earlier to run this command.
* Use the environment created earlier - you can use the `@latest` notation to indicate the latest version of the environment when the command is run.
* Configure some metadata like display name, experiment name etc. An *experiment* is a container for all the iterations you do on a certain project. All the jobs submitted under the same experiment name would be listed next to each other in Azure ML studio.
* Configure the command line action itself - `python main.py` in this case. The inputs/outputs are accessible in the command via the `${{ ... }}` notation.

In [None]:
#### TODO:

from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = "deposit-prediction-model"

job = command(
    inputs=dict(
        data=Input(
            type="uri_file",
            # TODO: Change the path variable to be your dataset name
            # "path" is a reference to a dataset named "bank-dataset" in the Azure Machine Learning workspace, version 1.
            path="azureml:bank-dataset:1",
        ),
        test_train_ratio=0.2,
        # Specify hyperparameters of Random Forest
        max_depth=10,
        n_estimators=300,
        registered_model_name=registered_model_name,
    ),
    code="./src/",  # location of source code
    command="python main.py --data ${{inputs.data}} --test_train_ratio ${{inputs.test_train_ratio}} --max_depth ${{inputs.max_depth}} --registered_model_name ${{inputs.registered_model_name}}",
    environment="aml-scikit-learn@latest",
    compute="cpu-cluster",
    experiment_name="train_model_deposit_prediction",
    display_name="deposit-prediction",
)

## Submit the job 

It's now time to submit the job to run in AzureML. This time you'll use `create_or_update`  on `ml_client.jobs`.

In [None]:
ml_client.create_or_update(job)

## View job output and wait for job completion

View the job in Azure ML studio by selecting the link in the output of the previous cell. 

The output of this job will look like this in Azure ML studio. Explore the tabs for various details like metrics, outputs etc. Once completed, the job will register a model in your workspace as a result of training. 

![Screenshot that shows the job overview](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/view-job.gif "View the job in studio")

> [!IMPORTANT]
> Wait until the status of the job is complete before returning to this notebook to continue. The job will take 2 to 3 minutes to run. It could take longer (up to 10 minutes) if the compute cluster has been scaled down to zero nodes and custom environment is still building.



To see the logs and examine errors if any, click on **Outputs + logs** tab
![Logs](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/Outputs+logs.png)


# Deploy model
Once the job is completed, you would see a model in the **Models** tab
![Logs](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/Models.png)

Click on deposit-prediction-model, Deploy tab and select Real-time endpoint
![Logs](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/Realtime_endpoint.png)

Use the following configuration, but keep in mind that the **Endpoint name** and **Deployment name** are not required to be the same as shown in the screenshot. This deployment requires a virtual machine to host the endpoint which would use a specified environment in conda.yml and Python script main.py.
![Endpoint_configuration](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/Endpoint_configuration.png)

Make sure to remove this endpoint once you are finished using it

## Test endpoint
To test the endpoint, go to Endpoints tab and click on the created endpoint.
![Logs](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/Endpoint.png)

Click the Test tab, enter the following JSON schema, and press the Test button.
![Test_endpoint](https://raw.githubusercontent.com/Khaninsi/Azure-MLOps/master/screenshots/Test_endpoint.png)
If no error messages appear, the deployment was successful and completed. Congrats!