# New Draft for Prototype with a Cloud workstation

[Sanghee's note]
- CI setup is in the resource setup article. We explain that CI is needed in order to run Notebook/Terminal. While this is not a requirement, we highly recommend setting up CI for tutorials. 
- Order of progression: Data tutorial > prototype > training > deployment > pipeline
- We will do some explanation of Notebook in the data tutorial as needed. Same for the prototyping, we will only explain what user will use.
- Do we want continuity from the data tutorial to the prototyping tutorial? My inclination is that we teach users how to clean up and register the full, training data in the data tutorial, and use the registered data asset in the training tutorial; meanwhile, in the prototyping tutorial, we instruct the user to download and upload a small test dataset to use. I think it is more realistic that way. Thoughts?
- Do we want continuity to the training tutorial? I think either way is fine, but I am partial to continuity so that users who use the entire series don't get too confused. We could start by a simpler script here, ask user to modify it and convert to main.py. In the training tutorial, we provide the full script again but users would recognize it is the same script. 

### Download assets required for this tutorial
In this tutorial, you'll be learning how to bring an existing project to Azure Machine Learning Notebook and run the prototyping code. As a pre-requisite of this tutorial, download the sample project files first.

- data set link
- main.py (maybe? - user can copy & paste from the .py or from the doc - it is a bit awkward bc we convert back to py at the end; we could just provide the script to copy & paste in the doc as well. We won't be providing a notebook here - Leah & I discussed this and when we want user to copy & paste into the notebook, providing another notebook doesn't make sense.)  
- conda.yml (maybe? - if this would make sense in the custom env instruction)

## Open Azure Machine Learning Notebook
Re-use the instruction Sheri already has here.

## Connect to Compute Instance if you haven't already
Re-use the instruction Sheri already has here.

## Upload your file
In order to prototype, you'll need a small test dataset and a training script. Let's upload them.

Re-use the instruction Sheri already has here.

## Set up a new environment to make your code work (create a new kernel)
In order to make your script work, you will need to make sure you have a development environment configured with libraries used in the training script.

(Sanghee's note: Leah I am totally making this up based on what I see on Studio as an example; you need to tell me what would work!) 

This is just an example: Azure Machine Learning provides pre-made environments so that you don't have to install everything. Let's open the conda.yml file to see what dependencies are required (if we provide a conda.yml file as part of the downloadable asset, we can make this work).

In [None]:
name: prototyping-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy=1.21.2
  - pip=21.2.4
  - scikit-learn=0.24.2
  - scipy=1.7.1
  - pandas>=1.1,<1.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - xlrd==2.0.1
    - mlflow== 1.26.1
    - azureml-mlflow==1.42.0
    - psutil>=5.8,<5.9
    - tqdm>=4.59,<4.60
    - ipykernel~=6.0
    - matplotlib

Now let's make a custom environment based on a pre-made environment image.

1. Go to **Environment**
1. Select **Custom Environment**
1. Click **Create**
1. Name it **prototyping-env**
1. Select **Start from an existing environment**
1. Choose **Scikitlearn 1.0**
1. Click **Next**
1. Add dependencies to **RUN pip install**
1. Click **Next**
1. Add a tag so you can recognize the custom environment. Add **Scikitlearn** **mlflow==1.26.1** **matplotlib**.
1. Click **Create** 

(Sanghee's note: I don't know how to load a custom kernel into the notebook CI. We may have to teach user how to restart?)

## Run the notebook

Now the environment is set up, let's run the code.

Re-use additional instructions Sheri already has - this may be a good place to also showcase the Variable Explorer.

Note on the script: this script was originally written to run as a command job package, we need to modify this to run as a prototype (ex. load the data, take out the model registration, etc) 

In [None]:

import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--n_estimators", required=False, default=100, type=int)
    parser.add_argument("--learning_rate", required=False, default=0.1, type=float)
    parser.add_argument("--registered_model_name", type=str, help="model name")
    args = parser.parse_args()
   
    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    ###################
    #<prepare the data>
    ###################
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print("input data:", args.data)
    
    credit_df = pd.read_excel(args.data, header=1, index_col=0)

    mlflow.log_metric("num_samples", credit_df.shape[0])
    mlflow.log_metric("num_features", credit_df.shape[1] - 1)

    train_df, test_df = train_test_split(
        credit_df,
        test_size=args.test_train_ratio,
    )
    ####################
    #</prepare the data>
    ####################

    ##################
    #<train the model>
    ##################
    # Extracting the label column
    y_train = train_df.pop("default payment next month")

    # convert the dataframe values to array
    X_train = train_df.values

    # Extracting the label column
    y_test = test_df.pop("default payment next month")

    # convert the dataframe values to array
    X_test = test_df.values

    print(f"Training with data of shape {X_train.shape}")

    clf = GradientBoostingClassifier(
        n_estimators=args.n_estimators, learning_rate=args.learning_rate
    )
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)

    print(classification_report(y_test, y_pred))
    ###################
    #</train the model>
    ###################

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=clf,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    # Saving the model to a file
    mlflow.sklearn.save_model(
        sk_model=clf,
        path=os.path.join(args.registered_model_name, "trained_model"),
    )
    ###########################
    #</save and register model>
    ###########################
    
    # Stop Logging
    mlflow.end_run()

if __name__ == "__main__":
    main()

## Iterate the code based on the test run result

We explain the test result and instruct the user to change a parameter.


## Export the notebook to a python script

Re-use the instruction Sheri already has.