# Training & Deploying a Speech Recognition Classifier on Azure

_**Note to the Reader**_

_This is a work in progress, and forms part of a series (and, hopefully, book) I hope to complete on MLOps and Deep Learning over the coming quarters. Ultimately, our goal is to create an MLOps platform on Azure, and use it as a platform on top of which to deploy, demo, and, potentially, productize the increasingly complex models we'll build throughout. This notebook demonstrates the end-to-end training of a simple [Speech Emotion Recognition classifier](https://peleke.me/emodb.html) trained on the [German-language EmoDB database](http://emodb.bilderbar.info/docu/#docu), a subject both dear to my heart as an amateur linguist, and, I think, a refreshing departure from the more common, rather trivial examples using the Titanic and Boston Housing data sets. It explains and demonstrates everything required to create `Dataset` instances; use them to generate Feature Sets; train models; track visualizations, parameters, metrics, and models with MLflow; and use it to deploy a blue/green `ManagedOnlineDeployment`._

_At time of upload, the final cells do not execute, as I'm still waiting for the Provider registration of `Microsoft.Network` to complete; the code is, however, otherwise functional, and fully documented._

_The next installment will demonstrate how to build a web application that allows users to record and upload audio via a Next.js UI, backed by a FastAPI server that will transform the raw audio into features to be passed to the newly deployed endpoint, and the result, in turn, to the user. We'll use Docker to containerize the application; deploy it onto App Services, and set up CI/CD pipelines in Azure DevOps, culminating in a truly end-to-end project incorporating Data Science skills via nontrivial use of `sklearn`; MLOps and Cloud Native Architecture on Azure ML; full-stack software engineering through the Next.js/FastAPI stack, with a demonstration of the use of AI-assisted development to develop the frontend; and DevOps best practices, by way of the continuous delivery implementation behind the App Services deployment._

_That piece is well along the way, but in the interest of providing evidence of my (extreme) interest in the role sooner than later, I decided to send this over before completing that one._

_Scripts executed via jobs are listed in the Appendices, but are also visible in the [src](https://github.com/Peleke/azure-mlops/tree/main/SER-on-Azure/src) directory of the corresponding GitHub repository (by personal convention, I use GitHub for public distribution, and ADO for private projects)._ 

---

If you followed the [previous installment of this series](), we now have a fully functional Azure ML Workspace in which to train, evaluate, and deploy our machine learning models. Last time, we focused on setting up the underlying infrastructure, and used Automated ML to train a model on the Boston Housing data set.

This served its purpose of verifying that our Workspace is, indeed, functional. From the ML perspective, though, this is essentially trivial: All we did was tell Azure to solve an easy problem for us.

In this case study, we raise the stakes.

Now that our Workspace is in order, we'll use it to train a Speech Emotion Recognition Classifier for German Sentence Utterances (because, _natürlich_) using the [EmoDB](http://emodb.bilderbar.info/docu/#docu) data set we uploaded as URI Folder asset last time. Along the way, we'll dip our toes into MLOps, seeing how to track model training with [MLFlow](https://learn.microsoft.com/en-us/azure/machine-learning/concept-mlflow?view=azureml-api-2) and integrating a [Feature Store](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-get-started-with-feature-store?view=azureml-api-2&tabs=SDK-track) into our stack in an effort to decouple our features from our training scripts. Once all is said and done, we'll register and deploy the model behind a real-time endpoint, exposing it to the world for consumption.

Then, we'll spread the other wing, detouring into some software development to create a two-tier web application with a [Next.js](https://nextjs.org/) frontend and [FastAPI](https://fastapi.tiangolo.com/) backend, which will allow users to record themselves speaking one of the sentences in the data set, and submit it to our endpoint for a classification. In keeping with our end-to-end focus, we'll then Dockerize this application, using one container for each tier of the application; orchestrate the containers with [Docker Compose](https://docs.docker.com/compose/); and deploy the result to [Azure App Services](https://learn.microsoft.com/en-us/azure/app-service/overview). 

---

_Our model won't be particularly accurate — fine-tuning a [wave2vec2](https://pytorch.org/audio/stable/tutorials/speech_recognition_pipeline_tutorial.html) network would be better — but we'll at least have a foundation for exposing any models we train in a scalable, fully general fashion._

Once this is all said and done, we'll have a _real_ starting point for running and distributing ML experiments. There's only one step after that: We'll update the application with sophisticated logging and monitoring with [Prometheus and Grafana](https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/), explore model retraining workflows using [pipelines](), and incorporate tools that track [model drift](https://learn.microsoft.com/en-us/azure/machine-learning/concept-model-monitoring?view=azureml-api-2) to inform the retraining process.

From there, we'll dive into deep learning, from the ground up, using our new platform to share our experiments with the world along the way. 

## Training the Speech Emotion Recognition Classifier 

We won't be covering the details of training the SER Classifier in this article, because...Well, it would just be distracting to discuss the [short time Fourier transform](https://en.wikipedia.org/wiki/Short-time_Fourier_transform), [Mel frequency cepstral coefficients](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), and fundamentals of [Scikit Learn](https://scikit-learn.org/stable/index.html) and the ML workflow, in addition to everything else.

For all that background, check out the [EmoDB case study](https://learn.microsoft.com/en-us/azure/machine-learning/concept-model-monitoring?view=azureml-api-2), in which I touch on all of the above, including: Audio and signal processing concepts; loading and preparing data; exploratory data analysis; featurization and scaling; training; hyperparameter optimizaton; and model evaluation.

Here, we'll take advantage of the fact this model is already built to simulate a common job scenario. Today, we'll pretend some Data Scientist on the team has figured out how to train this thing, and handed us the notebook containing their thought process and training scripts. It's now our job to:
- Convert the Notebook into a training script
- Persist Features in a Feature Store
- Set up ML FLow to enable model tracking and versioning
- Train the model using this distributed machinery
- Deploy a real-time endpoint for consumption by the software development team

### Stepping Back: Scripts, Jobs, Pipelines, Oh My!

Before we dive in, let's review some basic concepts of Azure ML:
- **Scripts** are just that: Standalone Python files containing arbitrary data science logic, which can be executed on Compute Instances or Clusters linked to Azure ML as...
- [Jobs](https://learn.microsoft.com/en-us/cli/azure/ml/job?view=azure-cli-latest), which are Azure ML's abstraction for managed units of work. Jobs encapsulate the request, processing, and result of running a script within the machinery of the Azure ML workspace, and can be composed to create...
- [Pipelines](https://learn.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2), which are chains of related jobs, whose inputs are the outputs are the immediately preceding job.

In the previous installment, we used [Automated ML](https://learn.microsoft.com/en-us/azure/machine-learning/concept-automated-ml?view=azureml-api-2) to verify that our setup is in working order, allowing us to simply create and run a `automl.regression` job instance directly.

Today's training process will be significantly more complex, since we'll be doing it from scratch. Rather than allow Auto ML to do everything for us, we'll implement each and every step of the pipeline ourselves, viz.:
- Loading & Preparing Data
- Featurizaton & Scaling 
- ML Flow Registration & Training
- Evaluation & Deployment 

Each of these steps will be represented by a **script**, the sequence of which we'll compose into a reproducible **pipeline**.

At this point, it's worth acknowledging that I've been referring to ML Flow and this thing called a Feature Store a fair bit. These are foundational concepts in the world of MLOps, and we'll need to have them in place before we implement our scripts.

## Feature Engineering in the ML Workflow

For a Data Scientist, machine learning might begin and end within the Notebook: Load, clean, model. This is fine for test/dev workflows, but bringing models to production requires significantly more sophistication.

Recall that the general ML process consists of the following steps. Embedded within the steps of **Source and Prepare your data** and **Code your model** are the processes of first cleaning the data — things like removing null values, encoding categorical variables, etc. — and subsequent [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering), a preprocessing step in which we transform the input columns of our data into representations more useful for the learning process than the raw data.

<div style="text-align: center;">
  <img src="https://cloud.google.com/static/ai-platform/images/ml-workflow.svg" />
</div>

Recall that features are an abstraction representing the inputs to our model. A feature can be anything with predictive value for the model. In the case of the Boston Housing data set, features used to predict the price of a home included its school district, crime rate of its zip code, etc., all characteristics that, intuitively, we associate with home value.

In the case of our Speech Emotion Recognition Classifier, our features include the [Mel frequency cepstral coefficients](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), [spectrogram](https://en.wikipedia.org/wiki/Spectrogram), and [chromagram, or chroma feature](https://en.wikipedia.org/wiki/Chroma_feature) of each audio signal.

Unless you spend a lot of time in [Automatic Speech Recognition/Speech to Text (ASR/STT)](https://maelfabien.github.io/machinelearning/speech_reco/) or [Music Processing](https://www.audiolabs-erlangen.de/resources/MIR/FMP/C0/C0.html), that should essentially be Greek to you. This is the point: Features can be of arbitrary form, vary widely across projects, and be anything from intuitive (_crime rate_) to profoundly sophisticated (_wtf is a Mel frequency cepstral coefficient_).

---

_Some notes_:
- _If you're curious, check out this [interactive spectrogram](https://spectrogram.sciencemusic.org/) for a little insight._
- _The details of the [SER Classifier are available elsewhere on the site](https://peleke.me/emodb.html)_.
- _Meinard Müller's [Fundamentals of Music Processing](https://www.audiolabs-erlangen.de/resources/MIR/FMP/C0/C0.html) is a venerable classic. Incidentally, this group being based out of Germany is half of why the SER Classifier is for German utterances. The other half is just that I like German._

Defining the transformations required to convert raw data into engineered features is an important step of the development process; Data Scientists necessarily dedicate a large chunk of their time and code to this process. When contained in a Notebook, this code typically lives in a cell, to be executed alongside other cells, as a part of the Notebook.

And herein lies our first problem: In this workflow, _feature engineering is coupled to the Notebook_. In other words, to figure out which features are in use, we have to refer to the Notebook — and first, find it. To use those features, we have to run the code..._In the Notebook_. If we want to use them, we can only do so in the Notebook, as well, since they're only visible to us within the context of the kernel in which they were loaded.

All of this becomes prohibitive as soon as _any_ of the following occur:
- The features need to be generated and/or inspected _independently_ of training the model
- The transformation logic needs to be shared across projects
- The features, and/or their transformation logic, need to be reused in a new project, to train a different model

As you'd guess, these are all common occurrences.

The solution? Decouple feature engineering from training — and, indeed, from _everything_ else. We do this by introducing the concept of a **feature store**, an abstraction layer serving to encapsulate the entirety of the feature engineering process, from transformation of the raw input data to provisioning of the results for model training.

### Feature Stores: In-Depth

[Feature stores](https://learn.microsoft.com/en-us/azure/machine-learning/concept-what-is-managed-feature-store?view=azureml-api-2) allow ML teams to specify a **feature set** — including inputs and transformation logic — and handle the operational details of serving, securing, and monitoring those features behind the scenes. This decouples feature engineering from the Notebooks in which the logic for the process is typically developed, and "frees you from the overhead of underlying feature engineering pipeline set-up and management." The result is a searchable, reusable index of features and transformations that can be easily shared across teams, projects, and use cases.

<div style="text-align: center;">
  <img caption="Architecture including feature store." src="https://learn.microsoft.com/en-us/azure/machine-learning/media/concept-what-is-managed-feature-store/conceptual-arch.png?view=azureml-api-2" />
</div>

Being an operational concept, feature stores afford principally operational benefits, viz.:
- **Acceleration of model time-to-delivery** by enabling a searchable index of features and, thus, feature reuse; generating reusable transformations; enabling faster experimentation; etc.
- **Improvements in model reliability** by ensuring consistent feature definitions across teams; enabling versioning; minimization of training/serving skew via materialization; etc.
- **Reduction in operational cost** by minimizing transaction cost between teams and reproduced work; using the system to manage materialization and monitoring to reduce engineering costs; etc.

As most things these days, feature stores come in both **managed** and **self-hosted** varieties. [Feast](https://feast.dev/) is the standard open-source solution for those inclined to self-hosting. Going this route implies creating and configuring a Virtual Network, AKS cluster, and Feast instance, then linking all this up with your Azure ML Workspace, Databricks stack, or whatever else you're using.

In general, that would be me, but since we're specifically building exploring an Azure-backed workflow, we'll use Microsoft's managed solution for our use case. This also helps keep things focused. This also saves us the time of defining the deployment with [Pulumi](https://www.pulumi.com/) (used to manage this blog) or [Terraform](https://www.terraform.io/), (used by Pulumi under the hood), which, while a crucial skill for MLOps practitioners, would be _quite_ the distraction from our already heavy workload for the day.

---

_That said, a proficient MLOps practitioner really **should** be able to get this working from scratch. We'll revisit building out a similar, but cloud-agnostic, setup atop Kubernetes at a later date, where we'll set up both Feast and ML Flow as self-hosted systems rather than managed ones._

### Top-Level Entities in Azure's Managed Feature Store

Like Azure ML, Azure Managed Feature Store exposes a set of **capabilities** and **entities** we should be familiar with.

Let's start with the capabilities:
- [Feature Discovery](https://learn.microsoft.com/en-us/azure/machine-learning/concept-what-is-managed-feature-store?view=azureml-api-2#discover-and-manage-features): As discussed, feature stores enable developers and data scientists to search and reuse existing features.
- [Feature Transformation](https://learn.microsoft.com/en-us/azure/machine-learning/concept-what-is-managed-feature-store?view=azureml-api-2#feature-transformation): This corresponds to the code that would be used to compute the features in a Notebook. For the SER Classifier, this would include, e.g., computing spectrograms from audio signals. In the context of a feature store, these are defined in a **feature spec**.
- [Feature Materialization](https://learn.microsoft.com/en-us/azure/machine-learning/feature-set-materialization-concepts?view=azureml-api-2&tabs=SDK-track): Materializaton, or**generation**, is the process of running transformations defined in the spec to compute features. Users define a start and end time to bound how much source data should be included in the materialization run; this interval is called the **feature window**. Materialized features are then persisted, either in an **offline store**, such as Azure Data Lake Gen 2, or an **online store**, such as Redis. Note that an **offline** store is used for model training and batch inference, while **online** stores are used for real-time inference. Without materialization, transformations are run on-the-fly on the source data, but this introduces an element of randomness that can diminish model reliability, amongst other inefficiencies deleterious to the ML workflow.
- [Feature Retrieval](https://learn.microsoft.com/en-us/azure/machine-learning/feature-retrieval-concepts?view=azureml-api-2): Retrieval is the process of fetching features from the Store for **offline** purposes, i.e., the model training and batch inference phases of an Azure ML pipeline job.
- [Monitoring](https://learn.microsoft.com/en-us/azure/machine-learning/concept-what-is-managed-feature-store?view=azureml-api-2#monitoring): Managed feature store allows you to keep track of materialization jobs, and configure notifications to alert upon status updates.
- [Security](https://learn.microsoft.com/en-us/azure/machine-learning/concept-what-is-managed-feature-store?view=azureml-api-2#security): Managed feature store provides built-in RBAC capabilities, allowing fine-grained cross-project access control to features.

The **entities** at play in Managed Feature Store are:
- [Feature Store](https://learn.microsoft.com/en-us/azure/machine-learning/concept-top-level-entities-in-managed-feature-store?view=azureml-api-2#feature-store): The Feature Store itself is a container for **feature sets**, which themselves are collections of **features**.
- [Feature Entity](https://learn.microsoft.com/en-us/azure/machine-learning/concept-top-level-entities-in-managed-feature-store?view=azureml-api-2#entities): Entities abstract the index columns associated with the logical entities of an enterprise, e.g., account number, customer ID, etc. These are typically created once and reused across feature sets within the store.
- [Feature Set Specification & Asset](https://learn.microsoft.com/en-us/azure/machine-learning/feature-set-specification-transformation-concepts?view=azureml-api-2): A feature set is a collection of features that result from applying a set of transformations to source data. They encapsulate the data **source**, PySpark **transformation** logic, and **materialization** settings. The configurations for each of these are specified in a **feature set specification**, and the result of realizing the spec is the **feature set** asset itself.
- [Feature Retrieval Specification](https://learn.microsoft.com/en-us/azure/machine-learning/feature-retrieval-concepts?view=azureml-api-2#create-a-feature-retrieval-specification): A feature retrieval specification defines the list of feature associated with a model, which typically acts as input to a training pipeline. They help generate training data, and can be packaged with models.

While we're at it, the documentation essentially omits explicit explanation of offline vs online materialization. Let's clear that up a bit.

**Offline** feature materialization is used primarily during model training, according to the following process:
- Features are computed or aggregated from raw data, typically in a batch process.
- The materialized features are stored in a persistent storage system (like a database or a data lake).
- These features are then retrieved during the training phase to build or train machine learning models.

For example, suppose you're training a model to predict customer churn. You might aggregate a customer's transaction history over the past month into a single feature. This feature is computed and stored offline, and then used on-demand during the training process.

In offline materialization, features are often computed in large batches. This makes for easier processing logic, but can be time-consuming. In addition, features are not computed in real-time, so they might not reflect the most current state of the data. Ultimately, the emphasis is on completeness and accuracy of feature transformations, rather than speed.

**Online** feature materialization is used during the inference phase, where real-time predictions are required, according to the following process:
- Features are computed or retrieved on-the-fly, often in response to a real-time event (like a user interaction).
- Materialized features are stored in a fast-access storage system, such as a Redis cache or other in-memory store, to ensure low latency.
- Online features are available immediately for real-time inferencing.
  
Continuing with the customer churn example, when a customer logs into an app, the system might compute or retrieve features such as their activity in the last 5 minutes, then use those features to predict whether they are likely to churn.

In online materialization, processing occurs in real- or near-real-time. The focus is on speed, so latency is minimized, and features are guaranteed to be up-to-date and computed using the most current data available.

To summarize:
- Offline features are precomputed and stored before they are used, while online features are computed or retrieved in real-time.
- Offline materialization is for training, while online is for inference.
- Offline processes prioritize accuracy and completeness, whereas online processes prioritize speed and freshness.

This is a lot of concepts without much structure, so allow me to define the workflow for working with feature store using the terms we've just learned:
- Create a new Feature Store to share across projects.
- Implement a Feature Set Specification.
- Use the Feature Set Spec to instantiate and register a Feature Store Entity.
- Generate a training DataFrame using the Feature Set.
- Enable offline materialization on the source data set.

### Creating a Feature Store

As before, we'll use the SDK for everything. Let's make our imports and define constants like the `SUBSCRIPTION_ID`, etc.

In [3]:
import os

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    FeatureStore,
    FeatureStoreEntity,
    FeatureSet,
)
from azure.identity import DefaultAzureCredential

In [4]:
# first used in: <https://peleke.me/ds-ass.html
os.environ["SUBSCRIPTION_ID"] = "3edf056b-b303-4dd1-9447-a08c6901065c"
os.environ["RESOURCE_GROUP_NAME"] = "rg-myfirstworkspace"
os.environ["LOCATION"] = "eastus"
os.environ["WORKSPACE_NAME"] = "ml-myfirstworkspace"

In [5]:
feature_store_name = "fs-myfirstworkspace"

Next, we create the `MLClient` instance that we'll use to interact with our existing Workspace.

Define an instance of `FeatureStore`...

In [7]:
fs = FeatureStore(
    name=feature_store_name,
    location=os.environ["LOCATION"],
)

...Then, create it.

In [8]:
fs_poller = ml_client.feature_stores.begin_create(fs)
fs_result = fs_poller.result()

Readonly attribute principal_id will be ignored in class <class 'azure.ai.ml._restclient.v2022_05_01.models._models_py3.UserAssignedIdentity'>
Readonly attribute client_id will be ignored in class <class 'azure.ai.ml._restclient.v2022_05_01.models._models_py3.UserAssignedIdentity'>


In [9]:
fs_result.id

'/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.MachineLearningServices/workspaces/fs-myfirstworkspace'

Note that this creates a Workspace for the Feature Store itself. This is independent of the Workspace we were using previously. In addition, note the following from the [documentation](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-get-started-with-feature-store?view=azureml-api-2&tabs=SDK-and-CLI-track#create-a-minimal-feature-store):
- The default blob store for the feature store is an ADLS Gen2 container.
- A feature store is always created with an offline materialization store and a user-assigned managed identity (UAI).

Next, we'll need to create a client to interact with the Store. Note that you'll need to `pip` or `pipenv` install `azureml-featurestore` independently from `azureml`, as well as `pyspark`.

In [37]:
!pipenv install azureml-featurestore pyspark

[32m[1mInstalling azureml-featurestore...[0m
[2K[32m⠏[0m ✔ Installation Succeededrestore.....
[1A[2K[32m[1mInstalling pyspark...[0m
[2K[32m⠹[0m ✔ Installation Succeeded
[1A[2K[33m[1mPipfile.lock (7d7f81) out of date, updating to (06fde9)...[0m
Locking[0m [33m[packages][0m dependencies...[0m
[2K[32m⠇[0m ✔ Success!dependencies....
[1A[2KLocking[0m [33m[dev-packages][0m dependencies...[0m
[1mUpdated Pipfile.lock (2f066a018c5793e35b3df1fb0324f9f24cbce180d743f3b0f57412b1f606fde9)![0m
[1mInstalling dependencies from Pipfile.lock (06fde9)...[0m


In [10]:
from azureml.featurestore import FeatureStoreClient

In [11]:
fs_client = FeatureStoreClient(
    credential=DefaultAzureCredential(),
    subscription_id=os.environ["SUBSCRIPTION_ID"],
    resource_group_name=os.environ["RESOURCE_GROUP_NAME"],
    name=feature_store_name,
)

Excellent: We now have a Feature Store in which we can store our Feature Sets. Let's turn our attention to that next.

### Designing a Feature Set

As I've mentioned myriad times by now, the [Speech Emotion Recognition Classifier](https://peleke.me/emodb.html) presents a deep-dive on audio data and feature engineering, the defailts of which we'll take for granted, here. Remember: We're assuming a scenario where this has been passed off to us for productionization, so "why" isn't so important (here).

The Classifier requires us to compute the following features for the audio files in the EmoDB data set, which we'll review momentarily:
- **Mel frequency cepstral coefficients**, numbers encoding the behavior of frequencies over time, scaled in alignment with the nature of human perception
- **Mel spectrograms**, numbers characterizing the behavior of raw frequencies over time, with no scaling for alignment with the nature of human perception
- **Chromagrams**, which map frequencies of audio signals to notes of the major scale

The specific steps we'll take include the following:
- Implement our feature engineering functions
- Use these to generate a DataFrame and, in turn, a Parquet file
- Use the Parquet file to create a Feature Specification
- Store the Parquet file as a URI File Asset
- Use the Feature Spec to create and register a Feature Entity

Fortunately, the data science portions are done — we'll simply copy the feature functions from the SER Classifier notebook, and use the DataFrame they generate to create a Parquet file, the format expected to create an Azure Feature Specification. We'll discuss the format when we generate the file.

#### Loading Data & Correcting a Mistake

We'll start by loading our data. Recall that, last time, we created a URI Folder asset using the EmoDB data. It looked like this:

```python
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
import azure.core.exceptions

emo_db_path = "./emodb"

emodb_asset = Data(
    path=emo_db_path,
    type=AssetTypes.URI_FOLDER,
    description="EmoDB German utterance WAV data.",
    name="EmoDB-Audio",
    version="v1.0.0",
)

try:
    emodb_asset_result = ml_client.data.create_or_update(emodb_asset)
except azure.core.exceptions.HttpResponseError as e:
    print("Attempted to create asset with existing name:version combination; ignoring...")
```

This worked just fine — we can see and retrieve this Data Asset without issue.

In [12]:
emodb_audio = ml_client.data.get(
    name="EmoDB-Audio",
    version="v1.0.0",
)

In [13]:
emodb_audio.path

'azureml://subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourcegroups/rg-myfirstworkspace/workspaces/ml-myfirstworkspace/datastores/workspaceblobstore/paths/LocalUpload/ec087c9a7609d0bedffa3bd212cd2d32/emodb/'

Unfortunately, there's an (intentional) problem. Recall that, in the last article, we trained the Boston Housing data set, using an ML Table to load the data. We uploaded the EmoDB data as an example of creating a URI Folder resource, but since we didn't need to use it, we stopped there.

Turns out, this leaves us in an awkward position. If we simply create a raw Data Asset like this, it _does_ upload the data to Azure. However, the only way for us to use it, is to download it from the blob store that contains it. 

Incidentally, we'll still need to download the data for local usage after doing it properly. However, we'll be able to do so more easily, in that we'll be able to avoid messing with the details of connecting to the storage account directly. In addition, it ensures we can easily load that data in other notebooks later — the gain is more obvious in a bigger project.

The crucial step we skipped is the creation and registration of a [dataset](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets?view=azureml-api-1) entity. _This_ is what we _actually_ want, e.g., an abstraction of the _loaded data_, not simply persistence of the data itself to the storage account.

To correct our mistake, we'll take the following steps:
- Create a `Dataset` using the files in the `emodb` directory
- Register the dataset instance

First, we need to install `azureml-core`.

In [113]:
!pipenv install azureml-core

<output trimmed>


We have the `emodb` data in an adjacent folder. After creating a dataset, we won't have to muck with filepaths like this.

In [20]:
emodb_path = "./emodb"

In [21]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
import azure.core.exceptions

emodb_asset = Data(
    path=emodb_path,
    type=AssetTypes.URI_FOLDER,
    description="EmoDB German utterance WAV data.",
    name="EmoDB-Audio",
    version="v1.0.1",
)

try:
    emodb_asset_result = ml_client.data.create_or_update(emodb_asset)
    print(f"Version {emodb_asset_result.version} uploaded to: `{emodb_asset_result.path}`")
except azure.core.exceptions.HttpResponseError as e:
    print("Attempted to create asset with existing name:version combination; ignoring...")

Attempted to create asset with existing name:version combination; ignoring...


Excellent. Next, we can create and register our dataset instance.

In [25]:
from azureml.core import Dataset, Datastore, Workspace
from azureml.data.datapath import DataPath

In [26]:
workspace = Workspace(
    subscription_id=os.environ["SUBSCRIPTION_ID"],
    resource_group=os.environ["RESOURCE_GROUP_NAME"],
    workspace_name=os.environ["WORKSPACE_NAME"], 
)

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


In [27]:
target_datastore = Datastore.get(workspace, "workspaceblobstore")
target_datastore

{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-b3be7294-d76e-40d3-894b-9edac01fb10f",
  "account_name": "mlmyfirsstorage661258faa",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

In [31]:
emodb_ds = Dataset.File.upload_directory(
    src_dir=emodb_path,
    target=DataPath(target_datastore, '/EmoDB-Audio-2'),
    show_progress=True,
)

Validating arguments.
Arguments validated.
'overwrite' is set to False. Any file already present in the target will be skipped.'
Uploading files from '/Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure/emodb' to '/EmoDB-Audio-2'
Copying 535 files with concurrency set to 16
Skipped /Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure/emodb/16a01Fc.wav, file 1 out of 535. Target already exists.
Skipped /Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure/emodb/16a04Ab.wav, file 2 out of 535. Target already exists.
Skipped /Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure/emodb/13a05Ea.wav, file 3 out of 535. Target already exists.
Skipped /Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure/emodb/11a01Ab.wav, file 4 out of 535. Target already exists.
Skipped /Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure/emodb/16a02Lb.wav, file 5 out of 535. Targe

Next, we [register](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets?view=azureml-api-1#register-datasets) the dataset, so we can retrieve it without having to mess with the EmoDB data directly next time.

In [32]:
emodb_ds_name = "ds-emodb"

In [33]:
emodb_ds = emodb_ds.register(
    workspace=workspace,
    name=emodb_ds_name,
    description="EmoDB audio (WAV) training data.",
)

Now, we'll be able to easily download this data locally to another workstation, or, as we'll see momentarily, refer to it by reference in Jobs, enabling us to completely avoid direct interaction.

#### Preparing Data

Since we created a [FileDataset](https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py), we'll need to download the data locally to use and explore it.

---

_The key word is "explore". We're still in the development phase — i.e., we haven't yet implemented scripts we can run as jobs — so we need to pull the data to determine how to prepare it. Once we've figured that out, we'll dump the code into a script, which future users can run as a job, obviating the need to download this data ever again._

Unlike using the Storage Account directly, the dataset instance provides abstractions to make this process secure and convenient.

In [165]:
# Download the dataset files
download_folder = 'emodb'
os.makedirs(download_folder, exist_ok=True)

In [None]:
emodb_ds.download(
    target_path=download_folder,
    overwrite=True,
)

In [169]:
emodb_files = os.listdir(download_folder)

In [170]:
print(f"There are {len(emodb_files)} files like: `{emodb_files[0]}`.")

There are 535 files like: `16a02Lb.wav`.


As expected.

Recall that the end goal is to create a feature specification. In order to do so, we need to:
- Load a DataFrame with which we can compute features
- Compute said features
- Export the DataFrame as a Parquet file
- Persist the Parquet file as a URI File Asset
- Use the Parquet file as an input to create the Feature Specification

Remember: Our scenario is such that we assume we've been handed a Notebook with training code in it, and nothing more. At this stage, we have to figure out which code to extract, and how, in order to create the Feature Specification we need. This is the only reason we're downloading the data again: To prototype our feature extraction logic.

Once we know what to do, we'll once again extract code from _this_ Notebook; dump it into a script; and run _that_ to create our Feature Specification.

We proceed by duplicating the logic for loading the DataFrame, given paths to our data files. The dictionaries below are used for parsing filenames...

In [34]:
EMOTION_MAP = {
    'W': 'wut',        # anger
    'L': 'langeweile', # boredom
    'E': 'ekel',       # disgust
    'A': 'angst',      # fear
    'F': 'freude',     # happiness/joy
    'T': 'trauer',     # sadness
    'N': 'neutral',    # neutral
}

In [35]:
SPEAKER_MAP = {
    '03': {
        'gender': 0,
        'age': 31,
    },
    '08': {
        'gender': 1,
        'age': 34,
    },
    '09': {
        'gender': 1,
        'age': 21,
    },
    '10': {
        'gender': 0,
        'age': 32,
    },
    '11': {
        'gender': 0,
        'age': 26,
    },
    '12': {
        'gender': 0,
        'age': 30,
    },
    '13': {
        'gender': 1,
        'age': 32,
    },
    '14': {
        'gender': 1,
        'age': 35,
    },
    '15': {
        'gender': 0,
        'age': 25,
    },
    '16': {
        'gender': 1,
        'age': 31,
    },
}

In [36]:
TEXT_MAP = {
    'a01': {
        'text_de': 'Der Lappen liegt auf dem Eisschrank.',
        'text_en': 'The tablecloth is lying on the fridge.'
    },
    'a02': {
        'text_de': 'Das will sie am Mittwoch abgeben.',
        'text_en': 'She will hand it in on Wednesday.'
    },
    'a04': {
        'text_de': 'Heute abend könnte ich es ihm sagen.',
        'text_en': 'Tonight I could tell him.'
    },
    'a05': {
        'text_de': 'Das schwarze Stück Papier befindet sich da oben neben dem Holzstück.',
        'text_en': 'The black sheet of paper is located up there besides the piece of timber.'
    },
    'a07': {
        'text_de': 'In sieben Stunden wird es soweit sein.',
        'text_en': 'In seven hours it will be.'
    },
    'b01': {
        'text_de': 'Was sind denn das für Tüten, die da unter dem Tisch stehen?',
        'text_en': 'What about the bags standing there under the table?'
    },
    'b02': {
        'text_de': 'Sie haben es gerade hochgetragen und jetzt gehen sie wieder runter.',
        'text_en': 'They just carried it upstairs and now they are going down again.'
    },
    'b03': {
        'text_de': 'An den Wochenenden bin ich jetzt immer nach Hause gefahren und habe Agnes besucht.',
        'text_en': 'Currently at the weekends I always went home and saw Agnes.'
    },
    'b09': {
        'text_de': 'Ich will das eben wegbringen und dann mit Karl was trinken gehen.',
        'text_en': 'I will just discard this and then go for a drink with Karl.'
    },
    'b10': {
        'text_de': 'Die wird auf dem Platz sein, wo wir sie immer hinlegen.',
        'text_en': 'It will be in the place where we always store it.'
    }
}

Next, the functions consuming these dictionaries...

In [37]:
def emotion_of(filename: str) -> str:
    return EMOTION_MAP[filename[-2]]

def parse_filename(filepath: str) -> dict[str, str]:
    filename = filepath.split('.')[0]
    return {
        **SPEAKER_MAP[filename[:2]],
        **TEXT_MAP[filename[2:5]],
        'filepath': f'{wav_path}/{filepath}',
        'filename': filename,
        'emotion': EMOTION_MAP[filename[-2]],
        'instance': filename[-1],
    }

def map_filenames(filepath: str) -> dict[str, dict[str, str]]:
    return [parse_filename(filepath) for filepath in os.listdir(filepath)]

And we create a DataFrame.

In [38]:
wav_path = "emodb"

In [40]:
import pandas as pd

In [41]:
df = pd.DataFrame(map_filenames(wav_path))

Voilà. The next step is to copy the feature extraction code...

In [47]:
import librosa
import soundfile

In [48]:
def f_chromagram(waveform, sample_rate):
    """Generate the chromagram of `waveform`'s STFT. Produces 12 features."""
    return np.mean(librosa.feature.chroma_stft(
        S=np.abs(librosa.stft(waveform)),
        sr=sample_rate,
    ).T, axis=0)

def f_mel_spectrogram(waveform, sample_rate):
    """Generate Mel Spectrogram of `waveform`. Generates 128 features."""
    return np.mean(librosa.feature.melspectrogram(
        y=waveform,
        sr=sample_rate,
    ).T, axis=0)

def f_mfcc(waveform, sample_rate, n_mfcc: int = 40):
    """Generate `n_mfcc` Mel-Frequency Cepstral Coefficientss of `waveform`. Produces `n_mfcc` features."""
    return np.mean(librosa.feature.mfcc(
        y=waveform,
        sr=sample_rate,
        n_mfcc=n_mfcc,
    ).T, axis=0)

def features(filepath):
    with soundfile.SoundFile(filepath) as audio:
        waveform = audio.read(dtype="float32")
        
        feature_matrix = np.array([])
        feature_matrix = np.hstack((
            f_chromagram(waveform, audio.samplerate),
            f_mel_spectrogram(waveform, audio.samplerate),
            f_mfcc(waveform, audio.samplerate),
        ))
    return feature_matrix

With this, we can load a DataFrame of features.

In [49]:
X = pd.DataFrame(df['filepath'].apply(features).tolist(), index=df.index)

In [50]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,170,171,172,173,174,175,176,177,178,179
0,0.519366,0.517606,0.551221,0.630626,0.657245,0.616187,0.595714,0.520496,0.512547,0.553091,...,0.52113,0.739171,-0.943298,0.644986,-0.272698,0.350792,3.789986,3.656435,4.393351,2.65079
1,0.496017,0.541108,0.557418,0.548521,0.572088,0.554556,0.581532,0.514501,0.447549,0.464362,...,-1.097012,1.173065,1.638992,0.870183,-0.466121,-0.498711,-1.240703,0.51634,0.868804,1.377451
2,0.449666,0.522166,0.710585,0.764962,0.803045,0.708488,0.636308,0.616403,0.57707,0.563288,...,-3.802398,-2.042081,-4.256393,-3.308553,-2.242352,-0.082309,-0.517581,4.566897,4.787785,7.637536
3,0.600575,0.542146,0.501488,0.518171,0.511432,0.513592,0.539829,0.613019,0.671621,0.65958,...,3.0718,5.410509,0.893856,2.015586,1.085273,0.862223,0.542212,1.65942,-1.383783,1.487293
4,0.54645,0.534972,0.521752,0.544894,0.602076,0.625782,0.607099,0.598428,0.601529,0.588482,...,1.840486,1.413311,0.682267,2.653771,0.906938,0.833642,0.911659,2.030907,1.32378,1.30235


Unfriendly, but, looks like it should. Let's add a couple bookkeeping columns, then generate a Parquet file.

In [51]:
# `filename` is used as an index column in the Feature Set Specification
X["filename"] = df["filename"]
X["timestamp"] = pd.Timestamp.now()

In [52]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,172,173,174,175,176,177,178,179,filename,timestamp
0,0.519366,0.517606,0.551221,0.630626,0.657245,0.616187,0.595714,0.520496,0.512547,0.553091,...,-0.943298,0.644986,-0.272698,0.350792,3.789986,3.656435,4.393351,2.65079,16a02Lb,2024-08-19 18:35:33.774715
1,0.496017,0.541108,0.557418,0.548521,0.572088,0.554556,0.581532,0.514501,0.447549,0.464362,...,1.638992,0.870183,-0.466121,-0.498711,-1.240703,0.51634,0.868804,1.377451,14a07Wc,2024-08-19 18:35:33.774715
2,0.449666,0.522166,0.710585,0.764962,0.803045,0.708488,0.636308,0.616403,0.57707,0.563288,...,-4.256393,-3.308553,-2.242352,-0.082309,-0.517581,4.566897,4.787785,7.637536,10a07Ad,2024-08-19 18:35:33.774715
3,0.600575,0.542146,0.501488,0.518171,0.511432,0.513592,0.539829,0.613019,0.671621,0.65958,...,0.893856,2.015586,1.085273,0.862223,0.542212,1.65942,-1.383783,1.487293,13a05Ea,2024-08-19 18:35:33.774715
4,0.54645,0.534972,0.521752,0.544894,0.602076,0.625782,0.607099,0.598428,0.601529,0.588482,...,0.682267,2.653771,0.906938,0.833642,0.911659,2.030907,1.32378,1.30235,14a05Wa,2024-08-19 18:35:33.774715


In [53]:
parquet_filename = "features.parquet"

In [54]:
X.to_parquet(parquet_filename, index=False)

[Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) is a columnar file format common in the [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) ecosystem, and preferred by Azure ML, probably, because of its proximity to PySpark.

We don't care about Parquet, really, but rather the fact that the Parquet export of the DataFrame is used to create the Feature Specification we need to register. Since we'll ultimately be executing this logic from a script, we need to upload the Parquet file somewhere it can be either downloaded or mounted to the Compute Instance executing the Job. 

This is a perfect use case for URI Files, again.

Again, we ultimately plan to run this in a script, so let's think of things from that angle. When creating and uploading the Parquet file in a script, we'll want not simply to upload it, but upload it with a version bump. If this is the first time we've uploaded the file, we hit an edge case: There's no version to bump, so we'll need to label it `v1.0.0`.

First, we implement a method to bump the asset's patch version.

---

_If you're not familiar with [semver (semantic versioning)](https://semver.org/), review briefly. Given a string like `v1.0.0`, the first number is the **major** version; the second is the **minor** version; and the final is the **patch** version. There are variants on this format that we needn't concern ourselves with. Our convention will be to only ever bump the patch version via automation; we'll allow the organization to decide on the meaning of major (e.g., training/model-breaking) and minor (e.g., prediction-breaking) changes, and implement such bumps through some manual process._

In [55]:
def bump_patch(version: str) -> str:
    """Given a semver string of form `vx.x.x`, return `v.x.x.{x+1}`."""
    comps = [int(el) for el in version.replace("v", "").split(".")]
    comps[-1] = comps[-1] + 1
    return f"v{'.'.join([str(el) for el in comps])}"

Next, let's sketch the logic to retrieve the data asset if it exists, and determine the new version string.

In [56]:
emodb_feature_set_parquet_filename = "emodb-feature-set-parquet-filename"

In [57]:
def get_new_asset_version(asset_name: str) -> str:
    try:
        asset_versions = [asset.version for asset in ml_client.data.list(name=asset_name)]
        sorted(asset_versions)
        return bump_patch(asset_versions[-1])
    except azure.core.exceptions.ResourceNotFoundError:  # If no versions are found
        return "v1.0.0"

In [58]:
get_new_asset_version(emodb_feature_set_parquet_filename)

'v1.0.0'

Excellent. From here, we can implement our new URI File logic.

In [59]:
def feature_parquet_create_or_update(ml_client: azure.ai.ml._ml_client.MLClient, parquet_filepath: str, asset_name: str) -> str:
    new_asset = Data(
        name=asset_name,
        version=get_new_asset_version(asset_name),
        path=parquet_filepath,
        type=AssetTypes.URI_FILE,
        description="Parquet file describing the EmoDB feature set.",
    )
    print(type(new_asset))
    print(f"creating resource: {new_asset}\n...with ml_client: {ml_client}")
    # Omitting this line for now -- we'll let it run when we run the script as a job
    # return ml_client.data.create_or_update(new_asset)

In [60]:
feature_parquet_create_or_update(ml_client, parquet_filename, emodb_feature_set_parquet_filename)

<class 'azure.ai.ml.entities._assets._artifacts.data.Data'>
creating resource: description: Parquet file describing the EmoDB feature set.
name: emodb-feature-set-parquet-filename
path: /Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure/features.parquet
properties: {}
tags: {}
type: uri_file
version: v1.0.0

...with ml_client: MLClient(credential=<azure.identity._credentials.default.DefaultAzureCredential object at 0x12bd2ffa0>,
         subscription_id=3edf056b-b303-4dd1-9447-a08c6901065c,
         resource_group_name=rg-myfirstworkspace,
         workspace_name=ml-myfirstworkspace)


We now have everything we need to implement half of our Feature Specification logic — viz., everything required to create and upload the Parquet file. The final step is to use this asset to create the Feature Spec.

In [61]:
def create_feature_specification(ml_client: azure.ai.ml._ml_client.MLClient, data_asset: azure.ai.ml.entities.Data) -> ...:
    feature_set_spec = create_feature_set_spec(
        source=ParquetFeatureSource(
            path=data_asset.path,
            timestamp_column=TimestampColumn(name="timestamp"),
        ),
        index_columns=[Column(name="filename", type=ColumnType.string)],
        infer_schema=True,
    )
    
    return ml_client.featurestore.create_or_update(feature_set_spec)

This function, like the above, actually has side effects — viz., creating the feature specification. As such, we'll avoid running it here, and instead opt to test it in a script.

### Implementing Scripts & Pipelines

We now have code for the following steps:
- Generate & Upload Parquet File
- Create Feature Set Specification 

Since we now understand the required logic, we'll extract code from the cells above to create two corresponding scripts: [generate_emodb_parquet.py](src/generate_emodb_parquet.py), and [create_emodb_feature_set_specification.py](src/create_emodb_feature_set_specification.py). We'll link them together with a pipeline job, and, if all goes well, end up with a Feature Set Specification we can use to train our model.

#### Provisioning a Custom Environment

Just...One thing we need to address first. Recall that, when executing a job, we need to specify an [environment](https://learn.microsoft.com/en-us/azure/machine-learning/concept-environments?view=azureml-api-2) within which to execute the script. The default curated environment doesn't contain all of the modules we need — namely, it lacks `soundfile` and `librosa`. This means we'll need to create our own environment that supplies these dependencies.

This is easy enough: We'll create a new environment based on an available base image, and supply our own `conda-env.yaml` to ensure the custom dependencies are available.

The YAML file looks like below. Some of the dependencies are duplicated (`azure-ai-ml`), but this won't hurt anything — it's better to specify everything required redundantly than leave stones unturned.

```yaml
# src/conda-env.yaml
name: env-asr
channels:
  - pypi
dependencies:
  - python=3.11
  - librosa
  - azure-ai-ml
  - azure-ai-ml-entities
  - azure-identity
  - scikit-learn
  - pandas
  - numpy
  - matplotlib
```

We'll use the `mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04` image referenced in the [documentation](https://learn.microsoft.com/en-us/training/modules/work-environments-azure-machine-learning/4-create-use-custom-environments) as our base.

Critical gotcha: We need to make use of `azureml-core` in our script. However, this package is not installable via `conda-forge`, `defaults`, `anaconda`, or any of the other channels you'd think to list in the `conda-env.yaml`, despite the availability of `azureml` from these sources. The (evidently, undocumented) solution is to include the following in the `dependencies` block in order to first, install `pip`; then, `pip install azureml-core` via PyPi. Note that adding `pypi` as a source channel does _not_ have the same effect.

```yaml
name: env-asr
channels:
  - conda-forge
dependencies:
  # omitted...
  - pip
  - pip:
    - azureml-core
```

In [1330]:
from azure.ai.ml.entities import Environment

env_asr = Environment(
    image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04",
    conda_file="./src/conda-env.yaml",
    name="env-asr",
    description="Environment created for Automated Speech Recognition.",
)
env_asr_result = ml_client.environments.create_or_update(env_asr)

#### Testing the Parquet Generation Script

Next, we need to choose a Compute target on which to load this environment, and then run our first script within it. We'll use the Cluster for this, since we'll be training soon anyway. Let's fetch all of our compute options in a dictionary, sorted by type (_note that this only works since we have exactly one of each type_).

In [64]:
compute_targets = {el.type:el.name for el in ml_client.compute.list()}
compute_targets

{'computeinstance': 'ci-ml-dev-885ddd69', 'amlcompute': 'training-cluster'}

#### Preparing for the Parquet Job

To run the job, we configure a working directory via the `code` keyword argument; pass a `command` to run our script, including CLI arguments; and specify the argument, compute target, and experiment names to associate with the job.

Before we configure the job: I forgot to assign a Managed Identity to both the Compute Instance and Cluster. Since this is the most seamless way for us to authenticate, let's update our Compute with a System Assigned Managed Identity.

Note that you _must_ use `IdentityConfiguration`. Passing the raw `dict` as `{"type": "SystemAssigned"}` breaks (with an uninformative error), because the value of the `identity` property is expected to have a `_to_rest_object` method, which only lives on Entity instances.

In [66]:
compute_updates = []

In [70]:
from azure.ai.ml.entities import ComputeInstance, AmlCompute, IdentityConfiguration

for c in ml_client.compute.list():
    if c.identity is None:
        c.identity = IdentityConfiguration(type="SystemAssigned")
        poller = ml_client.compute.begin_update(c)
        compute_updates.append(poller.result())
    else:
        print(f"{c.name} has identity of type: {c.identity.type}.")

ci-ml-dev-885ddd69 has identity of type: system_assigned.
training-cluster has identity of type: system_assigned.


#### Running the Parquet Step

Next, we configure the first step of the pipeline.

Recall that generating and saving the Parquet file required a few parameters, viz., the name of the name of the dataset, so the EmoDB data could be retrieved; the name of the Parquet filename to use when creating the URI File; etc.

The most straightforward way to handle this is to pass these parameters as command-line arguments via the `command` keyword argument, and retrieve them within the script with [argparse](https://docs.python.org/3/howto/argparse.html).

##### Interlude: `argparse`

[argparse](https://docs.python.org/3/howto/argparse.html) is Python's built-in tool for parsing command-line arguments. It enables you to allow users to pass data from the command line using either long (`--`) or option (`-`) styles, e.g:

```bash
# either...
$ python my_script.py --path /path/to/input/file
# or...
$ python my_script.py -p /path/to/input/file
```

`argparse` a powerful tool with lots of options for handling user input from the command line. Our use-case is neither here nor there, however: We're not building a CLI tool, but rather invoking a script programmatically by passing a command to be run on a command line somewhere else.

In short, this means the command we write is essentially static: Users will never be manipulating command-line arguments. Instead, we'll be passing the same set through every time, allowing us to safely ignore almost all of what `argparse` can do.

For most smiple workflows, the snippets below should suffice to equip your scripts with the basic machinery required to parse user arguments. They:
- Parse the command line into its arguments (`parse-args`)
- Use the arguments to implement your script logic (`main`)
- Invoke `main` with the parsed arguments as...Well, argument (`if __name__ == '__main__'`).

```python
# 1. Define a `parse_args` somewhere
def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # 1a. Add arguments; you can add whatever you want here, but make sure `-x` options do NOT overlap
    parser.add_argument("-a", "--asset-name", type=str)
    parser.add_argument("-d", "--dataset-name", type=str)
    parser.add_argument("-p", "--parquet-output-filename", type=str)
    parser.add_argument("-o", "--download-path", type=str)
    parser.add_argument("-w", "--workspace-name", type=str)
    parser.add_argument("-s", "--subscription-id", type=str)

    # parse args
    args = parser.parse_args()

    # return args
    return args

# 2. Define a `man` method that accepts `args` as argument; see 3 for how to use
def main(args):
    # Get values
    dataset_name = args.dataset_name
    download_path = args.download_path
    
    # Download data...

    # Etc.

# 3. Add conditional execution block
if __name__ == "__main__":
    # Invoke `main` with result of `parse_args` as argument
    main(parse_args())
```

Before we proceed, I'll need to create a service principal, and we'll pass its credentials through to the script.

Why, you ask? Good question. I'm not going to bother to detail the inconceivably stupid bugs that arise trying to do things sensibly, here — Azure ML's authentication scheme, general organization, and documentation are insultingly awful — but suffice to say, this is the smoothest way of doing things.

We generate a service principal, and persist its credentials. The correct way(s) to do this do _not_ involve storing credentials — every time you do it that way, you're doing it wrong — but I've wasted so much time tallying up outright mistakes in the Azure ML documentation that I'm not going to bother lifting a Key Vault, etc. We'll just store secrets as envvars, for now, and you'll just have to remember never to do something this incorrigibly stupid in production.

In [128]:
!echo ${os.environ["SUBSCRIPTION_ID"]}

edf056b-b303-4dd1-9447-a08c6901065c


In [126]:
os.environ["SUBSCRIPTION_ID"]

'3edf056b-b303-4dd1-9447-a08c6901065c'

In [125]:
!az account show

{
  "environmentName": "AzureCloud",
  "homeTenantId": "e6c5609e-4eaa-411c-ad2c-4d36e67cdc87",
  "id": "3edf056b-b303-4dd1-9447-a08c6901065c",
  "isDefault": true,
  "managedByTenants": [],
  "name": "subscription-ml/dev",
  "state": "Enabled",
  "tenantDefaultDomain": "pelekesprotonmail.onmicrosoft.com",
  "tenantDisplayName": "Default Directory",
  "tenantId": "e6c5609e-4eaa-411c-ad2c-4d36e67cdc87",
  "user": {
    "name": "peleke.s@protonmail.com",
    "type": "user"
  }
}


In [137]:
!az ad sp create-for-rbac --name "sp-azureml" --role Contributor --scopes /subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c > sp.json




Next, we load the JSON so we can extract the `appId`, `password`, and `tenant`.

In [100]:
import json

In [138]:
with open("sp.json", mode="r") as sp_file:
    sp = json.load(sp_file)

In [139]:
sp["appId"]

'3ebbd715-8040-425d-b0f1-26a509dd15d3'

Whatever you do, do _not_, _**ever**_, print the `password`. Even loading it up is horribly insecure, and the fact we have to use a client secret in the first place is ridiculous.

Next, we configure the job. 

Note that you can end the `command` ends in a `sleep 10m`. This keeps the instance alive for 10 minutes, time in which we'll be able to connect to the instance for debugging purposes. I'm keeping this number low so I don't accidentally incur charges, but you can set this to anything you'd like, including `sleep infinity`. Just remember to **manually cancel the job** to release the compute when you're done.

---

_Small bash note: You'll need to append `; sleep10m` to the end of the command. The semicolon is important; it forces execution of `sleep` even if the preceding command fails._

In addition, note the `services` argument. This contains an instance of `JupyterLabJobService`, an entity that allows us to enable interactive debugging services. Specifically, the included services allow us to connect to and rub Jupyter Lab within the instance; connect to it with VS Code for locally-mediated debugging; and/or SSH into the machine.

We have to include `openssh-server` and `ipykernel ~= 6.0.0` in our `conda-env.yaml` for these to work. These are included in all Azure ML curated environments, but _not_ the base `mcr` container we're using, so we'll add it just to be safe.

The `SshJobService` requires us to pass a public key for connection. Let's go ahead and create a new one with `ssh-keygen`.

In [361]:
!ssh-keygen -t rsa -b 4096 -f ./azure_ml_key -N "" -q

In [105]:
with open("./azure_ml_key.pub", mode="r") as public_key_file:
    public_key = public_key_file.read()

In [1167]:
from azure.ai.ml import command, Input
from azure.ai.ml.entities import JupyterLabJobService, VsCodeJobService, SshJobService

# configure job
generate_emodb_parquet_job = command(
    code="./src",
    command=f"python generate_emodb_parquet.py -a {emodb_feature_set_parquet_filename} -d {emodb_ds.name} -p {parquet_filename} -o '/tmp/emodb' -w {os.environ['WORKSPACE_NAME']} -s {os.environ['SUBSCRIPTION_ID']} -g {os.environ['RESOURCE_GROUP_NAME']} -i {sp['appId']} -t {sp['tenant']} -c {sp['password']}",
    environment=f"{env_asr_result.name}:{env_asr_result.version}",
    compute=compute_targets["computeinstance"],
    display_name="generate_emodb_parquet.py-test",
    experiment_name="test-generate-emodb-parquet",
    services={
      "My_jupyterlab": JupyterLabJobService(
        nodes="all",
      ),
      "My_vscode": VsCodeJobService(
        nodes="all",
      ),
      "My_ssh": SshJobService(
        ssh_public_keys=public_key,
        nodes="all",
      ),
    },
)

Let's turn on the compute instance, then run the job.

In [1168]:
def start_compute_instance():
    compute_options = list(ml_client.compute.list())
    for compute_option in compute_options:
        if compute_option.type == "computeinstance":
            if not compute_option.state == "Running":
                print(f"Starting {compute_option.name}...")
                start_poller = ml_client.compute.begin_start(name=compute_option.name)
                start_poller.result()
            else:
                print("Already running!")

In [1169]:
start_compute_instance()

Already running!


Let's start the job.

In [1170]:
returned_job = ml_client.jobs.create_or_update(generate_emodb_parquet_job)

Uploading src (0.04 MBs): 100%|████████| 38182/38182 [00:02<00:00, 15823.60it/s]




In [1171]:
import webbrowser
webbrowser.open(returned_job.studio_url)

True

Executing this job will build the `env-asr` container; deploy it to an instance in our Cluster; and execute the script within it.

---

_Note that, for this run, I commented any code that runs for POST-style side effects. I.e., the code that actually creates the Parquet file did _not_ execute on this run. That's because we want to create the initial versions of our new files and feature scripts from a pipeline that runs both scripts in coordination, rather than create those assets _during_ testing._

If you navigate to the studio, you'll be able to view the job's output, metrics, etc. Make particular note of the **Outputs + Logs** tab — you'll spend a disproportionate amount of time there.

Since we enabled debugging services, we'll be able to step into the instance once it's in a **Running** state. This way, we can update the training script directly, instead of having to continuously resubmit jobs to test our changes.

We fetch the services associated with the job using `MLClient.jobs.show_services` and passing the `name` of the `returned_job`.

In [206]:
services = ml_client.jobs.show_services(returned_job.name)

In [207]:
import json
for service, details in services.items():
    print(f"{service}::{json.dumps(dict(details), indent=2)}")

My_jupyterlab::{
  "type": "JupyterLab",
  "port": 8889,
  "status": "Running",
  "error": null,
  "endpoint": "https://jptrl-3p733ax1ayvuj0c3z6v2qim4247z0gmsij9wt7dmy07u53h2i3c.eastus.nodes.azureml.ms",
  "properties": {}
}
My_vscode::{
  "type": "VSCode",
  "port": null,
  "status": "Running",
  "error": null,
  "endpoint": "vscode://ms-toolsai.vscode-ai/interactiveSession?runUri=/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.MachineLearningServices/workspaces/ml-myfirstworkspace/experiments/test-generate-emodb-parquet/runs/amiable_napa_2qwhg0wzdb&windowId=_blank",
  "properties": {
    "ProxyEndpoint": "https://<internalServicePort>-3p733ax1ayvuj0c3z6v2qim4247z0gmsij9wt7dmy07u53h2i3c.eastus.nodes.azureml.ms/runVscodeCommand",
    "folderToOpen": "/mnt/azureml/cr/j/4d453af5a3c7426e93bcaf41613dddeb/exe/wd",
    "internalServicePort": "8710"
  }
}
My_ssh::{
  "type": "SSH",
  "port": 8705,
  "status": "Running",
  "error": nul

Incidentally, I ended up not making use of the debugging extensions, but [review the documentation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-interactive-jobs?view=azureml-api-2&tabs=python#enable-during-job-submission) for guidance. The VS Code URL launches a connection to your instance in VS Code; the Jupyter link opens Jupyter Lab; and SSH connection is achieved with the below.

```bash
$ az ml job connect-ssh --name <job-name> --node-index <compute node index> --private-key-file-path <path to private key>.
```

When the job succeeds, you'll get a **Completed** check. We retrieve the file in the notebook to validate its creation; looks good.

In [1207]:
parquet_asset_versions = sorted(list(ml_client.data.list(name=emodb_feature_set_parquet_filename)), key=lambda a: a.version)
parquet_asset_versions[-1]

Data({'path': 'azureml://subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourcegroups/rg-myfirstworkspace/workspaces/ml-myfirstworkspace/datastores/workspaceblobstore/paths/LocalUpload/fd505fae40c8df5dd8e48698919bec68/features.parquet', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'emodb-feature-set-parquet-filename', 'description': 'Parquet file describing the EmoDB feature set.', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.MachineLearningServices/workspaces/ml-myfirstworkspace/data/emodb-feature-set-parquet-filename/versions/v1.0.8', 'Resource__source_path': '', 'base_path': '/Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure', 'creation_context': <azure.ai.ml.entities._system_data.SystemD

#### Implementing the Feature Set Specification Script

Now that we have a script to create the Parquet file, we have what we need to create our feature set specification. This requires the following steps:
- Create a Feature Set Specification
- Create a Feature Set from the Feature Specification
- Persist the Feature Set Entity

By default, Azure advocates the use of Spark for the data transformations sometimes required when creating Feature Specs. The workflow in their documentaton requires it, and there is, literally, _zero_ mention of any alternative possibility. Anywhere.

If you waste enough time reverse engineering the SDK, you'll eventually figure out that this is, technically, unnecessary, as it's perfectly achievable with Pandas, though Spark is, of course, the better option when dealing with enterprise-sized datasets. 

---

_Even that's a nightmare, by the way, because the docs are a legitimate embarrassment. By which I mean, I am literally embarrassed to put my eyes on them: Parameters are duplicated; marked as `Optional` in the `help` docs retrieved by Python, but `Required` on the web; parameter descriptions are absent, inconsistent, or non-existent; quoting conventions are inconsistent; uninterpolated strings are prepended with `f`; and my desk is now covered in a pelt of my own fucking hair, which is only vacuously not the fault of whichever mouthbreather is in charge of the unforgivable stack of steaming drivel that Azure dares deem "documentation"._

Before we dive into the workflow, let's briefly review what Feature Set Specifications actually are, since the docs do a characteristically poor job explaning it.

#### Interlude: Feature Set Specifications

A Feature Set Specification is simply...Well, a specification defining a feature set. The spec is most easily understood by studying its [canonical YAML format](https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-featureset-spec?view=azureml-api-2).

The required parameters include:
- `$schema`: Which schema version to use. At time of writing, it should be `http://azureml/sdk-2-0/FeatureSetSpec.json`.
- `source`: An object describing the source data underlying the specification, e.g., the Parquet file we created from our DataFrame.
  - `type`: Type of the source data. These live on [azureml.featurestore.contracts.feature_source_type.SourceType](https://learn.microsoft.com/en-us/python/api/azureml-featurestore/azureml.featurestore.contracts.feature_source_type.sourcetype?view=azure-python), and include:
    - `MLTABLE`
    - `CSV`
    - `PARQUET`
    - `DELTATABLE`
    - `CUSTOM`
    - `FEATURE_SET`
  - `path`: Path to the underlying data source. In our case, this is `parquet_asset_versions[-1].path`.
  - `timestamp_column`: Object containing the name of the Timestamp column.
    - `name`: Name of timestamp column; for us, this is `"timestamp"`.
- `features`: A list containing objects of shape [azureml.featurestore.contracts.feature.Feature](https://learn.microsoft.com/en-us/python/api/azureml-featurestore/azureml.featurestore.contracts.feature.feature?view=azure-python), i.e., with `name` and `type` properties.
  - `name`: String representing the name of a given feature column.
  - `type`: Type of data in the column; can be any of the options defined by the [azureml.featurestore.contracts.ColumnType](https://learn.microsoft.com/en-us/python/api/azureml-featurestore/azureml.featurestore.contracts.columntype?view=azure-python) enum, i.e.: `STRING`; `INTEGER`; `LONG`; `FLOAT`; `DOUBLE`; `BINARY`; `DATETIME`; `BOOLEAN`; `NULL`; or `UNKNOWN`.
- `index_columns`: List containing [azureml.featurestore.contracts.Column](https://learn.microsoft.com/en-us/python/api/azureml-featurestore/azureml.featurestore.contracts.Column?view=azure-python) instances describing the `name` and `type` of each index column. In our case, we only have one, of type `ColumnType.STRING`, with a `name` of `"filename"`.

The official tutorial(s) and documentation suggest the use of the [azureml.featurestore.create_feature_set_spec](https://learn.microsoft.com/en-us/python/api/azureml-featurestore/azureml.featurestore?view=azure-python#azureml-featurestore-create-feature-set-spec) utility function. It has built-in support for Spark transformations, making it useful for creating feature sets that require manipulation of source data too large to load in-memory.

Our data is small enough to fit in-memory, and requires only straightforward transformations. This means it's possible to create a Featuure Set Specification simply by defining this structure by hand. Unfortunately, we won't be able to _load_ the feature set 

Fortunately, the Azure ML suite provides classes we can use to define the specification above in code, and _dump_ a YAML file representing our Feature Set Specification. We can then use this to create a Feature Set Specification URI File asset, and use this, in turn, to create and register a Feature Set.

#### Implementing the Feature Set Specification Script

To recap, our workflow will be:
- Define a [azureml.featurestore.FeatureSetSpec](https://learn.microsoft.com/en-us/python/api/azureml-featurestore/azureml.featurestore.featuresetspec?view=azure-python) instance
- Call [dump](https://learn.microsoft.com/en-us/python/api/azureml-featurestore/azureml.featurestore.featuresetspec?view=azure-python#azureml-featurestore-featuresetspec-dump) on this instance to create a YAML file representing the specification
- Create a URI File using the result
- Use [azure.ai.ml.entities.FeatureSet](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.featureset?view=azure-python) and [azure.ai.ml.entities.FeatureSetSpecification](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.featuresetspecification?view=azure-python) to create a Feature Set instance
  - This also requires the creation of a [FeatureStoreEntity](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.featurestoreentity?view=azure-python).
  - Yes, confusingly enough, this is different from a `FeatureSetSpec`. 
- Use [azure.ai.ml.MLClient](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.mlclient?view=azure-python) to create and register Feature Set Entity

##### Interlude: Workspace vs Feature Space 

Note that we need to create several kinds of assets, here: A `Data` asset representing a YAML file; a `FeatureStoreEntity` to be used in conjunction with the `FeatureSetSpec`; etc.

Some of these assets are managed under different namespaces. For instance, we use an `MLClient` instance attached to the Azure ML Workspace to retrieve assets broadly associated with the Workspace — `Data`, `Dataset` instances, the `Workspace` itself, etc.

In [848]:
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id=os.environ["SUBSCRIPTION_ID"],
    resource_group_name=os.environ["RESOURCE_GROUP_NAME"],
    workspace_name=os.environ["WORKSPACE_NAME"],
)

In [803]:
# fetch versions of the Parquet file created before
list(ml_client.data.list(name=emodb_feature_set_parquet_filename))[-1]

Data({'path': 'azureml://subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourcegroups/rg-myfirstworkspace/workspaces/ml-myfirstworkspace/datastores/workspaceblobstore/paths/LocalUpload/ec7acf62eb29e223de333bd4140c1b86/features.parquet', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'emodb-feature-set-parquet-filename', 'description': 'Parquet file describing the EmoDB feature set.', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.MachineLearningServices/workspaces/ml-myfirstworkspace/data/emodb-feature-set-parquet-filename/versions/v1.0.0', 'Resource__source_path': '', 'base_path': '/Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure', 'creation_context': <azure.ai.ml.entities._system_data.SystemD

On the other hand, we use an `MLClient` instance attached to the _feature store workspace_ to manage feature entities, feature sets, etc.

In [861]:
fs_client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id=os.environ["SUBSCRIPTION_ID"],
    resource_group_name=os.environ["RESOURCE_GROUP_NAME"],
    workspace_name=feature_store_name,
)

In [798]:
# NOTE: I added this cell AFTER running the script below -- this prints the asset we will create below
list(fs_client.feature_sets.list())

[FeatureSet({'is_anonymous': False, 'auto_increment_version': True, 'auto_delete_setting': None, 'name': 'EmoDB-FeatureSet', 'description': '', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': None, 'Resource__source_path': '', 'base_path': '/Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x14cd43730>, 'version': '', 'latest_version': 'v1.0.0', 'path': None, 'datastore': None, 'entities': [], 'specification': <azure.ai.ml.entities._feature_set.feature_set_specification.FeatureSetSpecification object at 0x14cd42680>, 'stage': 'Development', 'materialization_settings': None})]

On the surface, this is a sensible division of labor. The fact that both use an `MLClent` instance is a major _gotcha_, though, because there _does_ exist a [azureml.featurestore.FeatureStoreClient](https://learn.microsoft.com/en-us/python/api/azureml-featurestore/azureml.featurestore.featurestoreclient?view=azure-python).

This class is principally used to handle [feature retrieval specifications](https://learn.microsoft.com/en-us/azure/machine-learning/feature-retrieval-concepts?view=azureml-api-2), something we won't deal with for some time, as they're more relevant to time-series data and projects with a heavy retraining component, which we have yet to have occasion to set up.

Somewhat misleadingly, it also exposes methods to `get` and `list` Feature Sets, Feature Set Entities, and Feature Stores, but does _not_ provide any functionality to create, update, or delete them.

Do **not** attempt to use the `azureml.featurestore.FeatureStoreClient` to create, update, or delete Feature Sets and related resources. _All_ of that should be done via the tooling provided by `azure.ai.ml`, namely:
- The `MLClient`, configured to point to the **feature store workspace**; and
- The following classes from `azure.ai.ml.entities` (as well as the [others that we don't use today](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities?view=azure-python)):
  - [FeatureSet](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.featureset?view=azure-python)
  - [FeatureSetSpecification](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.featuresetspecification?view=azure-python)
  - [FeatureStoreEntity](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.featurestoreentity?view=azure-python)

First, collect the imports.

In [440]:
from azure.ai.ml.entities import FeatureSet, FeatureSetSpecification
from azureml.featurestore import FeatureSetSpec
from azureml.featurestore.contracts.feature import Feature
from azureml.featurestore.contracts.feature_source_type import SourceType
from azureml.featurestore.contracts.feature_source import FeatureSource
from azureml.featurestore.contracts import Column, ColumnType, TimestampColumn

Next, define the `FeatureSetSpec`. Note the following:
- We create a `FeatureSource` instance to represent the `source` entry in the YAML file, passing the `SourceType.PARQUET` enum instance and `path` property of the last element of the `parquet_asset_versions` list, fetched after our last Job run created a Parquet file, as the `source.path`  
- The `timestamp` kwarg to `FeatureSource` accounts for the `timestamp.name` property of the YAML file, accepting a `TimestampColumn` instance with a `name` property.
- `features` is provided by mapping `Feature(name=str(f), type=ColumnType.FLOAT)` over all `f` in `X.columns.to_list()`. Essentially, this creates objects that look like `Feature({"name": "0", "type": ColumnType.FLOAT})` for each column in `X`.
  - Here, we have `X` because we loaded it directly. In the script, we'll need to include a Step 0 that fetches, downloads, and loads the parquet URI File as a DataFrame. 
- `index_columns` explicitly labels `filename` as the index `Column`, with `type`  `ColumnType.STRING`.
- Optional lookback intervals are omitted, since we're not dealing with time-series or historical data.

Note that using the Enums and wrapper classes is **required** in this workflow; passing, e.g., `"parquet"` for `FeatureSource.type` will error, as will passing primitive values as names in `features`.

```python
# 1. Define FeatureSetSpec instance
fss = FeatureSetSpec(
    source=FeatureSource(
        type=SourceType.PARQUET,
        path=parquet_asset_versions[-1].path,
        timestamp_column=TimestampColumn(name="timestamp"),
    ),
    features=[Feature(name=str(f), type=ColumnType.FLOAT) for f in X.columns.to_list()],
    index_columns=[Column(name="filename", type=ColumnType.STRING)],
)
```

In [1006]:
fss = FeatureSetSpec(
    source=FeatureSource(
        type=SourceType.PARQUET,
        path=parquet_asset_versions[-1].path,
        timestamp_column=TimestampColumn(name="timestamp"),
    ),
    features=[Feature(name=f"feature{str(f)}", type=ColumnType.FLOAT) for f in X.columns.to_list() if f != 'timestamp' and f != 'filename'],
    index_columns=[Column(name="filename", type=ColumnType.STRING)],
)

Next, we `dump` the YAML file.

```python
# 2. `dump` the YAML -- for some reason, `dump` ignores custom filenames, true to form
fss_yaml_path = "./FeatureSetSpecification.yaml"
fss.dump(fss_yaml_path, overwrite=True)
```

We upload this file as a URI File asset...

```python
# 3. Create a URI FIle using the result — DO NOT RUN THIS; we'll do it in a job
feature_set_specification_yaml_name = "EmoDB-FeatureSetSpecfication"
feature_set_specification_yaml_asset = Data(
    name=feature_set_specification_yaml_name,
    version=get_new_asset_version(ml_client, feature_set_specification_yaml_name),
    path=fss_yaml_path,
    type=AssetTypes.URI_FILE,
    description="Feature Set Specification YAML for EmoDB data.",
)
feature_set_specification_yaml_result = ml_client.data.create_or_update(feature_set_specification_yaml_asset)
```

We can now use this YAML file to instantiate a `FeatureSetSpecification`, pass it to a `FeatureSet`, and register the result. We'll update our `get_new_asset_version` function to create a `get_new_feature_set_version` for the same purpose.

The `FeatureSet` constructor accepts an `entities` argument, which, according to `help`, is required.

Elsewhere, we've seen the term Entity refer to a representation of an arbitray resource. There exist classes like, e.g., `azure.ai.ml.entities.Workspace`, which represents, of course, an Azure ML Workspace. This is a _general_ use of the term, in accordance with the natural-language definition of the word "entity".

In this case, Entity refers specifically to a `FeatureStoreEntity` — a specific _kind_ of entity used to force consistent use of identical index column definitions across feature sets that share an index type. For instance, if two related data sets use client account IDs as index columns, we would create a `FeatureStoreEntity` representing account IDs, and use that _single_ entity to create _any_ Feature Set indexed on account ID.

Going further, Feature Set Entities are defined in terms of the following properties:
- `name`: 

```python
# 4a. create a Feature Set Entity
from azure.ai.ml.entities import DataColumn, DataColumnType, FeatureStoreEntity

# set `entity_name` equal to the Index Column name defined in `FeatureSetSpec.yaml`
entity_name = "filename"
entity_versions = sorted([int(e.version) for e in fs_client.feature_store_entities.list(name=entity_name)])
next_version = str(entity_versions[-1] + 1) if len(entity_versions) > 0 else "1"

fs_entity = FeatureStoreEntity(
    name=entity_name,
    version=next_version,
    index_columns=[DataColumn(name=entity_name, type=DataColumnType.STRING)],
    stage="Development",
    description=f"This entity represents the index column of the EmoDB dataset, `{entity_name}`.",
    tags={
        "pii": False
    },
)

entity_poller = fs_client.feature_store_entities.begin_create_or_update(fs_entity)
entity_poller.result()
```

I'll use `azureml:dataset:1`, with the logic being as follows:
- `azureml` is a broad namespace, indicating this entity belongs to Azure ML
- `dataset` is the "entity container" itself, i.e., a container for a specific ID
- `1` is the dataset ID, i.e., the reference to the specific entity

```python
feature_set_name = "EmoDB-FeatureSet2"
featureset=FeatureSet(
        name=feature_set_name,
        version=get_new_feature_set_version(fs_client, feature_set_name),
        description="Data Set for training German SER Classifier.",
        entities=["azureml:dataset:1"],
        specification=FeatureSetSpecification(path=output_dir),
        tags={
            "pii": False,
        },
    )
```

In [904]:
def get_new_feature_set_version(fs_client: MLClient, asset_name: str) -> str:
    try:
        asset_versions = [asset.version for asset in fs_client.feature_sets.list(name=asset_name)]
        if len(asset_versions) == 0:
            # package is just broken, as usual, try to fetch specific versions...how stupid
            latest = 0
            while True:
                version = f"v1.0.{latest}"
                try:
                    fs_client.feature_sets.get(name=asset_name, version=version)
                    latest += 1
                except azure.core.exceptions.ResourceNotFoundError as e:
                    return version
        sorted(asset_versions, key=lambda asset: asset.version)
        return bump_patch(asset_versions[-1])
    except azure.core.exceptions.ResourceNotFoundError:
        return "v1.0.0"

In [905]:
get_new_feature_set_version(fs_client, feature_set_name)

'v1.0.1'

Finally, we create and register the `fs` instance with our `fs_client`.

```python
# 5. create and register Feature Set
fs_poller = fs_client.feature_sets.begin_create_or_update(featureset=fs)
fs_result = fs_poller.result()
```

#### Running the Feature Set Creation Job

I've collected the code above, with necessary modifications, into the [create_emodb_feature_set_specification.py](./src/create_emodb_feature_set_specification.py) script. In order to retrieve the Parquet file, we use [inputs](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-read-write-data-v2?view=azureml-api-2&tabs=python) to have Azure ML download the URI file to `path_on_compute` within the Compute Instance, which allows us to run `pd.read_parquet(path=args.path_on_compute)` to load the DataFrame.

There's one last prep step: We need to assign Storage Blob Data Reader and Storage Blob Data Contributor roles to our Service Principal.

This is required **even if your service principal has** `Contributor` **permissions over your subscription**, because these roles are _separate_ from the global `Contributor` scope.

---

_You don't need to use a service principal. If you use [AzureMLOnBehalfOfCredential](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.identity.azuremlonbehalfofcredential?view=azure-python), or [DefaultAzureCredential](https://learn.microsoft.com/en-us/dotnet/api/azure.identity.defaultazurecredential?view=azure-dotnet) to detect user creds, you can assign the required roles to your User, instead, and pass `identity=azure.ai.ml.entities.UserIdentityConfiguration()` as a kwarg to your `command` job._

In [740]:
!az role assignment create \
  --role "Storage Blob Data Reader" \
  --assignee "3ebbd715-8040-425d-b0f1-26a509dd15d3" \
  --scope "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.Storage/storageAccounts/fsmyfirsstoragecb88f568c"

{
  "condition": null,
  "conditionVersion": null,
  "createdBy": null,
  "createdOn": "2024-08-24T00:24:07.181265+00:00",
  "delegatedManagedIdentityResourceId": null,
  "description": null,
  "id": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.Storage/storageAccounts/fsmyfirsstoragecb88f568c/providers/Microsoft.Authorization/roleAssignments/1f2035a3-3ded-4875-8cea-35bd496b9400",
  "name": "1f2035a3-3ded-4875-8cea-35bd496b9400",
  "principalId": "40a34d43-0193-4728-b7e3-b17838739b32",
  "principalType": "ServicePrincipal",
  "resourceGroup": "rg-myfirstworkspace",
  "roleDefinitionId": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/providers/Microsoft.Authorization/roleDefinitions/2a2b9908-6ea1-4ae2-8e65-a410df84e7d1",
  "scope": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.Storage/storageAccounts/fsmyfirsstoragecb88f568c",
  "type": "Microsoft.Authorizati

In [741]:
!az role assignment create \
  --role "Storage Blob Data Contributor" \
  --assignee "3ebbd715-8040-425d-b0f1-26a509dd15d3" \
  --scope "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.Storage/storageAccounts/fsmyfirsstoragecb88f568c

{
  "condition": null,
  "conditionVersion": null,
  "createdBy": null,
  "createdOn": "2024-08-24T00:24:18.912988+00:00",
  "delegatedManagedIdentityResourceId": null,
  "description": null,
  "id": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.Storage/storageAccounts/fsmyfirsstoragecb88f568c/providers/Microsoft.Authorization/roleAssignments/7532880e-4042-4248-bb0b-712a65bb1e6d",
  "name": "7532880e-4042-4248-bb0b-712a65bb1e6d",
  "principalId": "40a34d43-0193-4728-b7e3-b17838739b32",
  "principalType": "ServicePrincipal",
  "resourceGroup": "rg-myfirstworkspace",
  "roleDefinitionId": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/providers/Microsoft.Authorization/roleDefinitions/ba92f5b4-2d11-453d-a403-e96b0029c9fe",
  "scope": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.Storage/storageAccounts/fsmyfirsstoragecb88f568c",
  "type": "Microsoft.Authorizati

Let's verify that we have all required permissions: Global Contributor (admittedly, this is probably overkill), as well as the Storage Blob permissions.

In [820]:
!az role assignment list --assignee "3ebbd715-8040-425d-b0f1-26a509dd15d3" --all

[
  {
    "condition": null,
    "conditionVersion": null,
    "createdBy": "01bfedbd-3a90-47f5-8c7a-ad7d1de0b5a5",
    "createdOn": "2024-08-19T23:26:01.603528+00:00",
    "delegatedManagedIdentityResourceId": null,
    "description": null,
    "id": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/providers/Microsoft.Authorization/roleAssignments/db8b188e-471a-4b68-8325-6af03df9c03d",
    "name": "db8b188e-471a-4b68-8325-6af03df9c03d",
    "principalId": "40a34d43-0193-4728-b7e3-b17838739b32",
    "principalName": "3ebbd715-8040-425d-b0f1-26a509dd15d3",
    "principalType": "ServicePrincipal",
    "roleDefinitionId": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c",
    "roleDefinitionName": "Contributor",
    "scope": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c",
    "type": "Microsoft.Authorization/roleAssignments",
    "updatedBy": "01bfedbd-3a90-47f5-8c7a-ad7d1de0b5a5",
    "u

And with that, we should, in _theory_, be good to go. Let's see.

#### Moment of Truth

In [454]:
from azure.ai.ml.constants import InputOutputModes

In [1173]:
create_emodb_feature_set_job = command(
    code="./src",
    command=f"python create_emodb_feature_set_specification.py --parquet-file ${{inputs.parquet_file}} -f {feature_store_name} -p {emodb_feature_set_parquet_filename} -v {parquet_asset_versions[-1].version} -w {os.environ['WORKSPACE_NAME']} -s {os.environ['SUBSCRIPTION_ID']} -g {os.environ['RESOURCE_GROUP_NAME']}  -i {sp['appId']} -t {sp['tenant']} -c {sp['password']}",
    environment=f"{env_asr_result.name}:{env_asr_result.version}",
    compute=compute_targets["computeinstance"],
    display_name="create_emodb_feature_set_specification.py-test",
    experiment_name="test-create-emodb-feature-set",
    identity=azure.ai.ml.entities.UserIdentityConfiguration(),
    inputs={
        "parquet_file": Input(
            type=AssetTypes.URI_FILE,
            path=parquet_asset_versions[-1].path,
            mode=InputOutputModes.DOWNLOAD,
        ),
    },
)

In [1174]:
start_compute_instance()

Already running!


In [1175]:
returned_job = ml_client.jobs.create_or_update(create_emodb_feature_set_job)

Use of {} for parameters is deprecated, instead use ${{}}.


In [1176]:
webbrowser.open(returned_job.studio_url)

True

And, after _all that work_, we win.

**TODO**: _Inject image_

### Training the Model

Now that we have a Feature Set Specification, we can use it to generate a Pandas DataFrame and, finally, train and register our model.

Since we're now geting into the meat of the Data Science, let's take a step back and meet MLflow, a critical tool for the workflow we're about to engage.

#### Introducing MLflow 

Since we're proceeding to training, we should incorporate [MLfLow](https://mlflow.org/) into our stack. 

> MLflow is an open-source platform, purpose-built to assist machine learning practitioners and teams in handling the complexities of the machine learning process. MLflow focuses on the full lifecycle for machine learning projects, ensuring that each phase is manageable, traceable, and reproducible.

Most of your time in Data Science is spent exploring data, generating plots, and tinkering with models. Notebooks are excellent for this purpose, as their cell-oriented design makes it easy to document your thought process in markdown, while interspersing your code throughout.

Notebooks, however, lack any (real) notion of versioning: It's tricky to track the _progression_ of our thought process, and accordingly, our data analysis and training loops. For instance, it's difficult to keep track of the _sequence_ of hyperparameter tuning techniques and configurations you experimented with prior to landing on a final model.

The traditional SWE approach of using Git for versioning is a poor solution: Branch complexity would swiftly become profligate, and navigating the structure would make little sense to anyone, let alone a Data Scientist unlikely to have extensive experience traversing commit histories (_which, let's be honest, who wants that experience, anyway_).

In addition, distributing your experiments requires distributing the whole Notebook. This isn't an insurmountable challenge with tools like Azure ML Studio, [AWS Sagemaker](https://aws.amazon.com/sagemaker/), and [Google Colab](https://colab.research.google.com/), but this style of distribution is clunky, at best: It would be much more efficient to be able to share your experiments independently of any specific Notebook.

MLflow solves both these problems, providing us with:
- **Metric Tracking**: MLflow provides an intuitive, highly customizable interface for keeping track of your Exploratory Data Analysis, including plots generation, as well as the parameters you use to train, fine-tune, and optimize your models and modeling.
- **Distribution**: Experiment and trackng results are persisted to a **tracking server**, accessible as a web application where team members can explore a catalog of experiments, allowing you to easily share the artifacts of your Data Science work without the need to distribute an entire Notebook.
- **Performance Tracking**: After a model is released, it may be retrained on, e.g., new data, or to correct model drift. MLflow makes it easy to track performance of baseline models, iterations, and release candidates, simplifying the process of comparing model metrics and ensuring consistency.

In other words, MLflow essentially distributes the steps of the Data Science workflow, allowing Data Scientists to collect and share exploratory data analysis results; plots; preprocessing; training; deployment; and hyperparameter tuning experiments, amongst others.

Traditionally, MLflow would be installed on, e.g., a Kubernetes cluster, from which it could be accessed as a web application. We're working in Azure ML, where MLflow is already configured as a built-in. We'll cover the cloud-agnostic setup at a later date — there are advantages to being able to set this stuff up without relying explicitly on ML services to provision them for you — but, in the interest of maintaining focus on MLflow, we'll stick to Azure for now.

This means MLflow is already up and running on the server side: All we need to do is incorporate it into our training process to begin collecting metrics.

Fortunately, this is easy to do. By taking advantage of [autologging](https://mlflow.org/docs/latest/tracking/autolog.html), we can get started with just a few lines of code.

#### Autologging with MLflow

**Autologging** is a capability of that allows users to automatically track standard metrics _without_ specifying anything yourself. In fact, all it takes to get autologging up and running is the snippet below. Simply dropping our training code into the context manager (i.e., the `with` block) is sufficient to take advantage of MLflow's autologging feature.

```python
import mlflow

mlflow.autolog()

with mlflow.start_run():
    # training code here...
```

We'll need to add `mlflow` and `azureml-mlfow` to our `conda-env.yaml`. I've placed them both under `pip`, because `azureml-mlflow` is only available from PyPi, but `mlflow` can technically be installed from `conda-forge`, as well.

All we need to do is implement a training script, and wrap the training code in the `mlflow.start_run` block.

Fortunately, the training code is essentially complete: All we need to do is rip the `scikit-learn` logic out of the EmoDB notebook; load the DataFrame from the Feature Set Specification, instead of disk; and execute on our Compute Instance.

#### Interlude: Retrieving the Feature Set

Our training script will need to make use of the Feature Set we just created, instead of relying on loading data directly into Pandas. Let's verify that we can do that before attempting to train the model.

I apologize in advance for the immense confusion, but to actually retrieve the Feature Set we just created, we do _not_ use the `fs_client` we used to create it, nor the `ml_client` bound to our Workspace. Instead, we instantiate an `azureml.featurestore.FeatureStoreClient`, and use _that_ to fetch.

In [918]:
from azureml.featurestore import FeatureStoreClient

In [944]:
featurestore_client = FeatureStoreClient(
    credential=DefaultAzureCredential(),
    subscription_id=os.environ["SUBSCRIPTION_ID"],
    resource_group_name=os.environ["RESOURCE_GROUP_NAME"],
    name=feature_store_name,
)

In a truly frustrating turn of events, the `MLClient.feature_sets.list` method is completely broken; it simply returns nothing, silently, as if that makes any sense at all.

In [1070]:
emodb_feature_sets_null = list(fs_client.feature_sets.list(name=feature_set_name))
emodb_feature_sets_null

[]

Yet more confusingly — as if that was even possible — using it to `get` the specific version we want, actually _does_ work.

In [1071]:
emodb_feature_set = fs_client.feature_sets.get(
    name=feature_set_name, 
    version="v1.0.0",
)
emodb_feature_set

FeatureSet({'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'EmoDB-FeatureSet2', 'description': 'Data Set for training German SER Classifier.', 'tags': {'pii': 'False'}, 'properties': {'featuresetPropertiesVersion': '1', 'featuresetProperties': '{"features":[{"FeatureName":"featurefilepath","DataType":3}],"source":{"type":"parquet","path":"azureml://subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourcegroups/rg-myfirstworkspace/workspaces/ml-myfirstworkspace/datastores/workspaceblobstore/paths/LocalUpload/ec7acf62eb29e223de333bd4140c1b86/features.parquet","timestampColumn":{"name":"timestamp"}},"indexColumns":[{"DataType":0,"ColumnName":"filename"}]}', 'offlineMaterializationVersion': None, 'onlineMaterializationVersion': None, 'offlineStoreConnectionName': None, 'onlineStoreConnectionName': None}, 'print_as_yaml': False, 'id': '/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.M

Whatever. Let's retrieve the Feature Set again...

In [1083]:
emodb_feature_sets = list(featurestore_client.feature_sets.list())
emodb_feature_set = featurestore_client.feature_sets.get(
    name=feature_set_name,
    version="v1.0.2",
)
emodb_feature_set

FeatureSet
{
  "name": "'EmoDB-FeatureSet2'",
  "version": "'v1.0.2'",
  "specification": "<azure.ai.ml.entities._feature_set.feature_set_specification.FeatureSetSpecification object at 0x1509c7a00>",
  "source": "ParquetFeatureSource(path: azureml://subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourcegroups/rg-myfirstworkspace/workspaces/ml-myfirstworkspace/datastores/workspaceblobstore/paths/LocalUpload/ec53515e43b26e20385bcc5dd1ab8de0/features.parquet, timestamp_column: TimestampColumn(Name=timestamp,Format=None), source_delay: None)",
  "entities": [
    "FeatureStoreEntity({'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'filename', 'description': 'This entity represents the index column of the EmoDB dataset, `filename`.', 'tags': {'pii': 'False'}, 'properties': {}, 'print_as_yaml': False, 'id': None, 'Resource__source_path': '', 'base_path': '/Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure', 'creation_

Note the long list of `features`; these are the `float` columns we need to train our model.

We can select the subset of features we're interest in by calling `get_feature` and selecting each one we want by name.

In [1086]:
training_features = [emodb_feature_set.get_feature(f"feature{i}") for i in range(180)]
training_features[:5]

[Feature(Name=feature0,Type=float),
 Feature(Name=feature1,Type=float),
 Feature(Name=feature2,Type=float),
 Feature(Name=feature3,Type=float),
 Feature(Name=feature4,Type=float)]

At this point, we could convert our `training_features` list to a Spark DataFrame by using [azureml.featurestore.get_offline_features](https://learn.microsoft.com/en-us/python/api/azureml-featurestore/azureml.featurestore?view=azure-python#azureml-featurestore-get-offline-features).

This, of course, requires a Spark context. I'm not running Spark locally, and I'm not interested in incurring a ton of cost by setting up an attached Spark Compute Cluster via, e.g., Azure Synapse Analytics.

Similarly, running a training script on Spark Serverless is a bit extra for the task at hand, because our data is small enough to fit in-memory; running an entire distributed stack to build a model that trains in less than one minute on local CPU is, to put it lightly, utter overkill.

That said, it would look something like the below.

```python
# DOES NOT RUN WITHOUT SPARK CONTEXT
training_df = azureml.featurestore.get_offline_features(training_features)
```

Fetching named features like this feels brittle, and there is, in fact, a more systematic way to do this.

One can define a [feature retrieval specification](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-experiment-train-models-using-features?view=azureml-api-2&tabs=python#run-a-training-experiment) — a spec defining which features should be selected to train a model — and package it up with the model, so that everything lives in one place.

For now, the additional complexity of setting up a Spark environment, just to show how to do it, seems unwarranted — it's unnecessary, and would be more distracting than informative.

When we come back to train, e.g., a weather or stock forecasting model, which require simply profligate amounts of time-series data that must be subject to complex transformation, we'll have occasion to explore these big-data oriented features. 

#### Training the Model & Tracking Experiments

Instead of Spark, let's just go ahead and implement a script that loads the Parquet file we used to generate the Feature Spec. We'll use that to track a set of Mlflow runs to begin building a set of distributed experiments our team can explore in the Studio UI.

We plan to load the Parquet data within a script, which means we can pass the `parquet_file` through to the job in the same manner as before. This should allow us to load it into Pandas with a simple call to `read_parquet`, as follows:

```python
# this runs IN THE JOB -- the parquet file is passed as an `Input` and loaded here
import argparse
import logging

from azure.ai.ml.constants import AssetTypes
import pandas as pd

# Load the Parquet file into a Pandas DataFrame
X = pd.read_parquet(parquet_file.path)
logging.info(X.head())
```

Dead simple, for once. We'll dump this code into [src/train.py](./src/train.py), and build on it as we go.

For this to work, we have to pass `parquet_file` to our job, as before:

In [1267]:
create_train_job = command(
    code="./src",
    command=f"python train.py --parquet-file ${{inputs.parquet_file}} -i {sp['appId']} -c {sp['password']} -t {sp['tenant']} -x {experiment_name} -w {os.environ['WORKSPACE_NAME']} -s {os.environ['SUBSCRIPTION_ID']} -g {os.environ['RESOURCE_GROUP_NAME']}",
    environment=f"{env_asr_result.name}:{env_asr_result.version}",
    compute=compute_targets["computeinstance"],
    display_name="train.py-test",
    experiment_name="test-train-emodb-model",
    identity=azure.ai.ml.entities.UserIdentityConfiguration(),
    inputs={
        "parquet_file": Input(
            type=AssetTypes.URI_FILE,
            path=parquet_asset_versions[-1].path,
            mode=InputOutputModes.DOWNLOAD,
        ),
    },
)

We then run as usual:

In [1268]:
start_compute_instance()

Starting ci-ml-dev-885ddd69...


In [1269]:
returned_job = ml_client.jobs.create_or_update(create_train_job)

Uploading src (0.04 MBs): 100%|████████| 40902/40902 [00:01<00:00, 29132.91it/s]


Use of {} for parameters is deprecated, instead use ${{}}.


In [1270]:
webbrowser.open(returned_job.studio_url)

True

We can now proceed by training our model and saving evaluation metrics, confusion matrix plots, etc., to the tracking server.

In order to do this, MLflow must be configured with a tracking URI, which it uses to upload any data captured during a run. Each Azure ML Workspace you provision comes with an MLflow tracking URL out of the box, which can be fetched with the snippet below.

---

_[MLflow](https://mlflow.org/) can also be installed as an unmanaged service on, e.g., an [AKS cluster](https://learn.microsoft.com/en-us/azure/aks/). In this case, your MLflow instance would expose its own tracking URL, and you'd have to configure [Network Security Groups](https://learn.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview), [AKS Ingress, etc., policies](https://learn.microsoft.com/en-us/azure/aks/app-routing?tabs=default%2Cdeploy-app-default), etc._

In [1104]:
mlflow_tracking_uri = ml_client.workspaces.get(name=ml_client.workspace_name).mlflow_tracking_uri
mlflow_tracking_uri

'azureml://eastus.api.azureml.ms/mlflow/v1.0/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.MachineLearningServices/workspaces/ml-myfirstworkspace'

We can then configure `mlflow` to use this tracking URL with the following snippet.

In [1148]:
!pipenv install mlflow azureml-mlflow

<output trimmed>


In [1105]:
import mlflow

mlflow.set_tracking_uri(mlflow_tracking_uri)

`mlflow` will now send data captured in our runs to our `mlflow_tracking_uri`. Communication with the Workspace requires authentication. According to the documentation, [we must use service principal authentication  for unattended usage](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow-configure-tracking?view=azureml-api-2&tabs=python%2Cmlflow#configure-authentication). Fortunately, the package handles this for us, as long as we set the proper environment variables, e.g.:

```python
# must be set in job for mlflow to authenticate correctly 
os.environ['AZURE_CLIENT_ID'] = args.app_id
os.environ['AZURE_CLIENT_SECRET'] = args.client_secret
os.environ['AZURE_TENANT_ID'] = args.tenant
```

With that, we're now ready to associate our runs with an experiment. While we already know that **experiments** act as containers for data collected from **runs**, we haven't examined these entities formally, so let's take that deep-dive now.

---

_If you'd prefer not to follow along in Azure for reasons of cost, you can install MLflow locally, and run a local server with:_

```bash
$ py -m mlflow server --host 127.0.0.1 --port 8080
```

_...After which you can navigate to `https://127.0.0.1:8000` (or `https://localhost:8000`) to see your tracking server. Of course, you would use this URL in your Python code, as well:_

```python
import mlflow

host = "127.0.0.1"
port = 8080
mlflow.set_tracking_url(f"https://{host}:{port}")
```

If we add code to generate the plot of emotion distribution with this configuration, we see that MLflow does indeed save our plot to the tracking server — it can now be seen in the Studio UI.

![](./images/mlflow-visualization.png)

#### Interlude: MLflow Experiments & Runs

When you explore data or train a model, you run a bunch of functions against the data set. This could be anything from generating plots to transforming the data, but whatever it is, you're likely to _run_ each thing you do multiple times, with different parameters.

While every run might differ from every other with regard to its parameters, they are all performed against the same dataset. Logically, that makes them all part of the same _experiment_.

It's in this way that [MLflow experiment](https://mlflow.org/docs/latest/getting-started/logging-first-model/step3-create-experiment.html) are _containers_ for runs: They act as "folders" containing the data associated with plots, transformations, training, etc., performed against a given data set.

Technically, MLflow allows you to get away with _not_ creating a new experiment for each project: It comes out of the box with a **Default** experiment, which acts as a catch-all for any runs perform without reference to a specific experiment.

It takes two steps, and looks something like the below, where `args.experiment_name` would be `"emodb"` or something of the sort:

```python
# WITHIN JOB
# We send this through as an arg equal to `"emodb"`, in addition to passing as `kwarg` to `command`
mlflow.set_experiment(args.experiment_name) 
```

As a best practice, you _should_ create a new experiment for every new data set; consider the Default a safety net.

> However, I've run into issues getting this to work as expected when using `mlflow` in conjunction with Azure ML. I suspect this is because Azure manages the `mlflow` run and experiment context for us, but I haven't dug enough to know for sure.

> For our purposes, using the default, randomized experiment name(s) will be fine.

With that, we can create an MLflow run, and begin associating assets, such as plots or training metrics, with it. All plots or metrics generated during a given execution of the script will be associated with each other, making it easy to review the context and results of training runs performed against a given dataset.

We create a run by calling `with mlflow.start_run()`; anything run within the context manager is associated with the run. 

While you can technically run all of your code under the `start_run` block for convenience, [you shouldn't](https://mlflow.org/docs/latest/getting-started/intro-quickstart/index.html):

> While it can be valid to wrap the entire code within the start_run block, this is not recommended. If there as in issue with the training of the model or any other portion of code that is unrelated to MLflow-related actions, an empty or partially-logged run will be created, which will necessitate manual cleanup of the invalid run. It is best to keep the training execution outside of the run context block to ensure that the loggable content (parameters, metrics, artifacts, and the model) are fully materialized prior to logging.

Instead, we'll use functions to generate our assets, and simply use the `run` block to register them.

In our case, we'll proceed by:
- Training the model, producing the model itself
- Evaluating the model, producing metrics
- Implementing the MLflow workflow:
  - Logging parameters used to train and evaluate the model
  - Logging error metrics
  - Logging the model itself, i.e., registering and persisting an instance for later retrieval/reinstatiation for further experimentation
  - Logging visualizations, viz., learning curve and scaling properties

#### Training the SER Classifier

We begin by training the model, including the steps to:
- Scale features using the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) and [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
- Train the [MLP Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) on each (differently scaled) feature set
- Perform hyperparameter tuning
- Evaluate models, including performing [Stratified K-Fold CV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
- Plot evaluation visualizations
  - Loss curve
  - Confusion Matrices
  - Learning curve
  - Scaling curve
- Register model(s)

We'll compute the result of step with functions so we can easily persist them properly at the end of our script.

#### Scaling & Splitting Features

The first step is to scale our features and create a train/test split for each scaled version of the data set. Along the way, we save out the `"emotion"` (target) column from `X`, then drop it, so that `X` contains only features, and `y`, only labels.

```python
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import sklearn

def scale(X: pd.DataFrame):
    sscaler = StandardScaler(copy=True)
    mmscaler = MinMaxScaler(copy=True)
    return {
        "standard": sscaler.fit_transform(X),
        "minmax": mmscaler.fit_transform(X),
    }


def split(features, labels, test_size=.2, random_state=0):
    """A function to split and return a feature matrix."""
    return sklearn.model_selection.train_test_split(features, labels, test_size=test_size, random_state=random_state)


def split2dict(X_train, X_test, y_train, y_test):
    return {
        "train": {
            "X": X_train,
            "y": y_train,
        },
        "test": {
            "X": X_test,
            "y": y_test,
        },
    }

def main():
    # ...SNIP...
    y = X["emotion"]
    X.drop(columns=["emotion"], inplace=True)
    # Scale and split
    splits = {
        "standard": split2dict(*split(standard_scale(X), y)),
        "minmax": split2dict(*split(minmax_scale(X), y)),
    }
```

#### Training the MLP Classifier

Next, we train the classifiers, saving the scores associated with each into a dictionary.

```python
from typing import Dict, List

def train(nn_models: List[MLPClassifier], splits: Dict[str, Dict[str, pd.DataFrame]]):
    nnet_scores = {}
    for name, split in zip(["standard", "minmax"], [splits["standard"], splits["minmax"]]):
        print(f"{name.upper()}")
        nnet_scores[name] = []
        for m in nn_models:
            m.fit(split["train"]["X"], split["train"]["y"])
            score = m.score(split["test"]["X"], split["test"]["y"])
    
            m_name = type(m).__name__
            nnet_scores[name].append((m_name, f"{100*score:.2f}%"))
    return nnet_scores

def main():
    # ...SNIP...
    nn_models = [
        MLPClassifier(random_state=0),
    ]
    nnet_scores = train(nn_models, splits)
    print(nnet_scores)
```

#### Hyperparameter Tuning

Next, we perform a grid search to tune hyperparameters. This is straightforward on the `sklearn` side, but we'll see some incidental complexity here when we log metrics with MLflow.

```python
def grid_search(splits: Dict[str, Dict[str, pd.DataFrame]], split_name: str = "standard"):
    mlp = MLPClassifier(
        batch_size=32,
        random_state=0,
    )
    parameters = {
        "hidden_layer_sizes": [(7,), (180,), (360,)],
        "activation": ["tanh", "relu", "logistic"],
        "solver": ["adam", "sgd"],
        "alpha": [1e-3, 1e-2],
        "epsilon": [1e-8, 1e-6, 1e-4],
        "learning_rate": ["adaptive", "constant"],
    }
    gs = GridSearchCV(
        mlp,
        parameters,
        cv=5,
        n_jobs=4,
    )
    gs.fit(splits[split]["train"]["X"], splits[]["train"]["y"])
    return gs

def main():
    # ...SNIP...
    gs = grid_search(splits, "standard")
    print(gs.best_params_)
```

#### Interlude: Hyperparameter Tuning with Sweep Jobs

As it turns out, there's a whole lot more to say about hyperparameter tuning, specifically in the context of Azure.

Here, we're using `sklearn`'s `GridSearchCV`, because that's how the original model was implemented, and our task is to train and deploy it on Azure, as provided. Were we to have _begun_ this project on Azure, we would likely use a [SweepJob](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.sweepjob?view=azure-python) instead.

To set the stage, recall that hyperparameters are variables that determine _how_ the model is trained, but do _not_ act as training inputs to the model.

Examples includethe number and size of hidden layers in a DNN; [alpha](https://scikit-learn.org/stable/auto_examples/neural_networks/plot_mlp_alpha.html), a [regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) factor intended to reduce generalization error and overfitting; the [learning rate](https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/), the multiplicative factor used in concert with the gradient computed during SGD which determines the magnitude of the adjustment made during each backpropagation pass; and the choice of [activation function](https://en.wikipedia.org/wiki/Activation_function), which, essentially, determine each individual neuron's firing threshold.

All of these entities are numbers, making them quantitative variables and thus parameters of some sort. None are derived from the training data, however, and yet influence _how_ the model generates its weights, which is what makes them _hyperparameters_.

Because hyperparameters can have a significant effect on the final performance of a model, tinkering with them is crucial. Often, this process is essentially brute-force: We define a **search space**, or set of possible values over which each hyperparameter we care about is allowed to range; train a model with every possible permutation of this search space; and select the one that performs best.

Note that some hyperparameters are discrete, such as the choice of activation function. There are only a finite number of such functions<sup>1</sup>, and so we simply pass a list enumerating which options we wish to evaluate to define the search space. 

Others are continuous — meaning, for us, that they range over the real numbers — and are therefore intrinsically more difficult to deal with, because we can't simply enumerate an infinite number of options. In such cases, we define a **distribution** for the underlying hyperparameter space, as well as a **sampling method** defining how to draw values from it. We then allow the framework to compute the distribution and sample values based on the provided configuration.

---

1. _Strictly speaking, there are an infinite number of [such functions](https://en.wikipedia.org/wiki/Activation_function), but only a handful are of practical use. Pointing this out is arguably pedantry, but it's the kind of thing that becomes important when implementing your own DNN architectures from scratch, so it pays to be aware of._

This, of course, swiftly becomes profligate, easily requiring us to train dozens of models for a single search. In cases where the resource usage is prohibitive, one can use more sophisticated methods to explore the search space, such as [early termination to quit after further changes cease to lead to appreciable improvement](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#early-termination) and/or using domain knowledge to constrain the search to known reasonable values.

A much simpler approach is to simply randomize the search: Rather than trying _every_ possible permutation of the space, we train a _sample_ drawn from the distribution, and select the best.

Our training case is simple enough a randomized search doesn't make a significant difference — the grid search in our `train.py` only specifies the best option for each hyperparameter, anyway, specifically to save Compute (this is, after all, a case study, not a production deployment) — but be aware of the possibility.

Essentially any hyperparameter tuning framework can handle the above concepts, but Azure specifically offers the `SweepJob` mentioned above to handle this task. Definine a `SweepJob` requires three (broad) steps:
- [Define the hyperparameter space](https://learn.microsoft.com/en-us/training/modules/perform-hyperparameter-tuning-azure-machine-learning-pipelines/2-define-search-space)
- [Configure a sampling method](https://learn.microsoft.com/en-us/training/modules/perform-hyperparameter-tuning-azure-machine-learning-pipelines/3-configure-sampling-method)
- [Configure early termnation policies](https://learn.microsoft.com/en-us/training/modules/perform-hyperparameter-tuning-azure-machine-learning-pipelines/4-configure-early-termination)
- Run the training script

Using a sweep job for hyperparameter tuning is actually fairly simple; only the first three steps require new concepts.

##### Defining a Hyperparameter Space

Recall that hyperparameters will be either **discrete** or **continuous**.

In the case of discrete hyperparameters, we can either:
- Directly enumerate the values we wish to choose from; or
- Generate a discrete distribution, whose values we exhaust during the search

The first is the most intuitive option, and works well for hyperparameters with a small range of possible values, such as the list of possible activation functions. The latter is useful for numerical hyperparameters whose values might range over a series of well-defined, integral steps.

Continuous hyperparameters must be drawn from a distribution. Azure provides the following classes to define such distributions:
- [azure.ai.ml.sweep.Normal](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.normal?view=azure-python): Accepts `mu` and `sigma` parameters and generates a normal distribution about `mu` with standard deviation `sigma`.
- [azure.ai.ml.sweep.LogNormal](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.lognormal?view=azure-python): Generates values drawn from `exp(Normal(mu, sigma))`, such that the logarithm of the values sampled is normally distributed.
- [azure.ai.ml.sweep.Uniform](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.uniform?view=azure-python): Generates a uniform distribution between `min_value` and `max_value`.
- [azure.ai.ml.sweep.LogUniform](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.loguniform?view=azure-python): Generates values drawn from `exp(Uniform(min_value, max_value))`, such that the logarithm of the values sampled is uniformly distributed.

Each of these has an analogous discrete class, prefixed with `Q`, which generate discrete distributions with the same statistical characteristics. They are:
- [azure.ai.ml.sweep.QNormal](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.qnormal?view=azure-python): Accepts `mu` and `sigma` parameters and generates a normal distribution about `mu` with standard deviation `sigma`.
- [azure.ai.ml.sweep.QLogNormal](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.qlognormal?view=azure-python): Generates values drawn from `exp(Normal(mu, sigma))`, such that the logarithm of the values sampled is normally distributed.
- [azure.ai.ml.sweep.QUniform](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.quniform?view=azure-python): Generates a uniform distribution between `min_value` and `max_value`.
- [azure.ai.ml.sweep.QLogUniform](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.qloguniform?view=azure-python): Generates values drawn from `exp(Uniform(min_value, max_value))`, such that the logarithm of the values sampled is uniformly distributed.

Using these to define a search space is straightforward: Simply import the required class, as well as [azure.ai.ml.sweep.Choice](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.sweep.choice?view=azure-python), and create a raw `job` instance whose kwargs are the hyperparameters to tune keyed to their search spaces.

```python
from azure.ai.ml.sweep import Choice, Normal

command_for_sweep = job(
    # Discrete by definition -- we can (or, should) only use powers of two
    batch_size=Choice(values=[16, 32, 64]),
    # Here, treated as continuous, but we COULD define this as a discrete set of specific values we wish to try -- see below
    learning_rate=Normal(mu=10, sigma=3),
    # Discrete by nature
    activation=Choice(values=["relu", "tanh", "logistic"]),
)
```

##### Configuring Sampling Method

Defining the search space provides distributions from which to draw, but doesn't define _how_ to perform the sampling. Azure provides [four sampling strategies](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#sampling-the-hyperparameter-space):
- [Random Sampling](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#random-sampling): Draw randomly from the hyperparameter space. Supports both continuous and discrete variables, as well as early termination policies.
- [Grid Sampling](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#grid-sampling): Performs an exhaustive search over the search space of a discrete variable. Also supports early termination politices.
- [Sobol](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#sobol): Essentially a specialized Random Sampling type, which supports configuration with a random seed and therefore produces reprodicible results.
- [Bayesian](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#bayesian-sampling): Uses [Bayesian optimization](https://en.wikipedia.org/wiki/Bayesian_optimization) that chooses samples based on the performance of previous samples. Somewhat more formally, it treats your model as a black-box function; generates a [prior distribution](https://en.wikipedia.org/wiki/Prior_probability) summarizing assumptions about its distribution; trains your model; uses its predictions as inputs to compute a [posterior distribution](https://en.wikipedia.org/wiki/Posterior_probability), which serves as the prior for the next run by informing the [acquisition function](https://ekamperi.github.io/machine%20learning/2021/06/11/acquisition-functions.html) used to select new samples. This is easily the most sophisticated method, but also the most expensive.
  - One other note, from the docs:
 > Bayesian sampling only supports choice, uniform, and quniform distributions over the search space.

We use the `command` job created above to generate a sweep job by invoking its `sweep` method, as illustrated below. Under the hood, this probably creates something like a test matrix, with each entry corresponding to the result of a trial run, with the bookkeeping managed by `azure.ai.ml`.

Specifying a sampling method is as simple as passing its name to the `sweep` method.

```python
from azure.ai.ml.sweep import Choice, Normal

# Same as above, but adding `sampling_method` kwarg
command_for_sweep = command(
    # Still treated as discrete...
    batch_size=Choice(values=[16, 32, 64]),
    # Here, treated as discrete, instead of continuous, allowing us to use grid search
    learning_rate=Choice(values=[1e-3, 1e-2, 1e-1, 1.]),
    activation=Choice(values=["relu", "tanh", "logistic"]),
)

sweep_job = command_for_sweep.sweep(
    samping_method="grid",
    # other job params
)
```

##### Configuring Early Termination

At some point, especially during a grid search, it might end up being the case that varying hyperparameters ceases to lead to appreciable improvement in the model. In this case, it's best to stop tuning early, instead of wasting compute exhausting suboptimal solutions we'll throw out, anyway.

This is mostly a concern if your search space is large, which tends to happen if you have a large number of continuous search spaces to explore, which is often the case with Random and Bayesian sampling methods.

Configuring an early termination policy requires specifying two **evaluation parameters** that determine when and how often to evaluate models during the tuning process, effectively determining the probability that the run will stop after any given trial; and selecting a **selection policy**, which define how much better a model must be than the previous ones to induce early stoppage.

The select policies are:
- [Bandit](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#bandit-policy): Accepts a `slack_factor` _or_ `slack_amount`, used to define the "distance" in performance from previous models that induces early termination. `slack_factor` is relative, and configures the policy to terminate if subsequent models fail to perform within a certain percentage of the best found. `slack_amount` is absolute, and uses a raw number — not a percentage — to define the cutoff.
- [Median](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#median-stopping-policy): Median stopping policy accepts no special configuration arguments. Instead, it simply computes the median of the mean performance of models as trials proceed, stopping any jobs whose primary metric falls below this figure.
- [Truncation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?view=azureml-api-2#truncation-selection-policy): Truncation stops any jobs whose performance falls below a specified threshold. This threshold is configurable via the `truncation_percentage` argument, which specifies which percentage of lowest-performing jobs to drop at each evaluation interval. One can also specify an `exclude_finished_jobs`, which configures the class to disregard jobs that have finished from the truncation operation, regardless as to their performance.

All of the above policies also accept evaluation parameters, discussed below.

The evaluation parameters are:
- `evaluation_interval`: Specifies the interval at which the selection policy will be evaluated. If set to `1`, the policy will be evaluated every time primary metrics are logged, i.e., on every trial.
- `delay_evaluation`: Specifies how many trials to run before beginning to evaluate the policy, enabling you to force a minimum number of trials before early termination can take hold.

Of course, an early termination policy is not _required_. If your training jobs are small enough, omitting this entirely is perfectly acceptable (but this seems unlikely to happen often in industrial practice).

For examples of how to configure these, refer to the [Azure Learn documentation on the matter](https://learn.microsoft.com/en-us/training/modules/perform-hyperparameter-tuning-azure-machine-learning-pipelines/4-configure-early-termination). Below is a sample quoted from that document, demonstrating how to configure a `MedianStoppingPolicy` for the sweep job pictured above.

Note that we also round out the call to `sweep` by providing the following arguments:
- `primary_metric`: The metric to use for evaluating the model and early termination policy.
- `goal`: Whether to maximize or minimize the `primary_metric`.

```python
from azure.ai.ml.sweep import Choice, MedianStoppingPolicy, Normal

# Same as above, but adding `sampling_method` kwarg
command_for_sweep = command(
    # Still treated as discrete...
    batch_size=Choice(values=[16, 32, 64]),
    # Here, treated as discrete, instead of continuous, allowing us to use grid search
    learning_rate=Choice(values=[1e-3, 1e-2, 1e-1, 1.]),
    activation=Choice(values=["relu", "tanh", "logistic"]),
)

sweep_job = command_for_sweep.sweep(
    samping_method="grid",
    early_termination=MedianStoppingPolicy(
        delay_evaluation = 5, 
        evaluation_interval = 1
    ),
    # Assuming we're training a classifier
    primay_metric="Precision",
    goal="Maximize",
)
```

To actually run a sweep job, a few things need to happen:
- Define a base job to run your training script
- Override `inputs` with your hyperparameter space
- Generate and execute the `SweepJob`

These steps are documented in the [Azure Learn lesson on hyperparameter tuning](https://learn.microsoft.com/en-us/training/modules/perform-hyperparameter-tuning-azure-machine-learning-pipelines/5-use-sweep-job-hyperparameter-tuning), and illustrated below. The only additions relative to the above cell are the addition of a **base job** and configuration of some details of the sweep job, viz., trial limits.

```python
from azure.ai.ml.sweep import Choice, MedianStoppingPolicy, Normal

# Define base job -- note that `inputs` JUST define the namespace of our hyperparameter search space; the corresponding values will be overriden below
base_job = command(
    code="./src",
    command="python train.py --learning_rate ${{inputs.learning_rate}} --batch-size ${{inputs.batch_size}} --activation ${{inputs.activation}}",
    inputs={
        "learning_rate": 0.01,
        "batch_size": 16,
        "activation": "relu",
    },
    environment=f"{env_asr_result.name}:{env_asr_result.version}",
    compute=compute_targets["computeinstance"],
)

# Overrides inputs with hyperparameter space
command_for_sweep = base_job(
    batch_size=Choice(values=[16, 32, 64]),
    learning_rate=Choice(values=[1e-3, 1e-2, 1e-1, 1.]),
    activation=Choice(values=["relu", "tanh", "logistic"]),
)

sweep_job = command_for_sweep.sweep(
    samping_method="grid",
    early_termination=MedianStoppingPolicy(
        delay_evaluation = 5, 
        evaluation_interval = 1
    ),
    primay_metric="Precision",
    goal="Maximize",
)

# Configure limits and run
sweep_job.set_limits(
    max_total_trials=12,
    max_concurrent_trials=4,
    timeout=7200,
)

returned_job = ml_client.create_or_update(sweep_job)
```

As much as I'd love to demonstrate the use of a sweep job for this purpose, it would require a significant rework of the existing solution, with little added value.

In your own work, it's often good to prefer Sweep Jobs to, e.g., `GridSearchCV`, as it keeps everything "local" to the Azure ecosystem.

#### Train Final Model

Turning our attention back to what we actually _did_ do...We now take the `best_params_` from the grid search, and use them to train and evaluate a final model.

```python
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

def score(mlp_final: MLPClassifier, splits:Dict[str, Dict[str, pd.DataFrame]], split_name: str = "standard"):
    y_pred = mlp_final.predict(splits[split_name]["test"]["X"])
    return {
        "accuracy": accuracy_score(splits[split_name]["test"]["y"], y_pred),
        "f1": f1_score(splits[split_name]["test"]["y"], y_pred, average="macro"),
        "precision": precision_score(splits[split_name]["test"]["y"], y_pred, average="macro"),
        "recall": recall_score(splits[split_name]["test"]["y"], y_pred, average="macro"),
    }

def main():
    # ...SNIP...
    # Retrain and evaluate final model
    mlp_final = MLPClassifier(
        **gs.best_params_,
        # Adam constants -- defaults suggested in paper
        beta_1=0.9,
        beta_2=0.999,
        batch_size=32,
        max_iter=200,
        random_state=0,
    )
    mlp_final.fit(splits["standard"]["train"]["X"], splits["standard"]["train"]["y"])
    mlp_final_scores = score(mlp_final, splits, "standard")
```

#### StratifiedKFold

Next, we run `StratifiedKFold` to build trust in our model.

```python
def stratifiedkfold(mlp_final: MLPClassifier, target: pd.core.series.Series, splits: Dict[str, Dict[str, pd.DataFrame]], split_name: str = "standard"):
    skfold = StratifiedKFold(
        n_splits=min(target.value_counts().values), # 46
        shuffle=True,
        random_state=0,
    )
    kfold_scores = []
    for train_indices, test_indices in skfold.split(splits[split_name]["train"]["X"], splits[split_name]["train"]["y"]):
        mlp_final.fit(
            splits[split_name]["train"]["X"][train_indices],
            splits[split_name]["train"]["y"][train_indices],
        )
        kfold_scores.append(mlp_final.score(
            splits[split_name]["train"]["X"][test_indices],
            splits[split_name]["train"]["y"][test_indices]),
        )
    return kfold_scores

def main():
    # ...SNIP...
    kfold_scores = stratifiedkfold(mlp_final, y, splits, "standard")
    kfold_stats = {
        "kfold_mean": np.mean(kfold_scores),
        "kfold_std": np.std(kfold_scores),
    }
```

#### Learning & Scaling Curves

Let's visualize learning and scaling curves...

```python
def generate_learning_scaling_curves(
    target: pd.core.series.Series, 
    mlp_final: MLPClassifier, 
    splits:Dict[str, Dict[str, pd.DataFrame]], 
    split_name: str = "standard"
):
    params = {
        "X": splits[split_name]["train"]["X"],
        "y": splits[split_name]["train"]["y"],
        "train_sizes": np.linspace(0.1, 1.0, 5),
        "cv": min(target.value_counts().values), # 46
        "n_jobs": 4,
        "return_times": True,
    }
    train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(
        mlp_final,
        shuffle=True,
        random_state=0,
        **params,
    )
    # Generate Plot
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(24, 4), sharey=False, sharex=True)

    ax[0].plot(train_sizes, fit_times.mean(axis=1), 'o-')
    ax[0].fill_between(
            train_sizes,
            fit_times.mean(axis=1) - fit_times.std(axis=1),
            fit_times.mean(axis=1) + fit_times.std(axis=1),
            alpha=0.3,
        )
    ax[0].set_ylabel('Fit time (s)')
    ax[0].set_xlabel('# Training Samples')
    ax[0].set_title(
        f"Scalability"
    )
    
    ax[1].plot(train_sizes, train_scores.mean(axis=1), 'o-', color='r', label='Training Score')
    ax[1].plot(train_sizes, test_scores.mean(axis=1), 'o-', color='g', label='CV Score')
    ax[1].fill_between(
        train_sizes,
        train_scores.mean(axis=1) - train_scores.std(axis=1),
        train_scores.mean(axis=1) + train_scores.std(axis=1),
        color='r',
        alpha=0.3,
    )
    ax[1].fill_between(
        train_sizes,
        test_scores.mean(axis=1) - test_scores.std(axis=1),
        test_scores.mean(axis=1) + test_scores.std(axis=1),
        color='g',
        alpha=0.3,
    )
    ax[1].set_ylabel('Score')
    ax[1].set_xlabel('# Training Samples')
    ax[1].set_title(
        f"Learning Curve"
    )
    return fig

def main():
    # ...SNIP...
    fig2 = generate_learning_scaling_curves(y, mlp_final, splits, "standard")
```

#### Confusion Matrices

...And confusion matrices.

```python
import seaborn as sn

def generate_confusion_matrices(test_predictions, test_groundtruth):
    cmat = confusion_matrix(test_groundtruth, test_predictions)
    cmat_norm = confusion_matrix(test_groundtruth, test_predictions, normalize='true')

    df_cmat = pd.DataFrame(cmat, index=list(EMOTION_MAP.keys()), columns=list(EMOTION_MAP.keys()))
    df_cmat_norm = pd.DataFrame(cmat_norm, index=list(EMOTION_MAP.keys()), columns=list(EMOTION_MAP.keys()))

    return df_cmat, df_cmat_norm

def generate_confusion_plots(df_cmat: pd.DataFrame, df_cmat_norm: pd.DataFrame):
    plt.figure(figsize=(16, 6))

    plt.subplot(1, 2, 1)
    plt.title('Confusion Matrix')
    sn.heatmap(df_cmat, annot=True, annot_kws={'size': 18})
    
    plt.subplot(1, 2, 2)
    plt.title('Normalized Confusion Matrix')
    sn.heatmap(df_cmat_norm, annot=True, annot_kws={'size': 18})

def main():
    # ...SNIP...
    fig3 = generate_confusion_plots(generate_confusion_matrices(mlp_final.predict(splits["standard"]["test"]["X"]), splits["standard"]["test"]["y"]))
```

#### Interlude: Registering the Model & Logging with MLflow

This completes the training exercise; now, we must log our parameters, metrics, and model with MLflow, and then [register](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-models?view=azureml-api-2&tabs=cli) the result.

In addition to visualizations and metrics, MLflow provides facilities to log an entire model, persisting it as an `MLmodel` file in your configured artifacts directory. You can achieve this automatically with [autologging](https://mlflow.org/docs/latest/tracking/autolog.html), but since we like to get our hands dirty...We won't.

Instead, we'll invoke the `mlflow.sklearn.log_model` method directly. This requires three arguments:
- The model to persist;
- The directory to which to save the artifacts; and
- A **signature**, which we'll discuss next.

---

_This is a good time to point out that `mlflow` has a different `log_model` method for each of the different popular frameworks available. If you wish to persist a custom model built using an unsupported or proprietary library, Azure ML provides a [custom model](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-models?view=azureml-api-2&tabs=cli#register-a-model-by-using-the-azure-cli-or-python-sdk) option._

The MLflow model [signature](https://mlflow.org/docs/latest/model/signatures.html#id1) defines the expected inputs and outputs for a model. It essentially provides a schema defining the "shape" of an acceptable input, encapsulating it as an instance of [mlflow.models.signature.ModelSignature](https://mlflow.org/docs/latest/_modules/mlflow/models/signature.html).

---

_Model signatures are a crucial topic for production-grade deployment, so [reading the full documentation is worthwhile](https://mlflow.org/docs/latest/model/signatures.html#id1). Our use case here is sufficiently simple as to obviate the need for most of the machinery detailed in that document, but it's likely to arise during your own professional practice._

You _can_ define a signature by hand, but `mlflow` provides the [mlflow.models.signature.infer_signature](https://mlflow.org/docs/latest/python_api/mlflow.models.html#mlflow.models.infer_signature) method to do this for us, given an input and prediction. It looks like this (snippet borrowed from the [Azure Learn tutorial on Logging an MLflow Model](https://learn.microsoft.com/en-us/training/modules/register-mlflow-model-azure-machine-learning/2-log-mlflow-model).

```python
from mlflow.models.signature import infer_signature

# Assuming we've trained our `mlp_final` classifier...
signature = infer_signature(mlp_final, mlp_final.predict(splits["standard"]["train"]["X"]))

# Log the model with MLflow
mlflow.sklearn.log_model(mlp_final, "artifacts_rf", signature=signature)
```

This persists the model to the `"artifacts_rf"` directory, as specified, along with an [MLmodel file](https://mlflow.org/docs/latest/models.html), a YAML file that acts as the single source of truth for how to load and use the model.

I won't talk extensively about the format of this file, but do review the documentation. We'll take a closer look after we log our model and generate an MLmodel to inspect.

Recall that invoking `mlflow.sklearn.log_model` causes MLflow to save both an MLmodel file _and_ the model itself to the specified artifacts directory. We can use this fact to register the model artifact as a formal Azure ML model, allowing you to store and version models in your workspace, thereby enabling all the fancy [MLOps madness to which we aspire](https://learn.microsoft.com/en-us/azure/machine-learning/concept-model-management-and-deployment?view=azureml-api-2).

We'll see a "live" example of this below, but the [Azure Learn documentation on the matter](https://learn.microsoft.com/en-us/training/modules/register-mlflow-model-azure-machine-learning/4-register-model) provides the following example.

```python
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

# Assuming we have a `returned_job` after running our training script...
job_name = returned_job.name

# Create a new `Model` instance by grabbing the `model` artifact associated with the `returned_job` 
run_model = Model(
    path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/model/",
    # Define a model name; this can be arbitrary
    name=model_name,
    description="Model created from run to do <insert incredible thing here>.",
    type=AssetTypes.MLFLOW_MODEL,
)

# Perform create to actually register the model with Azure Workspace
ml_client.models.create_or_update(run_model)
```

#### Tracking with MLflow

Finally, we track as much as we can with MLflow. 

I won't talk too much about how to select what to persist. Here, I'm keeping it relatively simple, but you _could_ log out [the results of _every_ GridSearch (or SweepJob) trial](https://gist.github.com/navarmn/cd9dd63029618a9d52098dfaad39d9a6), for example. 

Recall, as well, that you can simply use [autologging](https://mlflow.org/docs/latest/tracking/autolog.html) to avoid doing this by hand. In general, I think it's better to rely on such automations only after understanding what they automate — so, after reading this, you're officially cleared to take the easy way out next time. Congratulations.

```python
with mlflow.start_run() as run:
    # Log visualizations
    mlflow.log_figure(fig1, "emotion_distribution.png")
    mlflow.log_figure(fig2, "learning_scaling_curves.png")
    mlflow.log_figure(fig3, "confusion.png")
    # Log GridSearchCV parameters and metrics
    mlflow.log_metric("best_cross_val_score", gs.best_score_)
    mlflow.log_metric("num_params_tested", len(gs.cv_results_["params"]))
    mlflow.log_metric("mean_fit_time", gs.cv_results_["mean_fit_time"][gs.best_index_])
    for param, value in gs.best_params_.items():
        mlflow.log_param(f"grid_search_{param}", value)
    # Log model metrics
    for metric_name, metric_value in mlp_final_scores.items():
        mlflow.log_metric(metric_name, metric_value)
    # Log kfold params and metrics
    mlflow.log_param("kfold_n_splits", min(y.value_counts().values))
    for name, value in kfold_stats.items():
        mlflow.log_metric(name, value)
    # Log model
    mlflow.sklearn.log_model(
        mlp_final,
        artifact_path,
        registered_model_name=args.model_name,
        signature=infer_signature(
            mlp_final,
            mlp_final.predict(split["standard"]["test"]["X"]),
        ),
        # omit `conda_env`: [mlflow.models.infer_pip_requirements](https://mlflow.org/docs/latest/python_api/mlflow.models.html#mlflow.models.infer_pip_requirements)
    )
```

#### Running the Job

Now that we have everything we need, let's run it.

In [1359]:
experiment_name = "ex-emodb"

In [1369]:
model_name = "emodb-ser-classifier"

In [1374]:
create_train_job = command(
    code="./src",
    command=f"python train.py --parquet-file ${{inputs.parquet_file}} -i {sp['appId']} -c {sp['password']} -t {sp['tenant']} -x {experiment_name} -w {os.environ['WORKSPACE_NAME']} -s {os.environ['SUBSCRIPTION_ID']} -g {os.environ['RESOURCE_GROUP_NAME']} -n {model_name}",
    environment=f"{env_asr_result.name}:{env_asr_result.version}",
    compute=compute_targets["computeinstance"],
    display_name="train.py-test",
    experiment_name=experiment_name,
    identity=azure.ai.ml.entities.UserIdentityConfiguration(),
    inputs={
        "parquet_file": Input(
            type=AssetTypes.URI_FILE,
            path=parquet_asset_versions[-1].path,
            mode=InputOutputModes.DOWNLOAD,
        ),
    },
)

As usual:

In [1375]:
start_compute_instance()

Already running!


In [1376]:
returned_job = ml_client.jobs.create_or_update(create_train_job)

Uploading src (0.06 MBs): 100%|████████| 57518/57518 [00:02<00:00, 23781.76it/s]


Use of {} for parameters is deprecated, instead use ${{}}.


In [1377]:
webbrowser.open(returned_job.studio_url)

True

### Model Registration

The run succeeded — **truly fucking lit**. We're one step away from deploying this thing, that step simply being model registration.

In [1385]:
from azure.ai.ml.entities import Model

# Create a new `Model` instance by grabbing the `model` artifact associated with the `returned_job` 
emodb_model = Model(
    path=f"azureml://jobs/{returned_job.name}/outputs/artifacts/paths/rt-emodb/",
    # Define a model name; this can be arbitrary
    name=model_name,
    description="German language SER sklearn.neural_nets.MLPclassifier for the EmoDB dataset.",
    type=AssetTypes.MLFLOW_MODEL,
)

In [1446]:
# Perform create to actually register the model with Azure Workspace
ml_client.models.create_or_update(emodb_model)

Model({'job_name': None, 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': True, 'auto_delete_setting': None, 'name': 'emodb-ser-classifier', 'description': None, 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.MachineLearningServices/workspaces/ml-myfirstworkspace/models/emodb-ser-classifier', 'Resource__source_path': '', 'base_path': '/Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x154e58a00>, 'serialize': <msrest.serialization.Serializer object at 0x154e5ba00>, 'version': None, 'latest_version': '2', 'path': None, 'datastore': None, 'utc_time_created': None, 'flavors': None, 'arm_type': 'model_version', 'type': 'custom_model', 'stage': None})

In [1449]:
model = list(ml_client.models.list())[-1]
model

Model({'job_name': None, 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': True, 'auto_delete_setting': None, 'name': 'emodb-ser-classifier', 'description': None, 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/resourceGroups/rg-myfirstworkspace/providers/Microsoft.MachineLearningServices/workspaces/ml-myfirstworkspace/models/emodb-ser-classifier', 'Resource__source_path': '', 'base_path': '/Users/peleke/Documents/Projects/COSerious/AzureCertification/SER-on-Azure', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x15519c760>, 'serialize': <msrest.serialization.Serializer object at 0x15519c970>, 'version': None, 'latest_version': '2', 'path': None, 'datastore': None, 'utc_time_created': None, 'flavors': None, 'arm_type': 'model_version', 'type': 'custom_model', 'stage': None})

...And, believe it or not, we're finally done with training: The model is trained, registered, and ready to deploy.

<div style="text-align: center;">
    <img src="https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExbGpodHhjcHowOHhteDliaThnNGx0aThpYXdmcGE2aHcwamE5c2NpbCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9dg/G96zgIcQn1L2xpmdxi/giphy.gif" />
</div>

### Deploying the Model

Now that the model is ready, all that's left is to deploy it.

The two deployment options are **real-time** and **batch**:
- **Real-time**, also known as **online**, deployments receive input vectors to an HTTPS endpoint, which invokes the model to generate a prediction returned to the client. As the name suggests, these are used for instantaneous, "single-use" predictions, as when a user requests movie recommendations while using an application.
- **Batch**, or **offline**, deployments receive a large(r) number of inputs, and are optimized to generate predictions for all at once.

There are two primary options when dealing with online endpoints:
- **Managed Endpoints**: Azure creates and manages all underlying infrastructure required to expose the endpoint. This is by far the best options for models that aren't yet worth using in a fashion that incurs great financial or management cost, and are dead simple to set up, even for Data Scientists with minimal MLOps and SWE background.
- **Unmanaged Kubernetes**: At higher scale, or in situations that require greater flexibility, users can manage their own Kubernetes cluster, allowing for the use of tools like [kubeflow](https://www.kubeflow.org/) and enabling significantly greater control over the deployment.

If you're running on unmanaged Kubernetes, you probably wouldn't be reading this, so we'll focus on the Managed Endpoint option.

#### Managed Endpoints

Creating a Managed Endpoint requires two broad steps:
- Create an Endpoint
- Deploy a Model to the Endpoint

The reason these two steps are separate is because an Endpoint can host _multiple_ models. Further, the Endpoint and its underlying Model are distinct entities: The _model_ is the...Well, model, used to perform inference. The _Endpoint_ is, strictly speaking, the infrastructure that receives user requests; uses the model to generate a prediction; and returns the result to the client.

#### Interlude: Deployment Strategies

In our case, we'll deploy the single model we trained. However, an important implication of the above is that an Endpoint, being an infrastructural concept, can host _multiple_ models. This enables us to easily implement the notion of a [blue/green deployment, also known as a safe rollout](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-safely-rollout-online-endpoints?view=azureml-api-2&tabs=azure-cli).

The notion of a blue/green deployment is based on the idea of incorporating some element of trust into our deployment. Suppose we use the classifier we've just trained for a few months, and it performs well. At that point, we're likely to trust that this version of the model works well.

Suppose, further, that we receive a bunch of new user data that allows us to retrain the model. We would then have _two_ models: One that we trust, which has worked for awhile; and one that we _hope_ will behave better, but which remains as-yet unproven.

An elegant solution would be to serve _most_ traffic to the model we trust — say, 90% — and send the remaining 10% to the new one. We spend some time monitoring the new model's performance, and, as our trust grows, allocate an increasing proportion of traffic to the new model, until it handles _all_ traffic.

This is, in essence, the definition of a blue/green deployment.

Because Managed Endpoints can manage multiple models, [implementing blue/green deployments](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-safely-rollout-online-endpoints?view=azureml-api-2&tabs=azure-cli) is fairly straightforward, boiling down to little more than a configuration exercise.

---

_Contrast this with [implementing the same concept on a Kuberentes instance](https://blog.devops.dev/machine-learning-deployment-strategies-in-kubernetes-canary-blue-green-and-a-b-testing-3203c6895450), where it's...Anything but simple._

#### Deploying a Managed Endpoint

We have no need for this, so we'll deploy a single Managed Endpoint that handles 100% of traffic.

First, we create the Endpoint itself.

Note that you might get an error like the below:

```
HttpResponseError: (SubscriptionNotRegistered) Resource provider [N/A] isn't registered with Subscription [N/A]. Please see troubleshooting guide, available here: https://aka.ms/register-resource-provider
Code: SubscriptionNotRegistered
Message: Resource provider [N/A] isn't registered with Subscription [N/A]. Please see troubleshooting guide, available here: https://aka.ms/register-resource-provider
```

If you do, it's because you need to "activate" — i.e., register — a few services in order to be able to create the Endpoint. This should only happen the first time you do this, so it's a one-off task for each new subscription.

The solution is below:

In [1462]:
# Register requried services
!az provider register --namespace Microsoft.MachineLearning
!az provider register --namespace Microsoft.ContainerRegistry
!az provider register --namespace Microsoft.Web
!az provider register --namespace Microsoft.Compute
!az provider register --namespace Microsoft.Network
!az provider register --namespace Microsoft.Storage
!az provider register --namespace Microsoft.Subscription

Registering is still on-going. You can monitor using 'az provider show -n Microsoft.Network'


In [1463]:
# Verify registrations
!az provider show --namespace Microsoft.MachineLearning
!az provider show --namespace Microsoft.ContainerRegistry
!az provider show --namespace Microsoft.Web
!az provider show --namespace Microsoft.Compute
!az provider show --namespace Microsoft.Network
!az provider show --namespace Microsoft.Storage
!az provider show --namespace Microsoft.Subscription

{
  "authorizations": [
    {
      "applicationId": "c1652d7f-7767-4507-9f5f-9a97f38585d2",
      "roleDefinitionId": "1cc297bc-1829-4524-941f-966373421033"
    }
  ],
  "id": "/subscriptions/3edf056b-b303-4dd1-9447-a08c6901065c/providers/Microsoft.MachineLearning",
  "namespace": "Microsoft.MachineLearning",
  "providerAuthorizationConsentState": null,
  "registrationPolicy": "RegistrationRequired",
  "registrationState": "Registered",
  "resourceTypes": [
    {
      "aliases": null,
      "apiProfiles": null,
      "apiVersions": [
        "2016-04-01"
      ],
      "capabilities": "CrossResourceGroupResourceMove, CrossSubscriptionResourceMove, SystemAssignedResourceIdentity, SupportsTags, SupportsLocation",
      "defaultApiVersion": null,
      "locationMappings": null,
      "locations": [
        "South Central US",
        "West Europe",
        "Southeast Asia",
        "Japan East",
        "West Central US"
      ],
      "properties": null,
      "resourceType": "Workspac

No errors probably means success, so...Let's proceed.

In [1409]:
import datetime

In [1410]:
from azure.ai.ml.entities import ManagedOnlineEndpoint

In [1440]:
endpoint_name = f"endpoint-emodb-{datetime.datetime.now().strftime('%m%d%H%M%f')}"

In [None]:
emodb_endpoint = ManagedOnlineEndpoint(
    name=endpoint_name,
    description="Online endpoint exposing EmoDB SER classifier.",
    auth_mode="key",
)

In [None]:
endpoint_poller = ml_client.begin_create_or_update(emodb_endpoint)
endpoint = endpoint_poller.result()

Next, we deploy the model itself via a [ManagedOnlineDeployment](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.entities.managedonlinedeployment?view=azure-python). Since we're using an MLflow model, this is reasonably straightforward, since Azure ML can take advantage of the MLmodel file and other such machinery at hand.

Here, we'll configure the deployment to use the `model` we created earlier.

In [1458]:
from azure.ai.ml.entities import ManagedOnlineDeployment

In [1460]:
deployment_name = "deployment-blue-emodb"

In [None]:
blue_deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=model,
    instance_type="Standard_F4s_v2",
    instance_count=1,
)

In [None]:
blue_deployment_poller = ml_client.online_deployments.begin_create_or_update(blue_deployment)

In [None]:
blue_deployment_result = blue_deployment_poller.result()
blue_deployment_result

We can now update the _endpoint_ to receive 100% of traffic to this specific deployment. This has to happen after the fact, because the `traffic` configuration option only makes sense if there exist deployments for it to refer to.

In [None]:
emodb_endpoint.traffic = {
    deployment_name: 100,
}

In [None]:
endpoint_poller = ml_client.begin_create_or_update(emodb_endpoint)
endpoint = endpoint_poller.result()

...And, we're done: The endpoint is ready for testing.

If this seemed easy, it's because it was. It's a [little more complicated if you need to deploy a model _not_ tracked by MLflow](https://learn.microsoft.com/en-us/training/modules/deploy-model-managed-online-endpoint/4-eploy-custom-model-managed-online-endpoint).

The main difference is, when deploying a model "by hand", you have to provide certain assets that MLflow would typically provide for you. Specifically, we must specify:
- **Model assets**, which might be `pkl` files or `joblib` exports stored on local path, or registered custom models;
- **Compute Configuration**, specifically, the `instance_type`, `instance_count`, and any important `scale_settings` for production deployments;
- **Scoring Script**, which defines how to generate predictions from the input data; and
- **Environment**, which simply refers to the list of packages required to run the model. In our case, we could simply use the `env_asr` environment we've been using all along.

When deploying an MLflow model, the last two are provided by MLflow itself. The latter is trivial to specify by simply collecting your imports into a `conda-env.yaml` and building an custom environment, as we've done with `env_asr`. The scoring script is more specific to your model; the example from [the tutorial documentation](https://learn.microsoft.com/en-us/training/modules/deploy-model-managed-online-endpoint/4-eploy-custom-model-managed-online-endpoint) is provided below.

```python
# from: <https://tinyurl.com/292bm8sk>
import json
import joblib
import numpy as np
import os

# called when the deployment is created or updated
def init():
    global model
    # get the path to the registered model file and load it
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'model.pkl')
    model = joblib.load(model_path)

# called when a request is received
def run(raw_data):
    # get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    # get a prediction from the model
    predictions = model.predict(data)
    # return the predictions as any JSON serializable format
    return predictions.tolist()
```

Deployment then looks as follows, again quoting the docs.

```python
# from: <https://tinyurl.com/292bm8sk>
from azure.ai.ml.entities import ManagedOnlineDeployment, CodeConfiguration

model = Model(path="./model",

blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="endpoint-example",
    model=model,
    environment="deployment-environment",
    code_configuration=CodeConfiguration(
        code="./src", scoring_script="score.py"
    ),
    instance_type="Standard_DS2_v2",
    instance_count=1,
)

ml_client.online_deployments.begin_create_or_update(blue_deployment).result()
```

#### Consuming the Endpoint

We can easily test the endpoint by sending in a sample input, which I've ripped out of the training data for simplicity's sake.

In [None]:
response = ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    deployment_name=deployment_name,
    request_file="endpoint-test-input.json",
)

In [None]:
# should be `wut`
response

**Hallelujah** — that was _quite_ the ride, but, **it works**.

### Summary

It's time to take a break — that was a _lot_ of work, but it left us with:
- A Feature Store
- Scripts with which we can generate new Feature Set Specifications
- A live Feature Set, ready for use in training (_assuming we have a Spark Context_)
- A model, tracked in MLflow, along with visualizations and metrics
- A real-time endpoint exposing the model

Thus far, we've done a lot of prototyping in the Notebook. This has been fine, but now that we have scripts, environments, and full knowledge of how to run them to complete the training process, we'll eventually want to convert them to [components](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-pipeline-component?view=azureml-api-2) and [pipelines](https://learn.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines?view=azureml-api-2). This provides a way to package jobs with environments and other metadata, and to retrain the model over time, at which point it'll make sense to add [drift monitoring](https://github.com/Azure/data-model-drift) tools to our stack.

Before we get to that, though, we'll take a detour and use [Next.js](https://nextjs.org/) and [FastAPI](https://fastapi.tiangolo.com/) to build a web application that consumes our model. Then, we'll containerize it with Docker; and deploy it to Azure App services. Along the way, we'll add [Prometheus and Grafana](https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/prometheus-grafana) to get our feet wet with [Azure Monitor](https://learn.microsoft.com/en-us/azure/azure-monitor/overview), an important foundational step for ensuring robust, high-visibility DevSecMLOps workflows down the line.

## Appendix I: Generating the Training Data

```python
import argparse
import os

import azure.core
import pandas as pd
import numpy as np
import soundfile
import librosa
from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.entities import Data
from azure.identity import DefaultAzureCredential
from azureml.core import Dataset, Workspace
from azureml.core.authentication import ServicePrincipalAuthentication

# DATA MAPS
EMOTION_MAP = {
    'W': 'wut',        # anger
    'L': 'langeweile', # boredom
    'E': 'ekel',       # disgust
    'A': 'angst',      # fear
    'F': 'freude',     # happiness/joy
    'T': 'trauer',     # sadness
    'N': 'neutral',    # neutral
}

SPEAKER_MAP = {
    '03': {
        'gender': 0,
        'age': 31,
    },
    '08': {
        'gender': 1,
        'age': 34,
    },
    '09': {
        'gender': 1,
        'age': 21,
    },
    '10': {
        'gender': 0,
        'age': 32,
    },
    '11': {
        'gender': 0,
        'age': 26,
    },
    '12': {
        'gender': 0,
        'age': 30,
    },
    '13': {
        'gender': 1,
        'age': 32,
    },
    '14': {
        'gender': 1,
        'age': 35,
    },
    '15': {
        'gender': 0,
        'age': 25,
    },
    '16': {
        'gender': 1,
        'age': 31,
    },
}

TEXT_MAP = {
    'a01': {
        'text_de': 'Der Lappen liegt auf dem Eisschrank.',
        'text_en': 'The tablecloth is lying on the fridge.'
    },
    'a02': {
        'text_de': 'Das will sie am Mittwoch abgeben.',
        'text_en': 'She will hand it in on Wednesday.'
    },
    'a04': {
        'text_de': 'Heute abend könnte ich es ihm sagen.',
        'text_en': 'Tonight I could tell him.'
    },
    'a05': {
        'text_de': 'Das schwarze Stück Papier befindet sich da oben neben dem Holzstück.',
        'text_en': 'The black sheet of paper is located up there besides the piece of timber.'
    },
    'a07': {
        'text_de': 'In sieben Stunden wird es soweit sein.',
        'text_en': 'In seven hours it will be.'
    },
    'b01': {
        'text_de': 'Was sind denn das für Tüten, die da unter dem Tisch stehen?',
        'text_en': 'What about the bags standing there under the table?'
    },
    'b02': {
        'text_de': 'Sie haben es gerade hochgetragen und jetzt gehen sie wieder runter.',
        'text_en': 'They just carried it upstairs and now they are going down again.'
    },
    'b03': {
        'text_de': 'An den Wochenenden bin ich jetzt immer nach Hause gefahren und habe Agnes besucht.',
        'text_en': 'Currently at the weekends I always went home and saw Agnes.'
    },
    'b09': {
        'text_de': 'Ich will das eben wegbringen und dann mit Karl was trinken gehen.',
        'text_en': 'I will just discard this and then go for a drink with Karl.'
    },
    'b10': {
        'text_de': 'Die wird auf dem Platz sein, wo wir sie immer hinlegen.',
        'text_en': 'It will be in the place where we always store it.'
    }
}

# PARSING FUNCTIONS
def emotion_of(filename: str) -> str:
    return EMOTION_MAP[filename[-2]]

def parse_filename(filepath: str, dir_path: str) -> dict[str, str]:
    filename = filepath.split('.')[0]
    return {
        **SPEAKER_MAP[filename[:2]],
        **TEXT_MAP[filename[2:5]],
        'filepath': f'{dir_path}/{filepath}',
        'filename': filename,
        'emotion': EMOTION_MAP[filename[-2]],
        'instance': filename[-1],
    }

def map_filenames(dir_path: str) -> dict[str, dict[str, str]]:
    return [parse_filename(filename, dir_path) for filename in os.listdir(dir_path)]

def f_chromagram(waveform, sample_rate):
    """Generate the chromagram of `waveform`'s STFT. Produces 12 features."""
    return np.mean(librosa.feature.chroma_stft(
        S=np.abs(librosa.stft(waveform)),
        sr=sample_rate,
    ).T, axis=0)

def f_mel_spectrogram(waveform, sample_rate):
    """Generate Mel Spectrogram of `waveform`. Generates 128 features."""
    return np.mean(librosa.feature.melspectrogram(
        y=waveform,
        sr=sample_rate,
    ).T, axis=0)

def f_mfcc(waveform, sample_rate, n_mfcc: int = 40):
    """Generate `n_mfcc` Mel-Frequency Cepstral Coefficientss of `waveform`. Produces `n_mfcc` features."""
    return np.mean(librosa.feature.mfcc(
        y=waveform,
        sr=sample_rate,
        n_mfcc=n_mfcc,
    ).T, axis=0)

def features(filepath):
    with soundfile.SoundFile(filepath) as audio:
        waveform = audio.read(dtype="float32")
        
        feature_matrix = np.array([])
        feature_matrix = np.hstack((
            f_chromagram(waveform, audio.samplerate),
            f_mel_spectrogram(waveform, audio.samplerate),
            f_mfcc(waveform, audio.samplerate),
        ))
    return feature_matrix

# ASSET FUNCTIONS
def bump_patch(version: str) -> str:
    """Given a semver string of form `vx.x.x`, return `v.x.x.{x+1}`."""
    comps = [int(el) for el in version.replace("v", "").split(".")]
    comps[-1] = comps[-1] + 1
    return f"v{'.'.join([str(el) for el in comps])}"

def get_new_asset_version(ml_client: MLClient, asset_name: str) -> str:
    try:
        asset_versions = sorted([asset.version for asset in ml_client.data.list(name=asset_name)])
        return bump_patch(asset_versions[-1])
    except azure.core.exceptions.ResourceNotFoundError:  # If no versions are found
        return "v1.0.0"

def feature_parquet_create_or_update(ml_client: MLClient, parquet_filepath: str, asset_name: str) -> str:
    new_asset = Data(
        name=asset_name,
        version=get_new_asset_version(ml_client, asset_name),
        path=parquet_filepath,
        type=AssetTypes.URI_FILE,
        description="Parquet file describing the EmoDB feature set.",
    )
    return ml_client.data.create_or_update(new_asset)

def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("-a", "--asset-name", type=str)
    parser.add_argument("-d", "--dataset-name", type=str)
    parser.add_argument("-p", "--parquet-output-filename", type=str)
    parser.add_argument("-o", "--download-path", type=str)
    parser.add_argument("-w", "--workspace-name", type=str)
    parser.add_argument("-s", "--subscription-id", type=str)
    parser.add_argument("-g", "--resource-group-name", type=str)
    parser.add_argument("-i", "--app-id", type=str)
    parser.add_argument("-c", "--client-secret", type=str)
    parser.add_argument("-t", "--tenant", type=str)

    # parse args
    args = parser.parse_args()

    # return args
    return args


def main(args):
    ws = Workspace(
        subscription_id=args.subscription_id,
        resource_group=args.resource_group_name,
        workspace_name=args.workspace_name,
        auth=ServicePrincipalAuthentication(
            tenant_id=args.tenant,
            service_principal_id=args.app_id,
            service_principal_password=args.client_secret,
        )
    )
    
    # Download data
    dataset = Dataset.get_by_name(
        workspace=ws,
        name=args.dataset_name,
        version="latest",
    )
    dataset.download(
        target_path=args.download_path,
        overwrite=True,
    )
    # Parse filenames to extract metadata
    df = pd.DataFrame(map_filenames(args.download_path))

    # Generate features for each audio file
    X = pd.DataFrame(df["filepath"].apply(features).tolist(), index=df.index)
    X["filename"] = df["filename"]
    X.set_index("filename", inplace=True)
    X = X.rename(columns={i: f'feature{i}' for i in range(180)})
    X["emotion"] = df["emotion"].to_numpy()
    print(X.head())

    # Save as Parquet file
    output_dir = "/tmp/output"
    os.makedirs(output_dir, exist_ok=True)
    X.to_parquet(os.path.join(output_dir, args.parquet_output_filename))

    # Upload as URI File
    ml_client = MLClient(
        DefaultAzureCredential(),
        args.subscription_id,
        args.resource_group_name,
        args.workspace_name,
    )
    feature_parquet_create_or_update(ml_client, os.path.join(output_dir, args.parquet_output_filename), args.asset_name)


if __name__ == "__main__":
    main(parse_args())
```

## Appendix II: Generating the Feature Set

```python
import argparse
import logging
import os

import azure.core
from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential
from azure.ai.ml.entities import Data, DataColumn, DataColumnType, FeatureSet, FeatureSetSpecification, FeatureStoreEntity
from azure.identity import DefaultAzureCredential
from azureml.core import Run
from azureml.featurestore import FeatureSetSpec
from azureml.featurestore.contracts.feature import Feature
from azureml.featurestore.contracts.feature_source_type import SourceType
from azureml.featurestore.contracts.feature_source import FeatureSource
from azureml.featurestore.contracts import (
    Column,
    ColumnType,
    TimestampColumn,
)
from azureml.featurestore.feature_source import ParquetFeatureSource
import pandas as pd
import pyspark


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("-p", "--parquet-output-filename", type=str)
    parser.add_argument("-v", "--version", type=str)
    parser.add_argument("-w", "--workspace-name", type=str)
    parser.add_argument("-s", "--subscription-id", type=str)
    parser.add_argument("-g", "--resource-group-name", type=str)
    parser.add_argument("-q", "--parquet-file", type=str)
    parser.add_argument("-f", "--feature-store-name", type=str)
    parser.add_argument("-i", "--app-id", type=str)
    parser.add_argument("-c", "--client-secret", type=str)
    parser.add_argument("-t", "--tenant", type=str)

    # parse args
    args = parser.parse_args()

    # return args
    return args


def bump_patch(version: str) -> str:
    """Given a semver string of form `vx.x.x`, return `v.x.x.{x+1}`."""
    comps = [int(el) for el in version.replace("v", "").split(".")]
    comps[-1] = comps[-1] + 1
    return f"v{'.'.join([str(el) for el in comps])}"


def get_new_feature_set_version(fs_client: MLClient, asset_name: str) -> str:
    try:
        asset_versions = [asset.version for asset in fs_client.feature_sets.list(name=asset_name)]
        if not len(asset_versions):
            # package is just broken, as usual, try to fetch specific versions...how stupid
            latest = 0
            while True:
                version = f"v1.0.{latest}"
                try:
                    fs_client.feature_sets.get(name=asset_name, version=version)
                    latest += 1
                except azure.core.exceptions.ResourceNotFoundError as e:
                    return version
    except azure.core.exceptions.ResourceNotFoundError:
        return "v1.0.0"


def get_new_asset_version(ml_client: MLClient, asset_name: str) -> str:
    try:
        asset_versions = [asset.version for asset in ml_client.data.list(name=asset_name)]
        asset_versions = sorted(asset_versions)
        return bump_patch(asset_versions[-1])
    except azure.core.exceptions.ResourceNotFoundError:  # If no versions are found
        return "v1.0.0"
    

def main(args):
    os.environ['AZURE_CLIENT_ID'] = args.app_id
    os.environ['AZURE_CLIENT_SECRET'] = args.client_secret
    os.environ['AZURE_TENANT_ID'] = args.tenant
    credential = DefaultAzureCredential()
    
    ml_client = MLClient(
        credential=credential,
        subscription_id=args.subscription_id,
        resource_group_name=args.resource_group_name,
        workspace_name=args.workspace_name,
    )

    fs_client = MLClient(
        credential=credential,
        subscription_id=args.subscription_id,
        resource_group_name=args.resource_group_name,
        workspace_name=args.feature_store_name,
    )
    
    # 0. Fetch, download, load Parquet as DF
    X = pd.read_parquet(path=args.parquet_file)

    # 1. Define FeatureSetSpec instance
    parquet_asset_versions = sorted(list(ml_client.data.list(name=args.parquet_output_filename)), key=lambda a: a.version)
    exclusions = ["timestamp", "filename", "filepath", "emotions"]
    fss = FeatureSetSpec(
        source=FeatureSource(
            type=SourceType.PARQUET,
            path=parquet_asset_versions[-1].path,
            timestamp_column=TimestampColumn(name="timestamp"),
        ),
        features=[Feature(name=str(f), type=ColumnType.FLOAT) for f in X.columns.to_list() if f not in exclusions],
        index_columns=[Column(name="filename", type=ColumnType.STRING)],
    )

    # 2. `dump` the YAML -- for some reason, `dump` ignores custom filenames, true to form
    output_dir = "/tmp/output"
    os.makedirs(output_dir, exist_ok=True)
    # using ONLY `output_dir` forces use of `FeatureSetSpec.yaml` as default name
    fss.dump(output_dir, overwrite=True)

    # 3. Create a URI FIle using the result
    fss_yaml_path = "FeatureSetSpec.yaml"
    outpath = os.path.join(output_dir, fss_yaml_path)
    feature_set_specification_yaml_name = "EmoDB-FeatureSetSpecfication"
    feature_set_specification_yaml_asset = Data(
        name=feature_set_specification_yaml_name,
        version=get_new_asset_version(fs_client, feature_set_specification_yaml_name),
        path=outpath,
        type=AssetTypes.URI_FILE,
        description="Feature Set Specification YAML for EmoDB data.",
    )
    try:
        feature_set_specification_yaml_result = ml_client.data.create_or_update(feature_set_specification_yaml_asset)
    except azure.core.exceptions.HttpResponseError as e:
        logging.error(e)
        logging.info(f"Attempted to use: {get_new_asset_version(ml_client, feature_set_specification_yaml_name)}")

    # 4a. Create a Feature Set Entity
    entity_name = "filename"
    entity_versions = sorted([int(e.version) for e in fs_client.feature_store_entities.list(name=entity_name)])
    next_version = str(entity_versions[-1] + 1) if len(entity_versions) > 0 else "1"
    fs_entity = FeatureStoreEntity(
        name=entity_name,
        version=next_version,
        index_columns=[DataColumn(name=entity_name, type=DataColumnType.STRING)],
        stage="Development",
        description=f"This entity represents the index column of the EmoDB dataset, `{entity_name}`.",
        tags={
            "pii": False
        },
    )
    entity_poller = fs_client.feature_store_entities.begin_create_or_update(fs_entity)
    entity = entity_poller.result()

    # 4b. Create Feature Set Object w/ Entity
    feature_set_name = "EmoDB-FeatureSet2"
    feature_set_version = get_new_feature_set_version(fs_client, feature_set_name)
    print(f"Feature Set Version: {feature_set_version}")
    fs_poller = fs_client.feature_sets.begin_create_or_update(featureset=FeatureSet(
        name=feature_set_name,
        version=feature_set_version,
        description="Data Set for training German SER Classifier.",
        entities=[f"azureml:{entity_name}:{next_version}"],
        specification=FeatureSetSpecification(path=output_dir),
        tags={
            "pii": False,
        },
    ))
    fs_result = fs_poller.result()
    logging.info(fs_result)


if __name__ == '__main__':
    logging.basicConfig()
    logging.getLogger().setLevel(logging.INFO)
    main(parse_args())
```

## Appendix III: Training the Model

```python
from typing import Dict, List
import argparse
import logging
import os

from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential
from mlflow.models import infer_signature
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score
from sklearn.model_selection import GridSearchCV, learning_curve, LearningCurveDisplay, StratifiedKFold
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import azureml.mlflow
import matplotlib.pyplot as plt
import mlflow
import numpy as np
import pandas as pd
import seaborn as sn
import sklearn


def parse_args():
    # setup arg parser
    parser = argparse.ArgumentParser()
    # add arguments
    parser.add_argument("-q", "--parquet-file", type=str)
    parser.add_argument("-w", "--workspace-name", type=str)
    parser.add_argument("-s", "--subscription-id", type=str)
    parser.add_argument("-g", "--resource-group-name", type=str)
    parser.add_argument("-i", "--app-id", type=str)
    parser.add_argument("-c", "--client-secret", type=str)
    parser.add_argument("-t", "--tenant", type=str)
    parser.add_argument("-x", "--experiment-name", type=str)
    parser.add_argument("-n", "--model-name", type=str)
    # parse args
    args = parser.parse_args()

    # return args
    return args


def standard_scale(df: pd.DataFrame):
    sscaler = StandardScaler(copy=True)
    return sscaler.fit_transform(df)


def minmax_scale(df: pd.DataFrame):
    mmscaler = MinMaxScaler(copy=True)
    return mmscaler.fit_transform(df)


def plot_emotion_distribution(df: pd.DataFrame, plot_size=(10, 8)):
    emotions, count = np.unique(df['emotion'].values, return_counts=True)

    fig, ax = plt.subplots(figsize=plot_size)
    plt.bar(x=range(len(emotions)), height=count)
    plt.xticks(
        ticks=range(len(emotions)),
        labels=[e for e in emotions],
    )
    ax.set_xlabel("Emotion")
    ax.set_ylabel("Count")
    plt.tight_layout()
    plt.close(fig)
    
    return fig


def scale(X: pd.DataFrame):
    sscaler = StandardScaler(copy=True)
    mmscaler = MinMaxScaler(copy=True)
    return {
        "standard": sscaler.fit_transform(X),
        "minmax": mmscaler.fit_transform(X),
    }


def split(features, labels, test_size=.2, random_state=0):
    """A function to split and return a feature matrix."""
    return sklearn.model_selection.train_test_split(features, labels, test_size=test_size, random_state=random_state)


def split2dict(X_train, X_test, y_train, y_test):
    return {
        "train": {
            "X": X_train,
            "y": y_train,
        },
        "test": {
            "X": X_test,
            "y": y_test,
        },
    }


def train(nn_models: List[MLPClassifier], splits: Dict[str, Dict[str, pd.DataFrame]]):
    nnet_scores = {}
    for name, split in zip(["standard", "minmax"], [splits["standard"], splits["minmax"]]):
        print(f"{name.upper()}")
        nnet_scores[name] = []
        for m in nn_models:
            m.fit(split["train"]["X"], split["train"]["y"])
            score = m.score(split["test"]["X"], split["test"]["y"])
    
            m_name = type(m).__name__
            nnet_scores[name].append((m_name, f"{100*score:.2f}%"))
    return nnet_scores


def grid_search(splits: Dict[str, Dict[str, pd.DataFrame]], split_name: str = "standard"):
    mlp = MLPClassifier(
        batch_size=32,
        random_state=0,
    )
    # reduced to best found, just to save compute time -- see notebook for original search
    parameters = {
        "hidden_layer_sizes": [(360,)],
        "activation": ["relu"],
        "solver": ["adam"],
        "alpha": [1e-3],
        "epsilon": [1e-8],
        "learning_rate": ["adaptive"],
    }
    gs = GridSearchCV(
        mlp,
        parameters,
        cv=5,
        n_jobs=4,
    )
    gs.fit(splits[split_name]["train"]["X"], splits[split_name]["train"]["y"])
    return gs


def stratifiedkfold(mlp_final: MLPClassifier, target: pd.core.series.Series, splits: Dict[str, Dict[str, pd.DataFrame]], split_name: str = "standard"):
    skfold = StratifiedKFold(
        n_splits=min(target.value_counts().values), # 46
        shuffle=True,
        random_state=0,
    )
    kfold_scores = []
    for train_indices, test_indices in skfold.split(splits[split_name]["train"]["X"], splits[split_name]["train"]["y"]):
        mlp_final.fit(
            splits[split_name]["train"]["X"][train_indices],
            splits[split_name]["train"]["y"][train_indices],
        )
        kfold_scores.append(mlp_final.score(
            splits[split_name]["train"]["X"][test_indices],
            splits[split_name]["train"]["y"][test_indices]),
        )
    return kfold_scores


def score(mlp_final: MLPClassifier, splits:Dict[str, Dict[str, pd.DataFrame]], split_name: str = "standard"):
    y_pred = mlp_final.predict(splits[split_name]["test"]["X"])
    return {
        "accuracy": accuracy_score(splits[split_name]["test"]["y"], y_pred),
        "f1": f1_score(splits[split_name]["test"]["y"], y_pred, average="macro"),
        "precision": precision_score(splits[split_name]["test"]["y"], y_pred, average="macro"),
        "recall": recall_score(splits[split_name]["test"]["y"], y_pred, average="macro"),
    }


def generate_learning_scaling_curves(target: pd.core.series.Series, mlp_final: MLPClassifier, splits:Dict[str, Dict[str, pd.DataFrame]], split_name: str = "standard"):
    params = {
        "X": splits[split_name]["train"]["X"],
        "y": splits[split_name]["train"]["y"],
        "train_sizes": np.linspace(0.1, 1.0, 5),
        "cv": min(target.value_counts().values), # 46
        "n_jobs": 4,
        "return_times": True,
    }
    train_sizes, train_scores, test_scores, fit_times, score_times = learning_curve(
        mlp_final,
        shuffle=True,
        random_state=0,
        **params,
    )
    # Generate Plot
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(24, 4), sharey=False, sharex=True)

    ax[0].plot(train_sizes, fit_times.mean(axis=1), 'o-')
    ax[0].fill_between(
            train_sizes,
            fit_times.mean(axis=1) - fit_times.std(axis=1),
            fit_times.mean(axis=1) + fit_times.std(axis=1),
            alpha=0.3,
        )
    ax[0].set_ylabel('Fit time (s)')
    ax[0].set_xlabel('# Training Samples')
    ax[0].set_title(
        f"Scalability"
    )
    
    ax[1].plot(train_sizes, train_scores.mean(axis=1), 'o-', color='r', label='Training Score')
    ax[1].plot(train_sizes, test_scores.mean(axis=1), 'o-', color='g', label='CV Score')
    ax[1].fill_between(
        train_sizes,
        train_scores.mean(axis=1) - train_scores.std(axis=1),
        train_scores.mean(axis=1) + train_scores.std(axis=1),
        color='r',
        alpha=0.3,
    )
    ax[1].fill_between(
        train_sizes,
        test_scores.mean(axis=1) - test_scores.std(axis=1),
        test_scores.mean(axis=1) + test_scores.std(axis=1),
        color='g',
        alpha=0.3,
    )
    ax[1].set_ylabel('Score')
    ax[1].set_xlabel('# Training Samples')
    ax[1].set_title(
        f"Learning Curve"
    )
    return fig


def generate_confusion_matrices(test_predictions, test_groundtruth):
    EMOTION_MAP = {
        'W': 'wut',        # anger
        'L': 'langeweile', # boredom
        'E': 'ekel',       # disgust
        'A': 'angst',      # fear
        'F': 'freude',     # happiness/joy
        'T': 'trauer',     # sadness
        'N': 'neutral',    # neutral
    }
    cmat = confusion_matrix(test_groundtruth, test_predictions)
    cmat_norm = confusion_matrix(test_groundtruth, test_predictions, normalize='true')

    df_cmat = pd.DataFrame(cmat, index=list(EMOTION_MAP.keys()), columns=list(EMOTION_MAP.keys()))
    df_cmat_norm = pd.DataFrame(cmat_norm, index=list(EMOTION_MAP.keys()), columns=list(EMOTION_MAP.keys()))

    return df_cmat, df_cmat_norm


def generate_confusion_plots(df_cmat: pd.DataFrame, df_cmat_norm: pd.DataFrame):
    plt.figure(figsize=(16, 6))

    plt.subplot(1, 2, 1)
    plt.title('Confusion Matrix')
    sn.heatmap(df_cmat, annot=True, annot_kws={'size': 18})
    
    plt.subplot(1, 2, 2)
    plt.title('Normalized Confusion Matrix')
    sn.heatmap(df_cmat_norm, annot=True, annot_kws={'size': 18})

    return plt.gcf() 
    

def main(args):
    # Must be set for mlflow to authenticate
    os.environ['AZURE_CLIENT_ID'] = args.app_id
    os.environ['AZURE_CLIENT_SECRET'] = args.client_secret
    os.environ['AZURE_TENANT_ID'] = args.tenant
    # Set MLflow tracking URI
    ml_client = MLClient(
        credential=DefaultAzureCredential(),
        subscription_id=args.subscription_id,
        resource_group_name=args.resource_group_name,
        workspace_name=args.workspace_name,
    )
    mlflow_tracking_uri = ml_client.workspaces.get(name=ml_client.workspace_name).mlflow_tracking_uri
    mlflow.set_tracking_uri(mlflow_tracking_uri)
    # Create and set experiment
    experiment = mlflow.get_experiment_by_name(args.experiment_name)
    print(f"Experiment: {experiment}")
    if experiment is None:
        mlflow.create_experiment(
            name=args.experiment_name,
            tags={
                "pii": False,
                "dataset": "emodb",
                "type": "classifier",
            }
        )
        experiment = mlflow.set_experiment(name=args.experiment_name)
    # Load the Parquet file into a Pandas DataFrame
    X = pd.read_parquet(path=args.parquet_file)
    # Generate emotion distribution plot
    fig1 = plot_emotion_distribution(X)
    # Save `emotion` column and drop from `X`
    y = X["emotion"]
    X.drop(columns=["emotion"], inplace=True)
    # Scale and split
    splits = {
        "standard": split2dict(*split(standard_scale(X), y)),
        "minmax": split2dict(*split(minmax_scale(X), y)),
    }
    # Train and evaluate model
    nn_models = [
        MLPClassifier(random_state=0),
    ]
    nnet_scores = train(nn_models, splits)
    print(nnet_scores)
    # Hyperparameter tuning
    gs = grid_search(splits, "standard")
    print(gs.best_params_)
    # Retrain and evaluate final model
    mlp_final = MLPClassifier(
        **gs.best_params_,
        # Adam constants -- defaults suggested in paper
        beta_1=0.9,
        beta_2=0.999,
        batch_size=32,
        max_iter=200,
        random_state=0,
    )
    mlp_final.fit(splits["standard"]["train"]["X"], splits["standard"]["train"]["y"])
    mlp_final_scores = score(mlp_final, splits, "standard")
    print(mlp_final_scores)
    # StratifiedKFold
    kfold_scores = stratifiedkfold(mlp_final, y, splits, "standard")
    kfold_stats = {
        "kfold_mean": np.mean(kfold_scores),
        "kfold_std": np.std(kfold_scores),
    }
    print(kfold_stats)
    # Generate learning and scaling curves
    fig2 = generate_learning_scaling_curves(y, mlp_final, splits, "standard")
    # Generate confusion matrices
    fig3 = generate_confusion_plots(
        *generate_confusion_matrices(
            mlp_final.predict(splits["standard"]["test"]["X"]), 
            splits["standard"]["test"]["y"]),
    )

    artifact_path = "rt-emodb"
    with mlflow.start_run() as run:
        # Log visualizations
        mlflow.log_figure(fig1, "emotion_distribution.png")
        mlflow.log_figure(fig2, "learning_scaling_curves.png")
        mlflow.log_figure(fig3, "confusion.png")
        # Log GridSearchCV parameters and metrics
        mlflow.log_metric("best_cross_val_score", gs.best_score_)
        mlflow.log_metric("num_params_tested", len(gs.cv_results_["params"]))
        mlflow.log_metric("mean_fit_time", gs.cv_results_["mean_fit_time"][gs.best_index_])
        for param, value in gs.best_params_.items():
            mlflow.log_param(f"grid_search_{param}", value)
        # Log model metrics
        for metric_name, metric_value in mlp_final_scores.items():
            mlflow.log_metric(metric_name, metric_value)
        # Log kfold params and metrics
        mlflow.log_param("kfold_n_splits", min(y.value_counts().values))
        for name, value in kfold_stats.items():
            mlflow.log_metric(name, value)
        # Log model
        mlflow.sklearn.log_model(
            mlp_final,
            artifact_path,
            registered_model_name=args.model_name,
            signature=infer_signature(
                splits["standard"]["test"]["X"],
                mlp_final.predict(splits["standard"]["test"]["X"]),
            ),
            # omit `conda_env`: [mlflow.models.infer_pip_requirements](https://mlflow.org/docs/latest/python_api/mlflow.models.html#mlflow.models.infer_pip_requirements)
        )


if __name__ == "__main__":
    logging.basicConfig()
    logging.getLogger().setLevel(logging.INFO)
    main(parse_args())
```