This architecture uses Batch AI. The Azure Batch AI service is retiring March 2019, and its at-scale training and scoring capabilities are now available in Azure Machine Learning Service. The reference architecture will be updated soon to use Machine Learning, which offers a managed compute target called Azure Machine Learning Compute for training, deploying, and scoring machine learning models. The repo for the version using Azure Machine Learning can be accessed here.
In this repository you will find a set of scripts and commands that help you build a scalable solution for scoring many models in parallel using Batch AI.
The solution can be used as a template and can generalize to different problems. The problem addressed here is monitoring the operation of a large number of devices in an IoT setting, where each device sends sensor readings continuously. We assume there are pre-trained anomaly detection models (one for each device) that need to be used to predict whether a series of measurements, that are aggregated over a predefined time interval, correspond to an anomaly or not.
To get started, read through the Architecture section, then go through the following sections:
- Install Prerequisites
- Create Environment
- Create Azure Resources
- Validate Deployments and Jobs Execution
- Cleanup
This solution consists of several Azure cloud services that allow upscaling and downscaling resources according to need. The services and their role in this solution are described below.
Blob containers are used to store the pre-trained models, the data, and the output predictions. The models that we upload to blob storage in the create_resources.ipynb notebook are One-class SVM models that are trained on data that represents values of different sensors of different devices. We assume that the data values are aggregated over a fixed interval of time. In real-world scenarios, this could be a stream of sensor readings that need to be filtered and aggregated before being used in training or real-time scoring. For simplicity, we use the same data file when executing scoring jobs.
Batch AI is the distributed computing engine used here. It allows spinning up virtual machines on demand with an auto-scaling option, where each node in the Batch AI cluster runs a scoring job for a specific sensor. The scoring Python script is run in Docker containers that are created on each node of the cluster.
Logic Apps provide an easy way to create the runtime workflow and scheduling for the solution. In our case, we create a Logic App that runs hourly Batch AI jobs. The jobs are submitted using a Python script that also runs in a Docker container.
The Docker images used in both Batch AI and Logic Apps are created in the create_resources.ipynb notebook and pushed to an Azure Container Registry (ACR). ACR provides a convenient way to host images and instantiate containers through other Azure services.
For more information on these services, check the documentation links provided in the Links section.
All scripts and commands were tested on an Ubuntu 16.04 LTS system.
Once all prerequisites are installed,
-
Clone or download this repsitory:
git clone https://github.com/Azure/BatchAIAnomalyDetection.git -
Create and select conda environment from yml file:
conda env create -f environment.yml source activate baimm -
Start Jupyter in the same environment:
jupyter notebook -
Open Jupyter Notebook in your browser and select the environemnt kernel in the menu:
Kernel > Change Kernel > Python [conda env:baimm]
Start creating the required resources in the next section.
The create_resources.ipynb notebook contains all Azure CLI and Docker commands needed to create resources in your Azure subscription, as well as configurations of Batch AI and scoring Python scripts.
Navigate to the cloned/downloaded directory in Jupyter Notebook: BatchAIAnomalyDetection/create_resources.ipynb, and start executing the cells to create the needed Azure resources.
After all resources are created, you can check your resource group in the portal and validate that all components have been deployed successfully.
For the Logic App to start running, you would need to authenticate the ACI API connection. Navigate to your resource group, click on the ACI API connection in the portal and authenticate.
Under Batch AI Workspace > Cluster > Jobs, you should see the experiment and scoring jobs, as soon as the Logic App is triggered. It might take a few minutes for ACI to pull the scheduling image and run the container and for the Batch AI cluster to resize and allocate nodes before running the jobs.
Under Storage Account > Blobs, you should see the predictions CSV files in the predictions container, after the Batch AI jobs finish successfully.
If you wish to delete all created resources, run the following CLI command to delete the resource group and all underlying resources.
az group delete --name <resource_group_name>- End-to-End Anomaly Detection Jobs using Azure Batch AI
- Batch AI Documentation
- Logic Apps Documentation
- Azure Blob Storage Documentation
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.