Skip to content

Simbamon/Azure_ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure Machine Learning

This project utilizes Microsoft Azure AI's Machine Learning Studio to streamline the machine learning lifecycle. The goal is to process data, develop machine learning models, deploy them with endpoints, monitor performance, and automate the pipeline for continuous integration and deployment.

Description

This GitHub repository focuses on how to utilize features in Microsoft Azure Machine Learning (AML) Studio to create an end-to-end machine learning model development pipeline. The workflow includes reading, writing, and cleaning data from Azure Data Lake Storage (ADLS) Gen2 and processing that data within AML Studio to prepare it for model training and testing. Once the model is developed, it can be deployed to the AML workspace (online or batch), and inference calls can be made to generate outcomes. These outcomes will be distributed back to ADLS Gen2 for further data analysis. All these processes will be orchestrated using YAML configurations so that the workflow can be initiated in a Microsoft Azure DevOps environment and follow best practices for DevOps principles.

Getting Started

Following these instructions will help you set up and run a project for development and testing purposes

Prerequisites

  • Microsoft Azure account
  • Data (e.g. csv or parquet) to preprocess and train model

Installation

Refer to the steps below to set up the necessary services in Microsoft Azure to get started with Azure Machine Learning Studio (the list contains 2 options):

Set up using Azure DevOps with YAML files

Azure AI Machine Learning Studio

Workspace

Work as a central hub for managing all the resources needed for machine learning projects. It includes features to build, train, deploy, and monitor models. Key features of an Azure Machine Learning Workspace include:

  • Resource Management: Centralized management of datasets, experiments, models, and deployments
  • Collaboration: Enables team collaboration by sharing resources and facilitating version control
  • Compute Management: Manages compute resources such as virtual machines, GPU clusters, and more.

Feature

  • Environment
    • Encapsulation of software dependencies, environment variables, and runtime settings required to execute a machine learning development scripts
    • Features:
      • Python Environment: A list of Python packages (e.g., pandas, scikit-learn, PyTorch) that are required for the experiment or model training run
      • Docker Environment: Docker image that can be used as the execution environment
      • Environment Variables: Custom environment variables to manage different configurations, such as paths, credentials, and other settings that the code requires
      • Conda Dependencies: Lists the Python packages and versions that need to be installed in Conda environment YAML file
      • Base Docker Image: A base Docker image to used for , which can be a public image or one from a private registry.
    • You can use Azure managed environments or customize your own
  • Components
    • Modularity: Break down complex workflows into manageable parts focused on specific tasks (e.g., data preprocessing, model training, model deployment, etc.)
    • Reusability: Reuse components across multiple pipelines and projects, promoting consistency and reducing repetitive work
    • Parameterization: Customize component behavior with inputs and parameters
    • Basic workflow
      • Using YAML file or Python script(SDK v2) to design a component and specify the behavior(e.g., name, type, inputs, outputs, command, etc.)
  • Jobs
    • A specific task or set of tasks that run within the Azure ML environment to perform machine learning operations
    • Example jobs:
      • Data Processing Jobs: Cleaning, transforming, and structuring raw data into a format suitable for machine learning
      • Training Jobs: Execute scripts or pipelines to train machine learning models
      • Pipeline Jobs: Run a sequence of steps, each performing a specific function like data preprocessing, model training, or deployment
      • Inference Jobs: Deploy trained models and run them against new data to generate predictions
    • Execution Environment
      • Jobs run on compute resources like AML(Azure Machine Learning) Compute, Kubernetes clusters, or Virtual Machines (VMs)
  • Pipelines
    • Workflow that automates and orchestrates a sequence of steps involved in the machine learning process
    • Example pipelines:
      • Data Preparation: Pipelines can include steps for data ingestion, transformation, and cleaning
      • Model Training: Steps to train models using different algorithms or configurations. These steps can be run in parallel to compare models and select the best one
      • Model Evaluation: Steps to evaluate models based on specific metrics to determine their accuracy and performance
      • Deployment: Deploy models as part of the pipeline, enabling automated model management and updates
  • MLTable
    • Data abstraction used for defining how data files should be loaded/processed for any tasks(command, job, pipeline)
    • Features:
      • Data Loading/Formats: Load various types of data files(CSV, JSON, Parquet, DeltaLake data, etc.) into memory as Pandas or Spark dataframes
      • Configuration: Configure how data should be read, such as handling missing values, setting data tpyes, setting encoding formats, etc.
      • Extracting data from paths, or reading from different storage locations can be handled by MLTable
    • Use MLTable's function like from_parquet_files, from_json_lines_files, or from_delimited_files to create a MLTable from Azure Data Lake Storage (ADLS) Gen 2
    • Utilize Azure Machine Learning Studio's dataset versioning to track changes in MLTable, such as number of rows or columns, schema changes, etc.

Azure Data Lake Storage (ADLS) Gen2

  • Storage solution combined with Azure Blob Storage and Azure Data Lake Storage(structured and unstructured data) for big data analytics
  • Hierarchical namespace
  • Data can be organized and managed in directories and files, like a traditional file system
  • Supports large-scale analytics platform like Apache Hadoop and Spark
  • Can be integrated with Azure services like Azure Synapse Analytics, Azure Databricks, Azure HDInsight, etc.
  • In this project, ADLS Gen2 is a central data repository for data operations such as prepping, cleaning, splitting, and etc. using Spark(PySpark)

MLflow in Azure Machine Learning

MLflow is an open-source platform to manage the complete machine learning lifecycle. In Azure Machine Learning, workspaces are designed to be MLflow-compatible, allowing you to track, experiment, and deploy machine learning models with MLFlow.

  • Since Azure ML workspace is MLflow-compatible, you can log and track your machine learning experiments using MLflow's API
  • MLflow tracking server:
    • Azure Machine Learning automatically provides a tracking server
    • By default, all runs executed in an Azure ML workspace are automatically tracked in the workspace w/o custom configuration
    • You can directly log experiments and metrics using MLflow within Azure ML, and the results are stored in the Azure workspace storage
  • Automatic experiment logging:
    • If you run any machine learning experiment using Azure's SDK or MLflow, the log will be automatically stored in Azure workspace:
      • Models
      • Metrics
      • Parameters
      • Artifacts
      • Images (ex. PIL.Image, numpy.ndarray, figure, etc.)
  • Model registration and versioning:
    • Using MLflow, you can automatically register a trained model in Azure Machine Learning's Model Registry
    • This will allow you to manage different versions of your models in the Azure workspace, ensuring model lineage, reproducibility, and version control
  • Deploy MLflow models:
    • Once a model is logged and registered in your Azure workspace, it can be deployed on Azure infrastructure (ex. Azure Kubernetes Service, Azure Container Instances) using the same MLflow interface
    • Azure Machine Learning's workspace will manage model deployment denpoints, and MLflow’s tools can handle the model serving, monitoring, and scalability
  • When you are executing training jobs (pipeline) in Azure Machine Learning, there's no need to manually call mlflow.start_run, because runs are automatically initiated
  • Basic functionalities have been addressed in this GitHub repository, and the MLflow features in this repository will be based on that GitHub repository

License

This project is licensed under the MIT License - see the LICENSE file for details

Reference

About

Utilizing Microsoft Azure to streamline and optimize machine learning operations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published