⛺️ Infrastructures™ for Machine Learning Training / Inference in Production.
Switch branches/tags
Nothing to show
Clone or download
1duo Merge pull request #1 from yzhwang/minor-typo
You forgot one 'r' :-P
Latest commit b87842e Dec 5, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
images Add Intel nGraph Dec 5, 2018
README.md You forgot one 'r' :-P Dec 5, 2018


⛺️ Awesome AI Infrastructures ⛺️

📙 List of real-world AI infrastructures (a.k.a., machine learning systems, pipelines, workflows, and platforms) for machine/deep learning training and/or inference in production 🔌. This usually includes technology stack necessary to enable machine learning algorithms run in production environments in a stable, scalable and reliable way. The list is for my own learning purposes, but feel free to contribute / star / fork / pull request. Any recommendations and suggestions are welcome 🎉.


This list contains some popular actively-maintained AI infrastructures that focus on one or more of the following topics:

  • Architecture of end-to-end machine learning training pipelines.
  • Inference at scale in production on Cloud ☁️ or on end devices 📱.
  • Compiler and optimization stacks for deployments on variety of devices.
  • Novel ideas of efficient large-scale distributed training.

in no specific order. This list cares more about overall architectures of AI solutions in production instead of individual machine/deep learning training or inference frameworks. My learning goals are: understand the workflows and principles of how to build (large-scale) systems that can enable machine learning in production.

End-to-End Machine Learning Platforms

TFX - TensorFlow Extended (Google)

TensorFlow Extended (TFX) is a TensorFlow-based general-purpose machine learning platform implemented at Google.

| homepage | talk | paper |



  • TensorFlow Data Validation: a library for exploring and validating machine learning data.

  • TensorFlow Transformation: perform full-pass analyze phases over data to create transformation graphs that are consistently applied during training and serving.

  • TensorFlow Model Analysis: libraries and visualization components to compute full-pass and sliced model metrics over large datasets, and analyze them in a notebook.

  • TensorFlow Serving: a flexible, high-performance serving system for machine learning models, designed for production environments

Kubeflow - The Machine Learning Toolkit for Kubernetes (Google)

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

Kubeflow started as an open sourcing of the way Google ran TensorFlow internally, based on a pipeline called TensorFlow Extended.

| homepage | github | documentation | blog | talk | slices |


  • Notebooks: a JupyterHub to create and manage interactive Jupyter notebooks.

  • TensorFlow Model Training: a TensorFlow Training Controller that can be configured to use either CPUs or GPUs and be dynamically adjusted to the size of a cluster with a single setting.

  • Model Serving: a TensorFlow Serving container to export trained TensorFlow models to Kubernetes. Integrated with Seldon Core, an open source platform for deploying machine learning models on Kubernetes, and NVIDIA TensorRT Inference Server for maximized GPU utilization when deploying ML/DL models at scale.

  • Multi-Framework: includes TensorFlow, PyTorch, MXNet, Chainer, and more.

RAPIDS - Open GPU Data Science (NVIDIA)

The RAPIDS suite of open source software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

RAPIDS is the result of contributions from the machine learning community and GPU Open Analytics Initiative (GOAI) partners, such as Anaconda, BlazingDB, Gunrock, etc.

| homepage | blog | github |



  • Apache Arrow: a columnar, in-memory data structure that delivers efficient and fast data interchange with flexibility to support complex data models.

  • cuDF: a DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The Python bindings of the core-accelerated CUDA DataFrame manipulation primitives mirror the pandas interface for seamless onboarding of pandas users.

  • cuML: a collection of GPU-accelerated machine learning libraries that will provide GPU versions of all machine learning algorithms available in scikit-learn.

  • cuGRAPH: a framework and collection of graph analytics libraries that seamlessly integrate into the RAPIDS data science platform.

  • Deep Learning Libraries: data stored in Apache Arrow can be seamlessly pushed to deep learning frameworks that accept array_interface such as PyTorch and Chainer.

  • Visualization Libraries: RAPIDS will include tightly integrated data visualization libraries based on Apache Arrow. Native GPU in-memory data format provides high-performance, high-FPS data visualization, even with very large datasets.

Michelangelo - Uber's Machine Learning Platform (Uber)

Michelangelo, an internal ML-as-a-service platform that democratizes machine learning and makes scaling AI to meet the needs of business as easy as requesting a ride.

Michelangelo consists of a mix of open source systems and components built in-house. The primary open sourced components used are HDFS, Spark, Samza, Cassandra, MLLib, XGBoost, and TensorFlow.

| blog | use-cases |



  • Manage data
  • Train models
  • Evaluate models
  • Deploy models
  • Make predictions
  • Monitor predictions

COTA: Improving Uber Customer Care with NLP & Machine Learning (Uber)

COTA, our Customer Obsession Ticket Assistant, a tool that uses machine learning and natural language processing (NLP) techniques to help agents deliver better customer support. Leveraging our Michelangelo machine learning-as-a-service platform on top of our customer support platform

| blog |



  • Preprocessing
  • Topic modeling
  • Feature engineering
  • Pointwise ranking algorithm

FBLearner (Facebook)

FBLearner Flow is capable of easily reusing algorithms in different products, scaling to run thousands of simultaneous custom experiments, and managing experiments with ease.

| blog | slices | talk |



  • Experimentation Management UI
  • Launching Workflows
  • Visualizing and Comparing Outputs
  • Managing Experiments
  • Machine Learning Library

Alchemist (Apple)

Alchemist - an internal service built at Apple from the ground up for easy, fast, and scalable distributed training.

| paper |



  • UI Layer: command line interface (CLI) and a web UI.

  • API Layer: exposes services to users, which allows them to upload and browse the code assets, submit distributed jobs, and query the logs and metrics.

  • Compute Layer: contains the compute-intensive workloads; in particular, the training and evaluation tasks orchestrated by the job scheduler.

  • Storage Layer: contains the distributed file systems and object storage systems that maintain the training and evaluation datasets, the trained model artifacts, and the code assets.

FfDL - Fabric for Deep Learning (IBM)

Deep learning frameworks such as TensorFlow, PyTorch, Caffe, Torch, Theano, and MXNet have contributed to the popularity of deep learning by reducing the effort and skills needed to design, train, and use deep learning models. Fabric for Deep Learning (FfDL, pronounced “fiddle”) provides a consistent way to run these deep-learning frameworks as a service on Kubernetes.

| blog | github |



  • REST API: the REST API microservice handles REST-level HTTP requests and acts as proxy to the lower-level gRPC Trainer service.

  • Trainer: the Trainer service admits training job requests, persisting metadata and model input configuration in a database (MongoDB).

  • Lifecycle Manager and learner pods: The Lifecycle Manager (LCM) deploys training jobs arriving from the Trainer, halting (pausing) and terminating training jobs. LCM uses the Kubernetes cluster manager to deploy containerized training jobs.

  • Training Data Service: The Training Data Service (TDS) provides short-lived storage and retrieval for logs and evaluation data from a Deep Learning training job.

BigDL - Distributed Deep Learning Library for Apache Spark (Intel)

BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters. To makes it easy to build Spark and BigDL applications, a high level Analytics Zoo is provided for end-to-end analytics + AI pipelines.

| blog | code |



  • Rich deep learning support. Modeled after Torch, BigDL provides comprehensive support for deep learning, including numeric computing (via Tensor) and high level neural networks; in addition, users can load pre-trained Caffe or Torch models into Spark programs using BigDL.

  • Extremely high performance. To achieve high performance, BigDL uses Intel MKL and multi-threaded programming in each Spark task. Consequently, it is orders of magnitude faster than out-of-box open source Caffe, Torch or TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).

  • Efficiently scale-out. BigDL can efficiently scale out to perform data analytics at "Big Data scale", by leveraging Apache Spark (a lightning fast distributed data processing framework), as well as efficient implementations of synchronous SGD and all-reduce communications on Spark.

SageMaker (Amazon Web Service)

Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker removes all the barriers that typically slow down developers who want to use machine learning.

| blog | talk |


TransmogrifAI (Salesforce)

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Spark. It was developed with a focus on enhancing machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity and reuse.

| homepage | github | documentation | blog |



  • Build production ready machine learning applications in hours, not months.

  • Build machine learning models.

  • Build modular, reusable, strongly typed machine learning workflows.

MLFlow (Databricks)

MLflow is an open source platform for managing the end-to-end machine learning lifecycle.

| homepage | github | documentation | blog |



  • MLflow Tracking: tracking experiments to record and compare parameters and results.

  • MLflow Projects: packaging ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production.

  • MLflow Models: managing and deploying models from a variety of ML libraries to a variety of model serving and inference platforms.

H2O - Open Source, Distributed Machine Learning for Everyone

H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O’s supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models.

H2O4GPU is an open source, GPU-accelerated machine learning package with APIs in Python and R that allows anyone to take advantage of GPUs to build advanced machine learning models. A variety of popular algorithms are available including Gradient Boosting Machines (GBM’s), Generalized Linear Models (GLM’s), and K-Means Clustering. Our benchmarks found that training machine learning models on GPUs was up to 40x faster than CPU based systems.

| h2o | h2o4gpu |

Machine Learning Model Inference/Deployment

Apple's CoreML

Integrate machine learning models into your app. Core ML is the foundation for domain-specific frameworks and functionality. Core ML supports Vision for image analysis, Natural Language for natural language processing, and GameplayKit for evaluating learned decision trees. Core ML itself builds on top of low-level primitives like Accelerate and BNNS, as well as Metal Performance Shaders.

Core ML is optimized for on-device performance, which minimizes memory footprint and power consumption. Running strictly on the device ensures the privacy of user data and guarantees that your app remains functional and responsive when a network connection is unavailable.

| documentation |

Greengrass (Amazon Web Service)

AWS Greengrass is software that lets you run local compute, messaging, data caching, sync, and ML inference capabilities for connected devices in a secure way. With AWS Greengrass, connected devices can run AWS Lambda functions, keep device data in sync, and communicate with other devices securely – even when not connected to the Internet. Using AWS Lambda, Greengrass ensures your IoT devices can respond quickly to local events, use Lambda functions running on Greengrass Core to interact with local resources, operate with intermittent connections, stay updated with over the air updates, and minimize the cost of transmitting IoT data to the cloud.

| blog |

GraphPipe (Oracle)

GraphPipe is a protocol and collection of software designed to simplify machine learning model deployment and decouple it from framework-specific model implementations.

| homepage | github | documentation |

PocketFlow (Tencent)

PocketFlow is an open-source framework for compressing and accelerating deep learning models with minimal human effort. Deep learning is widely used in various areas, such as computer vision, speech recognition, and natural language translation. However, deep learning models are often computational expensive, which limits further applications on mobile devices with limited computational resources.

| homepage | github |

TVM - End to End Deep Learning Compiler Stack (Amazon)

TVM is a compiler stack for deep learning systems. It is designed to close the gap between the productivity-focused deep learning frameworks, and the performance- and efficiency-focused hardware backends. TVM works with deep learning frameworks to provide end to end compilation to different backends. Checkout the tvm stack homepage for more information.

| homepage | github | documentation | paper |



  • Compilation of deep learning models in Keras, MXNet, PyTorch, Tensorflow, CoreML, DarkNet into minimum deploy-able modules on diverse hardware backends.

  • Infrastructure to automatic generate and optimize tensor operators on more backend with better performance.

Glow - A community-driven approach to AI infrastructure (Facebook)

Glow is a machine learning compiler that accelerates the performance of deep learning frameworks on different hardware platforms. It enables the ecosystem of hardware developers and researchers to focus on building next gen hardware accelerators that can be supported by deep learning frameworks like PyTorch.

| homepage | github | blog | paper |



  • High-level intermediate representation allows the optimizer to perform domain-specific optimizations.

  • Lower-level intermediate representation, an instruction-based address-only representation allows the compiler to perform memory-related optimizations, such as instruction scheduling, static memory allocation, and copy elimination.

  • The optimizer then performs machine-specific code generation to take advantage of specialized hardware features.

nGraph - A New Open Source Compiler for Deep Learning Systems (Intel)

We are pleased to announce the open sourcing of nGraph, a framework-neutral Deep Neural Network (DNN) model compiler that can target a variety of devices. With nGraph, data scientists can focus on data science rather than worrying about how to adapt their DNN models to train and run efficiently on different devices.

| homepage | documentation | github | paper |



  • Fusion: Fuse multiple ops to to decrease memory usage.

  • Data layout abstraction: Make abstraction easier and faster with nGraph translating element order to work best for a given or available device.

  • Data reuse: Save results and reuse for subgraphs with the same input.

  • Graph scheduling: Run similar subgraphs in parallel via multi-threading.

  • Graph partitioning: Partition subgraphs to run on different devices to speed up computation; make better use of spare CPU cycles with nGraph.

  • Memory management: Prevent peak memory usage by intercepting a graph with or by a "saved checkpoint," and to enable data auditing.

ONNX - Open Neural Network Exchange

ONNX is a open format to represent deep learning models. With ONNX, AI developers can more easily move models between state-of-the-art tools and choose the combination that is best for them. ONNX is developed and supported by a community of partners.

| homepage | documentation | github |


Large-Scale Distributed AI Training Efforts

Major milestones for "ImageNet in X nanoseconds" 🎢.

initial date resources elapsed top-1 accuracy batch size link
June 2017 256 NVIDIA P100 GPUs 1 hour 76.3% 8192 https://arxiv.org/abs/1706.02677
Aug 2017 256 NVIDIA P100 GPUs 50 mins 75.01% 8192 https://arxiv.org/abs/1708.02188
Sep 2017 512 KNLs -> 2048 KNLs 1 hour -> 20 mins 72.4% -> 75.4% 32768 https://arxiv.org/abs/1709.05011
Nov 2017 128 Google TPUs (v2) 30 mins 76.1% 16384 https://arxiv.org/abs/1711.00489
Nov 2017 1024 NVIDIA P100 GPUs 15 mins 74.9% 32768 https://arxiv.org/abs/1711.04325
Jul 2018 2048 NVIDIA P40 GPUs 6.6 mins 75.8% 65536 https://arxiv.org/abs/1807.11205
Nov 2018 2176 NVIDIA V100 GPUs 3.7 mins 75.03% 69632 https://arxiv.org/abs/1811.06992
Nov 2018 1024 Google TPUs (v3) 2.2 mins 76.3% 32768 https://arxiv.org/abs/1811.06992

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour


  • Topology-aware communication

ImageNet Training in Minutes

  • Layer-wise Adaptive Rate Scaling (LARS)

Don't Decay the Learning Rate, Increase the Batch Size

  • Decaying the learning rate is simulated annealing
  • Instead of decaying the learning rate, increase the batch size during training

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

  • RMSprop Warm-up
  • Slow-start learning rate schedule
  • Batch normalization without moving averages

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

  • Mixed precision
  • Layer-wise Adaptive Rate Scaling (LARS)
  • Improvements on model architecture
  • Communication: tensor fusion, hierarchical all-reduce, hybrid all-reduce

ImageNet/ResNet-50 Training in 224 Seconds

  • Batch size control to reduce accuracy degradation with mini-batch size exceeding 32K
  • Communication: 2D-torus all-reduce

Image Classification at Supercomputer Scale

  • Mixed precision
  • Layer-wise Adaptive Rate Scaling (LARS)
  • Distributed batch normalization
  • Input pipeline optimization: dataset sharding and caching, prefetch, fused JPEG decoding and cropping, parallel data parsing
  • Communication: 2D gradient summation

Machine Learning System Lectures

CSE 599W Systems for ML (University of Washington)

Over the past few years, deep learning has become an important technique to successfully solve problems in many different fields, such as vision, NLP, robotics. An important ingredient that is driving this success is the development of deep learning systems that efficiently support the task of learning and inference of complicated models using many devices and possibly using distributed resources. The study of how to build and optimize these deep learning systems is now an active area of research and commercialization, and yet there isn’t a course that covers this topic.

This course is designed to fill this gap. We will be covering various aspects of deep learning systems, including: basics of deep learning, programming models for expressing machine learning models, automatic differentiation, memory optimization, scheduling, distributed learning, hardware acceleration, domain specific languages, and model serving. Many of these topics intersect with existing research directions in databases, systems and networking, architecture and programming languages. The goal is to offer a comprehensive picture on how deep learning systems works, discuss and execute on possible research opportunities, and build open-source software that will have broad appeal.

| link | github | materials |

CSCE 790: Machine Learning Systems

When we talk about Machine Learning (ML), we typically refer to a technique or an algorithm that gives the computer systems the ability to learn and to reason with data. However, there is a lot more to ML than just implementing an algorithm or a technique. In this course, we will learn the fundamental differences between ML as a technique versus ML as a system in production. A machine learning system involves a significant number of components and it is important that they remain responsive in the face of failure and changes in load. This course covers several strategies to keep ML systems responsive, resilient, and elastic. Machine learning systems are different than other computer systems when it comes to building, testing, deploying, delivering, and evolving. ML systems also have unique challenges when we need to change the architecture or behavior of the system. Therefore, it is essential to learn how to deal with such unique challenges that only may happen when building real-world production-ready ML systems (e.g., performance issues, memory leaking, communication issues, multi-GPU issues, etc). The focus of this course will be primarily on deep learning systems, but the principles will remain similar across all ML systems.

| link | github | materials |