Skip to content
A curated list of research in machine learning system. I also summarize some papers if I think they are really interesting.
Branch: master
Clone or download
Latest commit 93c8c59 Jun 17, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
imgs add new image Mar 31, 2019
paper update sysml whitepaper Apr 1, 2019
LICENSE Initial commit Jan 7, 2019 update a blog Jun 17, 2019

Maintenance Commit Activity Last Commit Ask Me Anything ! Awesome GitHub license GitHub stars

Awesome System for Machine Learning

Path to system for AI [Whitepaper You Must Read]

A curated list of research in machine learning system. Link to the code if available is also present. I also summarize some papers if I think they are really interesting.

AI system

Table of Contents



System for AI

AI for System

PR template

- Title [[Paper]](link) [[GitHub]](link)
  - Author (*conference(journal) year*)
  - Summary: 


  • Computer Architecture: A Quantitative Approach [Must read]
  • Streaming Systems [Book]
  • Kubernetes in Action (start to read) [Book]
  • Machine Learning Systems: Designs that scale [Website]


  • System thinking. A TED talk. [YouTube]
  • Flexible systems are the next frontier of machine learning. Jeff Dean [YouTube]
  • Is It Time to Rewrite the Operating System in Rust? [YouTube]
  • InfoQ: AI, ML and Data Engineering [YouTube]
    • Start to watch.
  • Netflix: Human-centric Machine Learning Infrastructure [InfoQ]
  • SysML 2019: [YouTube]
  • ScaledML 2019: David Patterson, Ion Stoica, Dawn Song and so on [YouTube]
  • ScaledML 2018: Jeff Dean, Ion Stoica, Yangqing Jia and so on [YouTube] [Slides]
  • A New Golden Age for Computer Architecture History, Challenges, and Opportunities. David Patterson [YouTube]
  • How to Have a Bad Career. David Patterson (I am a big fan) [YouTube]
  • SysML 18: Perspectives and Challenges. Michael Jordan [YouTube]
  • SysML 18: Systems and Machine Learning Symbiosis. Jeff Dean [YouTube]



  • The Deep Learning Toolset — An Overview Blog
  • Summary of CSE 599W: Systems for ML [Chinese Blog]
  • Polyaxon, Argo and Seldon for Model Training, Package and Deployment in Kubernetes [Blog]
  • Overview of the different approaches to putting Machine Learning (ML) models in production [Blog]
  • Architecting a Machine Learning Pipeline [Part1][Part2]
  • Model Serving in PyTorch [Blog]
  • Machine learning in Netflix [Medium]
  • SciPy Conference Materials (slides, repo) [GitHub]
  • 继Spark之后,UC Berkeley 推出新一代AI计算引擎——Ray [Blog]
  • 了解/从事机器学习/深度学习系统相关的研究需要什么样的知识结构? [Zhihu]
  • Learn Kubernetes in Under 3 Hours: A Detailed Guide to Orchestrating Containers [Blog] [GitHub]


  • Survey on End-To-End Machine Learning Automation [Paper] [GitHub]
  • Opportunities and Challenges Of Machine Learning Accelerators In Production [Paper]
    • Ananthanarayanan, Rajagopal, et al. "
    • 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 2019.
  • Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools [Paper]
    • Summary:
  • How (and How Not) to Write a Good Systems Paper [Advice]
  • Applied machine learning at Facebook: a datacenter infrastructure perspective [Paper]
    • Hazelwood, Kim, et al. (HPCA 2018)
  • Infrastructure for Usable Machine Learning: The Stanford DAWN Project
    • Bailis, Peter, Kunle Olukotun, Christopher Ré, and Matei Zaharia. (preprint 2017)
  • Hidden technical debt in machine learning systems [Paper]
    • Sculley, David, et al. (NIPS 2015)
    • Summary:
  • End-to-end arguments in system design [Paper]
    • Saltzer, Jerome H., David P. Reed, and David D. Clark.
  • System Design for Large Scale Machine Learning [Thesis]
  • Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
    • Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
    • Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.

Userful Tools


  • NetworKit is a growing open-source toolkit for large-scale network analysis. [GitHub]
  • gpu-sentry: Flask-based package for monitoring utilisation of nVidia GPUs. [GitHub]
  • anderskm/gputil: A Python module for getting the GPU status from NVIDA GPUs using nvidia-smi programmically in Python [GitHub]
  • Pytorch-Memory-Utils: detect your GPU memory during training with Pytorch. [GitHub]
  • torchstat: a lightweight neural network analyzer based on PyTorch. [GitHub]
  • NVIDIA GPU Monitoring Tools [GitHub]
  • PyTorch/cpuinfo: cpuinfo is a library to detect essential for performance optimization information about host CPU. [GitHub]
  • Popular Network memory consumption and FLOP counts [GitHub]
  • Intel® VTune™ Amplifier [Website]
    • Stop guessing why software is slow. Advanced sampling and profiling techniques quickly analyze your code, isolate issues, and deliver insights for optimizing performance on modern processors
  • Pyflame: A Ptracing Profiler For Python [GitHub]


  • Facebook AI Performance Evaluation Platform [GitHub]
  • Netron: Visualizer for deep learning and machine learning models [GitHub]
  • Facebook/FBGEMM: FBGEMM (Facebook GEneral Matrix Multiplication) is a low-precision, high-performance matrix-matrix multiplications and convolution library for server-side inference. [GitHub]
  • Dslabs: Distributed Systems Labs and Framework for UW system course [GitHub]
  • Machine Learning Model Zoo [Website]
  • MLPerf Benchmark Suite/Inference: Reference implementations of inference benchmarks [GitHub]
  • Faiss: A library for efficient similarity search and clustering of dense vectors [GitHub]
  • Microsoft/MMdnn: A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models.[GitHub]
  • gpushare-scheduler-extender [GitHub]
    • More and more data scientists run their Nvidia GPU based inference tasks on Kubernetes. Some of these tasks can be run on the same Nvidia GPU device to increase GPU utilization. So one important challenge is how to share GPUs between the pods
  • TensorRT [NVIDIA]
    • It is designed to work in a complementary fashion with training frameworks such as TensorFlow, Caffe, PyTorch, MXNet, etc. It focuses specifically on running an already trained network quickly and efficiently on a GPU for the purpose of generating a result


  • iterative/dvc: Data & models versioning for ML projects, make them shareable and reproducible [GitHub]
  • Machine Learning for .NET [GitHub]
    • ML.NET is a cross-platform open-source machine learning framework which makes machine learning accessible to .NET developers.
    • ML.NET allows .NET developers to develop their own models and infuse custom machine learning into their applications, using .NET, even without prior expertise in developing or tuning machine learning models.
  • ONNX: Open Neural Network Exchange [GitHub]
  • BentoML: Machine Learning Toolkit for packaging and deploying models [GitHub]
  • ModelDB: A system to manage ML models [GitHub] [MIT short paper]
  • EuclidesDB: A multi-model machine learning feature embedding database [GitHub]
  • Prefect: Perfect is a new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. [GitHub]
  • MindsDB: MindsDB's goal is to make it very simple for developers to use the power of artificial neural networks in their projects [GitHub]
  • PAI: OpenPAI is an open source platform that provides complete AI model training and resource management capabilities. [Microsoft Project]
  • Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems [Facebook Project]
  • Osquery is a SQL powered operating system instrumentation, monitoring, and analytics framework. [Facebook Project]
  • Seldon: Sheldon Core is an open source platform for deploying machine learning models on a Kubernetes cluster.[GitHub]
  • Kubeflow: Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable. [GitHub]
  • Polytaxon: A platform for reproducible and scalable machine learning and deep learning on kubernetes. [GitHub]

Data Processing

  • Kedro is a workflow development tool that helps you build data pipelines that are robust, scalable, deployable, reproducible and versioned. [GitHub]
  • Google/jax: Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more [GitHub]
  • CuPy: NumPy-like API accelerated with CUDA [GitHub]
  • Modin: Speed up your Pandas workflows by changing a single line of code [GitHub]
  • Weld: Weld is a runtime for improving the performance of data-intensive applications. [Project Website]
  • Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines [Project Website]
    • Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe. (PLDI 2013)
    • Summary: Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines.

Machine Learning System Papers (Training)

Class materials for a distributed systems lecture series [GitHub]


  • Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks. [Paper] [GitHub]
    • Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. (ICML 2018)
  • Mesh-TensorFlow: Deep Learning for Supercomputers [Paper] [GitHub]
    • Shazeer, Noam, Youlong Cheng, Niki Parmar, Dustin Tran, et al. (NIPS 2018)
    • Summary: Data parallelism for language model
  • PyTorch-BigGraph: A Large-scale Graph Embedding System [Paper] [GitHub]
    • Lerer, Adam and Wu, Ledell and Shen, Jiajun and Lacroix, Timothee and Wehrstedt, Luca and Bose, Abhijit and Peysakhovich, Alex (SysML 2019)
  • Beyond data and model parallelism for deep neural networks [Paper] [GitHub]
    • Jia, Zhihao, Matei Zaharia, and Alex Aiken. (SysML 2019)
    • Summary: SOAP (sample, operation, attribution and parameter) parallelism. Operator graph, device topology and extution optimizer. MCMC search algorithm and excution simulator.
  • Device placement optimization with reinforcement learning [Paper]
    • Mirhoseini, Azalia, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. (ICML 17)
    • Summary: Using REINFORCE learn a device placement policy. Group operations to excute. Need a lot of GPUs.
  • Spotlight: Optimizing device placement for training deep neural networks [Paper]
    • Gao, Yuanxiang, Li Chen, and Baochun Li (ICML 18)
  • GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [Paper][GitHub] [News]
    • Huang, Yanping, et al. (arXiv preprint arXiv:1811.06965 (2018))
    • Summary:
  • Horovod: Distributed training framework for TensorFlow, Keras, and PyTorch. [GitHub]
  • Distributed machine learning infrastructure for large-scale robotics research [GitHub] [Blog]

Training(Multi-jobs on cluster)

  • Gandiva: Introspective cluster scheduling for deep learning. [Paper]
    • Xiao, Wencong, et al. (OSDI 2018)
    • Summary: Improvet the efficency of hyper-parameter in cluster. Aware of hardware utilization.
  • Optimus: an efficient dynamic resource scheduler for deep learning clusters [Paper]
    • Peng, Yanghua, et al. (EuroSys 2018)
    • Summary: Job scheduling on clusters. Total complete time as the metric.
  • Multi-tenant GPU clusters for deep learning workloads: Analysis and implications. [Paper] [wait dataset]
    • Jeon, Myeongjae, Shivaram Venkataraman, Junjie Qian, Amar Phanishayee, Wencong Xiao, and Fan Yang
  • Slurm: A Highly Scalable Workload Manager [GitHub]

Model Serving

  • Deep Learning Inference Service at Microsoft [Paper]
    • J Soifer, et al. (OptML2019)
  • {PRETZEL}: Opening the Black Box of Machine Learning Prediction Serving Systems. [Paper]
    • Lee, Y., Scolari, A., Chun, B.G., Santambrogio, M.D., Weimer, M. and Interlandi, M., 2018. (OSDI 2018)
    • Summary:
  • Brusta: PyTorch model serving project [GitHub]
  • Model Server for Apache MXNet: Model Server for Apache MXNet is a tool for serving neural net models for inference [GitHub]
  • TFX: A TensorFlow-Based Production-Scale Machine Learning Platform [Paper] [Website] [GitHub]
    • Baylor, Denis, et al. (KDD 2017)
  • Tensorflow-serving: Flexible, high-performance ml serving [Paper] [GitHub]
    • Olston, Christopher, et al.
  • IntelAI/OpenVINO-model-server: Inference model server implementation with gRPC interface, compatible with TensorFlow serving API and OpenVINO™ as the execution backend. [GitHub]
  • Clipper: A Low-Latency Online Prediction Serving System [Paper] [GitHub]
    • Crankshaw, Daniel, et al. (NSDI 2017)
    • Summary: Adaptive batch
  • InferLine: ML Inference Pipeline Composition Framework [Paper]
    • Crankshaw, Daniel, et al. (Preprint)
    • Summary: update version of Clipper
  • TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments [Paper]
    • Dakkak, Abdul, et al (Preprint)
    • Summary: model cold start problem
  • Rafiki: machine learning as an analytics service system [Paper] [GitHub]
    • Wang, Wei, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad.
    • Summary: Contain both training and inference. Auto-Hype-Parameter search for training. Ensemble models for inference. Using DRL to balance trade-off between accuracy and latency.
  • GraphPipe: Machine Learning Model Deployment Made Simple [GitHub]
  • Nexus: Nexus is a scalable and efficient serving system for DNN applications on GPU cluster. [GitHub]
  • Deepcpu: Serving rnn-based deep learning models 10x faster. [Paper]
    • Zhang, M., Rajbhandari, S., Wang, W. and He, Y., 2018. (ATC2018)
  • Orkhon: ML Inference Framework and Server Runtime [GitHub]

Machine Learning System Papers (Inference)

  • Dynamic Space-Time Scheduling for GPU Inference [Paper]
    • Jain, Paras, et al. (NIPS 18, System for ML)
    • Summary:
  • Dynamic Scheduling For Dynamic Control Flow in Deep Learning Systems [Paper]
    • Wei, Jinliang, Garth Gibson, Vijay Vasudevan, and Eric Xing. (On going)
  • Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. [Paper]
    • D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. (NeurIPS Systems for ML Workshop 2018)
    • Summary: They assume that their system, HiveMind, is given as input models grouped into model batches that are amenable to co-optimization and co-execution. a compiler, and a runtime.

Machine Learning Compiler

  • TVM: An Automated End-to-End Optimizing Compiler for Deep Learning [Project Website]
    • {TVM}: An Automated End-to-End Optimizing Compiler for Deep Learning [Paper] [YouTube]
      • Chen, Tianqi, et al. (OSDI 2018)
  • Facebook TC: Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. [GitHub]
  • Tensorflow/mlir: "Multi-Level Intermediate Representation" Compiler Infrastructure [GitHub] [Video]
  • PyTorch/glow: Compiler for Neural Network hardware accelerators [GitHub]

Machine Learning Infrastructure

  • cortexlabs/cortex: Deploy machine learning applications without worrying about setting up infrastructure, managing dependencies, or orchestrating data pipelines. [GitHub]

AutoML System

  • Taking human out of learning applications: A survey on automated machine learning. [Must Read Survey]
    • Quanming, Y., Mengshuo, W., Hugo, J.E., Isabelle, G., Yi-Qi, H., Yu-Feng, L., Wei-Wei, T., Qiang, Y. and Yang, Y.
  • Aut-sklearn: Automated Machine Learning with scikit-learn [GitHub] [Paper]
  • Katib: A Distributed General AutoML Platform on Kubernetes [GitHub] [Paper]
  • NNI: An open source AutoML toolkit for neural architecture search and hyper-parameter tuning [GitHub]
  • AutoKeras: Accessible AutoML for deep learning. [GitHub]
  • Facebook/Ax: Adaptive experimentation is the machine-learning guided process of iteratively exploring a (possibly infinite) parameter space in order to identify optimal configurations in a resource-efficient manner. [GitHub]
  • DeepSwarm: DeepSwarm is an open-source library which uses Ant Colony Optimization to tackle the neural architecture search problem. [GitHub]
  • Google/AdaNet: AdaNet is a lightweight TensorFlow-based framework for automatically learning high-quality models with minimal expert. Importantly, AdaNet provides a general framework for not only learning a neural network architecture, but also for learning to ensemble to obtain even better models. [GitHub]

Deep Reinforcement Learning System

  • Ray: A Distributed Framework for Emerging {AI} Applications [GitHub]
    • Moritz, Philipp, et al. (OSDI 2018)
    • Summary: Distributed DRL training, simulation and inference system. Can be used as a high-performance python framework.
  • Elf: An extensive, lightweight and flexible research platform for real-time strategy games [Paper] [GitHub]
    • Tian, Yuandong, Qucheng Gong, Wenling Shang, Yuxin Wu, and C. Lawrence Zitnick. (NIPS 2017)
    • Summary:
  • Horizon: Facebook's Open Source Applied Reinforcement Learning Platform [Paper] [GitHub]
    • Gauci, Jason, et al. (preprint 2019)
  • RLgraph: Modular Computation Graphs for Deep Reinforcement Learning [Paper][GitHub]
    • Schaarschmidt, Michael, Sven Mika, Kai Fricke, and Eiko Yoneki. (SysML 2019)
    • Summary:

Video System


  • VideoFlow: Python framework that facilitates the quick development of complex video analysis applications and other series-processing based applications in a multiprocessing environment. [GitHub]
  • VidGear: Powerful Multi-Threaded OpenCV and FFmpeg based Turbo Video Processing Python Library with unique State-of-the-Art Features. [GitHub]
  • NVIDIA DALI: A library containing both highly optimized building blocks and an execution engine for data pre-processing in deep learning applications [GitHub]
  • TensorStream: A library for real-time video stream decoding to CUDA memory [GitHub]
  • C++ image processing library with using of SIMD: SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX-512, VMX(Altivec) [GitHub]
  • Pretrained image and video models for Pytorch. [GitHub]
  • LiveDetect - Live video client to DeepDetect. [GitHub]


  • CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video [Paper]
    • Mao, Huizi, Taeyoung Kong, and William J. Dally. (SysML2019)
  • Live Video Analytics at Scale with Approximation and Delay-Tolerance [Paper]
    • Zhang, Haoyu, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. (NSDI 2017)
  • Chameleon: scalable adaptation of video analytics [Paper]
    • Jiang, Junchen, et al. (SIGCOMM 2018)
    • Summary: Configuration controller for balancing accuracy and resource. Golden configuration is a good design. Periodic profiling often exceeded any resource savings gained by adapting the configurations.
  • Noscope: optimizing neural network queries over video at scale [Paper] [GitHub]
    • Kang, Daniel, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. (VLDB2017)
    • Summary:
  • SVE: Distributed video processing at Facebook scale [Paper]
    • Huang, Qi, et al. (SOSP2017)
    • Summary:
  • Scanner: Efficient Video Analysis at Scale [Paper][GitHub]
    • Poms, Alex, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian (SIGGRAPH 2018)
    • Summary:
  • A cloud-based large-scale distributed video analysis system [Paper]
    • Wang, Yongzhe, et al. (ICIP 2016)
  • Rosetta: Large scale system for text detection and recognition in images [Paper]
    • Borisyuk, Fedor, Albert Gordo, and Viswanath Sivakumar. (KDD 2018)
    • Summary:
  • Neural adaptive content-aware internet video delivery. [Paper] [GitHub]
    • Yeo, H., Jung, Y., Kim, J., Shin, J. and Han, D., 2018. (OSDI 2018)
    • Summary: Combine video super-resolution and ABR

Edge or Mobile Papers

  • Mobile Computer Vision @ Facebook [GitHub]
  • Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. [Paper]
    • Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J. and Tang, L., 2017, April.
    • In ACM SIGARCH Computer Architecture News (Vol. 45, No. 1, pp. 615-629). ACM.
  • 26ms Inference Time for ResNet-50: Towards Real-Time Execution of all DNNs on Smartphone [Paper]
    • Wei Niu, Xiaolong Ma, Yanzhi Wang, Bin Ren (ICML2019)
  • NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision [Paper]
    • Fang, Biyi, Xiao Zeng, and Mi Zhang. (MobiCom 2018)
    • Summary: Borrow some ideas from network prune. The pruned model then recovers to trade-off computation resource and accuracy at runtime
  • Lavea: Latency-aware video analytics on edge computing platform [Paper]
    • Yi, Shanhe, et al. (Second ACM/IEEE Symposium on Edge Computing. ACM, 2017.)
  • Scaling Video Analytics on Constrained Edge Nodes [Paper] [GitHub]
    • Canel, C., Kim, T., Zhou, G., Li, C., Lim, H., Andersen, D. G., Kaminsky, M., and Dulloo (SysML 2019)
  • alibaba/MNN: MNN is a lightweight deep neural network inference engine. It loads models and do inference on devices. [GitHub]
  • XiaoMi/mobile-ai-bench: Benchmarking Neural Network Inference on Mobile Devices [GitHub]
  • XiaoMi/mace-models: Mobile AI Compute Engine Model Zoo [GitHub]

Resource Management

  • Resource management with deep reinforcement learning [Paper] [GitHub]
    • Mao, Hongzi, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula (ACM HotNets 2016)
    • Summary: Highly cited paper. Nice definaton. An example solution that translates the problem of packing tasks with multiple resource demands into a learning problem and then used DRL to solve it.

Advanced Theory

  • Differentiable MPC for End-to-end Planning and Control [Paper] [GitHub]
    • Amos, Brandon, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter (NIPS 2018)

Traditional System Optimization Papers

  • AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers [Paper]
    • Gandhi, Anshul, et al. (TOCS 2012)
  • Large-scale cluster management at Google with Borg [Paper]
    • Verma, Abhishek, et al. (ECCS2015)
You can’t perform that action at this time.