Skip to content

EthicalML/awesome-production-machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Maintenance GitHub GitHub GitHub GitHub

Awesome Production Machine Learning

This repository contains a curated list of awesome open source libraries that will help you deploy, monitor, version, scale and secure your production machine learning 🚀

Quick links to sections in this page

🔍 Explaining Predictions & Models 🔏 Privacy Preserving ML 📜 Model & Data Versioning
🏁 Model Training Orchestration 💪 Model Serving & Monitoring 🤖 AutoML
🧵 Data Pipeline 🏷️ Data Labelling & Synthesis 📅 Metadata Management
🗺️ Computation Distribution 📥 Model Serialisation 🧮 Optimized Computation
💸 Data Stream Processing 🔴 Outlier & Anomaly Detection 🎁 Feature Store
⚔ Adversarial Robustness 💾 Data Storage Optimization 📓 Data Science Notebook
🔥 Neural Search 🔩 Model Optimization, Compilation & Compression 👁️ Industry-strength Computer Vision
🔠 Industry-strength Natural Language Processing 🍕 Industry-strength Reinforcement Learning 📊 Industry-strength Visualisation
🙌 Industry-strength Recommender System 📈 Industry-strength Benchmarking & Evaluation 💰 Commercial Platform

10 Min Video Overview

This 10 minute video provides an overview of the motivations for machine learning operations as well as a high level overview on some of the tools in this repo. This newer video covers the an updated 2022 version of the state of MLOps

Want to receive recurrent updates on this repo and other advancements?

You can join the Machine Learning Engineer newsletter. Join over 10,000 ML professionals and enthusiasts who receive weekly curated articles & tutorials on production Machine Learning.
Also check out the Awesome Artificial Intelligence Guidelines List, where we aim to map the landscape of "Frameworks", "Codes of Ethics", "Guidelines", "Regulations", etc related to Artificial Intelligence.

Main Content

Explaining Black Box Models and Datasets

  • Aequitas - An open-source bias audit toolkit for data scientists, machine learning researchers, and policymakers to audit machine learning models for discrimination and bias, and to make informed and equitable decisions around developing and deploying predictive risk-assessment tools.
  • AI Explainability 360 - Interpretability and explainability of data and machine learning models including a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics.
  • AI Fairness 360 - A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
  • Alibi - Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The initial focus on the library is on black-box, instance based model explanations.
  • anchor - Code for the paper "High precision model agnostic explanations", a model-agnostic system that explains the behaviour of complex models with high-precision rules called anchors.
  • captum - model interpretability and understanding library for PyTorch developed by Facebook. It contains general purpose implementations of integrated gradients, saliency maps, smoothgrad, vargrad and others for PyTorch models.
  • casme - Example of using classifier-agnostic saliency map extraction on ImageNet presented on the paper "Classifier-agnostic saliency map extraction".
  • CleverHans - An adversarial example library for constructing attacks, building defenses, and benchmarking both. A python library to benchmark system's vulnerability to adversarial examples.
  • ContrastiveExplanation (Foil Trees) - Python script for model agnostic contrastive/counterfactual explanations for machine learning. Accompanying code for the paper "Contrastive Explanations with Local Foil Trees".
  • DeepLIFT - Codebase that contains the methods in the paper "Learning important features through propagating activation differences". Here is the slides and the video of the 15 minute talk given at ICML.
  • DeepVis Toolbox - This is the code required to run the Deep Visualization Toolbox, as well as to generate the neuron-by-neuron visualizations using regularized optimisation. The toolbox and methods are described casually here and more formally in this paper.
  • ELI5 - "Explain Like I'm 5" is a Python package which helps to debug machine learning classifiers and explain their predictions.
  • FACETS - Facets contains two robust visualizations to aid in understanding and analyzing machine learning datasets. Get a sense of the shape of each feature of your dataset using Facets Overview, or explore individual observations using Facets Dive.
  • Fairlearn - Fairlearn is a python toolkit to assess and mitigate unfairness in machine learning models.
  • FairML - FairML is a python toolbox auditing the machine learning models for bias.
  • Fairness Comparison - This repository is meant to facilitate the benchmarking of fairness aware machine learning algorithms based on this paper.
  • Fairness Indicators - The tool supports teams in evaluating, improving, and comparing models for fairness concerns in partnership with the broader Tensorflow toolkit.
  • GEBI - Global Explanations for Bias Identification - An attention-based summarized post-hoc explanations for detection and identification of bias in data. We propose a global explanation and introduce a step-by-step framework on how to detect and test bias. Python package for image data.
  • iNNvestigate - An open-source library for analyzing Keras models visually by methods such as DeepTaylor-Decomposition, PatternNet, Saliency Maps, and Integrated Gradients.
  • Integrated-Gradients - This repository provides code for implementing integrated gradients for networks with image inputs.
  • InterpretML - InterpretML is an open-source package for training interpretable models and explaining blackbox systems.
  • keras-vis - keras-vis is a high-level toolkit for visualizing and debugging your trained keras neural net models. Currently supported visualizations include: Activation maximization, Saliency maps, Class activation maps.
  • L2X - Code for replicating the experiments in the paper "Learning to Explain: An Information-Theoretic Perspective on Model Interpretation" at ICML 2018.
  • Lightly - A python framework for self-supervised learning on images. The learned representations can be used to analyze the distribution in unlabeled data and rebalance datasets.
  • Lightwood - A Pytorch based framework that breaks down machine learning problems into smaller blocks that can be glued together seamlessly with an objective to build predictive models with one line of code.
  • LIME - Local Interpretable Model-agnostic Explanations for machine learning models.
  • LOFO Importance - LOFO (Leave One Feature Out) Importance calculates the importances of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and evaluating the performance of the model, with a validation scheme of choice, based on the chosen metric.
  • MindsDB - MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.
  • mljar-supervised - A Python package for AutoML on tabular data with feature engineering, hyper-parameters tuning, explanations and automatic documentation.
  • NETRON - Viewer for neural network, deep learning and machine learning models.
  • pyBreakDown - A model agnostic tool for decomposition of predictions from black boxes. Break Down Table shows contributions of every variable to a final prediction.
  • responsibly - Toolkit for auditing and mitigating bias and fairness of machine learning systems
  • SHAP - SHapley Additive exPlanations is a unified approach to explain the output of any machine learning model.
  • SHAPash - Shapash is a Python library that provides several types of visualization that display explicit labels that everyone can understand.
  • Skater - Skater is a unified framework to enable Model Interpretation for all forms of model to help one build an Interpretable machine learning system often needed for real world use-cases.
  • tensorflow's Model Analysis - TensorFlow Model Analysis (TFMA) is a library for evaluating TensorFlow models. It allows users to evaluate their models on large amounts of data in a distributed manner, using the same metrics defined in their trainer.
  • themis-ml - themis-ml is a Python library built on top of pandas and sklearn that implements fairness-aware machine learning algorithms.
  • Themis - Themis is a testing-based approach for measuring discrimination in a software system.
  • Transformer Debugger - Transformer Debugger (TDB) is a tool developed by OpenAI's Superalignment team with the goal of supporting investigations into specific behaviors of small language models.
  • TreeInterpreter - Package for interpreting scikit-learn's decision tree and random forest predictions. Allows decomposing each prediction into bias and feature contribution components as described here.
  • WhatIf - An easy-to-use interface for expanding understanding of a black-box classification or regression ML model.
  • woe - Tools for WoE Transformation mostly used in ScoreCard Model for credit rating
  • XAI - eXplainableAI - An eXplainability toolbox for machine learning.

Privacy Preserving ML

  • BastionLab - BastionLab is a framework for confidential data science collaboration. It uses Confidential Computing, Access control data science, and Differential Privacy to enable data scientists to remotely perform data exploration, statistics, and training on confidential data while ensuring maximal privacy for data owners.
  • Concrete-ML - Concrete-ML is a Privacy-Preserving Machine Learning (PPML) open-source set of tools built on top of The Concrete Framework by Zama. It aims to simplify the use of fully homomorphic encryption (FHE) for data scientists to help them automatically turn machine learning models into their homomorphic equivalent.
  • Concrete-ML - Fedlearner is collaborative machine learning framework that enables joint modeling of data distributed between institutions.
  • FATE - FATE (Federated AI Technology Enabler) is the world's first industrial grade federated learning open source framework to enable enterprises and institutions to collaborate on data while protecting data security and privacy.
  • FedML - FedML provides a research and production integrated edge-cloud platform for Federated/Distributed Machine Learning at anywhere at any scale.
  • Flower - Flower is a Federated Learning Framework with a unified approach. It enables the federation of any ML workload, with any ML framework, and any programming language.
  • Google's Differential Privacy - This is a C++ library of ε-differentially private algorithms, which can be used to produce aggregate statistics over numeric data sets containing private or sensitive information.
  • Intel Homomorphic Encryption Backend - The Intel HE transformer for nGraph is a Homomorphic Encryption (HE) backend to the Intel nGraph Compiler, Intel's graph compiler for Artificial Neural Networks.
  • Microsoft SEAL - Microsoft SEAL is an easy-to-use open-source (MIT licensed) homomorphic encryption library developed by the Cryptography Research group at Microsoft.
  • OpenFL - OpenFL is a Python framework for Federated Learning. OpenFL is designed to be a flexible, extensible and easily learnable tool for data scientists. OpenFL is developed by Intel Internet of Things Group (IOTG) and Intel Labs.
  • PySyft - A Python library for secure, private Deep Learning. PySyft decouples private data from model training, using Multi-Party Computation (MPC) within PyTorch.
  • Rosetta - A privacy-preserving framework based on TensorFlow with customized backend Operations using Multi-Party Computation (MPC). Rosetta reuses the APIs of TensorFlow and allows to transfer original TensorFlow codes into a privacy-preserving manner with minimal changes.
  • Substra - Substra is an open-source framework for privacy-preserving, traceable and collaborative Machine Learning.
  • Tensorflow Privacy - A Python library that includes implementations of TensorFlow optimizers for training machine learning models with differential privacy.
  • TF Encrypted - A Framework for Confidential Machine Learning on Encrypted Data in TensorFlow.

Model and Data Versioning

  • Aim - A super-easy way to record, search and compare AI experiments.
  • Catalyst - High-level utils for PyTorch DL & RL research. It was developed with a focus on reproducibility, fast experimentation and code/ideas reusing.
  • ClearML - Auto-Magical Experiment Manager & Version Control for AI (previously Trains).
  • CodaLab - CodaLab Worksheets is a collaborative platform for reproducible research that allows researchers to run, manage, and share their experiments in the cloud. It helps researchers ensure that their runs are reproducible and consistent.
  • Data Version Control (DVC) - A git fork that allows for version management of models.
  • Deepkit - An open-source platform and cross-platform desktop application to execute, track, and debug modern machine learning experiments.
  • Dolt - Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a git repository.
  • Flor - Easy to use logger and automatic version controller made for data scientists who write ML code.
  • Guild AI - Open source toolkit that automates and optimizes machine learning experiments.
  • Hangar - Version control for tensor data, git-like semantics on numerical data with high speed and efficiency.
  • Keepsake - Version control for machine learning.
  • lakeFS - Repeatable, atomic and versioned data lake on top of object storage.
  • MLflow - Open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment.
  • ModelDB - An open-source system to version machine learning models including their ingredients code, data, config, and environment and to track ML metadata across the model lifecycle.
  • ModelStore - An open-source Python library that allows you to version, export, and save a machine learning model to your cloud storage provider.
  • ormb - Docker for Your ML/DL Models Based on OCI Artifacts.
  • Polyaxon - A platform for reproducible and scalable machine learning and deep learning on kubernetes - (Video).
  • Quilt - Versioning, reproducibility and deployment of data and models.
  • Sacred - Tool to help you configure, organize, log and reproduce machine learning experiments.
  • Studio - Model management framework which minimizes the overhead involved with scheduling, running, monitoring and managing artifacts of your machine learning experiments.
  • TerminusDB - A graph database management system that stores data like git.

Model Training Orchestration

  • Accelerate - Accelerate abstracts exactly and only the boilerplate code related to multi-GPU/TPU/mixed-precision and leaves the rest of your code unchanged.
  • Aqueduct - Aqueduct enables you to easily define, run, and manage AI & ML tasks on any cloud infrastructure.
  • CML - Continuous Machine Learning (CML) is an open-source library for implementing continuous integration & delivery (CI/CD) in machine learning projects.
  • Determined - Deep learning training platform with integrated support for distributed training, hyperparameter tuning, and model management (supports Tensorflow and Pytorch).
  • envd - Machine learning development environment for data science and AI/ML engineering teams.
  • Fabrik - Fabrik is an online collaborative platform to build, visualize and train deep learning models via a simple drag-and-drop interface.
  • Hopsworks - Hopsworks is a data-intensive platform for the design and operation of machine learning pipelines that includes a Feature Store - (Video).
  • Kubeflow - A cloud native platform for machine learning based on Google’s internal machine learning pipelines.
  • MLeap - Standardisation of pipeline and model serialization for Spark, Tensorflow and sklearn.
  • Nanotron - Nanotron provides distributed primitives to train a variety of models efficiently using 3D parallelism.
  • Nos - nos is an open-source platform to efficiently run AI workloads on Kubernetes, increasing GPU utilization and reducing infrastructure and operational costs.
  • NVIDIA TensorRT - TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.
  • Onepanel - Production scale vision AI platform, with fully integrated components for model building, automated labeling, data processing and model training pipelines.
  • Open Platform for AI - Platform that provides complete AI model training and resource management capabilities.
  • PyCaret ) - low-code library for training and deploying models (scikit-learn, XGBoost, LightGBM, spaCy)
  • Sematic - Platform to build resource-intensive pipelines with simple Python.
  • Skaffold - Skaffold is a command line tool that facilitates continuous development for Kubernetes applications. You can iterate on your application source code locally then deploy to local or remote Kubernetes clusters.
  • SkyPilot - Run LLMs, AI, and batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution -- all with a simple interface.
  • Streaming - A Data Streaming Library for Efficient Neural Network Training.
  • Tensorflow Extended (TFX) - Production oriented configuration framework for ML based on TensorFlow, incl. monitoring and model version management.
  • TonY - TonY is a framework to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow, PyTorch, MXNet and Horovod.
  • ZenML - ZenML is an extensible, open-source MLOps framework to create reproducible ML pipelines with a focus on automated metadata tracking, caching, and many integrations to other tools.

Model Serving and Monitoring

  • Backprop - Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
  • BentoML - BentoML is an open source framework for high performance ML model serving.
  • Cortex - Cortex is an open source platform for deploying machine learning models—trained with any framework—as production web services. No DevOps required.
  • Deepchecks - Deepchecks is an open source package for comprehensively validating your machine learning models and data with minimal effort during development, deployment or in production.
  • DeepDetect - Machine Learning production server for TensorFlow, XGBoost and Cafe models written in C++ and maintained by Jolibrain.
  • Evidently - Evidently helps analyze machine learning models during development, validation, or production monitoring. The tool generates interactive reports from pandas DataFrame.
  • ForestFlow - Cloud-native machine learning model server.
  • Giskard - Quality Assurance for AI models. Open-source platform to help organizations increase the efficiency of their AI development workflow, eliminate risks of AI biases and ensure robust, reliable & ethical AI models.
  • Helicone - Helicone is an observability platform for LLMs.
  • Hydrosphere ML Lambda - Open source model management cluster for deploying, serving and monitoring machine learning models and ad-hoc algorithms with a FaaS architecture.
  • Intel® Extension for Transformers - An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere.
  • Inference - A fast, production-ready inference server for computer vision supporting deployment of many popular model architectures and fine-tuned models. With Inference, you can deploy models such as YOLOv5, YOLOv8, CLIP, SAM, and CogVLM on your own hardware using Docker.
  • Jina - Cloud native search framework that supports to use deep learning/state of the art AI models for search.
  • KServe - Serverless framework to deploy and monitor machine learning models in Kubernetes - (Video).
  • Langfuse - Langfuse is an open source observability & analytics solution for LLM-based applications.
  • LightLLM - LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
  • LLMonitor - Observability & analytics for AI apps and agents.
  • LocalAI - LocalAI is a drop-in replacement REST API that's compatible with OpenAI API specifications for local inferencing.
  • m2cgen - A lightweight library which allows to transpile trained classic machine learning models into a native code of C, Java, Go, R, PHP, Dart, Haskell, Rust and many other programming languages.
  • MLEM - Version and deploy your ML models following GitOps principles.
  • MLRun- MLRun is an open MLOps framework for quickly building and managing continuous ML and generative AI applications across their lifecycle.
  • MLServer - An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more.
  • mltrace - a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines.
  • MLWatcher - MLWatcher is a python agent that records a large variety of time-serie metrics of your running ML classification algorithm. It enables you to monitor in real time.
  • Model Server for Apache MXNet (MMS) - A model server for Apache MXNet from Amazon Web Services that is able to run MXNet models as well as Gluon models (Amazon's SageMaker runs a custom version of MMS under the hood).
  • NannyML - An open source library to estimate post-deployment model performance (without access to targets). Capable of fully capturing the impact of data drift on performance.
  • Mosec - A rust-powered and multi-stage pipelined model server which offers dynamic batching and more. Super easy to implement and deploy as micro-services.
  • Nuclio - A high-performance "serverless" framework focused on data, I/O, and compute intensive workloads. It is well integrated with popular data science tools, such as Jupyter and Kubeflow; supports a variety of data and streaming sources; and supports execution over CPUs and GPUs.
  • ONNX Runtime - ONNX Runtime is a cross-platform inference and training machine-learning accelerator.
  • OpenScoring - REST web service for the true real-time scoring (< 1 ms) of Scikit-Learn, R and Apache Spark models.
  • OpenVINO - OpenVINO is an open-source toolkit for optimizing and deploying AI inference.
  • Pandas Profiling - Creates HTML profiling reports from pandas DataFrame objects. It extends the pandas DataFrame with df.profile_report() for quick data analysis.
  • Phoenix - Phoenix is an open source ML observability in a notebook to validate, monitor and fine tune your generative LLM, CV and tabular models.
  • PowerInfer - PowerInfer is a CPU/GPU LLM inference engine leveraging activation locality for your device.
  • PredictionIO - An open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task.
  • Redis-AI - A Redis module for serving tensors and executing deep learning models. Expect changes in the API and internals.
  • Seldon Core - Open source platform for deploying and monitoring machine learning models in kubernetes - (Video).
  • skops - skops is a Python library helping you share your scikit-learn based models and put them in production.
  • S-LoRA - Serving Thousands of Concurrent LoRA Adapters.
  • Tempo - Open source SDK that provides a unified interface to multiple MLOps projects that enable data scientists to deploy and productionise machine learning systems.
  • Tensorflow Serving - High-performant framework to serve Tensorflow models via grpc protocol able to handle 100k requests per second per core.
  • TorchServe - TorchServe is a flexible and easy to use tool for serving PyTorch models.
  • Transformer-deploy - Transformer-deploy is an efficient, scalable and enterprise-grade CPU/GPU inference server for Hugging Face transformer models.
  • Triton Inference Server - Triton is a high performance open source serving software to deploy AI models from any framework on GPU & CPU while maximizing utilization.
  • TruLens - TruLens provides a set of tools for developing and monitoring neural nets, including large language models.
  • UnionML - UnionML is an open source MLOps framework that aims to reduce the boilerplate and friction that comes with building models and deploying them to production.
  • vLLM - vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs

Adversarial Robustness

  • AdvBox - A toolbox to generate adversarial examples that fool neural networks in PaddlePaddle, PyTorch, Caffe2, MxNet, Keras, TensorFlow, and Advbox can benchmark the robustness of machine learning models.
  • Adversarial DNN Playground - think TensorFlow Playground, but for Adversarial Examples! A visualization tool designed for learning and teaching - the attack library is limited in size, but it has a nice front-end to it with buttons you can press!
  • AdverTorch - library for adversarial attacks / defenses specifically for PyTorch.
  • Artificial Adversary AirBnB's library to generate text that reads the same to a human but passes adversarial classifiers.
  • CleverHans - library for testing adversarial attacks / defenses maintained by some of the most important names in adversarial ML, namely Ian Goodfellow (ex-Google Brain, now Apple) and Nicolas Papernot (Google Brain). Comes with some nice tutorials!
  • Counterfit - Counterfit is a command-line tool and generic automation layer for assessing the security of machine learning systems.
  • DEEPSEC - another systematic tool for attacking and defending deep learning models.
  • Foolbox - second biggest adversarial library. Has an even longer list of attacks - but no defenses or evaluation metrics. Geared more towards computer vision. Code easier to understand / modify than ART - also better for exploring blackbox attacks on surrogate models.
  • Adversarial Robustness Toolbox (ART)) - ART provides tools that enable developers and researchers to defend and evaluate Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference.
  • MIA - A library for running membership inference attacks (MIA) against machine learning models.
  • NeMo Guardrails - NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
  • Nicolas Carlini’s Adversarial ML reading list - not a library, but a curated list of the most important adversarial papers by one of the leading minds in Adversarial ML, Nicholas Carlini. If you want to discover the 10 papers that matter the most - I would start here.
  • OpenAttack - OpenAttack is a Python-based textual adversarial attack toolkit, which handles the whole process of textual adversarial attacking, including preprocessing text, accessing the victim model, generating adversarial examples and evaluation.
  • RobustBench - another robustness resource maintained by some of the leading names in adversarial ML. They specifically focus on defenses, and onesa standardized adversarial robustness benchmark.
  • Robust ML - another robustness resource maintained by some of the leading names in adversarial ML. They specifically focus on defenses, and ones that have published code available next to papers. Practical and useful.
  • TextFool - plausible looking adversarial examples for text generation.
  • Trickster - Library and experiments for attacking machine learning in discrete domains using graph search.

AutoML

  • AutoGluon - Automated feature, model, and hyperparameter selection for tabular, image, and text data on top of popular machine learning libraries (Scikit-Learn, LightGBM, CatBoost, PyTorch, MXNet).
  • Autokeras - AutoML library for Keras based on "Auto-Keras: Efficient Neural Architecture Search with Network Morphism".
  • AutoML-GS - Automatic feature and model search with code generation in Python, on top of common data science libraries (tensorflow, sklearn, etc.).
  • auto-sklearn - Framework to automate algorithm and hyperparameter tuning for sklearn.
  • Colombus - A scalable framework to perform exploratory feature selection implemented in R.
  • ENAS via Parameter Sharing - Efficient Neural Architecture Search via Parameter Sharing by authors of paper.
  • ENAS-PyTorch - Efficient Neural Architecture Search (ENAS) in PyTorch based on this paper.
  • ENAS-Tensorflow - Efficient Neural Architecture search via parameter sharing(ENAS) micro search Tensorflow code for windows user.
  • Feature Engine - Feature-engine is a Python library that contains several transformers to engineer features for use in machine learning models.
  • Featuretools - An open source framework for automated feature engineering.
  • FLAML - FLAML is a fast library for automated machine learning & tuning.
  • go-featureprocessing - A feature pre-processing framework in Go that matches functionality of sklearn.
  • Katib - A Kubernetes-based system for Hyperparameter Tuning and Neural Architecture Search.
  • keras-tuner - Keras Tuner is an easy-to-use, distributable hyperparameter optimisation framework that solves the pain points of performing a hyperparameter search. Keras Tuner makes it easy to define a search space and leverage included algorithms to find the best hyperparameter values.
  • Maggy - Asynchronous, directed Hyperparameter search and parallel ablation studies on Apache Spark - (Video).
  • Neural Architecture Search with Controller RNN - Basic implementation of Controller RNN from Neural Architecture Search with Reinforcement Learning and Learning Transferable Architectures for Scalable Image Recognition.
  • Neural Network Intelligence - NNI (Neural Network Intelligence) is a toolkit to help users run automated machine learning (AutoML) experiments.
  • Optuna - Optuna is an automatic hyperparameter optimisation software framework, particularly designed for machine learning.
  • OSS Vizier - OSS Vizier is a Python-based service for black-box optimisation and research, one of the first hyperparameter tuning services designed to work at scale.
  • sklearn-deap Use evolutionary algorithms instead of gridsearch in scikit-learn.
  • TPOT - Automation of sklearn pipeline creation (including feature selection, pre-processor, etc.).
  • tsfresh - Automatic extraction of relevant features from time series.
  • Upgini - Free automated data & feature enrichment library for machine learning: automatically searches through thousands of ready-to-use features from public and community shared data sources and enriches your training dataset with only the accuracy improving features.

Data Pipeline

  • Apache Airflow - Data Pipeline framework built in Python, including scheduler, DAG definition and a UI for visualisation.
  • Apache Nifi - Apache NiFi was made for dataflow. It supports highly configurable directed graphs of data routing, transformation, and system mediation logic.
  • Argo Workflows - Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).
  • Azkaban - Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.
  • Basin - Visual programming editor for building Spark and PySpark pipelines.
  • BatchFlow - BatchFlow helps data scientists conveniently work with random or sequential batches of your data and define data processing and machine learning workflows for large datasets.
  • Bonobo - ETL framework for Python 3.5+ with focus on simple atomic operations working concurrently on rows of data.
  • Chronos - More of a job scheduler for Mesos than ETL pipeline.
  • Couler - Unified interface for constructing and managing machine learning workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.
  • DataTrove - DataTrove is a library to process, filter and deduplicate text data at a very large scale.
  • D6tflow - A python library that allows for building complex data science workflows on Python.
  • DALL·E Flow - DALL·E Flow is an interactive workflow for generating high-definition images from text prompt.
  • Dagster - A data orchestrator for machine learning, analytics, and ETL.
  • DBND - DBND is an agile pipeline framework that helps data engineering teams track and orchestrate their data processes.
  • DBT - ETL tool for running transformations inside data warehouses.
  • Flyte - Lyft’s Cloud Native Machine Learning and Data Processing Platform - (Demo).
  • Genie - Job orchestration engine to interface and trigger the execution of jobs from Hadoop-based systems.
  • Gokart - Wrapper of the data pipeline Luigi.
  • Hamilton - Hamilton is a micro-orchestration framework for defining dataflows. Runs anywhere python runs (e.g. jupyter, fastAPI, spark, ray, dask). Brings software engineering best practices without you knowing it. Use it to define feature engineering transforms, end-to-end model pipelines, and LLM workflows. It complements macro-orchestration systems (e.g. kedro, luigi, airflow, dbt, etc.) as it replaces the code within those macro tasks.
  • Instill VDP - Instill VDP (Versatile Data Pipeline) aims to streamline the data processing pipelines from inception to completion.
  • Kedro - Kedro is a workflow development tool that helps you build data pipelines that are robust, scalable, deployable, reproducible and versioned. Visualization of the kedro workflows can be done by kedro-viz.
  • Ludwig - Ludwig is a declarative machine learning framework that makes it easy to define machine learning pipelines using a simple and flexible data-driven configuration system.
  • Luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs, handling dependency resolution, workflow management, visualisation, etc..
  • Metaflow - A framework for data scientists to easily build and manage real-life data science projects.
  • Neuraxle - A framework for building neat pipelines, providing the right abstractions to chain your data transformation and prediction steps with data streaming, as well as doing hyperparameter searches (AutoML).
  • Oozie - Workflow scheduler for Hadoop jobs.
  • Pachyderm - Open source distributed processing framework build on Kubernetes focused mainly on dynamic building of production machine learning pipelines - (Video).
  • PipelineX - Based on Kedro and MLflow. Full comparison is found here.
  • Ploomber - The fastest way to build data pipelines. Develop iteratively, deploy anywhere.
  • Prefect Core - Workflow management system that makes it easy to take your data pipelines and add semantics like retries, logging, dynamic mapping, caching, failure notifications, and more.
  • SETL - A simple Spark-powered ETL framework that helps you structure your ETL projects, modularize your data transformation logic and speed up your development.
  • Snakemake - Workflow management system for reproducible and scalable data analyses.
  • Towhee - General-purpose machine learning pipeline for generating embedding vectors using one or many ML models.

Data Labelling and Synthesis

  • Argilla - Argilla helps domain experts and data teams to build better NLP datasets in less time.
  • Baal