Resource scheduling and cluster management for AI
Switch branches/tags
PortManager2 biwang/service_refactor canwan/fix-user-management-bug canwan/migrate-build-to-new-config canwan/migrate-to-secret canwan/pai-0.8.2-release chenqi/marketplace chenqi/paishare chenqi/paishare1 clean-dependencies contrib disk-config dixu/add-converter dixu/change-timeout fix-test hadoop-2.9.0 hadoop-ha-branch hao/decouple_datanode_image_with_cluster_object_module hao/decouple_zookeeper_image_with_cluster_object_module hao/deployment-refactoring hao/latest-mt hao/mt hao/podtest huangrong/paishare hwuu/pai-0.4.y_rateLimit hwuu/pylon hwuu/webPortal_stopJobV2 master pai-0.3.y pai-0.4.y pai-0.5.y pai-0.6.y pai-0.6.1 pai-0.7.y pai-0.8.y pai/aks-k8sdashboard pai/aks pai/p100cluster paishare portAsResource portManagerFixBug portmanager2 portmanger qixcheng/aks-k8s-apiserver-proxy qixcheng/jenkins/clean-repo-before-checkout qixcheng/k8s-apiserver-proxy qixcheng/pylon/add-proxy-etag qixcheng/pylon/replace-webhdfs-in-rest-server qixcheng/rest-server/disable-etag qixcheng/rest-server/disable-quota-oom-retries-legacy qixcheng/rest-server/disable-quota-oom-retries qixcheng/rest-server/k8s-service-account qixcheng/rest-server/rate-limit qixcheng/rest-server/shell-inject qixcheng/web-portal/optimize-service-view qixcheng/webportal/fix-doc-link-2 qixcheng/webportal/plugin qixiang/webportal/upgrade-bootstrap refactor-cleaner release-notes-v0.8 release/v0.1.0-alpha release/v0.2.0-alpha shaocs-paishare upateHadoopVersion vc_update xiongyf/secure-kubelet yanjga/aks-deploy yanjga/aks-etcd yanjga/aks-k8s-url yanjga/aks-prometheus-token yanjga/aks-prometheus yanjga/aks-prometheus2 yanjga/aks-rsdelete yanjga/aks-test yanjga/api-server-url-config yanjga/dev-box-config yanjga/dev-box-git yanjga/etcd yanjga/faq_hdfs yanjga/hdfs_path yanjga/vs_code_support yife/cdoc yife/onboard-doc yife/rbac yife/svc_doc yqwang/folder-refactor yqwang/launcher-dev yuan/stress yuqian/simple_python_sdk yuye/bug-fix yuye/clst-obj-api-configurable yuye/doc-deploy yuye/ib-drivers yuye/image-opt yuye/rm-name-resolve zhaonan/fix zhaoyu/cleaner-build-deploy zhaoyu/deleted-files zhaoyu/disable_cleaner zhaoyu/disk-cleaner/clean_docker_cache zhaoyu/end-to-end-test-fix zhaoyu/env_support zhaoyu/etcd_enhance zhaoyu/gpu_type_workaround zhaoyu/hdfs_access zhaoyu/hdfs_readme_fix zhaoyu/local_code zhaoyu/mount-fix zhaoyu/nni_doc zhaoyu/nni_refine zhaoyu/port-conflict-after-refactor zhaoyu/port-conflict zhaoyu/read-only-code-dir zhaoyu/refine_cleaner zhaoyu/skip_cleaner zimiao/add_hdfs_replace_policy zimiao/cleaner_parser zimiao/code_refactor zimiao/com_fix zimiao/docker_auth zimiao/docker_executor zimiao/docker_zombie zimiao/docs zimiao/enable_periodical_log zimiao/extend_expire zimiao/extend_log_time zimiao/fix_cntk_test_exit zimiao/fix_typo zimiao/gpu_number zimiao/hadoop_batchjob_parser zimiao/hadoop_datanode_parser zimiao/hadoop_jobhistory_parser zimiao/hadoop_namenode_parser zimiao/hadoop_nodemanager_parser zimiao/hadoop_resourcemanager_parser_do zimiao/hadoop_resourcemanager_parser zimiao/heapsize zimiao/lower_hdfs_threshold zimiao/move_dir_to_yarn_local zimiao/node_manager_toleration zimiao/ocr zimiao/queue_zookeeper zimiao/refactor_docker_script zimiao/refactor zimiao/update_vc zimiao/vc_update_api_doc zimiao/zookeeper_parser
Nothing to show
Clone or download

README.md

Open Platform for AI (OpenPAI) alt text

Build Status Issues Pull Requests Coverage Status Version

OpenPAI is an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale.

Table of Contents

  1. When to consider OpenPAI
  2. Why choose OpenPAI
  3. How to deploy
  4. How to use
  5. Resources
  6. Get Involved
  7. How to contribute

When to consider OpenPAI

  1. When your organization needs to share powerful AI computing resources (GPU/FPGA farm, etc.) among teams.
  2. When your organization needs to share and reuse common AI assets like Model, Data, Environment, etc.
  3. When your organization needs an easy IT ops platform for AI.
  4. When you want to run a complete training pipeline in one place.

Why choose OpenPAI

The platform incorporates the mature design that has a proven track record in Microsoft's large-scale production environment.

Support on-premises and easy to deploy

OpenPAI is a full stack solution. OpenPAI not only supports on-premises, hybrid, or public Cloud deployment but also supports single-box deployment for trial users.

Support popular AI frameworks and heterogeneous hardware

Pre-built docker for popular AI frameworks. Easy to include heterogeneous hardware. Support Distributed training, such as distributed TensorFlow.

Most complete solution and easy to extend

OpenPAI is a most complete solution for deep learning, support virtual cluster, compatible Hadoop / kubernetes eco-system, complete training pipeline at one cluster etc. OpenPAI is architected in a modular way: different module can be plugged in as appropriate.

Related Projects

Targeting at openness and advancing state-of-art technology, Microsoft Research (MSR) had also released few other open source projects.

  • NNI : An open source AutoML toolkit for neural architecture search and hyper-parameter tuning. We encourage researchers and students leverage these projects to accelerate the AI development and research.
  • MMdnn : A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models. The "MM" in MMdnn stands for model management and "dnn" is an acronym for deep neural network.

How to deploy

1 Prerequisites

Before start, you need to meet the following requirements:

  • Ubuntu 16.04
  • Assign each server a static IP address. Network is reachable between servers.
  • Server can access the external network, especially need to have access to a Docker registry service (e.g., Docker hub) to pull the Docker images for the services to be deployed.
  • All machines' SSH service is enabled, share the same username / password and have sudo privilege.
  • Need to enable NTP service.
  • Recommend no Docker installed or a Docker with api version >= 1.26.
  • See hardware resource requirements.

2 Deploy OpenPAI

2.1 Customized deploy
2.2 Single Box deploy

How to use

How to train jobs

Cluster administration

Resources

  • The OpenPAI user documentation provides in-depth instructions for using OpenPAI
  • Visit the release notes to read about the new features, or download the release today.
  • FAQ

Get Involved

  • StackOverflow: If you have questions about OpenPAI, please submit question at Stackoverflow under tag: openpai
  • Report an issue: If you have issue/ bug/ new feature, please submit it at Github

How to contribute

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Who should consider contributing to OpenPAI?

  • Folks who want to add support for other ML and DL frameworks
  • Folks who want to make OpenPAI a richer AI platform (e.g. support for more ML pipelines, hyperparameter tuning)
  • Folks who want to write tutorials/blog posts showing how to use OpenPAI to solve AI problems

Contributors

One key purpose of PAI is to support the highly diversified requirements from academia and industry. PAI is completely open: it is under the MIT license. This makes PAI particularly attractive to evaluate various research ideas, which include but not limited to the components.

PAI operates in an open model. It is initially designed and developed by Microsoft Research (MSR) and Microsoft Search Technology Center (STC) platform team. We are glad to have Peking University, Xi'an Jiaotong University, Zhejiang University, and University of Science and Technology of China join us to develop the platform jointly. Contributions from academia and industry are all highly welcome.