Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

A proposal of PAI protocol (Preview) #2007

Closed
fanyangCS opened this issue Jan 15, 2019 · 6 comments
Closed

A proposal of PAI protocol (Preview) #2007

fanyangCS opened this issue Jan 15, 2019 · 6 comments

Comments

@fanyangCS
Copy link
Contributor

fanyangCS commented Jan 15, 2019

pai-protocol.yaml

@scarlett2018
Copy link
Member

Adding related issue: #1575

@abuccts
Copy link
Member

abuccts commented Jan 22, 2019

Comment inline:

protocolVersion: String, required # Protocol version, current version is 2
name: String, required
type: String, required # The type of the component. Must be one of the following: job, data, script, dockerimage, or output
version: String, optional # Component version. Default is latest

Can we use tag here like docker image, which can be meaningful string, instead of version? protocolVersion can also be changed to version simply.

contributor: String, optional
description: String, optional

Maybe a license field can be considered.

prerequisites: # Optional
  - protocolVersion: String, optional # If omitted, follow the protocolVersion in root
    name: String, required
    type: String, required # Component type. Must be one of the following: data, script, dockerimage, or output. Prerequisites.type cannot be "job"
    version: String, optional # Component version. Default is latest
    contributor: String, optional
    description: String, optional

In my opinion, contributor and description fields are not needed in prerequisites.

    auth: Object, optional # Only available when the type is dockerimage
      username: String, optional
      password: String, optional

Should we avoid to use password in plain text? We can use its credential ("username:password" encoded in base64) instead of username and password.

      registryuri: String, optional
    uri: String or list, required # Only when the type is data can the uri be a list

parameters: # Optional, can be omitted
  <param1>: value1Type # Specify name and value of all the referencable parameters that will be used in the whole job template. They can be referenced by $$paramName$$.
  <param2>: value2Type

jobRetryCount: Integer, optional # Default is 0
taskRoles:
  - protocol_version: String, optional # Protocol version, default is 2
    name: String, required  # Name of the taskRole
    instances: Integer, optional # Default is 1, instances of a taskRole
    completion:
      minFailedInstances: Integer or null, optional # Default 1
      minSucceededInstances: Integer or null, optional # Default null
    taskRetryCount: Integer, optional # Default is 0
    dockerImage: String, required # Should reference to a dockerimage defined in prerequisites.
    data: Object, optional # Default is None
    output: Object, optional # Default is None
    script: Object, optional # Default is None

data, output, and script do not need to be object, we can use their name in prerequisites to refer.

In distributed tensorflow example, we can simply use data: cifar10 instead of data: { cifar10: prerequisites.[data,cifar10] }, script: tensorflow_cnnbenchmarks instead of script: { tf_cnnbenchmarks: prerequisites.[script,tensorflow_cnnbenchmarks] }, etc.

    extraContainerOptions: 
      shmMB: Integer, optional # config the /dev/shm in a docker container, https://docs.docker.com/compose/compose-file/#shm_size
    resourcePerInstance:
      cpu: Integer, required
      memoryMB: Integer, required
      gpu: Integer, required
      ports:
        <portLabel1>: Integer, optional, default is 0 # Only for host network
    commands:
      - String, required

# to handle that a component may interact with different component differently, user is encouraged to place the codes handling such difference in the "deployments" field.
# e.g., a job may get input data through wget, hdfc -dfs cp, copy, or just directly read from remote storage. This logic can be placed here.
# in summary, the deployments field is responsible to make sure the job to run properly in a deployment specific runtime environment.
# one could have many deployments, but only the first deployment can be activated at runtime. User can choose the deployment at job submission time.
deployments: 
  - protocolVersion: String, optional # If omitted, follow the root protocolVersion
    name: String, required    
    taskRoles:
      - name: String, required # Should be the same as taskRoles.name
        preCommands:
          - String, required # execute before $$commands$$
        postCommands:
          - String, required # execute after $$commands$$

The deployments field is useful for users to extend, but we can merge this block into taskRoles. Just add two optional fields preCommands and postCommands along with commands.

attachments: # optional, cluster specific parameters
  - protocolVersion: String, optional
    virtualCluster: String, optional

@fanyangCS
Copy link
Contributor Author

@abuccts , thanks for the comments. Can you submit a new version to the branch "pai-proto" for reference?

@abuccts
Copy link
Member

abuccts commented Jan 25, 2019

Should I edit the existing pai-protocol.yaml or create a new file?

@fanyangCS
Copy link
Contributor Author

How about you submit a PR that merges to the pai-proto branch?

@scarlett2018 scarlett2018 changed the title A proposal of PAI protocol (Preview) [Call for review] A proposal of PAI protocol (Preview) Feb 22, 2019
@scarlett2018 scarlett2018 pinned this issue Feb 22, 2019
@fanyangCS
Copy link
Contributor Author

a related PR. #2141

@scarlett2018 scarlett2018 changed the title [Call for review] A proposal of PAI protocol (Preview) A proposal of PAI protocol (Preview) Feb 22, 2019
@scarlett2018 scarlett2018 unpinned this issue Feb 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants