Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Autotuning DSL #1

Closed
kartshy opened this issue May 5, 2020 · 38 comments
Closed

Support for Autotuning DSL #1

kartshy opened this issue May 5, 2020 · 38 comments
Labels
enhancement New feature or request

Comments

@kartshy
Copy link
Contributor

kartshy commented May 5, 2020

Need to support Autotuning DSL as part of the Application optimisations.
The DSL contains
•Tuning parameters can be defined, constrained and injected into application source, build or run
•Easy to integrate with application build and run
•Can tune for any metric output, not just runtime, and take max, min or average of a set of runs

The example input is
image

We need to be able to supply this input in the SODALITE IDE and ten process it in the applicatin optimiser. It should be part of a Autotuning section!

@kartshy
Copy link
Contributor Author

kartshy commented May 5, 2020

@jesus-gorronogoitia Can you please comment?

@jesus-gorronogoitia
Copy link

Hi @kartshy

That's fine to include Autotuning DSL as part of the AADM DSL. There is only need to identify where (within AADM) an Autotuning model should be embedded.
Should it be embedded within an node template description (i.e. an application)? Should it be part of the workflow description? Should it be independent, as the same label of app descriptions?
What should be the root label: autotuning?
Please, clarify

@kartshy
Copy link
Contributor Author

kartshy commented May 5, 2020

It is a sister node for optimization. It is relevant for a particular application (skyline_extractor in the Snow example)

@kartshy
Copy link
Contributor Author

kartshy commented May 5, 2020

In the example we have optimisation as a single entry like

optimisation:GraphCompiler, ETL

instead it should be a list

optimisation{
GraphCompiler:XLA
ETL:Prefectch
Autotuning: DSL text
}

@jesus-gorronogoitia
Copy link

Ok,then, optimization needs to be promoted as a DSL entity for node templates (apps) and this has also implications with the KB schema and optimization reasoning, so we need to involve @gmeditsk and @zoevas on this discussion

@kartshy
Copy link
Contributor Author

kartshy commented May 20, 2020

@jesus-gorronogoitia , @gmeditsk @zoevas
I created a first draft of the Performance DSL that we want to support. It is a complete list that we would like to support.
There are two general sections for multi-arch support and autotuning. and then 3 application specific sections for AI, HPC and Big Data. Each application specific section will have Data and config sections that acts as inputs for selecting Performance optimisations to be enabled. We can also infer some entries from other entries.

I hope this gives a clear view of where we want to go with the optimisation DSL.

{
  "optimisation": {
    "arch": {
      "CPU_type": "Intelx86/ARM/AMD/Power",
      "opt_build": true,
      "acc_type": "NVIDIA-V100/AMD-M100/FPGA-Xilinx"
    },
    "autotuning": {
      "tuner": "default/...",
      "input": ""
    },
    "AI_training": {
      "config": {
        "AI_framework": "TF/PyTorch/Keras/CNTK/MXNet",
        "type": "Image classification/object detection/translation/recommendation/reinforncement learning",
        "layers": 6,
        "parameters": 872684236
      },
      "Data": {
        "location": "/some/data",
        "basedata": "Imagenet/CIFAR/MNIST",
        "size": 67,
        "count": 4389
      },
      "Keras": {
        "version": "1.1",
        "backend": "TensorFlow/PyTorch/CNTK/MXNet/Keras",
        "distribute": true,
        "ETL": {
          "prefetch": 100,
          "cache": 100
        }
      },
      "TensorFlow": {
        "version": "1.1",
        "XLA": true,
        "distribute": true,
        "ETL": {
          "prefetch": 100,
          "cache": 100
        }
      },
      "PyTorch": {
        "version": "1.1",
        "GLOW": true,
        "distribute": true
      }
    },
    "HPC": {
      "config":{
      },
      "data":{
      },
      "MPI": {
        "library": "mvapch/opnmpi",
        "version": "1.1",
        "scaling_efficiency": 0.75,
        "core_subscription": 1,
        "message_size": "small/medium/large",
        "message_sync": true
      },
      "OPENMP": {
        "number_of_threads": 2,
        "scaling_efficiency": 0.75,
        "affinity": "block/simpe"
      },
      "OPENACC": {
        "compiler": "pgi/cray",
        "version": "1.1",
        "multi-acc": true
      },
      "OPENCL": {
        "compiler": "pgi/cray",
        "version": "1.1",
        "multi-acc": true
      }
    },
    "BigData": {}
  }
} 

@kartshy
Copy link
Contributor Author

kartshy commented May 21, 2020

Need to add

  1. Description
  2. Mandatory or not

@kartshy
Copy link
Contributor Author

kartshy commented May 21, 2020

@jesus-gorronogoitia Please find below the updated DSL in json format.
We can discuss on slack or in the meeting next week.

{
  "optimisation": { - mandatory
    "enable_opt_build": true/false, - mandatory; Enable target specific optimised build container; 
    "enable_autotuning":true/false, - mandatory; Enable autotuning; Enable autotuning node if this is true,
    "app_type":"AI_Training/HPC/BigData/AI_Inference" - mandatory; desc - Specify the type of applictaion; Enable application node based on the application type",
    "opt_build": { mandatory Enable this node if enable_opt_build true
      "CPU_type": "Intelx86/ARM/AMD/Power" - mandatory; Specify the CPU architecture, We may be able to get this from the target model.",
      "acc_type": "NVIDIA-V100/AMD-M100/FPGA-Xilinx" - mandatory;  Specify the accelerator architecture, We may be able to get this from the target model."
    },
    "autotuning": { mandatory Enable this node if enable_autotuning true
      "tuner": "CREATA/AUTOTUNE" - mandatory; Specify the autotuning tool to be used",
      "input": "Mandatory; DSL or input text for the autotuning tool"
    },
    "AI_Training": { mandatory based on the app_type selected
      "config": { mandatory
        "AI_Framework": "TensorFlow/PyTorch/Keras/CNTK/MXNet" - mandatory; Specify the AI framework to support,
        "type": "Image_classification/object_detection/translation/recommendation/reinforncement_learning" - optional; Specify the type of AI training network,
        "distributed_training": true, - optional; Enable distributed training,
        "layers": 6 - optional; specify the number of layers,
        "parameters": 872684236 - optional; specify the numer of model parameters
      },
      "Data": { mandatory
        "location": "/some/data" -  optional; specify the data location,
        "basedata": "Imagenet/CIFAR/MNIST" - optional; specify the type of data,
        "size": 67 - optional; size of single data element,
        "count": 4389 - optional; number of data elements
        "ETL": { optional
          "prefetch": 100, optional, prefetch size to use
          "cache": 100, optional, caching size to use
        }
      },
      "Keras": { load the framework specific node based on the AI_Framework selected
        "version": 1.1; optional; version of keras,
        "backend": "TensorFlow/PyTorch/CNTK/MXNet/Keras"- optional; Keras backend to use,
      },
      "TensorFlow": { load the framework specific node based on the AI_Framework selected
        "version": "1.1", optional ; specify the version to use, default version will be used if not specified
        "XLA": true, optional; Enable XLA compiler for optimisation
      },
      "PyTorch": {load the framework specific node based on the AI_Framework selected
        "version": "1.1", optional ; specify the version to use, default version will be used if not specified
        "GLOW": true, optional; Enable GLOW compiler for optimisation
      }
    },
    "HPC": { mandatory based on the app_type selected
      "config":{ mandatory
          "Parallelisation":"MPI/OPENMP/OPENACC/OPENCL" - mandatory; multiple selections possible; select the application parallelisation strategy
      },
      "data":{ mandatory
       "location": "/some/data" - optional; specify the data location,
        "basedata": "IMAGE/RESTART" - optional; specify the type of data,
        "size": 67 - optional; size of single data element,
        "count": 4389 - optional; number of data elemments
      },
      "MPI": { load based on the selected Parallelisation
        "library": "mvapch/opnmpi", mandatory specify the MPI library to use
        "version": "1.1", optional specify the version to use, default used if not specified
        "scaling_efficiency": 0.75, optional specify the scaling_efficiency to use, default used if not specified
        "core_subscription": 1, optional specify the core subscription to use, default used if not specified
        "message_size": "small/medium/large" optional MPI message size, default used if not specified
      },
      "OPENMP": {load based on the selected Parallelisation
        "number_of_threads": 2, mandatory; specify the number of threads to use
        "scaling_efficiency": 0.75,optional specify the scaling_efficiency to use, default used if not specified
        "affinity": "block/simpe"optional specify the thread affinity to use, default used if not specified
      },
      "OPENACC": {load based on the selected Parallelisation
        "compiler": "pgi/cray",mandatory specify the compiler to use
        "version": "1.1",optional specify the version to use, default used if not specified
        "number_of_acc": 2, optional, specify the number of aceleartors to be used, default used if not specified
      },
      "OPENCL": {load based on the selected Parallelisation
        "compiler": "pgi/cray",mandatory specify the compiler to use
        "version": "1.1",optional specify the version to use, default used if not specified
        "number_of_acc": 2, optional, specify the number of aceleartors to be used, default used if not specified
      }
    },
    "BigData": {}
  }
}

@zoevas
Copy link

zoevas commented May 22, 2020

Hello @kartshy , @jesus-gorronogoitia

With the current implementation, having optimizations as a simple list, reasoner takes the type of framework by:
snow.aadm

dependency:
		 		node: tensorflow

and in conjuction with the capabilities in aadm, proposes optimizations.

Now, that the optimizations won't be a list, and I see that it is a block containing information for app_type, I suppose, that according to the selected app_type, the reasoner will propose optimizations based on capabilities.

That's how I understand it:
For example:
1) if app_type = AI_Training, selected framework("AI_Framework") should also be present
in the optimization block so as the semantic reasoner to propose applicable capabilities.

"TensorFlow": { load the framework specific node based on the AI_Framework selected
        "version": "1.1", optional ; specify the version to use, default version will be used if not specified
        "XLA": true, optional; Enable XLA compiler for optimisation
}

So, I suppose that XLA is proposed by the reasoner.
There are also other optimizations that could be proposed for the Tensorflow according to page 4
and they are not present in the json.
optimizations presentation
Maybe, they are not included in the json, since it is the first draft?
Also, later, will you also provide criteria enabling optimizations for other frameworks such as PyTorch , Keras e.t.c?

2) if app_type = HPC, reasoner should return if MPI, OPENMP, OPENACC e.tc. blocks should be enabled,
About HPC, for enabling optimizations, I see available only the criteria
in the last page here optimizations presentation

3) About autotuning and opt_build, there are specific criteria, that are being enabled, or the user decides that?

Please correct me if I am wrong in any point.
I am just trying to understand which fields will be inferred by the reasoner, and which criteria will be used for enabling optimizations based on the capabilities and the type of application.

Thank you so much in advance.

@kartshy
Copy link
Contributor Author

kartshy commented May 22, 2020

For 1) TensorFlow, PyTorch and Keras are type of AI frameworks.
So we will enable framework specific section based on AI framework selected. ("AI_Framework": )
The criteria or filters for optimisations, I will send you later. The ones in the slide are just examples.

  1. Your understanding is correct, based on the selected Parallelisation, different sections are enabled. Again I will send you the criteria or filters for optimisations later.

  2. Autotuning and opt_build are enabled based on the the following entries in top
    "enable_opt_build": true/false, - mandatory; Enable target specific optimised build container;
    "enable_autotuning":true/false, - mandatory; Enable autotuning; Enable autotuning node if this is true

In general I have designed this in such a way that the DSL expands based on AoE specified options. For example,
select and show AI_Training based on app_type
then select and show TensorFlow based on AI_Framework

There are some entries which can be inferred based on the application or infrastructure model.
The main criteria or filters are (for the first draft are)

  1. Number of CPU nodes
  2. number of GPU per node
  3. SSD available memory

I will add these criteria and send you later.

@kartshy
Copy link
Contributor Author

kartshy commented May 22, 2020

DSL with constrainst. (Added ETL for HPC also)

{
  "optimisation": { 
    "enable_opt_build": true, 
    "enable_autotuning":true, 
    "app_type":"AI_Training/HPC/BigData/AI_Inference",
    "opt_build": { 
      "CPU_type": "Intelx86/ARM/AMD/Power",
      "acc_type": "NVIDIA-V100/AMD-M100/FPGA-Xilinx" (Contraint : number of GPUs > 0)
    },
    "autotuning": { 
      "tuner": "CREATA/AUTOTUNE",
      "input": "Mandatory; DSL or input text for the autotuning tool"
    },
    "AI_Training": { 
      "config": { 
        "AI_Framework": "TensorFlow/PyTorch/Keras/CNTK/MXNet",
        "type": "Image_classification/object_detection/translation/recommendation/reinforncement_learning" ,
        "distributed_training": true, (Contraint : number of nodes > 1)
        "layers": 6 ,
        "parameters": 872684236 
      },
      "Data": { 
        "location": "/some/data" ,
        "basedata": "Imagenet/CIFAR/MNIST" ,
        "size": 67 ,
        "count": 4389 
        "ETL": { (Contraint : SSD available or number of GPUs > 0)
          "prefetch": 100,
          "cache": 100, 
        }
      },
      "Keras": { 
        "version": 1.1, 
        "backend": "TensorFlow/PyTorch/CNTK/MXNet/Keras",
      },
      "TensorFlow": { 
        "version": "1.1", 
        "XLA": true,  (Contraint : number of GPUs > 0)
      },
      "PyTorch": {
        "version": "1.1", 
        "GLOW": true, (Contraint : number of GPUs > 0)
      }
    },
    "HPC": { 
      "config":{ 
          "Parallelisation":"MPI/OPENMP/OPENACC/OPENCL" 
      },
      "data":{ 
       "location": "/some/data" ,
        "basedata": "IMAGE/RESTART" ,
        "size": 67 ,
        "count": 4389 
         "ETL": { (Contraint : SSD available or number of GPUs > 0)
          "prefetch": true,
          "cache": true, 
        }
      },
      "MPI": { 
        "library": "mvapch/opnmpi", 
        "version": "1.1", 
        "scaling_efficiency": 0.75, 
        "core_subscription": 1, ]
        "message_size": "small/medium/large" 
      },
      "OPENMP": {
        "number_of_threads": 2, 
        "scaling_efficiency": 0.75,
        "affinity": "block/simpe"
      },
      "OPENACC": {(Contraint : number of GPUs > 0)
        "compiler": "pgi/cray",
        "version": "1.1",
        "number_of_acc": 2, 
      },
      "OPENCL": {(Contraint : number of GPUs > 0)
        "compiler": "pgi/cray",
        "version": "1.1",
        "number_of_acc": 2,
      }
    },
    "BigData": {}
  }
}

@jesus-gorronogoitia
Copy link

Hi @kartshy
I guess this optimization DSL needs to be completed for BigData and AI_Inference, doesn't it?

@jesus-gorronogoitia
Copy link

jesus-gorronogoitia commented May 25, 2020

Hi @kartshy
Constrains are only applicable to those DSL entities you have applied above in your example? What is the precise format for expressing constraints? I mean, only one constraint for element? a free style label compared with a number? What comparison operators are available? As constraints are expressed in your example, I would support contrains as text to be provided by the user, to be interpreted by your parser

@jesus-gorronogoitia
Copy link

Hi @kartshy
ETL properties (cached, prefetch) are different for ETL group in AI_Training, where they are integers, than in HPC, where they are booleans. Is this correct? Are they different, or is it a mistake. If the latter, what are the correct types? Thanks

@kartshy
Copy link
Contributor Author

kartshy commented May 25, 2020

Hi @kartshy
I guess this optimization DSL needs to be completed for BigData and AI_Inference, doesn't it?

Yes. But that will be done in Y3. Y2 focus is only HPC and AI_training. There is BigData in AI_training (ETL) ,so we may skip BigData as there is no mapping to use cases applications.

@kartshy
Copy link
Contributor Author

kartshy commented May 25, 2020

Hi @kartshy
Constrains are only applicable to those DSL entities you have applied above in your example? What is the precise format for expressing constraints? I mean, only one constraint for element? a free style label compared with a number? What comparison operators are available? As constraints are expressed in your example, I would support contrains as text to be provided by the user, to be interpreted by your parser

Currently constraints are only for the entries I have mentioned in the DSL above. My understanding is that based on the Application and Target model, these entries may be enabled or disabled for the user. I can also enable or disable when I process in WP4, but we agreed the logic to be in KB. For example, for the constraint (number of GPUs > 0) we should get the number of GPUs from the infrastructure model and then use the condition to enable or disable the optimisation.

@kartshy
Copy link
Contributor Author

kartshy commented May 25, 2020

Hi @kartshy
ETL properties (cached, prefetch) are different for ETL group in AI_Training, where they are integers, than in HPC, where they are booleans. Is this correct? Are they different, or is it a mistake. If the latter, what are the correct types? Thanks

Yes for HPC they are boolean and AI_training it is integer. We can change the HPC to integer also. I dont have a full idea in HPC yet.

@jesus-gorronogoitia
Copy link

jesus-gorronogoitia commented May 25, 2020

Hi @kartshy
Constrains are only applicable to those DSL entities you have applied above in your example? What is the precise format for expressing constraints? I mean, only one constraint for element? a free style label compared with a number? What comparison operators are available? As constraints are expressed in your example, I would support contrains as text to be provided by the user, to be interpreted by your parser

Currently constraints are only for the entries I have mentioned in the DSL above. My understanding is that based on the Application and Target model, these entries may be enabled or disabled for the user. I can also enable or disable when I process in WP4, but we agreed the logic to be in KB. For example, for the constraint (number of GPUs > 0) we should get the number of GPUs from the infrastructure model and then use the condition to enable or disable the optimisation.

Still unclear to me how to support the formalization of constraints:
This one, for instance, attached to the ETL entity of Data entity of HPC entity:
(Contraint : SSD available or number of GPUs > 0)

Questions:
1- there is only one constraint per entity?
2- Can user combine constraints connected by logical operators (AND, OR, etc)? If so, what operators are supported
3- the expression: "SSD available or number of GPUs" looks free text. If so, how can you parse and interpret it?. If not, what are the available expressions to pick from?
4- a single constraint format seems to be: (Constraint: <label> <comparative_oper><value>)
What are the supported comparative operators? What are possible values? Numbers?

I think, by the moment, waiting for you to be more precise in the constraint format, I will support this generic format for constraint: constraint: '<expression as string>' where the user can provide whatever textual expression that the optimizer may read.

@jesus-gorronogoitia
Copy link

Need to support Autotuning DSL as part of the Application optimisations.
The DSL contains
•Tuning parameters can be defined, constrained and injected into application source, build or run
•Easy to integrate with application build and run
•Can tune for any metric output, not just runtime, and take max, min or average of a set of runs

The example input is
image

We need to be able to supply this input in the SODALITE IDE and ten process it in the applicatin optimiser. It should be part of a Autotuning section!

Hi @kartshy
I need the specification of the Autotuning DSL language:
1- in section typing: what are possible entities/keywords. The example shows int, what other ones are supported and what are their expected values
2- in section constraints, same question as above. Only range entity/keyword is shown in example.
3- for sections build and run, I assume the only expected property is command.

@jesus-gorronogoitia
Copy link

Hi @kartshy
Below I show screenshots of optimization models shown in the optimization textual editor:

  1. Leftside one showing AI Training optimization for Tensorflow
  2. Rightside one showing HPC optimization for MPI.

I assume that once the app_type is chosen (e.g. AI_Training) then only that section appears below in the model. Similarly, once the config.ai_framework (or config.parallalisation) is choosen, only that section appears below in the model (e.g. TensorFlow or MPI).

Please, confirm.

image

@kartshy
Copy link
Contributor Author

kartshy commented May 25, 2020

@jesus-gorronogoitia You are fast. I am on leave (Uk Holiday) today, will have a look tomorrow.

@jesus-gorronogoitia
Copy link

take your time, enjoy your day off.

@jesus-gorronogoitia
Copy link

First version of autotuning DSL and editor implemented. See #4 for details

@kartshy
Copy link
Contributor Author

kartshy commented May 26, 2020

Hi @kartshy
Constrains are only applicable to those DSL entities you have applied above in your example? What is the precise format for expressing constraints? I mean, only one constraint for element? a free style label compared with a number? What comparison operators are available? As constraints are expressed in your example, I would support contrains as text to be provided by the user, to be interpreted by your parser

Currently constraints are only for the entries I have mentioned in the DSL above. My understanding is that based on the Application and Target model, these entries may be enabled or disabled for the user. I can also enable or disable when I process in WP4, but we agreed the logic to be in KB. For example, for the constraint (number of GPUs > 0) we should get the number of GPUs from the infrastructure model and then use the condition to enable or disable the optimisation.

Still unclear to me how to support the formalization of constraints:
This one, for instance, attached to the ETL entity of Data entity of HPC entity:
(Contraint : SSD available or number of GPUs > 0)

Questions:
1- there is only one constraint per entity?
2- Can user combine constraints connected by logical operators (AND, OR, etc)? If so, what operators are supported
3- the expression: "SSD available or number of GPUs" looks free text. If so, how can you parse and interpret it?. If not, what are the available expressions to pick from?
4- a single constraint format seems to be: (Constraint: <label> <comparative_oper><value>)
What are the supported comparative operators? What are possible values? Numbers?

I think, by the moment, waiting for you to be more precise in the constraint format, I will support this generic format for constraint: constraint: '<expression as string>' where the user can provide whatever textual expression that the optimizer may read.

  1. There is only one constraint per entity.
  2. But that constraint can have multiple filters with AND/OR syntax. Let us support only AND and OR
  3. I thought you will tell me how better to support. SSD and number of GPUs are entities in infrastructure model. We need to map that to it. We can have keywords like SSD/nGPUs. I would avoid free text
  4. In <comparative_oper>
    label denotes any entity in application or infrastructure model
    comparative oper are =, > , <
    value are integer/double/boolean

We can discuss in the thursday call.

@kartshy
Copy link
Contributor Author

kartshy commented May 26, 2020

Need to support Autotuning DSL as part of the Application optimisations.
The DSL contains
•Tuning parameters can be defined, constrained and injected into application source, build or run
•Easy to integrate with application build and run
•Can tune for any metric output, not just runtime, and take max, min or average of a set of runs
The example input is
image
We need to be able to supply this input in the SODALITE IDE and ten process it in the applicatin optimiser. It should be part of a Autotuning section!

Hi @kartshy
I need the specification of the Autotuning DSL language:
1- in section typing: what are possible entities/keywords. The example shows int, what other ones are supported and what are their expected values
2- in section constraints, same question as above. Only range entity/keyword is shown in example.
3- for sections build and run, I assume the only expected property is command.

This autotuning DSL is big. My initial thought that we wont support it as DSL but string or file name (input) which is passed to the tool. I will email you the full DSL.

@jesus-gorronogoitia
Copy link

If autotuning DSL is big and complex, it should be provided by the users as you suggest, referenced the path to an external file. Then, during AADM deployment, the IDE can retrieve this file and attached its content as string to the AADM model to be sent to the KB. Let's discuss this point on Thursday meeting

@jesus-gorronogoitia
Copy link

Agreement: Autotunning DSL edition will not be supported by IDE inline the Optimization DSL editor. It will be provided by the user as a separate file content. Upon the saving or deployment of the AADM model, if autotunning block is present, the user will be prompted to provide the autotunning content file. The content of this file will be embedded as string into the autotuning input property.

@kartshy
Copy link
Contributor Author

kartshy commented May 28, 2020

Can we add the file to the artifacts instead of embedding as a string?

@jesus-gorronogoitia
Copy link

Hi @kartshy What do you mean by adding the autotuning file to the artifacts? Do you mean to send the autotuning file to the KB not embedded as string into the optimization model, but as a separate file?

@jesus-gorronogoitia
Copy link

Implemented IDE support for selecting the autotuning model from the file system, prompting the user to select the model, whose path is associated to the input property of autotuning. It is pending to decide how this autotuning model will be sent to the Optimization engine.

@zoevas
Copy link

zoevas commented Jun 9, 2020

Hello @kartshy , @jesus-gorronogoitia,

Regarding the optimization section and its schema supported by the KB or not, suppose that the optimization schema is not supported by the KB, I am thinking if the reasoner can infer all the input needed from the rest of the aadm so as to return optimizations.
We discussed at the ontology meeting, that the type of the application (HPC or AI training),
can be inferred by the node type. Could you have any example?

In optimization section, I understand that the following variables are the ones needed
so as the reasoner to find the possible optimizations (correct me if am wrong and other variable
is also needed):

  • cpu_type is needed, so as the reasoner
    to return the relevant acc_type
  • app_type
  • if app_type = ai_training, ai_framework contains the selected framework.

So, if the optimization schema is not supported by the KB, do you know if it is possible
to infer the above knowledge from the rest of the aadm. If an example is provided,
it could be useful.

Thanks in advance

@zoevas
Copy link

zoevas commented Jun 16, 2020

Hello @kartshy,

Just for confirmation, I understood from the last ontology meeting, that for now, the reasoner will
just check which are the applicable optimizations and return the optimizations to the IDE.
Would you agree?

For finding which are the applicable optimizations, reasoner checks which are the capabilities of a node template and few parameters from the optimization json such as app_type, cpu_type, ai_framework. Please, inform me in case any other parameter should be taken into account.
Regarding the app_type, cpu_type and ai_framework, I understand that the values of those variables cannot be inferred from the rest of the aadm, so the reasoner should retrieve that information from the optimisation json string for finding if the application is hpc or ai, and which framework is selected for ai.

Please, correct me if I am wrong in any point.

So as to start the implementation as soon as possible, could you provide an end-to-end example,
by providing an aadm with node templates having capabilities, an optimizationjson, the constraints that the reasoner should check for return optimizations, and the expected optimizations that
should be returned by the reasoner.

Any other that we should take in mind regarding the reasoner? Should the reasoner send anything to optimizer? Or for now, just the IDE will send the serialized optimization json to the optimizer, and just the reasoner will assist the IDE by sending the applicable optimizations to IDE.

Thank you so much in advance,
Zoe

@kartshy
Copy link
Contributor Author

kartshy commented Jun 17, 2020

Hello @kartshy , @jesus-gorronogoitia,

Regarding the optimization section and its schema supported by the KB or not, suppose that the optimization schema is not supported by the KB, I am thinking if the reasoner can infer all the input needed from the rest of the aadm so as to return optimizations.
We discussed at the ontology meeting, that the type of the application (HPC or AI training),
can be inferred by the node type. Could you have any example?

In optimization section, I understand that the following variables are the ones needed
so as the reasoner to find the possible optimizations (correct me if am wrong and other variable
is also needed):

* cpu_type is needed, so as the reasoner
  to return the relevant acc_type

* app_type

* if app_type = ai_training, ai_framework contains the selected framework.

So, if the optimization schema is not supported by the KB, do you know if it is possible
to infer the above knowledge from the rest of the aadm. If an example is provided,
it could be useful.

Thanks in advance

The reasoner can infer many things based on the application and infrastructure node. But I dont have a clear view to define those now. SO we have pushed those work past M18.

Reasoner is only involved in implementing the constraint in the optimisation DSL like SSDs and number of GPUs.

For example, to enable xla optimisations , we need number of GPUs > 0. So my expectation is for the reasoner to find of the number of GPUs is > 0 and then enable the xla optmimisation.

 "tensorflow": {
        "version": "1.1", 
        "XLA": true  
      },

In the current DSL spec, there are only two constraint for SSD and number of GPUs. Both the values need to be inferred from the infrastructure node.

@zoevas
Copy link

zoevas commented Jun 22, 2020

Hello @kartshy ,

Thanks for the answer. For the interim review, regarding the optimizations, what needs to be implemented by the reasoner?
I remember from the ontology meeting that constraints such as (Constraint : number of GPUs > 0) will be removed from the optimization DSL, and in the future QoE will provide those constraints to the reasoner. Also, the IDE will serialize the optimization DSL to json.
So, should the reasoner send anything to the optimizer? Or should reasoner assist the IDE by proposing optimizations, based on the optimization json selections(e.g. app_type) and the capabilities of the aadm? Or anything else that I have missed?

Regards,
Zoe

@kartshy
Copy link
Contributor Author

kartshy commented Jun 22, 2020 via email

@zoevas
Copy link

zoevas commented Jun 30, 2020

@kartshy about hpc, MPI and OpenMP, when they get enabled?
I see that there are constraints for OpenACC, and OpenCL, but not for the MPI and OpenMP.

@kartshy
Copy link
Contributor Author

kartshy commented Jun 30, 2020

MPI and OpenMP will be enabled by default. We are calling applications that use MPI and OpenMP as traditional HPC applications. They can use OpenAcc or OpenCL for accelerators like GPUs.

@kartshy kartshy closed this as completed Sep 14, 2020
@kartshy
Copy link
Contributor Author

kartshy commented Sep 14, 2020

Closing this as it is supported in M18 deliverable. Will create a new issue if required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants