update mdf_to_onnx branch to current scenario #310

parikshit14 · 2022-08-09T16:25:24Z

updated mdf_to_onnx branch to the current development version
added the example which are currently passing the tests successfully
having trouble with standard torchvision examples (used directly from torchvision library) when converted to mdf. something needs to be updated in execution_engine or the way input image is defined.

parikshit14 · 2022-08-10T16:09:18Z

Hey @davidt0x, I am trying to add some predefined standard pytorch models to the current example folder but there seems to be an issue with them.

I have tried to add the resnet18 model from torchvision library and run it against some imagenet sample images(also tried with the tensor of zeros) but the evaluate function fails while converting to mdf model

error:

We don't have a standard way of specifying types yet in MDF but the current code is definitely not good since it is always just Tensor. Now it should be aligned with pytorch and numpy.

davidt0x · 2022-08-11T02:05:50Z

hey @parikshit14,

I am taking a look at this. I fixed the issue on my end (it was a bug caused by not passing in a tuple to the args parameter of pytorch_to_mdf). However, when that is fixed, there are some issues with ONNX opset mismatches that I am cleaning up. Will try to push my changes tomorrow.

- Properly handle ONNX operations that return multiple outputs. Now expressions for the value of and output port cand index into the tuple, for example. OutputPort(id='_13', value="onnx_BatchNormalization_1[0]"). This was needed for ResNet This also fixed an existing hack for MaxPool. Might be more of these lying around, need to take a better look. - In order to get the above working, I changed "onnx::" to "onnx_" in function\parameter ids generated by the exported. This lets eval work because "::" is a can't be parsed in Python. - Moved modeic ONNX opset version to 15 from 13. - Fixed issue where improper schema (wrong opset) was being looked up for ops in the exporter now that ONNX has moved to 15.

I moved it to a better place and made it general for all Reshape ops, not just ops with Reshape_1 ids.

This looks like a hack needed for multiple outputs from Clip op. I think this is handled generally now.

davidt0x · 2022-08-12T02:38:59Z

Hey @parikshit14,

Take a look at these changes to your branch. I think they should fix a lot of the issues. It is passing locally on my end but not sure if it will pass CI, lets see. This ended up being more involved than I thought it would be. Essentially, a lot of the problems were related to improper handling of multiple outputs from ONNX ops. We had some ugly op specific hacks hanging around. I think I have cleaned this up now and it should be more general. It would probably be a good idea to also rerun all the PyTorch examples so the the MDF JSON in the repo gets updated.

davidt0x · 2022-08-12T03:01:36Z

I think this failure is intermittent and related to the simple_convolution model being float32 and the execution_engine using float64. Let me try to address that.

I am pretty sure this is related to not handling precision consistently in the execution engine. We are converting back and forth in different places from float32 to float64. The results are matching intermittently without a tolerance but I have set and absolute tolerance for now. Need to investigate thie further.

davidt0x · 2022-08-12T03:32:29Z

Ok, well I didn't really address it. I modified the test to compare with a tolerance. I think this is related to how are not handling precision in a disciplined way within the execution engine. I see we are converting back and forth from float32 to float64. The final result is being output at float64 upcasted from float32. This is all kind of a mess, I think we need to systematically handle this better. Not certain this is the issue but it is certainly something we need to fix down the line anyway.

parikshit14 · 2022-08-12T13:30:58Z

hi @davidt0x thanks for the changes,

Have you tried to setup new local env using these changes, coz I am getting impossible to resolve problem same as the ci from the earlier commit.

Also, I was trying to add few more examples and they seem to fail because of shape inference failure because of dropout layer's presence. So I tried to add model_name.eval() to get rid of the random stuff and now the tests run well. I hope it is not just a way around just to pass the test!!

davidt0x · 2022-08-12T17:43:27Z

Hey @parikshit14,

Yeah, there does seem to be an issue if you just try to do a pip install -e .[all] from a clean environment. I get something like this:

INFO: pip is looking at multiple versions of neuromllite to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of modelspec to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of jinja2 to determine which version is compatible with other requirements. This could take a while.
Collecting Jinja2<3.1
  Using cached Jinja2-3.0.2-py3-none-any.whl (133 kB)
INFO: pip is looking at multiple versions of graph-scheduler to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of attrs to determine which version is compatible with other requirements. This could take a while.
Collecting attrs>=21.1.0
  Using cached attrs-21.4.0-py2.py3-none-any.whl (60 kB)
INFO: pip is looking at multiple versions of pytorch-sphinx-theme to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of modeci-mdf[all] to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install modeci-mdf and modeci-mdf[all]==0.4.2 because these package versions have conflicting dependencies.

The conflict is caused by:
    modeci-mdf[all] 0.4.2 depends on modeci-mdf 0.4.2 (from C:\Users\david\Dropbox\princeton\PsyNeuLink_Stuff\mdf\MDF)
    psyneulink 0.12.0.1 depends on modeci-mdf<0.4.2 and >=0.3.4
    modeci-mdf[all] 0.4.2 depends on modelspec~=0.2.6
    neuromllite 0.5.1 depends on modelspec>=0.2.2
    psyneulink 0.12.0.0 depends on modelspec<0.2.6
    modeci-mdf[all] 0.4.2 depends on graph-scheduler>=1.1.1
    psyneulink 0.11.0.0 depends on graph-scheduler<1.1.1 and >=0.2.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

This does not fail in CI currently. I think it is because the CI script is installing PsyNeuLink (and maybe modelspec) from the github repos before installing anything. Ultimately, I think this could be fixed easily by removing all these version pin caps (or at least modelspec and torch) from psyneulink repo. I will push some changes to the CI workflow that I hope will cause this to manifest in CI.

Regarding your other question about Dropout. There was a few lines that seem to have gotten removed or moved from tests/conftest.py that try to make PyTorch more deterministic for testing. I will push some changes that should re-enable this. I have also refactored some of the fixtures of the models into a pytorch specific conftest.py under interfaces/pytorch. I think this keeps the main conftest.py cleaner.

These fixtures are only used by pytorch to mdf conversion tests. Lets keep them separate.

… been installed. Currently, we are not testing if core mdf works without PyTorch or PsyNeuLink because these are installed before core package is installed (PsyNeuLink depends on PyTorch). I have moved this install to the end, after installation of all backends (including PNL) from PIP. This is better for testing that a clean install works.

… into examples/pytorch-to-mdf � Conflicts: � tests/conftest.py

This should support both ways depending on version of torchvision.

@parikshit14

@parikshit14, I changed some of your code to remove a lot of duplication. test_import.py under pytorch now enumerates all models in torchvision and creates a test for each different type of model. All the models have the same interface so we can consolidate the testing code into a parameterized pytest. We have 5 out of 20 tests failing still, lets see if we can track down why each of these models is failing to run in the execution engine.

Specifically, h5py which isn't being used and is causing an uneeded dependency.

davidt0x · 2022-08-13T05:39:26Z

Hey @parikshit14,

Check the changes I have made. I tried to remove some duplicate code with how you were creating tests for these models. I wrote a function that picks up all the models in torchvision and creates a test for each. Most work but 5 are still failing for various reasons. I think most are failing for simple reasons. One looks like its because there is a parameter or something with a hyphen character in it (MDF probably can't handle this). A couple others look like they are because of mismatched types for tensors passed to ONNX ops. I will be out next week so I can't help with these right now.

…ples/pytorch-to-mdf

- We can't have '-' characters in port names in MDF. Some PyTorch models have these. - Also fixed issue where models were being tested with all zeros inputs which made some models pass when in reality they were not computing the correct values. More bugs to track down ...

The pytorch exporter now removes constant ops from the graph and simply inserts them as parameters.

ONNX Batch normalization can return and optional number of outputs. To complicate this, the optional outputs are not allowed unless the attribute training_mode=1 is set. I have added a hardcoded check for this to handle this case.

Torchvision models run through PyTorch and MDF do not have exactly matching results. I have set the comparison tolerance to and absolute 1e-5 for now. I imagine the difference could be a lot of things. One could be how we are not really controlling precision well in MDF models. Another could be that ONNX ops don't implement exactly the same algorithms that PyTorch ops do.

The outputs of the PyTorch and MDF model for this torchvsion version of inception are different. Need to investigate.

… into examples/pytorch-to-mdf

davidt0x · 2022-10-06T04:01:36Z

I think this is good to go now @pgleeson and @parikshit14. Have a look and merge if you like. I actually fixed the issues with all the torchvision models except one, the inception_v3. This model converts and runs but get very different results. Not sure what is going on with this one actually, never had an error like that before, the error is massive! I will need to trace through it to figure it out but I have set that test to xfail. I did a test run where I run tests on all 60 or so torchvision models and all but 5 pass but it takes too long to run in CI. Instead, I made it so the test only does one test per unique model class. The 5 failing tests in the run of 60 seem to be larger versions of resnet which don't generate the same results between MDF and PyTorch. I will look into these as well down the line. It might be a good task for an outreachy candidate as well.

Also, for some reason the macos runs took very long to run, maybe related to this issue: actions/runner-images#1336

pgleeson · 2022-10-06T15:43:18Z

Ok @davidt0x tested locally and it's running fine. Certainly there is more that can be done with all of this down the line, as well as for other PyTorch examples.

Thanks again for your contributions to this @parikshit14!

parikshit14 and others added 3 commits August 9, 2022 21:23

update mdf_to_onnx branch to current scenario

7884dea

Merge branch 'development' into examples/pytorch-to-mdf

396bf3a

added resnet18 model from torchvision library

1e528d6

davidt0x added 4 commits August 10, 2022 12:57

Add fixes to support latest version of PyTorch

7cf4450

Remove dependency of resnet example and tests on images

a6e0121

Fix bug when tuple is not passed to args.

5a62001

Fix bug where type is always "Tensor" of PyTorch->MDF

eb02936

We don't have a standard way of specifying types yet in MDF but the current code is definitely not good since it is always just Tensor. Now it should be aligned with pytorch and numpy.

davidt0x added 3 commits August 11, 2022 22:27

This hack for handling Reshape was expressed incorrectly.

589af92

I moved it to a better place and made it general for all Reshape ops, not just ops with Reshape_1 ids.

Remove hack for Clip.

79985df

This looks like a hack needed for multiple outputs from Clip op. I think this is handled generally now.

Relax the PyTorch lower bound pin.

a370db8

parikshit14 added 2 commits August 12, 2022 22:54

added standard examples from torchvision

7564d50

fix formating

207dc98

davidt0x added 9 commits August 12, 2022 14:28

Refactor PyTorch model fixtures into separate conftest.py

0a64741

These fixtures are only used by pytorch to mdf conversion tests. Lets keep them separate.

Merge branch 'examples/pytorch-to-mdf' of https://github.com/ModECI/MDF…

66ab273

… into examples/pytorch-to-mdf � Conflicts: � tests/conftest.py

Merged new models @parikshit14 added in pytorch/conftest.py

861b6be

Handle new way pretrain=False is specified for torchvision 0.13.

3017d15

This should support both ways depending on version of torchvision.

Switch CI script back to pre-install PNL from github devel.

ae4f94e

Add back in guard to prevent pytorch tests when torch is not installed.

d19ada6

Add pytest.importskip('torch') to test_torchvision_models

32cea77

davidt0x added 3 commits August 13, 2022 00:51

Really fix the skipping of pytorch tests when torch is not available.

01524e1

Remove a bunch of unused imports in pytorch exporter.py.

af35b89

Specifically, h5py which isn't being used and is causing an uneeded dependency.

Add h5py as optional dep.

8029275

parikshit14 and others added 13 commits August 24, 2022 18:03

added notebook tutorial to compare pytorch and mdf models

6e58777

merged development to resolve conflicts

ff8ff4f

added output IMAGENET classes

e1dd9ce

Merge branch 'development' of https://github.com/ModECI/MDF into exam…

978d437

…ples/pytorch-to-mdf

Add types and shapes to out ports generated from PyTorch importer.py

18db98b

Merge

f1db317

ONNX Constant ops are now inserted as parameters.

36a4f32

The pytorch exporter now removes constant ops from the graph and simply inserts them as parameters.

Temporary fix for BatchNormalization

3062961

ONNX Batch normalization can return and optional number of outputs. To complicate this, the optional outputs are not allowed unless the attribute training_mode=1 is set. I have added a hardcoded check for this to handle this case.

Seed torch random numbers for deterministic tests.

0ff2336

Mark torchvision.inception_v3 test as xfail.

fd5a23c

The outputs of the PyTorch and MDF model for this torchvsion version of inception are different. Need to investigate.

Merge branch 'examples/pytorch-to-mdf' of https://github.com/ModECI/MDF…

7541d8d

… into examples/pytorch-to-mdf

davidt0x requested a review from pgleeson October 6, 2022 03:53

pgleeson merged commit cd3e49a into development Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update mdf_to_onnx branch to current scenario #310

update mdf_to_onnx branch to current scenario #310

parikshit14 commented Aug 9, 2022

parikshit14 commented Aug 10, 2022

davidt0x commented Aug 11, 2022

davidt0x commented Aug 12, 2022

davidt0x commented Aug 12, 2022

davidt0x commented Aug 12, 2022

parikshit14 commented Aug 12, 2022

davidt0x commented Aug 12, 2022

davidt0x commented Aug 13, 2022

davidt0x commented Oct 6, 2022

pgleeson commented Oct 6, 2022

update mdf_to_onnx branch to current scenario #310

update mdf_to_onnx branch to current scenario #310

Conversation

parikshit14 commented Aug 9, 2022

parikshit14 commented Aug 10, 2022

davidt0x commented Aug 11, 2022

davidt0x commented Aug 12, 2022

davidt0x commented Aug 12, 2022

davidt0x commented Aug 12, 2022

parikshit14 commented Aug 12, 2022

davidt0x commented Aug 12, 2022

davidt0x commented Aug 13, 2022

davidt0x commented Oct 6, 2022

pgleeson commented Oct 6, 2022