-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update mdf_to_onnx branch to current scenario #310
Conversation
parikshit14
commented
Aug 9, 2022
- updated mdf_to_onnx branch to the current development version
- added the example which are currently passing the tests successfully
- having trouble with standard torchvision examples (used directly from torchvision library) when converted to mdf. something needs to be updated in execution_engine or the way input image is defined.
Hey @davidt0x, I am trying to add some predefined standard pytorch models to the current example folder but there seems to be an issue with them.
|
We don't have a standard way of specifying types yet in MDF but the current code is definitely not good since it is always just Tensor. Now it should be aligned with pytorch and numpy.
hey @parikshit14, I am taking a look at this. I fixed the issue on my end (it was a bug caused by not passing in a tuple to the args parameter of pytorch_to_mdf). However, when that is fixed, there are some issues with ONNX opset mismatches that I am cleaning up. Will try to push my changes tomorrow. |
- Properly handle ONNX operations that return multiple outputs. Now expressions for the value of and output port cand index into the tuple, for example. OutputPort(id='_13', value="onnx_BatchNormalization_1[0]"). This was needed for ResNet This also fixed an existing hack for MaxPool. Might be more of these lying around, need to take a better look. - In order to get the above working, I changed "onnx::" to "onnx_" in function\parameter ids generated by the exported. This lets eval work because "::" is a can't be parsed in Python. - Moved modeic ONNX opset version to 15 from 13. - Fixed issue where improper schema (wrong opset) was being looked up for ops in the exporter now that ONNX has moved to 15.
I moved it to a better place and made it general for all Reshape ops, not just ops with Reshape_1 ids.
This looks like a hack needed for multiple outputs from Clip op. I think this is handled generally now.
Hey @parikshit14, Take a look at these changes to your branch. I think they should fix a lot of the issues. It is passing locally on my end but not sure if it will pass CI, lets see. This ended up being more involved than I thought it would be. Essentially, a lot of the problems were related to improper handling of multiple outputs from ONNX ops. We had some ugly op specific hacks hanging around. I think I have cleaned this up now and it should be more general. It would probably be a good idea to also rerun all the PyTorch examples so the the MDF JSON in the repo gets updated. |
I think this failure is intermittent and related to the simple_convolution model being float32 and the execution_engine using float64. Let me try to address that. |
I am pretty sure this is related to not handling precision consistently in the execution engine. We are converting back and forth in different places from float32 to float64. The results are matching intermittently without a tolerance but I have set and absolute tolerance for now. Need to investigate thie further.
Ok, well I didn't really address it. I modified the test to compare with a tolerance. I think this is related to how are not handling precision in a disciplined way within the execution engine. I see we are converting back and forth from float32 to float64. The final result is being output at float64 upcasted from float32. This is all kind of a mess, I think we need to systematically handle this better. Not certain this is the issue but it is certainly something we need to fix down the line anyway. |
hi @davidt0x thanks for the changes, Have you tried to setup new local env using these changes, coz I am getting impossible to resolve problem same as the ci from the earlier commit. Also, I was trying to add few more examples and they seem to fail because of |
Hey @parikshit14, Yeah, there does seem to be an issue if you just try to do a
This does not fail in CI currently. I think it is because the CI script is installing PsyNeuLink (and maybe modelspec) from the github repos before installing anything. Ultimately, I think this could be fixed easily by removing all these version pin caps (or at least modelspec and torch) from psyneulink repo. I will push some changes to the CI workflow that I hope will cause this to manifest in CI. Regarding your other question about Dropout. There was a few lines that seem to have gotten removed or moved from |
These fixtures are only used by pytorch to mdf conversion tests. Lets keep them separate.
… been installed. Currently, we are not testing if core mdf works without PyTorch or PsyNeuLink because these are installed before core package is installed (PsyNeuLink depends on PyTorch). I have moved this install to the end, after installation of all backends (including PNL) from PIP. This is better for testing that a clean install works.
… into examples/pytorch-to-mdf � Conflicts: � tests/conftest.py
This should support both ways depending on version of torchvision.
@parikshit14, I changed some of your code to remove a lot of duplication. test_import.py under pytorch now enumerates all models in torchvision and creates a test for each different type of model. All the models have the same interface so we can consolidate the testing code into a parameterized pytest. We have 5 out of 20 tests failing still, lets see if we can track down why each of these models is failing to run in the execution engine.
Specifically, h5py which isn't being used and is causing an uneeded dependency.
Hey @parikshit14, Check the changes I have made. I tried to remove some duplicate code with how you were creating tests for these models. I wrote a function that picks up all the models in torchvision and creates a test for each. Most work but 5 are still failing for various reasons. I think most are failing for simple reasons. One looks like its because there is a parameter or something with a hyphen character in it (MDF probably can't handle this). A couple others look like they are because of mismatched types for tensors passed to ONNX ops. I will be out next week so I can't help with these right now. |
…ples/pytorch-to-mdf
- We can't have '-' characters in port names in MDF. Some PyTorch models have these. - Also fixed issue where models were being tested with all zeros inputs which made some models pass when in reality they were not computing the correct values. More bugs to track down ...
The pytorch exporter now removes constant ops from the graph and simply inserts them as parameters.
ONNX Batch normalization can return and optional number of outputs. To complicate this, the optional outputs are not allowed unless the attribute training_mode=1 is set. I have added a hardcoded check for this to handle this case.
Torchvision models run through PyTorch and MDF do not have exactly matching results. I have set the comparison tolerance to and absolute 1e-5 for now. I imagine the difference could be a lot of things. One could be how we are not really controlling precision well in MDF models. Another could be that ONNX ops don't implement exactly the same algorithms that PyTorch ops do.
The outputs of the PyTorch and MDF model for this torchvsion version of inception are different. Need to investigate.
… into examples/pytorch-to-mdf
I think this is good to go now @pgleeson and @parikshit14. Have a look and merge if you like. I actually fixed the issues with all the torchvision models except one, the Also, for some reason the macos runs took very long to run, maybe related to this issue: actions/runner-images#1336 |
Ok @davidt0x tested locally and it's running fine. Certainly there is more that can be done with all of this down the line, as well as for other PyTorch examples. Thanks again for your contributions to this @parikshit14! |