Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update mdf_to_onnx branch to current scenario #310

Merged
merged 39 commits into from
Oct 6, 2022

Conversation

parikshit14
Copy link
Contributor

  • updated mdf_to_onnx branch to the current development version
  • added the example which are currently passing the tests successfully
  • having trouble with standard torchvision examples (used directly from torchvision library) when converted to mdf. something needs to be updated in execution_engine or the way input image is defined.

@parikshit14
Copy link
Contributor Author

Hey @davidt0x, I am trying to add some predefined standard pytorch models to the current example folder but there seems to be an issue with them.

  • I have tried to add the resnet18 model from torchvision library and run it against some imagenet sample images(also tried with the tensor of zeros) but the evaluate function fails while converting to mdf model

error:
Screenshot from 2022-08-10 21-17-49

We don't have a standard way of specifying types yet in MDF
but the current code is definitely not good since it is always
just Tensor. Now it should be aligned with pytorch and numpy.
@davidt0x
Copy link
Contributor

hey @parikshit14,

I am taking a look at this. I fixed the issue on my end (it was a bug caused by not passing in a tuple to the args parameter of pytorch_to_mdf). However, when that is fixed, there are some issues with ONNX opset mismatches that I am cleaning up. Will try to push my changes tomorrow.

- Properly handle ONNX operations that return multiple outputs. Now expressions for
the value of and output port cand index into the tuple, for example.
OutputPort(id='_13', value="onnx_BatchNormalization_1[0]"). This was needed for ResNet
This also fixed an existing hack for MaxPool. Might be more of these lying around, need
to take a better look.
- In order to get the above working, I changed "onnx::" to "onnx_" in function\parameter ids generated by the exported. This lets eval work because
"::" is a can't be parsed in Python.
- Moved modeic ONNX opset version to 15 from 13.
- Fixed issue where improper schema (wrong opset) was being looked up for
ops in the exporter now that ONNX has moved to 15.
I moved it to a better place and made it general for all Reshape ops,
not just ops with Reshape_1 ids.
This looks like a hack needed for multiple outputs from Clip op.
I think this is handled generally now.
@davidt0x
Copy link
Contributor

Hey @parikshit14,

Take a look at these changes to your branch. I think they should fix a lot of the issues. It is passing locally on my end but not sure if it will pass CI, lets see. This ended up being more involved than I thought it would be. Essentially, a lot of the problems were related to improper handling of multiple outputs from ONNX ops. We had some ugly op specific hacks hanging around. I think I have cleaned this up now and it should be more general. It would probably be a good idea to also rerun all the PyTorch examples so the the MDF JSON in the repo gets updated.

@davidt0x
Copy link
Contributor

I think this failure is intermittent and related to the simple_convolution model being float32 and the execution_engine using float64. Let me try to address that.

I am pretty sure this is related to not handling precision consistently
in the execution engine. We are converting back and forth in different
places from float32 to float64. The results are matching intermittently
without a tolerance but I have set and absolute tolerance for now. Need
to investigate thie further.
@davidt0x
Copy link
Contributor

Ok, well I didn't really address it. I modified the test to compare with a tolerance. I think this is related to how are not handling precision in a disciplined way within the execution engine. I see we are converting back and forth from float32 to float64. The final result is being output at float64 upcasted from float32. This is all kind of a mess, I think we need to systematically handle this better. Not certain this is the issue but it is certainly something we need to fix down the line anyway.

@parikshit14
Copy link
Contributor Author

hi @davidt0x thanks for the changes,

Have you tried to setup new local env using these changes, coz I am getting impossible to resolve problem same as the ci from the earlier commit.

Also, I was trying to add few more examples and they seem to fail because of shape inference failure because of dropout layer's presence. So I tried to add model_name.eval() to get rid of the random stuff and now the tests run well. I hope it is not just a way around just to pass the test!!

@davidt0x
Copy link
Contributor

Hey @parikshit14,

Yeah, there does seem to be an issue if you just try to do a pip install -e .[all] from a clean environment. I get something like this:

INFO: pip is looking at multiple versions of neuromllite to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of modelspec to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of jinja2 to determine which version is compatible with other requirements. This could take a while.
Collecting Jinja2<3.1
  Using cached Jinja2-3.0.2-py3-none-any.whl (133 kB)
INFO: pip is looking at multiple versions of graph-scheduler to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of attrs to determine which version is compatible with other requirements. This could take a while.
Collecting attrs>=21.1.0
  Using cached attrs-21.4.0-py2.py3-none-any.whl (60 kB)
INFO: pip is looking at multiple versions of pytorch-sphinx-theme to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of modeci-mdf[all] to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install modeci-mdf and modeci-mdf[all]==0.4.2 because these package versions have conflicting dependencies.

The conflict is caused by:
    modeci-mdf[all] 0.4.2 depends on modeci-mdf 0.4.2 (from C:\Users\david\Dropbox\princeton\PsyNeuLink_Stuff\mdf\MDF)
    psyneulink 0.12.0.1 depends on modeci-mdf<0.4.2 and >=0.3.4
    modeci-mdf[all] 0.4.2 depends on modelspec~=0.2.6
    neuromllite 0.5.1 depends on modelspec>=0.2.2
    psyneulink 0.12.0.0 depends on modelspec<0.2.6
    modeci-mdf[all] 0.4.2 depends on graph-scheduler>=1.1.1
    psyneulink 0.11.0.0 depends on graph-scheduler<1.1.1 and >=0.2.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

This does not fail in CI currently. I think it is because the CI script is installing PsyNeuLink (and maybe modelspec) from the github repos before installing anything. Ultimately, I think this could be fixed easily by removing all these version pin caps (or at least modelspec and torch) from psyneulink repo. I will push some changes to the CI workflow that I hope will cause this to manifest in CI.

Regarding your other question about Dropout. There was a few lines that seem to have gotten removed or moved from tests/conftest.py that try to make PyTorch more deterministic for testing. I will push some changes that should re-enable this. I have also refactored some of the fixtures of the models into a pytorch specific conftest.py under interfaces/pytorch. I think this keeps the main conftest.py cleaner.

These fixtures are only used by pytorch to mdf conversion tests. Lets keep them separate.
… been installed.

Currently, we are not testing if core mdf works without PyTorch or PsyNeuLink because
these are installed before core package is installed (PsyNeuLink depends on PyTorch).
I have moved this install to the end, after installation of all backends (including PNL) from PIP. This is better for testing that a clean install works.
… into examples/pytorch-to-mdf

� Conflicts:
�	tests/conftest.py
This should support both ways depending on version of torchvision.
@parikshit14, I changed some of your code to remove a lot of duplication.
test_import.py under pytorch now enumerates all models in torchvision and
creates a test for each different type of model. All the models have the
same interface so we can consolidate the testing code into a parameterized
pytest.

We have 5 out of 20 tests failing still, lets see if we can track down why
each of these models is failing to run in the execution engine.
@davidt0x
Copy link
Contributor

Hey @parikshit14,

Check the changes I have made. I tried to remove some duplicate code with how you were creating tests for these models. I wrote a function that picks up all the models in torchvision and creates a test for each. Most work but 5 are still failing for various reasons. I think most are failing for simple reasons. One looks like its because there is a parameter or something with a hyphen character in it (MDF probably can't handle this). A couple others look like they are because of mismatched types for tensors passed to ONNX ops. I will be out next week so I can't help with these right now.

parikshit14 and others added 13 commits August 24, 2022 18:03
- We can't have '-' characters in port names in MDF. Some PyTorch models have these.
- Also fixed issue where models were being tested with all zeros inputs which made some models pass when in reality they were not computing the correct values. More bugs to track down ...
The pytorch exporter now removes constant ops from the graph
and simply inserts them as parameters.
ONNX Batch normalization can return and optional number of
outputs. To complicate this, the optional outputs are not
allowed unless the attribute training_mode=1 is set. I have
added a hardcoded check for this to handle this case.
Torchvision models run through PyTorch and MDF do not have
exactly matching results. I have set the comparison tolerance
to and absolute 1e-5 for now. I imagine the difference could
be a lot of things. One could be how we are not really
controlling precision well in MDF models. Another could be that
ONNX ops don't implement exactly the same algorithms that
PyTorch ops do.
The outputs of the PyTorch and MDF model for this torchvsion
version of inception are different. Need to investigate.
@davidt0x davidt0x requested a review from pgleeson October 6, 2022 03:53
@davidt0x
Copy link
Contributor

davidt0x commented Oct 6, 2022

I think this is good to go now @pgleeson and @parikshit14. Have a look and merge if you like. I actually fixed the issues with all the torchvision models except one, the inception_v3. This model converts and runs but get very different results. Not sure what is going on with this one actually, never had an error like that before, the error is massive! I will need to trace through it to figure it out but I have set that test to xfail. I did a test run where I run tests on all 60 or so torchvision models and all but 5 pass but it takes too long to run in CI. Instead, I made it so the test only does one test per unique model class. The 5 failing tests in the run of 60 seem to be larger versions of resnet which don't generate the same results between MDF and PyTorch. I will look into these as well down the line. It might be a good task for an outreachy candidate as well.

Also, for some reason the macos runs took very long to run, maybe related to this issue: actions/runner-images#1336

@pgleeson
Copy link
Member

pgleeson commented Oct 6, 2022

Ok @davidt0x tested locally and it's running fine. Certainly there is more that can be done with all of this down the line, as well as for other PyTorch examples.

Thanks again for your contributions to this @parikshit14!

@pgleeson pgleeson merged commit cd3e49a into development Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants