Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF Updates #87

Merged
merged 32 commits into from Dec 4, 2019
Merged

TF Updates #87

merged 32 commits into from Dec 4, 2019

Conversation

@farizrahman4u
Copy link

farizrahman4u commented Nov 27, 2019

No description provided.

farizrahman4u and others added 16 commits Nov 4, 2019
Signed-off-by: AlexDBlack <blacka101@gmail.com>
@farizrahman4u

This comment has been minimized.

Copy link
Author

farizrahman4u commented Nov 27, 2019

@AlexDBlack I cant seem to reproduce dep issue, rest of the issues should be gone now.

Copy link
Member

AlexDBlack left a comment

LGTM, other than maybe minor tweak to logging level in a couple of places

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Nov 27, 2019

@AlexDBlack I cant seem to reproduce dep issue, rest of the issues should be gone now.

Still seeing the windows gson dependency issue, other issues confirmed fixed. I'll look into that.

AlexDBlack added 2 commits Nov 27, 2019
Signed-off-by: AlexDBlack <blacka101@gmail.com>
… config case

Signed-off-by: Alex Black <blacka101@gmail.com>
AlexDBlack added 2 commits Nov 27, 2019
Signed-off-by: AlexDBlack <blacka101@gmail.com>
@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Nov 27, 2019

Confirmed passing now on windows + linux for CPU

CUDA is failing, however with java.lang.RuntimeException: cudaMemcpyAsync failed on most of the tests.
To reproduce: run tests in nd4j-tests-tensorflow with nd4j-tf-gpu and tf-gpu profiles.

For example:

java.lang.RuntimeException: cudaMemcpyAsync failed

	at org.nd4j.jita.allocator.pointers.cuda.cudaStream_t.synchronize(cudaStream_t.java:41)
	at org.nd4j.jita.flow.impl.SynchronousFlowController.commitTransfer(SynchronousFlowController.java:315)
	at org.nd4j.jita.handler.impl.CudaZeroHandler.relocate(CudaZeroHandler.java:406)
	at org.nd4j.jita.handler.impl.CudaZeroHandler.getDevicePointer(CudaZeroHandler.java:727)
	at org.nd4j.jita.allocator.impl.AtomicAllocator.getPointer(AtomicAllocator.java:325)
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.naiveExec(CudaExecutioner.java:384)
	at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:572)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.equalsWithEps(BaseNDArray.java:4374)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.equals(BaseNDArray.java:4430)
	at org.nd4j.linalg.jcublas.JCublasNDArray.equals(JCublasNDArray.java:507)
	at org.junit.Assert.isEquals(Assert.java:131)
	at org.junit.Assert.equalsRegardingNull(Assert.java:127)
	at org.junit.Assert.assertEquals(Assert.java:111)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at org.nd4j.tensorflow.conversion.TensorflowConversionTest.testConversionFromNdArray(TensorflowConversionTest.java:59)

CUDA tests are failing for me on master though with the following:

java.lang.UnsatisfiedLinkError: no jnitensorflow in java.library.path

	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
	at java.lang.Runtime.loadLibrary0(Runtime.java:870)
	at java.lang.System.loadLibrary(System.java:1122)
	at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1543)
...
Caused by: java.lang.UnsatisfiedLinkError: C:\Users\Alex\.javacpp\cache\tensorflow-1.15.0-1.5.2-windows-x86_64-gpu.jar\org\bytedeco\tensorflow\windows-x86_64-gpu\jnitensorflow.dll: Can't find dependent libraries
	at java.lang.ClassLoader$NativeLibrary.load(Native Method)
	at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
	at java.lang.Runtime.load0(Runtime.java:809)
	at java.lang.System.load(System.java:1086)
	at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:1493)
raver119 and others added 3 commits Nov 27, 2019
Signed-off-by: raver119 <raver119@gmail.com>
Signed-off-by: raver119 <raver119@gmail.com>
Signed-off-by: AlexDBlack <blacka101@gmail.com>
@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Nov 28, 2019

OK, so I'm seeing most tests pass now on CUDA (windows)

If I run GraphRunnerTest.testGraphRunnerFilePath in isolation, or run it as part of all tests in GraphRunnerTest, I get the following failure:

java.lang.AssertionError: 
Expected :[    2.0000,    4.0000,    6.0000,    8.0000]
Actual   :[         0,         0,         0,         0]

However, if I right click on org.nd4j.tensorflow.conversion under gpujava and run tests, I get this:

Note all tests are running, and GraphRunnerTest.testGraphRunnerFilePath passes - but GpuGraphRunnerTest.testGraphRunner now fails with exactly the same issue.

GpuGraphRunnerTest.testGraphRunner also fails when run in isolation.

image

Note that a Nd4j.getExecutioner().commit(); and/or Thread.sleep(2000) after graphRunner.run(inputs) doesn't make any difference.

@agibsonccc

This comment has been minimized.

Copy link

agibsonccc commented Nov 29, 2019

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Dec 2, 2019

@farizrahman4u Any updates/progress here?

@farizrahman4u

This comment has been minimized.

Copy link
Author

farizrahman4u commented Dec 2, 2019

@AlexDBlack on it

@farizrahman4u

This comment has been minimized.

Copy link
Author

farizrahman4u commented Dec 3, 2019

@AlexDBlack try now?

Copy link
Member

AlexDBlack left a comment

I'm not following how this was fixed (we previously talked about adding opyToHost ops), but I can confirm that tests are consistently passing for me now on CUDA.

@farizrahman4u

This comment has been minimized.

Copy link
Author

farizrahman4u commented Dec 4, 2019

I got curious and created a buffer from the device pointer. As I suspected, it had the same behavior as the host array - the first test case of a test suit always fails and the rest passes, even if there is a test identical to the first one. (Which also mean test cases would fail when run individually, because each one is the "first" in that session). So I simply added a "warm up" functionality to emulate this.

@AlexDBlack AlexDBlack merged commit 0d14032 into master Dec 4, 2019
1 check failed
1 check failed
continuous-integration/jenkins/pr-head The build of this commit was aborted
Details
@AlexDBlack AlexDBlack deleted the fr_tf_updates branch Dec 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.