Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Mocha Networks on AWS #220

Open
zacsketches opened this issue Oct 28, 2016 · 3 comments
Open

Training Mocha Networks on AWS #220

zacsketches opened this issue Oct 28, 2016 · 3 comments

Comments

@zacsketches
Copy link
Contributor

zacsketches commented Oct 28, 2016

After using lots of deep learning frameworks I find that I enjoy programming ML using Julia and I prefer Mocha over MxNet syntax. Therefore, I'm interested in growing the Mocha documentation and examples from their current state to a place where Mocha is a platform for learning Deep Learning.

With this goal in I wrote an extension to the MNIST tutorial earlier this month that provided an example of learning curves.

I'd now like to write an extension of the CIFAR-10 tutorial that shows how to train the model on AWS with a GPU instance. However, I'm having trouble running the GPU backend. I have a p2.xlarge instance provisioned with the NVIDIA tools and an Ubuntu 14.04 OS. nvidia-smi runs correctly.

It fails on Pkg.test("Mocha") with the following output:

-- Testing Pooling(Mocha.Pooling.Max)  on Mocha.GPUBackend{Float64}...
    > Setup
ERROR: LoadError: LoadError: Bad param
 [inlined code] from /home/ubuntu/.julia/v0.4/Mocha/src/cuda/cudnn.jl:53
 in set_pooling_descriptor at /home/ubuntu/.julia/v0.4/Mocha/src/cuda/cudnn.jl:412
 in setup_etc at /home/ubuntu/.julia/v0.4/Mocha/src/cuda/layers/pooling.jl:19
 in setup at /home/ubuntu/.julia/v0.4/Mocha/src/layers/pooling.jl:74
 in test_pooling_layer at /home/ubuntu/.julia/v0.4/Mocha/test/layers/pooling.jl:46
 in test_pooling_layer at /home/ubuntu/.julia/v0.4/Mocha/test/layers/pooling.jl:165
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:320
 in anonymous at /home/ubuntu/.julia/v0.4/Mocha/test/runtests.jl:26
 in map_to! at abstractarray.jl:1286
 in map at abstractarray.jl:1308
 in test_dir at /home/ubuntu/.julia/v0.4/Mocha/test/runtests.jl:25
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:320
 in process_options at ./client.jl:280
 in _start at ./client.jl:378
while loading /home/ubuntu/.julia/v0.4/Mocha/test/layers/pooling.jl, in expression starting on line 173
while loading /home/ubuntu/.julia/v0.4/Mocha/test/runtests.jl, in expression starting on line 85
=======================================[ ERROR: Mocha ]========================================

failed process: Process(`/usr/bin/julia --check-bounds=yes --code-coverage=none --color=yes /home/ubuntu/.julia/v0.4/Mocha/test/runtests.jl`, ProcessExited(1)) [1]

===============================================================================================
ERROR: Mocha had test errors
 in error at ./error.jl:21
 in test at pkg/entry.jl:803
 in anonymous at pkg/dir.jl:31
 in cd at file.jl:22
 in cd at pkg/dir.jl:31
 in test at pkg.jl:71

I'm going to keep troubleshooting this, but if anyone has a successful path for running Mocha on AWS please let me know your setup.

@zacsketches
Copy link
Contributor Author

zacsketches commented Oct 29, 2016

After trying six or seven different AMI's and dozens of combinations of Julia version => Mocha version, the following configuration allows training on AWS.

p2.xlarge instance
Bitfusion Deep Learning AMI
Julia==>v0.4.7 built from source
Mocha==>0.1.2

Results from this setup on the CIFAR10 example:

29-Oct 15:22:43:INFO:root:  Accuracy (avg over 10000) = 78.8200%
29-Oct 15:22:43:INFO:root:---------------------------------------------------------
29-Oct 15:22:43:INFO:root:
29-Oct 15:22:43:DEBUG:root:Destroying network CIFAR10-train
29-Oct 15:22:43:DEBUG:root:Destroying network CIFAR10-test
29-Oct 15:22:43:INFO:root:Shutting down CuDNN backend...
29-Oct 15:22:43:INFO:root:CuDNN Backend shutdown finished!

real    19m13.617s
user    14m1.893s
sys 5m12.049s

There are build specifics to make this combination work that I'm going to document in a new tutorial on training Mocha in the cloud. When the new tutorial is up I'll close this comment.

@pluskid there is an unmistakeable build error in the compatibilty.jl file related to the way you are trying to identify the BLAS library. Any version of Julia past 0.4.7 will not build Mocha correctly, and is probably the culprit for the failing travis builds. This might belong in a separate issue, but this was the last hurdle I had to figure out in order to find a working combination on AWS.

@pluskid
Copy link
Owner

pluskid commented Nov 1, 2016

Thanks! I'm a bit busy recently. Will take a look at the blas issue when I have a chance. Could you open an issue for that for tracking?

@zacsketches
Copy link
Contributor Author

@pluskid I had a busy week and couldn't back to Mocha until now. This weekend I'll create a new issue describing the blas issue with enough detail that you should be able to get it fixed.

I also found a few minor errors in my last tutorial.

  1. The summit request shows the wrong link in the image
  2. The times on the final image need correcting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants