Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a model fails #4

Open
randomrandom opened this issue Jul 15, 2016 · 9 comments
Open

Training a model fails #4

randomrandom opened this issue Jul 15, 2016 · 9 comments

Comments

@randomrandom
Copy link

Hi, I tried to run the command from the tutorial for model training, but it failed with the following error:

 CUDA_VISIBLE_DEVICES=0 th feedforward_neural_doodle.lua -model_name skip_noise_4 -masks_hdf5 data/starry/gen_doodles.hdf5 -batch_size 4 -num_mask_noise_times 0 -num_noise_channels 0 -learning_rate 1e-1 -half false
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/hdf5/group.lua:312: HDF5Group:read() - no such child 'style_img' for [HDF5Group 33554432 /]
stack traceback:
    [C]: in function 'error'
    /root/torch/install/share/lua/5.1/hdf5/group.lua:312: in function 'read'
    feedforward_neural_doodle.lua:49: in main chunk
    [C]: in function 'dofile'
    /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

any ideas why hdf5 might fail with such error?

@DmitryUlyanov
Copy link
Owner

did you generate hdf5 file first?

@randomrandom
Copy link
Author

yes, initially I thought that something with the generation didn't go good - since this script never completed:

python generate.py --n_jobs 30 --n_colors 4 --style_image data/starry/style.png --style_mask data/starry/style_mask.png --out_hdf5 data/starry/gen_doodles.hdf5
even though a new hdf5 file was generated

So I decided to try the sample command that you have put in the README - so it should use the sample hdf5 file from the repo, unfortunately it made no difference.

Is it possible that the two fail due to bad hdf5 setup?

@DmitryUlyanov
Copy link
Owner

there's no sample hdf5 file, since it is too large. You should let it work till it finishes.

@randomrandom
Copy link
Author

randomrandom commented Jul 16, 2016

thanks, I'll try that! How much time does it take on ur setup?

Do you advise to increase the jobs? I'm using a Tesla K10 setup

@randomrandom
Copy link
Author

randomrandom commented Jul 16, 2016

I managed to get it working, unfortunately it looks like the VRAM (3.5GB) is not enough. What's the best way to reduce the memory footprint?

p.s.: I'm familiar with Johnson's implementation and know what I can do there, but I still haven't read your blogpost and the code documentation :(

Edit 1: From first glance - looks like reducing the batch_size and n_colors might do the trick? I increased them to 8, maybe that's why it fails..

Edit 2: Is it even possible to squeeze the training into 3.5GB? I started going through the code and I noticed that you are already doing a lot of the memory optimizations (e.g. using cudnn and the ADAM optimizer)..

@DmitryUlyanov
Copy link
Owner

Try doing batch_size = 1, do not change ncolors, you can also downsize the image to 256x256 for example

@randomrandom
Copy link
Author

looks like batch_size=1 did the trick, I previously tried with 2 and 3 with no success. Does this affect the quality or just the speed of the training?

@DmitryUlyanov
Copy link
Owner

The quality will be ok, I used batch_size = 1, but at test time you need to experiment with midel:evaluate() or model:training()

@randomrandom
Copy link
Author

BTW, do you recommend this repo for artistic neural transfer? probably to do it well there should be some semantic analysis that determines the masks :? Is there any other approach that you can recommend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants