Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

Error:neon --gpu nervanagpu examples/convnet/i1k-alexnet-fp32.yaml #26

Closed
andyyuan78 opened this issue May 16, 2015 · 5 comments
Closed

Comments

@andyyuan78
Copy link

ubgpu@ubgpu:~/github/neon/neon$ neon --gpu nervanagpu examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.util.persist:deserializing object from: examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.datasets.imageset:Imageset initialized with dtype <type 'numpy.float32'>
2015-05-15 22:00:54,319 WARNING:neon - setting log level to: 20
2015-05-15 22:00:54,447 INFO:gpu - Initialized NervanaGPU with stochastic_round=None
2015-05-15 22:00:54,447 INFO:gpu - Seeding random number generator with: None
2015-05-15 22:00:54,448 INFO:init - NervanaGPU backend, RNG seed: None, numerr: None
2015-05-15 22:00:54,449 INFO:mlp - Layers:
ImageDataLayer d0: 3 x (224 x 224) nodes
ConvLayer conv1: 3 x (224 x 224) inputs, 64 x (55 x 55) nodes, RectLin act_fn
PoolingLayer pool1: 64 x (55 x 55) inputs, 64 x (27 x 27) nodes, Linear act_fn
ConvLayer conv2: 64 x (27 x 27) inputs, 192 x (27 x 27) nodes, RectLin act_fn
PoolingLayer pool2: 192 x (27 x 27) inputs, 192 x (13 x 13) nodes, Linear act_fn
ConvLayer conv3: 192 x (13 x 13) inputs, 384 x (13 x 13) nodes, RectLin act_fn
ConvLayer conv4: 384 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin act_fn
ConvLayer conv5: 256 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin act_fn
PoolingLayer pool3: 256 x (13 x 13) inputs, 256 x (6 x 6) nodes, Linear act_fn
FCLayer fc4096a: 9216 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout1: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc4096b: 4096 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout2: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc1000: 4096 inputs, 1000 nodes, Softmax act_fn
CostLayer cost: 1000 nodes, CrossEntropy cost_fn

2015-05-15 22:00:54,449 INFO:batch_norm - BatchNormalization set to train mode
2015-05-15 22:00:54,450 INFO:val_init - Generating AutoUniformValGen values of shape (363, 64)
2015-05-15 22:00:54,452 INFO:batch_norm - BatchNormalization set to train mode
2015-05-15 22:00:54,453 INFO:val_init - Generating AutoUniformValGen values of shape (1600, 192)
2015-05-15 22:00:54,458 INFO:batch_norm - BatchNormalization set to train mode
2015-05-15 22:00:54,459 INFO:val_init - Generating AutoUniformValGen values of shape (1728, 384)
2015-05-15 22:00:54,469 INFO:batch_norm - BatchNormalization set to train mode
2015-05-15 22:00:54,470 INFO:val_init - Generating AutoUniformValGen values of shape (3456, 256)
2015-05-15 22:00:54,483 INFO:batch_norm - BatchNormalization set to train mode
2015-05-15 22:00:54,484 INFO:val_init - Generating AutoUniformValGen values of shape (2304, 256)
2015-05-15 22:00:54,492 INFO:batch_norm - BatchNormalization set to train mode
2015-05-15 22:00:54,493 INFO:val_init - Generating AutoUniformValGen values of shape (4096, 9216)
2015-05-15 22:00:54,964 INFO:batch_norm - BatchNormalization set to train mode
2015-05-15 22:00:54,965 INFO:val_init - Generating AutoUniformValGen values of shape (4096, 4096)
2015-05-15 22:00:55,175 INFO:val_init - Generating AutoUniformValGen values of shape (1000, 4096)
2015-05-15 22:00:55,229 WARNING:imageset - Batch dir cache not found in /home/ubgpu/data/I1K/imageset_batches/dataset_cache.pkl:
Press Y to create, otherwise exit: Y
/usr/local/lib/python2.7/dist-packages/neon/util/batch_writer.py:137: RuntimeWarning: divide by zero encountered in log10
self.val_start = 10 ** int(np.log10(self.ntrain * 10))
Traceback (most recent call last):
File "/usr/local/bin/neon", line 199, in
experiment, result, status = main()
File "/usr/local/bin/neon", line 168, in main
result = experiment.run()
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit_predict_err.py", line 97, in run
super(FitPredictErrorExperiment, self).run()
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit.py", line 70, in run
self.dataset.load()
File "/usr/local/lib/python2.7/dist-packages/neon/datasets/imageset.py", line 176, in load
self.bw.run()
File "/usr/local/lib/python2.7/dist-packages/neon/util/batch_writer.py", line 215, in run
self.write_csv_files()
File "/usr/local/lib/python2.7/dist-packages/neon/util/batch_writer.py", line 137, in write_csv_files
self.val_start = 10 ** int(np.log10(self.ntrain * 10))
OverflowError: cannot convert float infinity to integer
ubgpu@ubgpu:~/github/neon/neon$

@apark263
Copy link
Contributor

Do you have the imagenet data files (the tar files containing the images)?
They are not distributed as part of neon, but you need to get them from
ilsvrc in order to run the imagenet example.

Alex

On Saturday, May 16, 2015, Andy Yuan notifications@github.com wrote:

ubgpu@ubgpu:~/github/neon/neon$ neon --gpu nervanagpu
examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.util.persist:deserializing object from:
examples/convnet/i1k-alexnet-fp32.yaml
WARNING:neon.datasets.imageset:Imageset initialized with dtype
2015-05-15 22:00:54,319 WARNING:neon - setting log level to: 20
2015-05-15 22:00:54,447 INFO:gpu - Initialized NervanaGPU with
stochastic_round=None
2015-05-15 22:00:54,447 INFO:gpu - Seeding random number generator with:
None
2015-05-15 22:00:54,448 INFO:init - NervanaGPU backend, RNG seed: None,
numerr: None
2015-05-15 22:00:54,449 INFO:mlp - Layers:
ImageDataLayer d0: 3 x (224 x 224) nodes
ConvLayer conv1: 3 x (224 x 224) inputs, 64 x (55 x 55) nodes, RectLin
act_fn
PoolingLayer pool1: 64 x (55 x 55) inputs, 64 x (27 x 27) nodes, Linear
act_fn
ConvLayer conv2: 64 x (27 x 27) inputs, 192 x (27 x 27) nodes, RectLin
act_fn
PoolingLayer pool2: 192 x (27 x 27) inputs, 192 x (13 x 13) nodes, Linear
act_fn
ConvLayer conv3: 192 x (13 x 13) inputs, 384 x (13 x 13) nodes, RectLin
act_fn
ConvLayer conv4: 384 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin
act_fn
ConvLayer conv5: 256 x (13 x 13) inputs, 256 x (13 x 13) nodes, RectLin
act_fn
PoolingLayer pool3: 256 x (13 x 13) inputs, 256 x (6 x 6) nodes, Linear
act_fn
FCLayer fc4096a: 9216 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout1: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc4096b: 4096 inputs, 4096 nodes, RectLin act_fn
DropOutLayer dropout2: 4096 inputs, 4096 nodes, Linear act_fn
FCLayer fc1000: 4096 inputs, 1000 nodes, Softmax act_fn
CostLayer cost: 1000 nodes, CrossEntropy cost_fn

2015-05-15 22:00:54,449 INFO:batch_norm - BatchNormalization set to train
mode
2015-05-15 22:00:54,450 INFO:val_init - Generating AutoUniformValGen
values of shape (363, 64)
2015-05-15 22:00:54,452 INFO:batch_norm - BatchNormalization set to train
mode
2015-05-15 22:00:54,453 INFO:val_init - Generating AutoUniformValGen
values of shape (1600, 192)
2015-05-15 22:00:54,458 INFO:batch_norm - BatchNormalization set to train
mode
2015-05-15 22:00:54,459 INFO:val_init - Generating AutoUniformValGen
values of shape (1728, 384)
2015-05-15 22:00:54,469 INFO:batch_norm - BatchNormalization set to train
mode
2015-05-15 22:00:54,470 INFO:val_init - Generating AutoUniformValGen
values of shape (3456, 256)
2015-05-15 22:00:54,483 INFO:batch_norm - BatchNormalization set to train
mode
2015-05-15 22:00:54,484 INFO:val_init - Generating AutoUniformValGen
values of shape (2304, 256)
2015-05-15 22:00:54,492 INFO:batch_norm - BatchNormalization set to train
mode
2015-05-15 22:00:54,493 INFO:val_init - Generating AutoUniformValGen
values of shape (4096, 9216)
2015-05-15 22:00:54,964 INFO:batch_norm - BatchNormalization set to train
mode
2015-05-15 22:00:54,965 INFO:val_init - Generating AutoUniformValGen
values of shape (4096, 4096)
2015-05-15 22:00:55,175 INFO:val_init - Generating AutoUniformValGen
values of shape (1000, 4096)
2015-05-15 22:00:55,229 WARNING:imageset - Batch dir cache not found in
/home/ubgpu/data/I1K/imageset_batches/dataset_cache.pkl:
Press Y to create, otherwise exit: Y
/usr/local/lib/python2.7/dist-packages/neon/util/batch_writer.py:137:
RuntimeWarning: divide by zero encountered in log10
self.val_start = 10 ** int(np.log10(self.ntrain * 10))
Traceback (most recent call last):
File "/usr/local/bin/neon", line 199, in
experiment, result, status = main()
File "/usr/local/bin/neon", line 168, in main
result = experiment.run()
File
"/usr/local/lib/python2.7/dist-packages/neon/experiments/fit_predict_err.py",
line 97, in run
super(FitPredictErrorExperiment, self).run()
File "/usr/local/lib/python2.7/dist-packages/neon/experiments/fit.py",
line 70, in run
self.dataset.load()
File "/usr/local/lib/python2.7/dist-packages/neon/datasets/imageset.py",
line 176, in load
self.bw.run()
File "/usr/local/lib/python2.7/dist-packages/neon/util/batch_writer.py",
line 215, in run
self.write_csv_files()
File "/usr/local/lib/python2.7/dist-packages/neon/util/batch_writer.py",
line 137, in write_csv_files
self.val_start = 10 ** int(np.log10(self.ntrain * 10))
OverflowError: cannot convert float infinity to integer
ubgpu@ubgpu:~/github/neon/neon$


Reply to this email directly or view it on GitHub
#26.

@andyyuan78
Copy link
Author

yes, I have it and change the path of -fp32.yaml .

@apark263
Copy link
Contributor

can you confirm that the following files are in $repo_path/I1K (where
repo_path is set as specified in the yam file):

ILSVRC2012_img_train.tar
ILSVRC2012_img_val.tar
ILSVRC2012_devkit_t12.tar.gz

from the error it seems like the batch_writer is not finding the train tar
file.

On Sat, May 16, 2015 at 6:35 PM, Andy Yuan notifications@github.com wrote:

yes, I have it and change the path of -fp32.yaml .


Reply to this email directly or view it on GitHub
#26 (comment).

@andyyuan78
Copy link
Author

cool. it works!

@andyyuan78
Copy link
Author

a little suggestion: maybe we should provide meanful debug/error message. ;)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants