Error when fine-tuning Caffe model using DIGITS #347

antran89 · 2015-10-07T08:48:54Z

Hi developers,
I am sorry, if I am posting in the wrong place. I searched user group, but did not find any clues. So, I report a bug here.
When I started fine-tuning a model using bvlc_reference_caffenet model, I followed tutorials of using Caffe to fine-tune flickr_style dataset. If I use Caffe, it was running fine, but it is hard to visualize the loss values. So, I decided to use DIGITS to fine tune my models on my data. The differenc is that I use the latest version of DIGITS and NVIDIA-Caffe to do fine-tuning.
Then it generates this error

2015-10-07 14:14:02 [20151007-141401-e882] [INFO ] Train Caffe Model task started.
2015-10-07 14:14:08 [20151007-141401-e882] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated transformation parameters: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-07 14:14:08 [20151007-141401-e882] [ERROR] Train Caffe Model: Note that future Caffe releases will only support transform_param messages for transformation fields.
2015-10-07 14:14:08 [20151007-141401-e882] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated V1LayerParameter: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-07 14:14:09 [20151007-141401-e882] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated transformation parameters: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-07 14:14:09 [20151007-141401-e882] [ERROR] Train Caffe Model: Note that future Caffe releases will only support transform_param messages for transformation fields.
2015-10-07 14:14:09 [20151007-141401-e882] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated V1LayerParameter: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
127.0.0.1 - - [2015-10-07 14:23:03] "POST /jobs/20151007-141401-e882/abort HTTP/1.1" 200 128 0.002188

Hope to receive your comments.

The text was updated successfully, but these errors were encountered:

antran89 · 2015-10-07T09:46:01Z

My train_val.prototxt and solver.prototxt are attached!

lukeyeager · 2015-10-07T20:59:29Z

Are you sure this is an error? Is this a duplicate of #140? See this comment - #140 (comment).

antran89 · 2015-10-08T15:47:33Z

Hi @lukeyeager
I downloaded again the caffe model from the script provided in the latest Caffe. The error is still the same. It did not progress after few minutes, so I aborted them.

2015-10-08 23:37:03 [20151008-233702-fa10] [INFO ] Train Caffe Model task started.
2015-10-08 23:37:11 [20151008-233702-fa10] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated transformation parameters: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-08 23:37:11 [20151008-233702-fa10] [ERROR] Train Caffe Model: Note that future Caffe releases will only support transform_param messages for transformation fields.
2015-10-08 23:37:11 [20151008-233702-fa10] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated V1LayerParameter: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-08 23:37:15 [20151008-233702-fa10] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated transformation parameters: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-08 23:37:15 [20151008-233702-fa10] [ERROR] Train Caffe Model: Note that future Caffe releases will only support transform_param messages for transformation fields.
2015-10-08 23:37:15 [20151008-233702-fa10] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated V1LayerParameter: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
127.0.0.1 - - [2015-10-08 23:40:21] "POST /jobs/20151008-233702-fa10/abort HTTP/1.1" 200 128 0.004213
2015-10-08 23:40:26 [20151008-233702-fa10] [INFO ] Job deleted.
127.0.0.1 - - [2015-10-08 23:40:26] "DELETE /jobs/20151008-233702-fa10 HTTP/1.1" 200 128 0.004342

lukeyeager · 2015-10-08T18:18:36Z

Can you post the raw caffe log? I suspect it may just be really slow to respond.

antran89 · 2015-10-09T02:30:57Z

Hi @lukeyeager.
This is my caffe log file [http://pastebin.com/DaC8WNLD]. I observe that Data layer prefetch queue empty is as the same as previous time.

lukeyeager · 2015-10-09T16:10:57Z

I1009 10:07:20.342645 14891 solver.cpp:314] Iteration 0, Testing net (#0)
I1009 10:07:20.354153 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.405398 14897 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.405930 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.467383 14897 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.486361 14897 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.511524 14891 blocking_queue.cpp:50] Data layer prefetch queue empty
...
I1009 10:16:44.137663 14891 blocking_queue.cpp:50] Data layer prefetch queue empty
I1009 10:16:44.168421 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:16:44.213986 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:16:44.366152 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:16:44.383676 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:16:44.427858 14899 blocking_queue.cpp:50] Waiting for data

How big is your dataset? 9 minutes seems like it should be plenty long enough to get some kind of a response, but if you're trying to train on all of ImageNet with a small GPU, I can see how the validation pass might take longer than that.

Can you try fine-tuning on a small dataset? Have you been able to use DIGITS to train any other model successfully on your machine?

antran89 · 2015-10-10T05:51:17Z

@lukeyeager Hi,
For your information, I did training and following tutorial for NVIDIA Deep Learning course [https://developer.nvidia.com/deep-learning-courses]. It was successful with DIGITS in my local machine with NVIDIA-Caffe 0.12. I never did fine-tune with DIGITS before. Then I upgrade NVIDIA-Caffe to master branch and this error happened.
From what I read from NVIDIA/caffe#41, it looks like the problem with post-0.12 version of NVIDIA-Caffe.

antran89 · 2015-10-10T08:04:46Z

Hi @lukeyeager
I can confirm that the same thing happened when fine-tuning even on the latest version of BVLC-Caffe. After a few iterations, Caffe will shutdown my workstation. I have repeated the errors multiple times and still don't know why. I might be because of my train_val.prototxt file or Caffe binary model 😓. I do not why I succeeded in the first time.
I will try to localize the problem and report this error to Caffe developers.

antran89 · 2015-10-11T02:31:52Z

Hi @lukeyeager
I took your advice and tried fine-tuning on Flickr example provided in Caffe. Something serious problem occurred in my computer.
Could you please execute for me the Flickr fine-tune examples in NVIDIA-Caffe or BLVC-Caffe?
Whenever I run it, after sometimes, it will restart my workstation whether I run it on CPU or GPU. Please help me to confirm whether it is the same on your machine.
Thanks,

gsxuan1127 · 2015-10-12T03:24:40Z

I have meet the same problem as
I1009 10:16:44.137663 14891 blocking_queue.cpp:50] Data layer prefetch queue empty
I1009 10:16:44.168421 14899 blocking_queue.cpp:50] Waiting for data
And after a short time waiting, the ubuntu system shutted down.
I had do a lot of trys, the problem is still not solved.

deepsemantic · 2015-10-18T12:59:26Z

Hi @antran89 and @gsxuan1127 ,
I have same error, but the different is my ubuntu did not shut down. Do you guys already know how to fix this problem? I am using the caffe I downloaded one week ago.

antran89 · 2015-12-19T05:35:08Z

Hi @deepsemantic, @gsxuan1127, @lukeyeager,
Sorry, I did not close this issue previously. After sometime, I figured out that the restarting problem is related to power issues. It might be that your workstation does not support enough power for GPUs (i.e., in my cases, Titan X).
Please check your power issues, power cables, and try disable GPU auto-boost if it does not solve your problem.
Here is my summarization of how to set up proper workstation for Deep Learning https://drive.google.com/open?id=1McEMHewtMOwhIyUSO_2AZxbeEwRihjF2Ql_eYQ6bq2o.

lukeyeager · 2015-12-28T17:39:35Z

Nice presentation @antran89 - thanks for the link!

antran89 closed this as completed Oct 7, 2015

antran89 reopened this Oct 7, 2015

lukeyeager mentioned this issue Oct 10, 2015

Training stops at iteration 0 with no error message or probable cause? NVIDIA/caffe#41

Closed

lukeyeager added the bug label Oct 12, 2015

antran89 closed this as completed Dec 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when fine-tuning Caffe model using DIGITS #347

Error when fine-tuning Caffe model using DIGITS #347

antran89 commented Oct 7, 2015

antran89 commented Oct 7, 2015

lukeyeager commented Oct 7, 2015

antran89 commented Oct 8, 2015

lukeyeager commented Oct 8, 2015

antran89 commented Oct 9, 2015

lukeyeager commented Oct 9, 2015

antran89 commented Oct 10, 2015

antran89 commented Oct 10, 2015

antran89 commented Oct 11, 2015

gsxuan1127 commented Oct 12, 2015

deepsemantic commented Oct 18, 2015

antran89 commented Dec 19, 2015

lukeyeager commented Dec 28, 2015

Error when fine-tuning Caffe model using DIGITS #347

Error when fine-tuning Caffe model using DIGITS #347

Comments

antran89 commented Oct 7, 2015

antran89 commented Oct 7, 2015

lukeyeager commented Oct 7, 2015

antran89 commented Oct 8, 2015

lukeyeager commented Oct 8, 2015

antran89 commented Oct 9, 2015

lukeyeager commented Oct 9, 2015

antran89 commented Oct 10, 2015

antran89 commented Oct 10, 2015

antran89 commented Oct 11, 2015

gsxuan1127 commented Oct 12, 2015

deepsemantic commented Oct 18, 2015

antran89 commented Dec 19, 2015

lukeyeager commented Dec 28, 2015