Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when fine-tuning Caffe model using DIGITS #347

Closed
antran89 opened this issue Oct 7, 2015 · 13 comments
Closed

Error when fine-tuning Caffe model using DIGITS #347

antran89 opened this issue Oct 7, 2015 · 13 comments
Labels

Comments

@antran89
Copy link

antran89 commented Oct 7, 2015

Hi developers,
I am sorry, if I am posting in the wrong place. I searched user group, but did not find any clues. So, I report a bug here.
When I started fine-tuning a model using bvlc_reference_caffenet model, I followed tutorials of using Caffe to fine-tune flickr_style dataset. If I use Caffe, it was running fine, but it is hard to visualize the loss values. So, I decided to use DIGITS to fine tune my models on my data. The differenc is that I use the latest version of DIGITS and NVIDIA-Caffe to do fine-tuning.
Then it generates this error

2015-10-07 14:14:02 [20151007-141401-e882] [INFO ] Train Caffe Model task started.
2015-10-07 14:14:08 [20151007-141401-e882] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated transformation parameters: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-07 14:14:08 [20151007-141401-e882] [ERROR] Train Caffe Model: Note that future Caffe releases will only support transform_param messages for transformation fields.
2015-10-07 14:14:08 [20151007-141401-e882] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated V1LayerParameter: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-07 14:14:09 [20151007-141401-e882] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated transformation parameters: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-07 14:14:09 [20151007-141401-e882] [ERROR] Train Caffe Model: Note that future Caffe releases will only support transform_param messages for transformation fields.
2015-10-07 14:14:09 [20151007-141401-e882] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated V1LayerParameter: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
127.0.0.1 - - [2015-10-07 14:23:03] "POST /jobs/20151007-141401-e882/abort HTTP/1.1" 200 128 0.002188

Hope to receive your comments.

@antran89
Copy link
Author

antran89 commented Oct 7, 2015

My train_val.prototxt and solver.prototxt are attached!

@antran89 antran89 closed this as completed Oct 7, 2015
@antran89 antran89 reopened this Oct 7, 2015
@lukeyeager
Copy link
Member

Are you sure this is an error? Is this a duplicate of #140? See this comment - #140 (comment).

@antran89
Copy link
Author

antran89 commented Oct 8, 2015

Hi @lukeyeager
I downloaded again the caffe model from the script provided in the latest Caffe. The error is still the same. It did not progress after few minutes, so I aborted them.

2015-10-08 23:37:03 [20151008-233702-fa10] [INFO ] Train Caffe Model task started.
2015-10-08 23:37:11 [20151008-233702-fa10] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated transformation parameters: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-08 23:37:11 [20151008-233702-fa10] [ERROR] Train Caffe Model: Note that future Caffe releases will only support transform_param messages for transformation fields.
2015-10-08 23:37:11 [20151008-233702-fa10] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated V1LayerParameter: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-08 23:37:15 [20151008-233702-fa10] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated transformation parameters: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
2015-10-08 23:37:15 [20151008-233702-fa10] [ERROR] Train Caffe Model: Note that future Caffe releases will only support transform_param messages for transformation fields.
2015-10-08 23:37:15 [20151008-233702-fa10] [ERROR] Train Caffe Model: Attempting to upgrade input file specified using deprecated V1LayerParameter: /home/tranlaman/BLVC-caffe/models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
127.0.0.1 - - [2015-10-08 23:40:21] "POST /jobs/20151008-233702-fa10/abort HTTP/1.1" 200 128 0.004213
2015-10-08 23:40:26 [20151008-233702-fa10] [INFO ] Job deleted.
127.0.0.1 - - [2015-10-08 23:40:26] "DELETE /jobs/20151008-233702-fa10 HTTP/1.1" 200 128 0.004342

@lukeyeager
Copy link
Member

Can you post the raw caffe log? I suspect it may just be really slow to respond.

@antran89
Copy link
Author

antran89 commented Oct 9, 2015

Hi @lukeyeager.
This is my caffe log file [http://pastebin.com/DaC8WNLD]. I observe that Data layer prefetch queue empty is as the same as previous time.

@lukeyeager
Copy link
Member

I1009 10:07:20.342645 14891 solver.cpp:314] Iteration 0, Testing net (#0)
I1009 10:07:20.354153 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.405398 14897 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.405930 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.467383 14897 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.486361 14897 blocking_queue.cpp:50] Waiting for data
I1009 10:07:20.511524 14891 blocking_queue.cpp:50] Data layer prefetch queue empty
...
I1009 10:16:44.137663 14891 blocking_queue.cpp:50] Data layer prefetch queue empty
I1009 10:16:44.168421 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:16:44.213986 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:16:44.366152 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:16:44.383676 14899 blocking_queue.cpp:50] Waiting for data
I1009 10:16:44.427858 14899 blocking_queue.cpp:50] Waiting for data

How big is your dataset? 9 minutes seems like it should be plenty long enough to get some kind of a response, but if you're trying to train on all of ImageNet with a small GPU, I can see how the validation pass might take longer than that.

Can you try fine-tuning on a small dataset? Have you been able to use DIGITS to train any other model successfully on your machine?

@antran89
Copy link
Author

@lukeyeager Hi,
For your information, I did training and following tutorial for NVIDIA Deep Learning course [https://developer.nvidia.com/deep-learning-courses]. It was successful with DIGITS in my local machine with NVIDIA-Caffe 0.12. I never did fine-tune with DIGITS before. Then I upgrade NVIDIA-Caffe to master branch and this error happened.
From what I read from NVIDIA/caffe#41, it looks like the problem with post-0.12 version of NVIDIA-Caffe.

@antran89
Copy link
Author

Hi @lukeyeager
I can confirm that the same thing happened when fine-tuning even on the latest version of BVLC-Caffe. After a few iterations, Caffe will shutdown my workstation. I have repeated the errors multiple times and still don't know why. I might be because of my train_val.prototxt file or Caffe binary model 😓. I do not why I succeeded in the first time.
I will try to localize the problem and report this error to Caffe developers.

@antran89
Copy link
Author

Hi @lukeyeager
I took your advice and tried fine-tuning on Flickr example provided in Caffe. Something serious problem occurred in my computer.
Could you please execute for me the Flickr fine-tune examples in NVIDIA-Caffe or BLVC-Caffe?
Whenever I run it, after sometimes, it will restart my workstation whether I run it on CPU or GPU. Please help me to confirm whether it is the same on your machine.
Thanks,

@gsxuan1127
Copy link

I have meet the same problem as
I1009 10:16:44.137663 14891 blocking_queue.cpp:50] Data layer prefetch queue empty
I1009 10:16:44.168421 14899 blocking_queue.cpp:50] Waiting for data
And after a short time waiting, the ubuntu system shutted down.
I had do a lot of trys, the problem is still not solved.

@lukeyeager lukeyeager added the bug label Oct 12, 2015
@deepsemantic
Copy link

Hi @antran89 and @gsxuan1127 ,
I have same error, but the different is my ubuntu did not shut down. Do you guys already know how to fix this problem? I am using the caffe I downloaded one week ago.

@antran89
Copy link
Author

Hi @deepsemantic, @gsxuan1127, @lukeyeager,
Sorry, I did not close this issue previously. After sometime, I figured out that the restarting problem is related to power issues. It might be that your workstation does not support enough power for GPUs (i.e., in my cases, Titan X).
Please check your power issues, power cables, and try disable GPU auto-boost if it does not solve your problem.
Here is my summarization of how to set up proper workstation for Deep Learning https://drive.google.com/open?id=1McEMHewtMOwhIyUSO_2AZxbeEwRihjF2Ql_eYQ6bq2o.

@lukeyeager
Copy link
Member

Nice presentation @antran89 - thanks for the link!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants