Segmentation fault after training for a few iterations #356

groot-1313 · 2018-01-29T06:16:21Z

I am training on a custom dataset, and have used an anchor size of 10. Using batch size as 64 and 16 subdivisions. It ends the training after a few iterations due to a seg fault.

sivagnanamn · 2018-01-29T07:51:59Z

What is your GPU memory size?
Have you set random=1 in your cfg file? If yes, please try random=0

groot-1313 · 2018-01-29T07:54:42Z

I am using Tesla K80. GPU memory size is 12GB
I tried with random = 0. Same problem persists

sivagnanamn · 2018-01-29T08:21:37Z

Could you please share your train log? Is it always stopping at a particular iteration (or) randomly?

groot-1313 · 2018-01-29T08:26:01Z

Region Avg IOU: 0.075464, Class: 0.436563, Obj: 0.538045, No Obj: 0.482124, Avg Recall: 0.000000, count: 7
Region Avg IOU: 0.407119, Class: 0.053543, Obj: 0.604019, No Obj: 0.481808, Avg Recall: 0.333333, count: 3
Region Avg IOU: 0.058566, Class: 0.160997, Obj: 0.348668, No Obj: 0.483735, Avg Recall: 0.000000, count: 4
Region Avg IOU: 0.186754, Class: 0.169088, Obj: 0.831168, No Obj: 0.484255, Avg Recall: 0.000000, count: 2
Region Avg IOU: 0.035496, Class: 0.082673, Obj: 0.546367, No Obj: 0.482488, Avg Recall: 0.000000, count: 6
18: 1038.119141, 1047.479004 avg, 0.000000 rate, 23.280225 seconds, 1152 images
Loaded: 0.000031 seconds
Region Avg IOU: 0.065268, Class: 0.404112, Obj: 0.649606, No Obj: 0.484439, Avg Recall: 0.000000, count: 7
Region Avg IOU: 0.183518, Class: 0.366346, Obj: 0.542902, No Obj: 0.483338, Avg Recall: 0.090909, count: 11
Region Avg IOU: 0.197373, Class: 0.201846, Obj: 0.352214, No Obj: 0.483670, Avg Recall: 0.100000, count: 10
Region Avg IOU: 0.175585, Class: 0.268489, Obj: 0.432158, No Obj: 0.484682, Avg Recall: 0.250000, count: 4
Region Avg IOU: 0.206101, Class: 0.234225, Obj: 0.576000, No Obj: 0.482113, Avg Recall: 0.055556, count: 18
Segmentation fault

Stops at random iterations.
Tried it for anchor size 5 using the pascal VOC anchors as well.
Using size 608x608.

sivagnanamn · 2018-01-29T08:38:24Z

Using gdb trace to check to root cause will be helpful. I've faced similar seg fault in the following cases:

GPU memory full
Incomplete system configuration (Ex: OpenCV, cuDNN)
Missing train images

If you're sure that your case is not related to any of the above, you can use gdb to check the trace.

groot-1313 · 2018-01-29T09:41:57Z

I get a make error when I set opencv and debug to 1 in the makefile.

groot-1313 · 2018-01-29T11:21:00Z

Program received signal SIGSEGV, Segmentation fault.
0x000000000046bb7c in get_region_box (x=0x8ce6690, biases=0x881f10, n=0,
index=-874956011, i=18999980, j=18999980, w=19, h=19, stride=361)
at ./src/region_layer.c:79
79 b.x = (i + x[index + 0*stride]) / w;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.169.amzn1.x86_64

The above was the output of gdb.

sivagnanamn · 2018-01-29T11:42:38Z

Does any of our annotations contain 0.0 in your training data? If so could you please change it to 0.1 (any smaller non-zero value) and try?

AlexeyAB · 2018-01-29T11:48:26Z

@groot-1313 What software utility did you use to get annotations in .txt-files for each image?

groot-1313 · 2018-01-29T11:49:50Z

I had a dataset annotated using bbox-Label me tool. I used a python script to convert it to format accepted by darknet.

AlexeyAB · 2018-01-29T12:44:12Z

@groot-1313

As I see you use original Darknet repo: https://github.com/pjreddie/darknet/blob/80d9bec20f0a44ab07616215c6eadb2d633492fe/src/region_layer.c#L79

Because in my repo this line is different:

darknet/src/region_layer.c

Line 76 in 993e3a3

b.x = (i + logistic_activate(x[index + 0])) / w;

So you can try to use my repo.
Also try to check 0.0 in your annotation .txt-files, as said @sivagnanamn
What cfg-file do you use, can you show it?

groot-1313 · 2018-01-29T14:11:28Z

The cfg file has only the following changes:
At the top:

[net]
#Testing
#batch=1
#subdivisions=1
#Training
batch=64
subdivisions=8
width=608
height=608

At the bottom:

[convolutional]
size=1
stride=1
pad=1
filters=90
activation=linear

[region]
anchors = 0.89,1.26, 0.90,2.67, 1.20,0.85, 1.46,1.30, 1.49,4.14, 1.55,7.72, 2.08,1.57, 2.08,2.29, 2.91,3.73, 3.37,11.64
bias_match=1
classes=4
coords=4
num=10
softmax=1
jitter=.3
rescore=1

object_scale=5
noobject_scale=1
class_scale=1
coord_scale=1

absolute=1
thresh = .6
random=0

I checked my annotations. No 0.0. But the txt files have an extra empty line at the bottom.

I will try your repo and get back to you @AlexeyAB on this issue thread.

groot-1313 · 2018-01-30T10:40:50Z

@AlexeyAB @sivagnanamn I found some annotations with 0.0. Thank you for helping. I will make the required changes.

youzi27 · 2020-05-01T07:20:01Z

I have the same problem and don't know how to solve it

groot-1313 closed this as completed Jan 30, 2018

IrenTang mentioned this issue Dec 4, 2019

Segmentation fault after training for a few iterations Zzh-tju/DIoU-darknet#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault after training for a few iterations #356

Segmentation fault after training for a few iterations #356

groot-1313 commented Jan 29, 2018

sivagnanamn commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

sivagnanamn commented Jan 29, 2018

groot-1313 commented Jan 29, 2018 •

edited

Loading

sivagnanamn commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

sivagnanamn commented Jan 29, 2018

AlexeyAB commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

AlexeyAB commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

groot-1313 commented Jan 30, 2018

youzi27 commented May 1, 2020

Segmentation fault after training for a few iterations #356

Segmentation fault after training for a few iterations #356

Comments

groot-1313 commented Jan 29, 2018

sivagnanamn commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

sivagnanamn commented Jan 29, 2018

groot-1313 commented Jan 29, 2018 • edited Loading

sivagnanamn commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

sivagnanamn commented Jan 29, 2018

AlexeyAB commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

AlexeyAB commented Jan 29, 2018

groot-1313 commented Jan 29, 2018

groot-1313 commented Jan 30, 2018

youzi27 commented May 1, 2020

groot-1313 commented Jan 29, 2018 •

edited

Loading