Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault after training for a few iterations #356

Closed
groot-1313 opened this issue Jan 29, 2018 · 14 comments
Closed

Segmentation fault after training for a few iterations #356

groot-1313 opened this issue Jan 29, 2018 · 14 comments

Comments

@groot-1313
Copy link

I am training on a custom dataset, and have used an anchor size of 10. Using batch size as 64 and 16 subdivisions. It ends the training after a few iterations due to a seg fault.

@sivagnanamn
Copy link

  1. What is your GPU memory size?
  2. Have you set random=1 in your cfg file? If yes, please try random=0

@groot-1313
Copy link
Author

  1. I am using Tesla K80. GPU memory size is 12GB
  2. I tried with random = 0. Same problem persists

@sivagnanamn
Copy link

Could you please share your train log? Is it always stopping at a particular iteration (or) randomly?

@groot-1313
Copy link
Author

groot-1313 commented Jan 29, 2018

Region Avg IOU: 0.075464, Class: 0.436563, Obj: 0.538045, No Obj: 0.482124, Avg Recall: 0.000000, count: 7
Region Avg IOU: 0.407119, Class: 0.053543, Obj: 0.604019, No Obj: 0.481808, Avg Recall: 0.333333, count: 3
Region Avg IOU: 0.058566, Class: 0.160997, Obj: 0.348668, No Obj: 0.483735, Avg Recall: 0.000000, count: 4
Region Avg IOU: 0.186754, Class: 0.169088, Obj: 0.831168, No Obj: 0.484255, Avg Recall: 0.000000, count: 2
Region Avg IOU: 0.035496, Class: 0.082673, Obj: 0.546367, No Obj: 0.482488, Avg Recall: 0.000000, count: 6
18: 1038.119141, 1047.479004 avg, 0.000000 rate, 23.280225 seconds, 1152 images
Loaded: 0.000031 seconds
Region Avg IOU: 0.065268, Class: 0.404112, Obj: 0.649606, No Obj: 0.484439, Avg Recall: 0.000000, count: 7
Region Avg IOU: 0.183518, Class: 0.366346, Obj: 0.542902, No Obj: 0.483338, Avg Recall: 0.090909, count: 11
Region Avg IOU: 0.197373, Class: 0.201846, Obj: 0.352214, No Obj: 0.483670, Avg Recall: 0.100000, count: 10
Region Avg IOU: 0.175585, Class: 0.268489, Obj: 0.432158, No Obj: 0.484682, Avg Recall: 0.250000, count: 4
Region Avg IOU: 0.206101, Class: 0.234225, Obj: 0.576000, No Obj: 0.482113, Avg Recall: 0.055556, count: 18
Segmentation fault

Stops at random iterations.
Tried it for anchor size 5 using the pascal VOC anchors as well.
Using size 608x608.

@sivagnanamn
Copy link

Using gdb trace to check to root cause will be helpful. I've faced similar seg fault in the following cases:

  1. GPU memory full
  2. Incomplete system configuration (Ex: OpenCV, cuDNN)
  3. Missing train images

If you're sure that your case is not related to any of the above, you can use gdb to check the trace.

@groot-1313
Copy link
Author

I get a make error when I set opencv and debug to 1 in the makefile.

@groot-1313
Copy link
Author

Program received signal SIGSEGV, Segmentation fault.
0x000000000046bb7c in get_region_box (x=0x8ce6690, biases=0x881f10, n=0,
index=-874956011, i=18999980, j=18999980, w=19, h=19, stride=361)
at ./src/region_layer.c:79
79 b.x = (i + x[index + 0*stride]) / w;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.169.amzn1.x86_64

The above was the output of gdb.

@sivagnanamn
Copy link

Does any of our annotations contain 0.0 in your training data? If so could you please change it to 0.1 (any smaller non-zero value) and try?

@AlexeyAB
Copy link
Owner

@groot-1313 What software utility did you use to get annotations in .txt-files for each image?

@groot-1313
Copy link
Author

I had a dataset annotated using bbox-Label me tool. I used a python script to convert it to format accepted by darknet.

@AlexeyAB
Copy link
Owner

@groot-1313

As I see you use original Darknet repo: https://github.com/pjreddie/darknet/blob/80d9bec20f0a44ab07616215c6eadb2d633492fe/src/region_layer.c#L79

Because in my repo this line is different:

b.x = (i + logistic_activate(x[index + 0])) / w;

  • So you can try to use my repo.
  • Also try to check 0.0 in your annotation .txt-files, as said @sivagnanamn
  • What cfg-file do you use, can you show it?

@groot-1313
Copy link
Author

The cfg file has only the following changes:
At the top:

[net]
#Testing
#batch=1
#subdivisions=1
#Training
batch=64
subdivisions=8
width=608
height=608

At the bottom:

[convolutional]
size=1
stride=1
pad=1
filters=90
activation=linear

[region]
anchors = 0.89,1.26, 0.90,2.67, 1.20,0.85, 1.46,1.30, 1.49,4.14, 1.55,7.72, 2.08,1.57, 2.08,2.29, 2.91,3.73, 3.37,11.64
bias_match=1
classes=4
coords=4
num=10
softmax=1
jitter=.3
rescore=1

object_scale=5
noobject_scale=1
class_scale=1
coord_scale=1

absolute=1
thresh = .6
random=0

I checked my annotations. No 0.0. But the txt files have an extra empty line at the bottom.

I will try your repo and get back to you @AlexeyAB on this issue thread.

@groot-1313
Copy link
Author

@AlexeyAB @sivagnanamn I found some annotations with 0.0. Thank you for helping. I will make the required changes.

@youzi27
Copy link

youzi27 commented May 1, 2020

I have the same problem and don't know how to solve it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants