Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use DetectNet for custom size data ? #980

Open
erogol opened this issue Aug 16, 2016 · 55 comments
Open

How can I use DetectNet for custom size data ? #980

erogol opened this issue Aug 16, 2016 · 55 comments

Comments

@erogol
Copy link

erogol commented Aug 16, 2016

I try to use DetectNet for a third party data with 448x448 size images. What are the parameters need to be changed for this custom problem?

@lukeyeager
Copy link
Member

It's certainly possible to adjust DetectNet to work with other image sizes, but not easy. @jbarker-nvidia got it to work with 1024x512 images in his blog post:
https://devblogs.nvidia.com/parallelforall/detectnet-deep-neural-network-object-detection-digits/

Unfortunately, it's not a simple process. And even if you get it to run without errors, you need to understand what's going on pretty well to get it to actually converge to a solution for your data.

Off the top of my head, here are some places to start:

  • Adjust the image size for your problem (here, here, etc.)
  • Adjust the stride according to the smallest size object you'd like to detect (here)
  • If you do change the stride, I think there's another parameter near the end that needs adjusting (this one?) so that the network output size matches the label size

@fchouteau
Copy link

fchouteau commented Aug 26, 2016

I am also trying to adapt detectnet for my own dataset (example 1024x1024 size images) with custom object sizes (around 192x192)

The issue is that in the blog post, the full modified prototxt is not published so I'm having a lot of trouble recalculating what I need to modify:

If I'm correct:

Adjust Image size:
In L79-80 and L118-119
{ (...) xsize: myImageXSize (or myCropSize if crop) ysize: myImageXSize (or myCropSize if crop) }

Adjust stride for detecting custom classes
In L73
{ stride: myMinObjectSize }
But there I can't understand which parameters I need to tune as 1248 - 352 looks like the original image size but not that much.
In L2504
{ param_str : '1248, 352, 16, 0.6, 3, 0.02, 22' }
I would then guess param_str = 'xSize,ySize,Stride,?,?,?,?' but the rest...

Same for L2519 and L2545

However, I can't understand what I would need to modify: L2418. does not seem to need modification as it is the bounding box regressor so it should output 4 objects. (unless I'm mistaken).

I would love adding documentation to using DetectNet & Digits with a custom dataset, however I can't really understand everything yet.

Regards

@jon-barker
Copy link

For 1024x1024 images and target objects around 192x192 you probably don't need to adjust the stride initially. DetectNet with default settings should be sensitive to objects in the range 50-400px. That means that you can just replace the 1248x348/352 everywhere by 1024x1024 and it should "just work".

Something I found that helped accuracy when I modified image sizes was to use random cropping in the "train_transform" - modify the image_size_x and image_size_y parameters to, say, 512 and 512 and set crop_bboxes: false.

@szm-R
Copy link

szm-R commented Aug 28, 2016

@jbarker-nvidia , Hi I did what you said (set crop_bboxes: false) and it improved my mAP from 1.6 to 14 percent, kindly take a look at my question #1011 , Thank you.

@fchouteau
Copy link

@jbarker-nvidia Thank you for your input, much appreciated.
I have one more question however: I was also thinking about sampling random crops from image (in my case 512x512) so setting up
image_size_x: 512 image_size_y: 512 crop_bboxes: false
in detectnet_groundtruth_param
however, in the deploy data and later layers, should I specify 1024x1024 or 512x512 as image size ?
My guess would be to put 1024x1024 on before the train / val transform and at the end when calculating maP and clustering bboxes however I just watend to be sure.

Regards

@jon-barker
Copy link

@fchouteau Set image_size_x: 512 image_size_y: 512 crop_bboxes: false in name: "train_transform", i.e. the type: "DetectNetTransformation" layer applied at training time only. Everywhere else leave the image size as 1024x1024. That way cropping will only be applied at training time and validation and test will use the full-size 1024x1024 images. This works fine because the heart of DetectNet is a fully-convolutional network so can be applied to varying image sizes.

@JVR32
Copy link

JVR32 commented Sep 6, 2016

Hello everyone,

I want to use DIGITS (detectnet) + CAFFE to detect objects in my own dataset. I read some posts about adapting some settings in detectnet to use it for training and detection in a custom dataset. But apparently, most of the mentioned datasets consists of images with more or less the same dimensions for all images. My case is a bit different from the comments that I found …

I have 3 different object classes which I want to detect in images : classA, classB and classC.

For each object class, I have 3000 training images available (=> so 9000 in total), and 1500 validation images (=> 4500 in total). Those images are ROI’s (regions of interest from other images) that I manually cropped in the past, so the whole (training) image consists of one specific object.
The smallest dimension of a training or validation image is always 256 (e.g. 256x256, 256x340, 256x402, 256x280, 340x256, … --> note : not a perfect square, but never a long rectangle like 256 x 1024 or 256 x 800 ; always more or less a square shape).
Since all images consist of cropped regions (around an object) from other images, the label files look like this :

108 0.0 0 0.0 0.000000 0.000000 391.000000 255.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
108 0.0 0 0.0 0.000000 0.000000 255.000000 459.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
108 0.0 0 0.0 0.000000 0.000000 411.000000 255.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0
etc.

-> image class = ‘108’ and bounding box of object in the image = image dimensions

I want to train an object detection model(s) so I can detect those 3 objects (if present) in unknown test images, images that were not cropped beforehand. Dimensions of those images can be different (e.g. 800x600, 1200x800, 1486x680, … can be about everything).
Remark : in these unknown images -if the object appears in it-, the whole image can consist of the object, or the object can be a smaller part of the image (not covering the whole image).

My first question : is it necessary to make all the training / validation images have the same dimensions (e.g. 256 x 256), or can I solve it by setting some parameters (pad image? resize image?) to a specific dimension while creating a dataset? It’s not clear to me what those parameters exactly imply.

Second question : how about the test images that can have about any dimension ; do I have to resize them before analyzing or not?

If I get it right, I have to make some changes :

A] While creating a dataset, in the DIGITS box, change :

  • Pad image
  • Resize image

B] In detectnet_network.prototxt (dim:384 and dim:1248), here.

In the following lines, image_size_x:1248 and image_size_y:384 and crop_bboxes true/false are mentioned : here and here.

And in the following line, dimensions (1248, 352) are also used : here, here and here.

At this moment, it is not clear to me how to set these options for my specific case …

With kind regards.

@jon-barker
Copy link

@JVR32 DetectNet is not designed to work with datasets of the kind that you describe. A dataset for DetectNet should be images where the object you wish detect is some smaller part of the image and has a bounding box label that is a smaller part of the image. Some of these images could have objects that take up a large part of the image, but not all of them as it is important for DetectNet to be able to learn what "non-object" image/pixels looks like around a bounding box. That ability to learn a robust background model is why DetectNet can work well. Also note that you will need to modify the statndard DetectNet to work for multi-class object detection.

If you have access to the original dataset that you cropped the objects from then you should create a training dataset from those images and use the crop locations as the bounding box annotations to use DetectNet.

If you only have the cropped images to train on then you should just train an image classification network but make sure you train a Fully Convolutional Network (FCN). See here. An FCN for image classification can then be applied to a test image of any size and the output will be a "heatmap" of where objects might be present in the image. Note that this approach will not be as accurate as DetectNet and will suffer from a higher false alarm rate unless you also add non-object/background training samples to your dataset.

@JVR32
Copy link

JVR32 commented Sep 6, 2016

I guess I got stuck then.
I already used the (cropped) training/validation images before (together with a set of negative images that didn't contain the specified objects) to do image classification.
=> 4 image classes : negative, classA, classB, classC

And that worked quite good, but ... it worked best when the whole image was taken by the object, or if 'most of the image' was taken by the object. If the object was only a smaller part of the image, the image often was considered as 'negative' (->= not an image of the object we look for).

That's why I hoped that using detection instead of classification would improve the results. Main purpose is to detect if a certain object is present in an image (taking the whole image or only a smaller part of the image). Multi-class is not important, it can be done in multiple checks (-> classA present or not? ; classB present or not? ; classC present or not?).

Unfortunately, I manually cropped all the images in the past, so I don't have the crop locations in the original images :-( .

@JVR32
Copy link

JVR32 commented Sep 6, 2016

Suppose I had done it differently, and I would have 9000 training images and 4500 validation images with dimensions 640x640, and the wanted objects were smaller parts of those images :

e.g.
image1 = 640x640 with object ROI = (10, 10, 200, 250) = (top, left, bottom, right)
image2 = 640x640 with object ROI = (200, 10, 400, 370)
image3 = 640x640 with object ROI = (150, 150, 400, 400)
...

My test images could still have different dimensions : e.g. 800x600, 1200x800, 1486x680, …

Which settings should I provide while creating a dataset in the DIGITS box :

  • Pad image?
  • Resize image?

Are those totally independent from the possible dimensions of the test images (-> leave pad image empty and put 640x640 for resize image) or not?

And what about the dim and image_size_x and image_size_y parameters in detectnet_network.prototxt ?
Now 384/352 and 1248 are used, but what if the dimensions of the test images can be different, what do I have to put for those parameters?

@sherifshehata
Copy link

@jbarker-nvidia
I explored the code and it is not clear to me why do you set "crop_bboxes: false".

As i understand the function pruneBboxes() (in detectnet_coverage_rectangle.cpp) adjusts the boxes according to the done transformation. what happens when crop_bboxes is set to false?

@JVR32
Copy link

JVR32 commented Sep 8, 2016

Hello,

Could you please point me in the right direction before I spend at lot of time on annotating images that cannot be used in later processing?

You told me that I cannot use cropped images, and I can see why ...

But I would like to use object detection in Digits, so I'm willing to start over, and annotate the images again (determine bounding box coordinates around object), but I want to be sure I do it the right way this time.

So, this is my setup :

Suppose I want to detect if a certain object (let's call it classA) is present in an unknown image.

I start with collecting a number of images, e.g. 1000 images that contain objects of classA.

All those images can have different dimensions : 480x480 ; 640x640 ; 800x600 ; 1024x1024 ; 3200x1800 ; 726x1080 ; 1280x2740 ; ...

First question : how do I start?

a] Keep the original dimensions, and get the bounding box coordinates for the object of classA in the image ?

b] Resize the images, so they all have comparable dimensions (e.g. resize so the smallest or largest dimension is 640), and after that get the bounding box coordinates for the object of classA in the resized image ?

c] Non of the options above ; all images must have exactly the same dimensions, so resize all images to the same dimensions, and after that get the bounding box coordinates.

Option a] and b] can be done without a problem, c] is not that flexible, so rather not if not necessary.

So, that's the first thing I need to know : how do I start, can I get bounding boxes for the original images, or do I have to resize the images before determing the bounding boxes?

And then the second question : if I follow option a], b] or c] ... I will have 1000 images with for each image the bounding boxes around objects of classA.

After that I'm ready to create the database.

For parameter 'custom classes', I can use 'dontcare,classA'.

But how do I use the 'padding image' and 'resize image'?

I hope you can help me, cause I really want to try to detect objects on my own data, but it's not clear to me how to get started ...

With kind regards,

Johan.


Van: Jon Barker notifications@github.com
Verzonden: dinsdag 6 september 2016 14:19
Aan: NVIDIA/DIGITS
CC: JVR32; Mention
Onderwerp: Re: [NVIDIA/DIGITS] How can I use DetectNet for custom size data ? (#980)

@JVR32https://github.com/JVR32 DetectNet is not designed to work with datasets of the kind that you describe. A dataset for DetectNet should be images where the object you wish detect is some smaller part of the image and has a bounding box label that is a smaller part of the image. Some of these images could have objects that take up a large part of the image, but not all of them as it is important for DetectNet to be able to learn what "non-object" image/pixels looks like around a bounding box. That ability to learn a robust background model is why DetectNet can work well. Also note that you will need to modify the statndard DetectNet to work for multi-class object detection.

If you have access to the original dataset that you cropped the objects from then you should create a training dataset from those images and use the crop locations as the bounding box annotations to use DetectNet.

If you only have the cropped images to train on then you should just train an image classification network but make sure you train a Fully Convolutional Network (FCN). See herehttps://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb. An FCN for image classification can then be applied to a test image of any size and the output will be a "heatmap" of where objects might be present in the image. Note that this approach will not be as accurate as DetectNet and will suffer from a higher false alarm rate unless you also add non-object/background training samples to your dataset.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/980#issuecomment-244932886, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATALz186HGgQ2h3imgxbl66Gqoczz2aZks5qnVpvgaJpZM4JlRVq.

@jon-barker
Copy link

@JVR32 You can annotate bounding boxes on the images in their original size - this is probably desirable so that you can use them in that form in the future. DIGITS can resize the images and bounding box annotations during data ingest.

There's no definitive way to use 'padding image' and 'resize image', but to use DetectNet without modification you want to ensure that most of your objects are within the 50x50 to 400x400 pixel range. The benefit of padding is that you maintain aspect ratio and pixel resolution/object scaling. Having said that, if you have large variation in your input image sizes it is not desirable to pad too much around small images, so you may choose to resize all images to some size in the middle.

@JVR32
Copy link

JVR32 commented Sep 9, 2016

Thank you very much for the information.

In that case, I think it is best that I resize all images to have -more or less- the same dimensions before starting to process them.

=> I will resize all images so the smallest dimension is 640.

Then, input images will have dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ... and then normally, the object sizes will be in the range 50x50 to 400x400.

=> After resizing, I can start annotating, and determine the bounding boxes in the resized images.

Note : I think the bounding boxes don't have to be square?!

And when I'm done annotating, I will have a set of images with the smallest dimension 640 and bounding boxes in those images.

Maintaining the aspect ratio is important, so since I will have resized the images before annotating them, I suppose it is better to use padding (instead of resize) while creating the dataset?

I'll have to use padding if I'm correct, cause all the input images must have the same dimensions, right? So is it correct to leave the 'resize' parameters empty in that case, and put the padding so that all images (with dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ...) will fit in it -> e.g. 800 x 800 ?

Or do I have to set a bigger padding (e.g. 1024 x 1024) and set resize to 800x800?

I guess I have to use at least one of both parameters (padding or resizing), that I cannot just input the images with various dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ... without setting one of the 2 mentioned parameters?


Van: Jon Barker notifications@github.com
Verzonden: vrijdag 9 september 2016 4:23
Aan: NVIDIA/DIGITS
CC: JVR32; Mention
Onderwerp: Re: [NVIDIA/DIGITS] How can I use DetectNet for custom size data ? (#980)

@JVR32https://github.com/JVR32 You can annotate bounding boxes on the images in their original size - this is probably desirable so that you can use them in that form in the future. DIGITS can resize the images and bounding box annotations during data ingest.

There's no definitive way to use 'padding image' and 'resize image', but to use DetectNet without modification you want to ensure that most of your objects are within the 50x50 to 400x400 pixel range. The benefit of padding is that you maintain aspect ratio and pixel resolution/object scaling. Having said that, if you have large variation in your input image sizes it is not desirable to pad too much around small images, so you may choose to resize all images to some size in the middle.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/980#issuecomment-245800374, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATALz5FrlZjG_bTSn223B30HYGrF4Mgpks5qoMMLgaJpZM4JlRVq.

@jon-barker
Copy link

@JVR32

I think the bounding boxes don't have to be square?!

Correct

Or do I have to set a bigger padding (e.g. 1024 x 1024) and set resize to 800x800?

You don't need to do this, you can just pad up to the standard size. If all of your images are already 640 in one dimension then I would must pad the other dimension to the size of your largest image in that dimension. That minimizes further unnecessary manipulation of the data.

@fdesmedt
Copy link

I am trying to train a model for pedestrians (using TownCentre annotations) based on the KITTI example for cars. First I tried using the original resolutions (1920x1080), but changing the network parameters according to the comments above (replacing 1248x348/352 with the new resolution) lead to the error "bottom[i]->shape == bottom[0]->shape" which I was not able to solve.

To avoid having to change the network parameters, I just rescaled all training images ( and annotations accordingly) to the same resolution as KITTI, but the accuracy remains very low (also after processing 350 epochs). When I tried the advice of using cropping from the images, I fall back to the same error message about the shape.

Is there some other example available for object detection with different resolution input that reaches acceptable results?

@sherifshehata
Copy link

which layer give this error? My guess it is because your resolution is not divisible by 16, so you need to set your should replace 1248x348/352 with 1920x1080/1088

@fdesmedt
Copy link

The problem is always on "bboxes-masked"

I will try your suggestion. The original size in the network is actually 1248x384 (copied the values from above, which turned out to be incorrect). The 384 value is however divisible by 16, so what is the reason for using 352 there?

Another question. Is it important to scamble the data? The training data I have are the images of a long sequence, so consecutive frames contain a lot of the same pedestrians. Is this data scrambled before training? Or should I do this myself?

@fdesmedt
Copy link

I have tried your suggestion, but still get an error on the shape-issue. I attach the resulting log-file:
caffe_output.txt

@sherifshehata
Copy link

Did you do any other changes? your bboxes is 3 4 67 120, while i think it should be 3 4 68 120

@fdesmedt
Copy link

I did not change anything else, just replaced all instanced of 1248 by 1920, 384 by 1080 and 352 by 1088. Does the last one make sense?

It seems indeed that the 67 is the problem. I think it comes from the 1080 size, which is pooled 4 times (leading to dimention 540, 270, 135 and 67 of which the last one is truncated). I am now recreating the dataset with padding to 1088 to avoid the truncation. Hopes this helps ;)

@JVR32
Copy link

JVR32 commented Sep 30, 2016

Hello,

I trained a detection network as follows :

All training images (containing the objects I want to detect) can have different dimensions : 480x480 ; 640x640 ; 800x600 ; 1024x1024 ; 3200x1800 ; 726x1080 ; 1280x2740 ; 5125x3480 ; ...
Before annotating (determining the bounding boxes around the objects in the images -> needed for KITTI format), I resized all those images so the largest dimension is 640. Then, input images will have dimensions 640x640 ; 640x400 ; 490x640 ; 380x640 ; 640x500; ... and then normally, the object sizes will be in the range 50x50 to 400x400.
After resizing, I can start annotating, and determine the bounding boxes in the resized images. And when I'm done annotating, I have a set of images with the largest dimension 640 and bounding boxes around the objects of interest in those images.

I use those resized images and the bounding boxes around the objects for building a dataset, using this settings :
image1
So I padded the images to 640 x 640.

In 'detectnet_network.prototxt', I replaced 384/352 and 1248 by 640.

After training, I want to test the network.

The images I want to test can also have different dimensions. I can resize those images so the largest dimension is 640, but I don't know if that is necessary? And since the images can have different dimensions, it seems logical to me to put the Do not resize input image(s) flag to TRUE?
image2

I created a text file with the paths to the images I would like to test.
If I use this file for 'test many', it will generate some results if the Do not resize input image(s) is not set.
If I set this flag to TRUE, it generates an error :

Couldn't import dot_parser, loading of dot files will not be possible.
2016-09-30 10:13:21 [ERROR] ValueError: could not broadcast input array from shape (3,480,640) into shape (3,640,640)
Traceback (most recent call last):
File "C:\Programs\DIGITS-master\tools\inference.py", line 293, in
args['resize']
File "C:\Programs\DIGITS-master\tools\inference.py", line 167, in infer
resize=resize)
File "C:\Programs\DIGITS-master\digits\model\tasks\caffe_train.py", line 1394, in infer_many
resize=resize)
File "C:\Programs\DIGITS-master\digits\model\tasks\caffe_train.py", line 1434, in infer_many_images
'data', image)
ValueError: could not broadcast input array from shape (3,480,640) into shape (3,640,640)

What I don't understand : if I put only 1 file in the images list (and press test many), there is no error. If I put multiple files in the list, I get the error. But only if the 'do not resize' flag is checked ; if not checked -> no error?

Is this a bug, or is there a logical explanation?
Anyhow, I guess it must be possible to process a list of (test)images without resizing them before object detection? If it works for a single image in the list, it should also be possible for multiple images?

@gheinrich
Copy link
Contributor

Hi @JVR32 thanks for the detailed report! This is a bug indeed, sorry about that. I agree the error is certainly not explicit! We have a Github issue for this: #1092
In short the explanation is: when you test a batch of images, they must all have the same size otherwise you can't fit them all into a tensor.

@aprentis
Copy link

Hi, everybody!

i`ve clonned this repo
https://github.com/skyzhao3q/NvidiaDigitsObjDetect

and done everything as it mentioned in Readme(made dataset for Object Detection, run network).
But everything i`ve got was this.

screenshot from 2017-03-29 19-32-08

What`s the problem with one class? Full kitti goes ok.

@jon-barker
Copy link

@aprentis Can you hover over the graph so that we can the actual numeric results for the metric - it matters greatly whether those numbers are just small or exactly zero?

Looking at the repo you cloned I noticed that the model has explicit "dontcare" regions marked - whilst this can be useful, e.g. for masking out the sky when you only care about the road it is not necessary. I'm not sure what regions are being marked as "dontcare" for this data, but if it includes the sidewalks where the pedestrians are then you're going to have problems.

@aprentis
Copy link

@jbarker-nvidia those numbers are exactly zero.
Right now i`m training another model(with one class in it), unfortanately it has same problem.

In this repo i`ve found result screenshot, which says that mAP is OK after 10 epochs.
Does anybody know any successfull story about training DetectNet with only one class?

@aprentis
Copy link

@jbarker-nvidia I`ve tried the network which @gheinrich published(for two classes), but mAP was still zero.

@jon-barker
Copy link

@aprentis The fluctuations in those graphs suggest that not all of those numbers are exactly zero. Can you specify which ones are zero and which ones are not?

The basic Kitti example is for one class - just cars. There are lots of examples of successfully training DetectNet on other one class problems too.

@aprentis
Copy link

@jbarker-nvidia

Here are zoomed graphs.
2017-03-30_16-56-54

@jon-barker
Copy link

@aprentis Thanks - which optimizer, learning rate and learning rate decay policy are you using? I think you may want to try a smaller learning rate and/or more aggressive learning rate decay - I like to use Adam and an exponential learning rate decay.

@aprentis
Copy link

aprentis commented Mar 30, 2017

@jbarker-nvidia Ive run the network step by step. Whats wrong?

Here I used ADAM with Expo decay with 0.95 Gamma.

Does it depend on my CPU\GPU configuration?

2017-03-30_17-17-03
2017-03-30_17-17-22
2017-03-30_17-33-07
2017-03-30_17-35-15

@jon-barker
Copy link

@aprentis "Does it depend on my CPU\GPU configuration?" - No, that shouldn't matter unless you are using multiple GPUs in which case you may need to adjust your learning rate to accomodate the larger effective batch size.

From the information you've posted it appears that you have a correctly configured dataset and model definition. You may want to try a more aggressive learning rate decay schedule, say exponential with 0.99 decay.

@aprentis
Copy link

Now with 0.99. I dont understand whats wrong with it. =(

2017-03-31_11-22-12

@AleiLmy
Copy link

AleiLmy commented Apr 27, 2017

@jbarker-nvidia Hi,I saw your article https://devblogs.nvidia.com/parallelforall/exploring-spacenet-dataset-using-digits/,I have some questions to ask you about it.When you train the net,what your data format looks like,How do you convert the format of spaceNet into a data format that can use detectnet.thank you!

@jon-barker
Copy link

@AleiLmy For the object detection/DetectNet approach the data follows the standard Kitti format for bounding boxes. We used Python scripts to convert the geoJSON files to Kitti format text files. Obviously the building footprints are not all rectangular and don't all have sides parallel to the input image, so we used the minimum enclosing rectangle with those properties.

For the segmentation approach we again used Python scripts to convert the geoJSON files to .PNG files for the segmentation masks.

@ontheway16
Copy link

@jbarker-nvidia Hello, for my project I am trying to detect small objects (30-60 pixels). Detectnet is making detections very nicely, mAP is about 65, test accuracy is over 90% no problems here. The only thing I cannot figure out how to solve is detection of nearby objects. If there are two objects with 5-10 pixels distance, detectnet fails to distinguish and gives them a single bounding box. And I guess detecting overlapping objects individually is totally impossible.

I changed stride to 8 but not much help for this problem. May visualizations help me here to detect where the problem actually starts across network? Can you advise some modification points in network for this purpose?

@AleiLmy
Copy link

AleiLmy commented May 6, 2017

@jbarker-nvidia my labels look like this
building 0.0 0 0.0 325 68 358 104 0 0 0 0 0 0 0 (the labels of image contains buildings)
dontcare 0.0 0 0.0 0 0 50 50 0 0 0 0 0 0 0 (the labels of image doesn't contain buildings)

resized the images to 1280x1280,the map is zero.
index

and i tried to follow you steps ,but got a really bad output like this
bbox-list [[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]]
where did i go wrong?

@bfreskura
Copy link

@ontheway16 Can you please explain what have you changed to make it work with stride=8?

@ontheway16
Copy link

@Barty777 please check the following discussion;

https://groups.google.com/forum/m/#!topic/digits-users/zx_UYu3jlt8

@sam-pochyly
Copy link

@varunvv Can you clarify what you fixed? I'm getting the same error as you had.

@juiceboxjoe
Copy link

Thank you all for your ongoing support. I am currently training DetectNet on aerial images. Recall is at about 47% so far and precision at about 23% (epoch 800). I'm using base learning rate of 1e-4 with exp decay and gamma .99 on one GPU. After 350 epochs I stopped training to restart using two GPUs (set base learning rate to the last learning rate used on previous run to continue training). So far mAP, precision, and recall are still increasing (slowly).

I have a question about one of @jbarker-nvidia 's previous comments in this thread about using multiple GPUs:

@jbarker-nvidia "@aprentis "Does it depend on my CPU\GPU configuration?" - No, that shouldn't matter unless you are using multiple GPUs in which case you may need to adjust your learning rate to accomodate the larger effective batch size."

For the sake of clarity:

I know that when using (for example) 1 GPU with batch size of 5 and batch accumulation of 2 I get an effective batch size of 10 and I should adjust my learning rate accordingly, but do you mean that using additional GPUs would also change the effective batch size? For example if I'm using 2 GPUs with those same parameters (batch size 5 and accumulation 2) the effective batch size is actually going to be 20 instead of 10?

I would naturally think that backprop would take place every time 2 batches of 5 images have been processed regardless of which GPU was used to process each batch, and not every time all the GPUs in use have processed 2 batches of 5 images each. Meaning that my learning rate adjustments should only be based on batch size + batch accumulation, whilst totally disregarding the amount of GPUs in use.

I'm very confused as to what you mean by adjusting the learning rate to accommodate effective batch size when using multiple GPUs.

So my question is why is learning rate affected by amount of GPUs in use?

Thank you in advance for your help.

@jon-barker
Copy link

@juiceboxjoe You are correct that backprop would take place on each GPU independently after 2 batches of 5 images have been processed. But the learning rate is not used in backprop, it used in the optimizer. The optimizer aggregates the gradients from all of the GPUs, performs a gradient descent update and then broadcasts the new parameters back to the GPUs. So that aggregation step across the GPUs effectively increases the batch size to 20 (in the case of 2 GPUs).

@juiceboxjoe
Copy link

juiceboxjoe commented Jan 12, 2018

@jbarker-nvidia Thank you so much for your quick and enlightening response! Sorry for my confusion. I also found some nice, quick reference here that further clarifies your point about the difference between backprop and optimizers.

This means that I was wrong to think that my effective batch size did not change when I paused training on one GPU and then continued on two GPUs.

Can you also point me in the right direction regarding learning rate adjustments according to effective batch size when training DetectNet (found through experimentation - like DetectNet's ideal 40-500px detection range)?

Thanks in advance.

@jon-barker
Copy link

@juiceboxjoe There's not really any solid rule for how much to adjust the learning rate by as a function of batch size. But if you do need to change it you will need to decrease the learning rate as the batch size grows.

@rsandler00
Copy link

This seems relevant to post here:

When I naively changed the sizes in the detectNet prototxt file to my current image size, I got the error:
"Check failed: bottom[i]->shape() == bottom[0]->shape()"

This is because the image dimensions have to be an integer multiple of the stride. So w/ stride 16, I resized my 1920x1080 image to 1920x1088 and the error was resolved

@cesare-montresor
Copy link

cesare-montresor commented Jun 26, 2018

@jon-e-barker Same issues here, but I'm using a small custom dataset ( 500 images coming from the same video ) could be this the problem of the mAP = 0 (as well as every other metric) ?
I've tried adjusting sizes in the model, padding, resizing, etc.
The 12 Gb kitti dataset traing with no issues.

@rsandler00
Copy link

@cesare-montresor , I was usnig custom sized data and also getting mAP=0 for the first 100 epochs, and then it started picking up. Adjusting hyperparameters (such as reducing learning rate and clustering threshold) ma have avoided this problem, but this demonstrates that a mAP=0 doesnt mean the model isnt working - it just may be working very slowly.

image

@cesare-montresor
Copy link

ooook, I'm a goat typing on a keyboard, I made a mistake converting the labels from VOC to Kitti.
With the hyperparameters given in the docs and just 500 images it train like a charm ! ^_^

@michaelwellens
Copy link

@rsandler00 , hi I wanted to train also on the image size 1920x1088 like you did but it wont even train on this image size. I provided my prototxt file in this comment. Wen attempting the training the error can not force shape (84,5) into shape (50,5).

My data set is made out of images slightly smaller than the 1920x1088 size so i can pad them to the desired size in DIGITS.

Did you got this error to in the beginning?

Can you perhaps provide me your prototxt? It would be of great help.

detectnet_network_1920_1088_stride16.prototxt.txt

@GhaCH
Copy link

GhaCH commented Sep 12, 2019

Hi @ontheway16, i am facing the same problem. Any solution ?

@nikunjlad
Copy link

@jon-barker I am trying to do a 2 class object detection using DIGITS 6. My objective is to detect people and license plates and I have done hard-negative mining by annotating some false positive objects by the label name Dontcare. My dataset is very varied and image sizes varying from minimum of (296, 259) to maximum of (7360, 4912). So, likewise, the objects are ranging from small boxes around (5,10) or (10x10) to large objects probably beyond (400x400). This is causing my MAP to not improve.

Following are my dataset creation settings. I created 3 datasets. Based on your suggestions given above, padding and resizing would not be a wise option for my case since the image sizes are very small and padding may take up a lot of space. Hence, going forward with your suggestion of resizing images to an intermediate size ignoring the aspect ratio problem, I did the following:

  1. dataset_v1: images resized to 1280x720, validation boxes min size is 0, custom class names are - dontcare, person, plate, batchsize is 32, format is LMDB
  2. dataset_v2: images resized to 1248x384, validation boxes min size is 0, custom class names are - dontcare, person, plate, batchsize is 32 and format is LMDB
  3. dataset_v3: images resized to 2160x1440, validation boxes min size is 0, custom class names are - Dontcare, Person, Plate, batchsize is 32 and format is LMDB. (the labels are capitalized based on comments mentioned in this post Detectnet doesn't always place bounding box and mAP is always zero with custom dataset. #1384)

RUN 1
For training, I first ran on 1248x720 but gave me memory errors since I am running on RTX 2080 single GPU with 12GB GPU memory. Network configuration given in detectnet_network_v1.txt attached below
detectnet_network_v1.txt

RUN 2
So I switched over to 1248x384 size and used default detectnet for 2 classes prototxt.
My training configurations were:

  1. epochs: 300
  2. batch_size: 2
  3. batch_accumulation: 5
  4. optimizer: Adam
  5. learning_rate: 0.0001
  6. gamma: 0.99
  7. decay_rate: exponential_decay
  8. mean_subtraction: Image
  9. prototxt: custom detectnet network
  10. pretrained_model: bvlc_googlenet.caffemodel

My MAP was 8.45 and 7.67 at 150 epochs and training just fluctuates with train_loss going anywhere between less than 0.1 to 10.56 and beyond and likewise the coverage_loss is very bad too.

So considering that I might have small object sizes and maybe the detectnet won't be learning to capture those small bounding boxes which might not lie in 50x50 pixel minimum criteria, I decided to reduce the stride to 8 based on the steps mentioned in this thread: https://groups.google.com/g/digits-users/c/zx_UYu3jlt8

For stride 8, I made the pool3 layer to have kernel 1x1 and stride 1 or else it gave Check failed: bottom[i]->shape() == bottom[0]->shape() error. I made following changes in prototxt file as shown in detectnet_network_v2.txt shown below:
detectnet_network_v2.txt

The images in my dataset are varying in sizes since they are randomly sampled from the internet. Below is my dataset information:

For People class:

  1. I have the data which was taken from Boy, Girl, Man and Woman class from Google Open Images V6 repository. These images are of arbitrary sizes and hence the bounding boxes don't suffice entirely with the permissible ranges required by Detectnet (50x50 - 400x400). A good proportion of boxes are smaller and larger than this range.
  2. Other than the above mentioned data, I have data which was collected earlier by the research group I work with. They are all images taken from a IP camera hence they are 1280x720 in resolution. The bounding boxes for people are again varied.
  3. To curb false positives we decided to collect some negative samples and label them as Dontcare. These include randomly sampled images of fire hydrants (since some earlier model detected fire hydrants as people), mannequins, wall posters (if people are walking past a wall with pictures or faces of humans, the inanimate pictures are detected as people too apart from the actual humans walking past the hoardings). The fire hydrant images were taken from Google Open Images, while the mannequin and other negatives were self annotated with LabelImg into PASCAL VOC format. After all the data was gathered for people detection along with false positives, the annotations were converted to KITTI format for compatibility with our Detectnet architecture.

For Vehicle License Plate class:

  1. For our work, we need to develop a model which can detect person and plate from a live stream of video. For our plate detection, the data was again accumulated from Google Open Images V6 dataset which are again of varied sizes and shapes and hence the bounding boxes too are ranging in sizes.
  2. To curb false positives we aggregated negative samples of traffic sign images.
  3. Along with the above data we have our own custom collected vehicle license plate data too.

Overall in our dataset, we had about 10K annotations of license plates, 12K Dontcare annotations for both classes in total and about 20K person annotations spread randomly across a total of 29K images (since images have multiple instances of classes).

I am wondering how can I improve the MAP of the model on a machine which has single RTX 2080 12 GB memory.

Things which I feel are causing an issue:

  1. Images have bounding boxes outside permissible ranges
  2. dataset is imbalanced for classes.
  3. memory is not sufficient for higher batch training.

I am currently developing a script to filter out images which are bigger than say 2500x2500 in resolution as well as deleting annotations for objects in person and Dontcare classes which are less than 50x50 in size for bounding boxes. I am thinking not to do the filtering for the License plate class, since the bounding box surrounding the plates are usually 100x35 or 90x30 in dimension. Also, the images of vehicle number plates are less, hence filtering them will cause imbalance to increase further.

@jon-barker I will appreciate if you can suggest some techniques which I can employ for get a model with appreciable MAP out to be deployed for a real time detection problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests