-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I use DetectNet for custom size data ? #980
Comments
It's certainly possible to adjust DetectNet to work with other image sizes, but not easy. @jbarker-nvidia got it to work with Unfortunately, it's not a simple process. And even if you get it to run without errors, you need to understand what's going on pretty well to get it to actually converge to a solution for your data. Off the top of my head, here are some places to start:
|
I am also trying to adapt detectnet for my own dataset (example 1024x1024 size images) with custom object sizes (around 192x192) The issue is that in the blog post, the full modified prototxt is not published so I'm having a lot of trouble recalculating what I need to modify: If I'm correct: Adjust Image size: Adjust stride for detecting custom classes However, I can't understand what I would need to modify: L2418. does not seem to need modification as it is the bounding box regressor so it should output 4 objects. (unless I'm mistaken). I would love adding documentation to using DetectNet & Digits with a custom dataset, however I can't really understand everything yet. Regards |
For 1024x1024 images and target objects around 192x192 you probably don't need to adjust the stride initially. DetectNet with default settings should be sensitive to objects in the range 50-400px. That means that you can just replace the 1248x348/352 everywhere by 1024x1024 and it should "just work". Something I found that helped accuracy when I modified image sizes was to use random cropping in the "train_transform" - modify the image_size_x and image_size_y parameters to, say, 512 and 512 and set crop_bboxes: false. |
@jbarker-nvidia , Hi I did what you said (set crop_bboxes: false) and it improved my mAP from 1.6 to 14 percent, kindly take a look at my question #1011 , Thank you. |
@jbarker-nvidia Thank you for your input, much appreciated. Regards |
@fchouteau Set |
Hello everyone, I want to use DIGITS (detectnet) + CAFFE to detect objects in my own dataset. I read some posts about adapting some settings in detectnet to use it for training and detection in a custom dataset. But apparently, most of the mentioned datasets consists of images with more or less the same dimensions for all images. My case is a bit different from the comments that I found … I have 3 different object classes which I want to detect in images : classA, classB and classC. For each object class, I have 3000 training images available (=> so 9000 in total), and 1500 validation images (=> 4500 in total). Those images are ROI’s (regions of interest from other images) that I manually cropped in the past, so the whole (training) image consists of one specific object.
-> image class = ‘108’ and bounding box of object in the image = image dimensions I want to train an object detection model(s) so I can detect those 3 objects (if present) in unknown test images, images that were not cropped beforehand. Dimensions of those images can be different (e.g. 800x600, 1200x800, 1486x680, … can be about everything). My first question : is it necessary to make all the training / validation images have the same dimensions (e.g. 256 x 256), or can I solve it by setting some parameters (pad image? resize image?) to a specific dimension while creating a dataset? It’s not clear to me what those parameters exactly imply. Second question : how about the test images that can have about any dimension ; do I have to resize them before analyzing or not? If I get it right, I have to make some changes : A] While creating a dataset, in the DIGITS box, change :
B] In detectnet_network.prototxt (dim:384 and dim:1248), here. In the following lines, image_size_x:1248 and image_size_y:384 and crop_bboxes true/false are mentioned : here and here. And in the following line, dimensions (1248, 352) are also used : here, here and here. At this moment, it is not clear to me how to set these options for my specific case … With kind regards. |
@JVR32 DetectNet is not designed to work with datasets of the kind that you describe. A dataset for DetectNet should be images where the object you wish detect is some smaller part of the image and has a bounding box label that is a smaller part of the image. Some of these images could have objects that take up a large part of the image, but not all of them as it is important for DetectNet to be able to learn what "non-object" image/pixels looks like around a bounding box. That ability to learn a robust background model is why DetectNet can work well. Also note that you will need to modify the statndard DetectNet to work for multi-class object detection. If you have access to the original dataset that you cropped the objects from then you should create a training dataset from those images and use the crop locations as the bounding box annotations to use DetectNet. If you only have the cropped images to train on then you should just train an image classification network but make sure you train a Fully Convolutional Network (FCN). See here. An FCN for image classification can then be applied to a test image of any size and the output will be a "heatmap" of where objects might be present in the image. Note that this approach will not be as accurate as DetectNet and will suffer from a higher false alarm rate unless you also add non-object/background training samples to your dataset. |
I guess I got stuck then. And that worked quite good, but ... it worked best when the whole image was taken by the object, or if 'most of the image' was taken by the object. If the object was only a smaller part of the image, the image often was considered as 'negative' (->= not an image of the object we look for). That's why I hoped that using detection instead of classification would improve the results. Main purpose is to detect if a certain object is present in an image (taking the whole image or only a smaller part of the image). Multi-class is not important, it can be done in multiple checks (-> classA present or not? ; classB present or not? ; classC present or not?). Unfortunately, I manually cropped all the images in the past, so I don't have the crop locations in the original images :-( . |
Suppose I had done it differently, and I would have 9000 training images and 4500 validation images with dimensions 640x640, and the wanted objects were smaller parts of those images :
My test images could still have different dimensions : e.g. 800x600, 1200x800, 1486x680, … Which settings should I provide while creating a dataset in the DIGITS box :
Are those totally independent from the possible dimensions of the test images (-> leave pad image empty and put 640x640 for resize image) or not? And what about the dim and image_size_x and image_size_y parameters in detectnet_network.prototxt ? |
@jbarker-nvidia As i understand the function pruneBboxes() (in detectnet_coverage_rectangle.cpp) adjusts the boxes according to the done transformation. what happens when crop_bboxes is set to false? |
Hello, Could you please point me in the right direction before I spend at lot of time on annotating images that cannot be used in later processing? You told me that I cannot use cropped images, and I can see why ... But I would like to use object detection in Digits, so I'm willing to start over, and annotate the images again (determine bounding box coordinates around object), but I want to be sure I do it the right way this time. So, this is my setup : Suppose I want to detect if a certain object (let's call it classA) is present in an unknown image. I start with collecting a number of images, e.g. 1000 images that contain objects of classA. All those images can have different dimensions : 480x480 ; 640x640 ; 800x600 ; 1024x1024 ; 3200x1800 ; 726x1080 ; 1280x2740 ; ... First question : how do I start? a] Keep the original dimensions, and get the bounding box coordinates for the object of classA in the image ? b] Resize the images, so they all have comparable dimensions (e.g. resize so the smallest or largest dimension is 640), and after that get the bounding box coordinates for the object of classA in the resized image ? c] Non of the options above ; all images must have exactly the same dimensions, so resize all images to the same dimensions, and after that get the bounding box coordinates. Option a] and b] can be done without a problem, c] is not that flexible, so rather not if not necessary. So, that's the first thing I need to know : how do I start, can I get bounding boxes for the original images, or do I have to resize the images before determing the bounding boxes? And then the second question : if I follow option a], b] or c] ... I will have 1000 images with for each image the bounding boxes around objects of classA. After that I'm ready to create the database. For parameter 'custom classes', I can use 'dontcare,classA'. But how do I use the 'padding image' and 'resize image'? I hope you can help me, cause I really want to try to detect objects on my own data, but it's not clear to me how to get started ... With kind regards, Johan. Van: Jon Barker notifications@github.com @JVR32https://github.com/JVR32 DetectNet is not designed to work with datasets of the kind that you describe. A dataset for DetectNet should be images where the object you wish detect is some smaller part of the image and has a bounding box label that is a smaller part of the image. Some of these images could have objects that take up a large part of the image, but not all of them as it is important for DetectNet to be able to learn what "non-object" image/pixels looks like around a bounding box. That ability to learn a robust background model is why DetectNet can work well. Also note that you will need to modify the statndard DetectNet to work for multi-class object detection. If you have access to the original dataset that you cropped the objects from then you should create a training dataset from those images and use the crop locations as the bounding box annotations to use DetectNet. If you only have the cropped images to train on then you should just train an image classification network but make sure you train a Fully Convolutional Network (FCN). See herehttps://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb. An FCN for image classification can then be applied to a test image of any size and the output will be a "heatmap" of where objects might be present in the image. Note that this approach will not be as accurate as DetectNet and will suffer from a higher false alarm rate unless you also add non-object/background training samples to your dataset. You are receiving this because you were mentioned. |
@JVR32 You can annotate bounding boxes on the images in their original size - this is probably desirable so that you can use them in that form in the future. DIGITS can resize the images and bounding box annotations during data ingest. There's no definitive way to use 'padding image' and 'resize image', but to use DetectNet without modification you want to ensure that most of your objects are within the 50x50 to 400x400 pixel range. The benefit of padding is that you maintain aspect ratio and pixel resolution/object scaling. Having said that, if you have large variation in your input image sizes it is not desirable to pad too much around small images, so you may choose to resize all images to some size in the middle. |
Thank you very much for the information. In that case, I think it is best that I resize all images to have -more or less- the same dimensions before starting to process them. => I will resize all images so the smallest dimension is 640. Then, input images will have dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ... and then normally, the object sizes will be in the range 50x50 to 400x400. => After resizing, I can start annotating, and determine the bounding boxes in the resized images. Note : I think the bounding boxes don't have to be square?! And when I'm done annotating, I will have a set of images with the smallest dimension 640 and bounding boxes in those images. Maintaining the aspect ratio is important, so since I will have resized the images before annotating them, I suppose it is better to use padding (instead of resize) while creating the dataset? I'll have to use padding if I'm correct, cause all the input images must have the same dimensions, right? So is it correct to leave the 'resize' parameters empty in that case, and put the padding so that all images (with dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ...) will fit in it -> e.g. 800 x 800 ? Or do I have to set a bigger padding (e.g. 1024 x 1024) and set resize to 800x800? I guess I have to use at least one of both parameters (padding or resizing), that I cannot just input the images with various dimensions 640x640 ; 640x800 ; 490x640 ; 380x640 ; 640x500; ... without setting one of the 2 mentioned parameters? Van: Jon Barker notifications@github.com @JVR32https://github.com/JVR32 You can annotate bounding boxes on the images in their original size - this is probably desirable so that you can use them in that form in the future. DIGITS can resize the images and bounding box annotations during data ingest. There's no definitive way to use 'padding image' and 'resize image', but to use DetectNet without modification you want to ensure that most of your objects are within the 50x50 to 400x400 pixel range. The benefit of padding is that you maintain aspect ratio and pixel resolution/object scaling. Having said that, if you have large variation in your input image sizes it is not desirable to pad too much around small images, so you may choose to resize all images to some size in the middle. You are receiving this because you were mentioned. |
Correct
You don't need to do this, you can just pad up to the standard size. If all of your images are already 640 in one dimension then I would must pad the other dimension to the size of your largest image in that dimension. That minimizes further unnecessary manipulation of the data. |
I am trying to train a model for pedestrians (using TownCentre annotations) based on the KITTI example for cars. First I tried using the original resolutions (1920x1080), but changing the network parameters according to the comments above (replacing 1248x348/352 with the new resolution) lead to the error "bottom[i]->shape == bottom[0]->shape" which I was not able to solve. To avoid having to change the network parameters, I just rescaled all training images ( and annotations accordingly) to the same resolution as KITTI, but the accuracy remains very low (also after processing 350 epochs). When I tried the advice of using cropping from the images, I fall back to the same error message about the shape. Is there some other example available for object detection with different resolution input that reaches acceptable results? |
which layer give this error? My guess it is because your resolution is not divisible by 16, so you need to set your should replace 1248x348/352 with 1920x1080/1088 |
The problem is always on "bboxes-masked" I will try your suggestion. The original size in the network is actually 1248x384 (copied the values from above, which turned out to be incorrect). The 384 value is however divisible by 16, so what is the reason for using 352 there? Another question. Is it important to scamble the data? The training data I have are the images of a long sequence, so consecutive frames contain a lot of the same pedestrians. Is this data scrambled before training? Or should I do this myself? |
I have tried your suggestion, but still get an error on the shape-issue. I attach the resulting log-file: |
Did you do any other changes? your bboxes is 3 4 67 120, while i think it should be 3 4 68 120 |
I did not change anything else, just replaced all instanced of 1248 by 1920, 384 by 1080 and 352 by 1088. Does the last one make sense? It seems indeed that the 67 is the problem. I think it comes from the 1080 size, which is pooled 4 times (leading to dimention 540, 270, 135 and 67 of which the last one is truncated). I am now recreating the dataset with padding to 1088 to avoid the truncation. Hopes this helps ;) |
Hi @JVR32 thanks for the detailed report! This is a bug indeed, sorry about that. I agree the error is certainly not explicit! We have a Github issue for this: #1092 |
Hi, everybody! i`ve clonned this repo and done everything as it mentioned in Readme(made dataset for Object Detection, run network). What`s the problem with one class? Full kitti goes ok. |
@aprentis Can you hover over the graph so that we can the actual numeric results for the metric - it matters greatly whether those numbers are just small or exactly zero? Looking at the repo you cloned I noticed that the model has explicit "dontcare" regions marked - whilst this can be useful, e.g. for masking out the sky when you only care about the road it is not necessary. I'm not sure what regions are being marked as "dontcare" for this data, but if it includes the sidewalks where the pedestrians are then you're going to have problems. |
@jbarker-nvidia those numbers are exactly zero. In this repo i`ve found result screenshot, which says that mAP is OK after 10 epochs. |
@jbarker-nvidia I`ve tried the network which @gheinrich published(for two classes), but mAP was still zero. |
@aprentis The fluctuations in those graphs suggest that not all of those numbers are exactly zero. Can you specify which ones are zero and which ones are not? The basic Kitti example is for one class - just cars. There are lots of examples of successfully training DetectNet on other one class problems too. |
@aprentis Thanks - which optimizer, learning rate and learning rate decay policy are you using? I think you may want to try a smaller learning rate and/or more aggressive learning rate decay - I like to use Adam and an exponential learning rate decay. |
@aprentis "Does it depend on my CPU\GPU configuration?" - No, that shouldn't matter unless you are using multiple GPUs in which case you may need to adjust your learning rate to accomodate the larger effective batch size. From the information you've posted it appears that you have a correctly configured dataset and model definition. You may want to try a more aggressive learning rate decay schedule, say exponential with 0.99 decay. |
@jbarker-nvidia Hi,I saw your article https://devblogs.nvidia.com/parallelforall/exploring-spacenet-dataset-using-digits/,I have some questions to ask you about it.When you train the net,what your data format looks like,How do you convert the format of spaceNet into a data format that can use detectnet.thank you! |
@AleiLmy For the object detection/DetectNet approach the data follows the standard Kitti format for bounding boxes. We used Python scripts to convert the geoJSON files to Kitti format text files. Obviously the building footprints are not all rectangular and don't all have sides parallel to the input image, so we used the minimum enclosing rectangle with those properties. For the segmentation approach we again used Python scripts to convert the geoJSON files to .PNG files for the segmentation masks. |
@jbarker-nvidia Hello, for my project I am trying to detect small objects (30-60 pixels). Detectnet is making detections very nicely, mAP is about 65, test accuracy is over 90% no problems here. The only thing I cannot figure out how to solve is detection of nearby objects. If there are two objects with 5-10 pixels distance, detectnet fails to distinguish and gives them a single bounding box. And I guess detecting overlapping objects individually is totally impossible. I changed stride to 8 but not much help for this problem. May visualizations help me here to detect where the problem actually starts across network? Can you advise some modification points in network for this purpose? |
@ontheway16 Can you please explain what have you changed to make it work with stride=8? |
@Barty777 please check the following discussion; https://groups.google.com/forum/m/#!topic/digits-users/zx_UYu3jlt8 |
@varunvv Can you clarify what you fixed? I'm getting the same error as you had. |
Thank you all for your ongoing support. I am currently training DetectNet on aerial images. Recall is at about 47% so far and precision at about 23% (epoch 800). I'm using base learning rate of 1e-4 with exp decay and gamma .99 on one GPU. After 350 epochs I stopped training to restart using two GPUs (set base learning rate to the last learning rate used on previous run to continue training). So far mAP, precision, and recall are still increasing (slowly). I have a question about one of @jbarker-nvidia 's previous comments in this thread about using multiple GPUs: @jbarker-nvidia "@aprentis "Does it depend on my CPU\GPU configuration?" - No, that shouldn't matter unless you are using multiple GPUs in which case you may need to adjust your learning rate to accomodate the larger effective batch size." For the sake of clarity: I know that when using (for example) 1 GPU with batch size of 5 and batch accumulation of 2 I get an effective batch size of 10 and I should adjust my learning rate accordingly, but do you mean that using additional GPUs would also change the effective batch size? For example if I'm using 2 GPUs with those same parameters (batch size 5 and accumulation 2) the effective batch size is actually going to be 20 instead of 10? I would naturally think that backprop would take place every time 2 batches of 5 images have been processed regardless of which GPU was used to process each batch, and not every time all the GPUs in use have processed 2 batches of 5 images each. Meaning that my learning rate adjustments should only be based on batch size + batch accumulation, whilst totally disregarding the amount of GPUs in use. I'm very confused as to what you mean by adjusting the learning rate to accommodate effective batch size when using multiple GPUs. So my question is why is learning rate affected by amount of GPUs in use? Thank you in advance for your help. |
@juiceboxjoe You are correct that backprop would take place on each GPU independently after 2 batches of 5 images have been processed. But the learning rate is not used in backprop, it used in the optimizer. The optimizer aggregates the gradients from all of the GPUs, performs a gradient descent update and then broadcasts the new parameters back to the GPUs. So that aggregation step across the GPUs effectively increases the batch size to 20 (in the case of 2 GPUs). |
@jbarker-nvidia Thank you so much for your quick and enlightening response! Sorry for my confusion. I also found some nice, quick reference here that further clarifies your point about the difference between backprop and optimizers. This means that I was wrong to think that my effective batch size did not change when I paused training on one GPU and then continued on two GPUs. Can you also point me in the right direction regarding learning rate adjustments according to effective batch size when training DetectNet (found through experimentation - like DetectNet's ideal 40-500px detection range)? Thanks in advance. |
@juiceboxjoe There's not really any solid rule for how much to adjust the learning rate by as a function of batch size. But if you do need to change it you will need to decrease the learning rate as the batch size grows. |
This seems relevant to post here: When I naively changed the sizes in the detectNet prototxt file to my current image size, I got the error: This is because the image dimensions have to be an integer multiple of the stride. So w/ stride 16, I resized my 1920x1080 image to 1920x1088 and the error was resolved |
@jon-e-barker Same issues here, but I'm using a small custom dataset ( 500 images coming from the same video ) could be this the problem of the mAP = 0 (as well as every other metric) ? |
@cesare-montresor , I was usnig custom sized data and also getting mAP=0 for the first 100 epochs, and then it started picking up. Adjusting hyperparameters (such as reducing learning rate and clustering threshold) ma have avoided this problem, but this demonstrates that a mAP=0 doesnt mean the model isnt working - it just may be working very slowly. |
ooook, I'm a goat typing on a keyboard, I made a mistake converting the labels from VOC to Kitti. |
@rsandler00 , hi I wanted to train also on the image size 1920x1088 like you did but it wont even train on this image size. I provided my prototxt file in this comment. Wen attempting the training the error can not force shape (84,5) into shape (50,5). My data set is made out of images slightly smaller than the 1920x1088 size so i can pad them to the desired size in DIGITS. Did you got this error to in the beginning? Can you perhaps provide me your prototxt? It would be of great help. |
Hi @ontheway16, i am facing the same problem. Any solution ? |
@jon-barker I am trying to do a 2 class object detection using DIGITS 6. My objective is to detect people and license plates and I have done hard-negative mining by annotating some false positive objects by the label name Dontcare. My dataset is very varied and image sizes varying from minimum of (296, 259) to maximum of (7360, 4912). So, likewise, the objects are ranging from small boxes around (5,10) or (10x10) to large objects probably beyond (400x400). This is causing my MAP to not improve. Following are my dataset creation settings. I created 3 datasets. Based on your suggestions given above, padding and resizing would not be a wise option for my case since the image sizes are very small and padding may take up a lot of space. Hence, going forward with your suggestion of resizing images to an intermediate size ignoring the aspect ratio problem, I did the following:
RUN 1 RUN 2
My MAP was 8.45 and 7.67 at 150 epochs and training just fluctuates with train_loss going anywhere between less than 0.1 to 10.56 and beyond and likewise the coverage_loss is very bad too. So considering that I might have small object sizes and maybe the detectnet won't be learning to capture those small bounding boxes which might not lie in 50x50 pixel minimum criteria, I decided to reduce the stride to 8 based on the steps mentioned in this thread: https://groups.google.com/g/digits-users/c/zx_UYu3jlt8 For stride 8, I made the pool3 layer to have kernel 1x1 and stride 1 or else it gave The images in my dataset are varying in sizes since they are randomly sampled from the internet. Below is my dataset information: For People class:
For Vehicle License Plate class:
Overall in our dataset, we had about 10K annotations of license plates, 12K Dontcare annotations for both classes in total and about 20K person annotations spread randomly across a total of 29K images (since images have multiple instances of classes). I am wondering how can I improve the MAP of the model on a machine which has single RTX 2080 12 GB memory. Things which I feel are causing an issue:
I am currently developing a script to filter out images which are bigger than say 2500x2500 in resolution as well as deleting annotations for objects in person and Dontcare classes which are less than 50x50 in size for bounding boxes. I am thinking not to do the filtering for the License plate class, since the bounding box surrounding the plates are usually 100x35 or 90x30 in dimension. Also, the images of vehicle number plates are less, hence filtering them will cause imbalance to increase further. @jon-barker I will appreciate if you can suggest some techniques which I can employ for get a model with appreciable MAP out to be deployed for a real time detection problem. |
I try to use DetectNet for a third party data with 448x448 size images. What are the parameters need to be changed for this custom problem?
The text was updated successfully, but these errors were encountered: