Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiny YOLO: Looking for suggestions to improve training on a custom dataset #406

Open
saihv opened this issue Feb 25, 2018 · 24 comments
Open

Comments

@saihv
Copy link

saihv commented Feb 25, 2018

I am currently working on object detection on a custom dataset, where a close to real time implementation on a Jetson TX2 is the final goal. Hence, I am trying to achieve a performance of ~30 fps (20-30 would be acceptable too as long as accuracy is not too bad) as well as a decent IoU.

As of now, I am using Tiny YOLO as my framework through Darknet, compiled with GPU and CUDNN support. The images are 640x360 in dimensions and I have about 100000 of them with around 10 classes of objects in total. I've trained tiny YOLO for about 80000 iterations and on an average, this has given me IoUs of around 50% on the test dataset with a performance of around 18 fps on the Jetson TX2: I am currently looking to improve these numbers while not affecting the performance too much. I was hoping to get some suggestions regarding this:

  1. What steps can I take to 'customize' training to my dataset? Because I have multiple classes of objects, some of them are very small (bounding boxes of 50x50 pixels in size approx.); and tiny YOLO is having a lot of trouble specifically with these small objects, while performing decently on the bigger ones. Can I somehow retrain my network to focus on these small objects more? Or are there any modifications I can make in the cfg file to account for these small objects?

(I see in the README two points relating to this: parameters small_object=1 and random=1. Do these affect the performance adversely at the cost of increased accuracy?)

  1. Does YOLO have a performance boost when working on square images? i.e., is there any noticeable improvement in resizing the images to square?

  2. Is IoU the best metric to check when trying to increase or derease network-resolution (width and height)? I gather from the README that these values create an accuracy vs speed trade-off, how should I pick the best values for my application?

  3. When performing training or inference, in my application, each image has only one class of object in it. Can I somehow exploit this fact to improve the performance a little (somehow tell YOLO that the maximum number of objects it needs to finally detect is just one)?

Any other general comments aimed at improving accuracy or speed are very welcome too. Thanks!

@AlexeyAB
Copy link
Owner

AlexeyAB commented Feb 25, 2018

  • Did you get IoU using darknet map or darknet recall command?
  • What width= height= params do you use in the cfg-file?
  • What learning_rate, steps, scales and decay do you use?
  1. You can use small_object=1 and random=1 this params doesn't decrease speed:
  • random=1 increase mAP +1%, it does not affect the speed of detection, but it deacrease speed of training
  • small_object=1 is required only for objects with size less than 1%x1% i.e. smaller than 5x5 pixels (if you use width=416 height=416)
  • also you can try to train using pre-trained tiny-yolo-voc.conv.13 instead of darknet19_448.conv.23 that you can get using command: darknet.exe partial cfg/tiny-yolo-voc.cfg tiny-yolo-voc.weights tiny-yolo-voc.conv.13 13
  1. By default Yolo uses square network 416x416, and any image is resized to this square resolution 416x416 automatically, so you shouldn't do it. But there are several approaches for keeping aspect ration, so you can do pre-processing of images, as in the original darknet, or as in the OpenCV-dnn-Yolo: Resizing : keeping aspect ratio, or not #232 (comment)
    But there are positive and negative points here.

  2. For default networks (Yolo, Tiny-yolo) and default threshold=0.24, the IoU is the best accuracy metric. But if you use your own model (DenseNet-Yolo, ResNet-Yolo), that requires a different optimal threshold, then the best metric is mAP. Yes, the higher the network resolution, the slower it works, but the more accurately it detects (especially small objects).

    3.1. Also, if all of your images (training and detection) have the same size 640x360, then you can try to change your network size width=640 height=352 and train with random=0

  3. You can try to implement it in the source code in this function:

    darknet/src/region_layer.c

    Lines 333 to 384 in 3ff4797

    void get_region_boxes(layer l, int w, int h, float thresh, float **probs, box *boxes, int only_objectness, int *map)
    {
    int i,j,n;
    float *predictions = l.output;
    for (i = 0; i < l.w*l.h; ++i){
    int row = i / l.w;
    int col = i % l.w;
    for(n = 0; n < l.n; ++n){
    int index = i*l.n + n;
    int p_index = index * (l.classes + 5) + 4;
    float scale = predictions[p_index];
    if(l.classfix == -1 && scale < .5) scale = 0;
    int box_index = index * (l.classes + 5);
    boxes[index] = get_region_box(predictions, l.biases, n, box_index, col, row, l.w, l.h);
    boxes[index].x *= w;
    boxes[index].y *= h;
    boxes[index].w *= w;
    boxes[index].h *= h;
    int class_index = index * (l.classes + 5) + 5;
    if(l.softmax_tree){
    hierarchy_predictions(predictions + class_index, l.classes, l.softmax_tree, 0);
    int found = 0;
    if(map){
    for(j = 0; j < 200; ++j){
    float prob = scale*predictions[class_index+map[j]];
    probs[index][j] = (prob > thresh) ? prob : 0;
    }
    } else {
    for(j = l.classes - 1; j >= 0; --j){
    if(!found && predictions[class_index + j] > .5){
    found = 1;
    } else {
    predictions[class_index + j] = 0;
    }
    float prob = predictions[class_index+j];
    probs[index][j] = (scale > thresh) ? prob : 0;
    }
    }
    } else {
    for(j = 0; j < l.classes; ++j){
    float prob = scale*predictions[class_index+j];
    probs[index][j] = (prob > thresh) ? prob : 0;
    }
    }
    if(only_objectness){
    probs[index][0] = scale;
    }
    }
    }
    }

For example add this code at the end of the function, before this line:

float max_prob = 0;
int max_index = 0, max_j = 0;
    int i,j,n;
    for (i = 0; i < l.w*l.h; ++i){
        int row = i / l.w;
        int col = i % l.w;
        for(n = 0; n < l.n; ++n){
            int index = i*l.n + n;
                for(j = 0; j < l.classes; ++j){
                    if(probs[index][j] > max_prob) {
                        max_prob = probs[index][j];
                        max_index = index;
                        max_j = j;
                    }
                }
       }
    }

    for (i = 0; i < l.w*l.h; ++i){
        int row = i / l.w;
        int col = i % l.w;
        for(n = 0; n < l.n; ++n){
            int index = i*l.n + n;
                for(j = 0; j < l.classes; ++j){
                    if(index != max_index || j != max_j) probs[index][j] = 0;
                }
       }
    }
  1. Also you can re-generate anchors for your dataset:

@AlexeyAB AlexeyAB reopened this Feb 25, 2018
@MyVanitar
Copy link

MyVanitar commented Feb 25, 2018

is it a good idea to pad images during preprocessing to make it compatible with 416 * 416?

I don't mean resizing them all to 416 * 416, but pad them in a way to make a correct fraction of 416, because if the network resizes them all to 416 * 416, many images which are not dividable by 416, will lose their aspect ratio. such as 300 * 300.

@AlexeyAB
Copy link
Owner

@VanitarNordic There are positive and negative points here for each case: #232 (comment)

  • original Darknet: (+) keep aspect ratio, (-) has the smallest size of object - this further worsens the detection of small objects
  • OpenCV-dnn-Yolo: (+) keep aspect ratio, (-) removes part of the image - you will not be able to detect objects at the edges of the image
  • this Darknet repo: (+) object has the biggest size, (-) does not keep aspect ratio - If the sizes of the images in training and detection datasets are very different, then the accuracy will be reduced

Because I train my models on training dataset with the same image size (1280x720 or 1920x1080) as sizes of detection dataset, then I shouldn't keep aspect ratio, so for me the best option is to use this Darknet repository with the maximum object size.

@MyVanitar
Copy link

Okay, therefore this Darknet Repo does not keep the aspect ratio. I think it is the same with SSD.
if I pad the image to build a correct fraction or multiplication of 416, then is it good (for this repo)?

@AlexeyAB
Copy link
Owner

@VanitarNordic

if I pad the image to build a correct fraction or multiplication of 416, then is it good (for this repo)?

Do you mean, that you will do as in original Darknet, but by yourself? It will keep aspect ratio, but you will have the smallest size of object. If you have small object - then this is bad idea. But if you have big objects, but all your images has different size - then this is good idea.

@MyVanitar
Copy link

Do you mean, that you will do as in original Darknet, but by yourself?

By doing it myself outside before starting the training, You said this repo does not keep the aspect ratio, therefore I want to do the padding outside to make a correct fraction or multiplication of 416 for all images, by the padding method. Therefore even if the network does not keep the aspect ratio, but because of the correct numbers, then the object will not have un-normal shapes.

@saihv
Copy link
Author

saihv commented Feb 26, 2018

@AlexeyAB

Thanks a lot for the detailed reply! I will note your suggestions. Replies:

Did you get IoU using darknet map or darknet recall command?

I used darknet recall. But the 50% IoU I mentioned was on the test dataset, not validation. Validation IoU (the last line in the output) was about 65% IIRC.

What width= height= params do you use in the cfg-file?

As of now, just the defaults: 416x416.

What learning_rate, steps, scales and decay do you use?

momentum=0.9
decay=0.0005

learning_rate=0.001
policy=steps
steps=-1,100,80000,100000
scales=.1,10,.1,.1

@saihv
Copy link
Author

saihv commented May 8, 2018

@AlexeyAB

Thanks a lot for the tips! The one that made the biggest difference was using 640x352 with random=0. Strangely, regenerating the anchors actually reduced the IoU (and map). Is this possible?

Also, would you happen to have any tips for improving training only on certain classes? My training data is somewhat unbalanced: some classes have a lot more images than others, and running detector map looks like this:

detections_count = 35981, unique_truth_count = 16923  
class_id = 0, name = boat, 981   ap = 100.00 % 
class_id = 1, name = building,      ap = 79.95 % 
class_id = 2, name = car,      ap = 90.91 % 
class_id = 3, name = drone,      ap = 90.91 % 
class_id = 4, name = group,      ap = 80.07 % 
class_id = 5, name = horseride,      ap = 90.91 % 
class_id = 6, name = paraglider,      ap = 100.00 % 
class_id = 7, name = person,      ap = 90.91 % 
class_id = 8, name = riding,      ap = 90.91 % 
class_id = 9, name = truck,      ap = 72.41 %       // Slightly lower iou/precision on this class for example
class_id = 10, name = wakeboard,      ap = 83.83 % 
class_id = 11, name = whale,      ap = 100.00 % 

Although the map/IoU looks really good on validation, it is slightly lower on test data: so I am just curious if I can improve training for only specific classes.

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 8, 2018

@saihv Simple solution is to do many duplicates of images+labels of classes which have small number of images. Then re-generate train.txt by using Yolo_mark.
Due to data augmentation even duplicates of images+labels will increase accuracy.

@saihv
Copy link
Author

saihv commented May 15, 2018

Got it, thank you! I will try that.

Just one last question in the custom dataset area, if you don't mind:

I am working on an object detection contest, where I only have access to training data. I am supposed to train a model, which is then evaluated on a test set with the same classes (I don't have access to these test images); and the evaluation metric is the average IoU. I am splitting the given data into train and validation (as usual) and training my tiny YOLO model: but there is a noticeably big difference in IoU between validation and test (avg. 80% on validation vs 60% on test).

I guess this could be because of multiple reasons: the test data might be more challenging, or perhaps it has a different distribution of images per class etc. But conceptually, this seems like a tricky problem because the model does perform well in validation, but it still looks like there is some overfitting when it comes to new data. So that makes me curious, are there any tips or tricks to making a model more generalized? Thanks!

@AlexeyAB
Copy link
Owner

the test data might be more challenging, or perhaps it has a different distribution of images per class etc.

Yes.


So that makes me curious, are there any tips or tricks to making a model more generalized?

Increase params in the data augmentation and train 10x times more iterations.
random=1 jitter=0.4 increase width and height to 608 or 832.
If you should detect objects with different colors as the same class_id, increase hue=0.2 saturation=1.8 exposure=1.8

Also fix this mistake:

mask = 1,2,3

mask = 0,1,2

@saihv
Copy link
Author

saihv commented May 15, 2018 via email

@AlexeyAB
Copy link
Owner

If you want to use random=1 with non-square network 640x352 then you should download the latest version of Darknet from this GitHub repository.

Also did you re-calculate anchors? You can do it too for -width 20 -height 11
https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

@saihv
Copy link
Author

saihv commented May 15, 2018

I tried regenerating anchors in the past through this command:

gen_anchors.py -filelist data/train.txt -output_dir data/anchors -num_clusters 5

But using those anchors actually decreased the IoU. I now see that I should probably try with those width and height arguments (net.w/32 and net.h/32 I guess?)

@AlexeyAB
Copy link
Owner

@saihv Set here:

width_in_cfg_file = 416.
height_in_cfg_file = 416.

 width_in_cfg_file = 640. 
 height_in_cfg_file = 352. 

@saihv
Copy link
Author

saihv commented May 15, 2018

Oops, I should have looked at that! Thanks for pointing it out, will change it and try.

@saihv
Copy link
Author

saihv commented May 20, 2018

I tried training with random=1, which produces slightly lower validation IoU (avg. 75% vs 80% with random=0); but most of the inaccuracy comes from classes that have relatively smaller objects (I did include small_object=1 in the cfg file; but the objects are not smaller than 1% pixels, so I don't know if this parameter helps)

Would it be helpful to train at a higher resolution (more than 640x360 but still non-square) and random=0 but do inference at 640x352?

@AlexeyAB
Copy link
Owner

@saihv What mAP can you get for random=1 and random=0?
Usually training resolution should be ~the same as detection resolution, if the images in the training and detection dataset have the same resolution.

@saihv
Copy link
Author

saihv commented May 21, 2018

random=0:

detections_count = 35981, unique_truth_count = 13965 
class_id = 0, name = boat, 981   ap = 100.00 % 
class_id = 1, name = building, 	 ap = 79.95 % 
class_id = 2, name = car, 	 ap = 90.91 % 
class_id = 3, name = drone, 	 ap = 90.91 % 
class_id = 4, name = group, 	 ap = 80.07 % 
class_id = 5, name = horseride, 	 ap = 90.91 % 
class_id = 6, name = paraglider, 	 ap = 100.00 % 
class_id = 7, name = person, 	 ap = 90.91 % 
class_id = 8, name = riding, 	 ap = 90.91 % 
class_id = 9, name = truck, 	 ap = 72.41 % 
class_id = 10, name = wakeboard, 	 ap = 83.83 % 
class_id = 11, name = whale, 	 ap = 100.00 % 
 for thresh = 0.24, precision = 0.97, recall = 0.98, F1-score = 0.98 
 for thresh = 0.24, TP = 16597, FP = 453, FN = 326, average IoU = 81.34 % 

 mean average precision (mAP) = 0.892338, or 89.23 %

random=1:

detections_count = 38874, unique_truth_count = 13965  
class_id = 0, name = boat, 874   ap = 90.91 % 
class_id = 1, name = building, 	 ap = 66.27 % 
class_id = 2, name = car, 	 ap = 90.89 % 
class_id = 3, name = drone, 	 ap = 90.53 % 
class_id = 4, name = group, 	 ap = 59.67 % 
class_id = 5, name = horseride, 	 ap = 90.63 % 
class_id = 6, name = paraglider, 	 ap = 100.00 % 
class_id = 7, name = person, 	 ap = 90.89 % 
class_id = 8, name = riding, 	 ap = 90.84 % 
class_id = 9, name = truck, 	 ap = 69.11 % 
class_id = 10, name = wakeboard, 	 ap = 80.74 % 
class_id = 11, name = whale, 	 ap = 90.87 % 
 for thresh = 0.25, precision = 0.95, recall = 0.95, F1-score = 0.95 
 for thresh = 0.25, TP = 13223, FP = 658, FN = 742, average IoU = 75.12 % 

 mean average precision (mAP) = 0.842781, or 84.28 % 

Please note the difference in AP for classes 1, 4 and 9: which are the challenging ones with smaller object sizes. Both configurations were trained for about ~120k iterations after which the mAP settles and does not change by much.

@AlexeyAB
Copy link
Owner

@saihv Try to change these lines:

darknet/src/detector.c

Lines 132 to 134 in 4403e71

int random_val = rand() % 12;
int dim_w = (random_val + (init_w / 32 - 5)) * 32; // +-160
int dim_h = (random_val + (init_h / 32 - 5)) * 32; // +-160

to these:

			float random_val = rand_scale(1.4);	// *x or /x
			int dim_w = roundl(random_val*init_w / 32) * 32;
			int dim_h = roundl(random_val*init_h / 32) * 32;

And train with random=1, what mAP will you get?

@saihv
Copy link
Author

saihv commented May 22, 2018

Trained it for 120k iterations with those changes, and now the mAP is pretty close to random=0:

detections_count = 30511, unique_truth_count = 13965  
class_id = 0, name = boat, 511   ap = 90.91 % 
class_id = 1, name = building, 	 ap = 82.86 % 
class_id = 2, name = car, 	 ap = 90.91 % 
class_id = 3, name = drone, 	 ap = 90.91 % 
class_id = 4, name = group, 	 ap = 78.91 % 
class_id = 5, name = horseride, 	 ap = 100.00 % 
class_id = 6, name = paraglider, 	 ap = 100.00 % 
class_id = 7, name = person, 	 ap = 90.90 % 
class_id = 8, name = riding, 	 ap = 90.91 % 
class_id = 9, name = truck, 	 ap = 72.62 % 
class_id = 10, name = wakeboard, 	 ap = 88.94 % 
class_id = 11, name = whale, 	 ap = 100.00 % 
 for thresh = 0.25, precision = 0.97, recall = 0.98, F1-score = 0.97 
 for thresh = 0.25, TP = 13632, FP = 416, FN = 333, average IoU = 80.81 % 

 mean average precision (mAP) = 0.898222, or 89.82 % 

But I guess because random=1 switches between low and high resolutions, it might be beneficial to train for more iterations.

@AlexeyAB
Copy link
Owner

But I guess because random=1 switches between low and high resolutions, it might be beneficial to train for more iterations.

Yes. random=1 is almost the same as if there will be 2x more images, so it requires 2x more itrations.

What jitter do you use in all these cases?

@saihv
Copy link
Author

saihv commented May 23, 2018

I am still using jitter=0.2. I remember one of your suggestions was to move to 0.4, but I was just testing one thing at a time, so that's next on my list.

@AlexeyAB
Copy link
Owner

Yes, better to test one thing at a time. Changing jitter from 0.2 to 0.4 requires about 5-10x times more iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants