Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any suggestions about improving an imbalanced dataset for training a detector? #3001

Open
yangulei opened this issue Apr 22, 2019 · 21 comments
Labels

Comments

@yangulei
Copy link

yangulei commented Apr 22, 2019

I want to train a yolo detector to detect ‘bus’, ‘car’ and ‘truck’ in the videos recorded by a drone. Here is what I did so far:

  • Extract images form the videos with a delta time equals 5 seconds, noted as mass dataset.
  • Calculate the pHash of the images for further similarity calculation.
  • For every image in the mass dataset, calculate the max similarity among this one and the ones in the selected dataset. If the max similarity less than a threshold, add the image to selected dataset.
  • Label the images in the selected dataset (540 images), which hurts my eyes seriously.
  • Split the selected dataset to train/valid dataset (435 vs. 105) and start to train the model.

Then I get the problem: the recall in the training process reach about 100% quickly, but the mAP is only about 25% and decreasing with more training steps. I guess it’s the so-called overfitting problem.
I also noticed that the selected dataset is imbalanced badly, the number of ‘bus’, ‘car’ and ‘truck’ is 0.6k, 20k and 1.4k respectively.
I’m going to select more images with relative low confidences to enrich my dataset, following the concept of active learning. But I don’t know how to deal with the imbalance of the dataset. Does anyone have some ideas?

@JakupGuven
Copy link

@yangulei
There are some things you can do in this situation.
First you can augment your images to get more to data to train on.
You can use weights from models trained on ImageNet, or any other dataset, which this repo and pjreddie.com provides you with, this is called transfer learning.
You can use hyper parameter tuning, i.e. making changes in cfg file. Joseph Redmon outlines how he does this in his research papers, for instance increasing height and width.
Also ideally your dataset should have an equal distribution of the classes you want to train your model to recognize.

@AlexeyAB
Copy link
Owner

@yangulei

  • Can you show output anchors and cloud.png that you get by using command?
    ./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 -show

  • Can you attach your cfg-file (renamed to txt-file)?

  • Do you want to Detect objects on images or on video (file, camera, ...)?


Label the images in the selected dataset (540 images), which hurts my eyes seriously.
Split the selected dataset to train/valid dataset (435 vs. 105) and start to train the model.
I want to train a yolo detector to detect ‘bus’, ‘car’ and ‘truck’ in the videos recorded by a drone.

Did you use Yolo mark? https://github.com/AlexeyAB/Yolo_mark
For good result you should have about 2000 images per class, i.e. at about 6000 images :) https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

So collect more images and do data augmentation - rotation, because rotation augmentation isn't implemented in Yolo yet.

for each object which you want to detect - there must be at least 1 similar object in the Training dataset with about the same: shape, side of object, relative size, angle of rotation, tilt, illumination. So desirable that your training dataset include images with objects at diffrent: scales, rotations, lightings, from different sides, on different backgrounds - you should preferably have 2000 different images for each class or more, and you should train 2000*classes iterations or more

@yangulei
Copy link
Author

yangulei commented Apr 23, 2019

@JakupGuven @AlexeyAB Thanks for your reply and suggestions.

Do you want to Detect objects on images or on video (file, camera, ...)?

My goal is to train a yolo detector to detect 'bus', 'car, and 'truck' in the frames taken by a camera mounted on a drone. The detector model will be deployed on the NVIDIA Jetson TX2(i) and flying with the drone. Refer to deploying of the model, I'll use NVIDIA TensorRT, which will speed up the inference about 2 to 3 times.

Can you attach your cfg-file (renamed to txt-file)?

Here is the cfg-file I used for training, which is an early prototype and be customized with several consideration:

  • TX2 have a limited FLOPS, so I must keep the CNN relatively small, i.e. less layers and less filters, in order to get enough FPS.
  • "Concatenation" layer can't be optimized by TensorRT, so I avoid the usage of "shortcut" or "route" layers with more than one layers.
  • "leaky" activation is not support officially by TensorRT before, so I changed it to "relu". The latest release of TensorRT 5 do support "leaky" activation, I'll test it latter.

Can you show output anchors and cloud.png that you get by using command?

The input size of my model is 960*640, and there are only 2 "yolo" layers, so I calculate the anchors using the command:
darknet detector calc_anchors data\drone.data -num_of_clusters 6 -width 960 -height 640 -show
and I got this:
calc_anchors

and this:
cloud

Did you use Yolo mark? https://github.com/AlexeyAB/Yolo_mark

I use LabelImg. https://github.com/tzutalin/labelImg

For good result you should have about 2000 images per class, i.e. at about 6000 images :)

This is the point that confuse me. Why the number of images matters, instead of the number of objects in the images? There might be dozens even hundreds of objects in a single frame:
00003_005125
labeledimage

It's really a huge work to label 6000 images!
I know 540 images are far from enough, I will collect more images, but my point is how to reduce the imbalance of the dataset at the same time.

do data augmentation - rotation, because rotation augmentation isn't implemented in Yolo yet.

That's a good idea, I'll check whether the rotated frames similar to some real scene. Does the "angle" parameter in the cfg-file mean rotation augmentation? I find out a function that seems doing the job:

darknet/src/image.c

Lines 1005 to 1024 in 099b71d

image random_augment_image(image im, float angle, float aspect, int low, int high, int size)
{
aspect = rand_scale(aspect);
int r = rand_int(low, high);
int min = (im.h < im.w*aspect) ? im.h : im.w*aspect;
float scale = (float)r / min;
float rad = rand_uniform(-angle, angle) * 2.0 * M_PI / 360.;
float dx = (im.w*scale/aspect - size) / 2.;
float dy = (im.h*scale - size) / 2.;
if(dx < 0) dx = 0;
if(dy < 0) dy = 0;
dx = rand_uniform(-dx, dx);
dy = rand_uniform(-dy, dy);
image crop = rotate_crop_image(im, rad, scale, size, size, dx, dy, aspect);
return crop;
}

@AlexeyAB
Copy link
Owner

@yangulei

Set num=6 for both [yolo] layers in cfg-file, since you use only 6 anchors.

I'll check whether the rotated frames similar to some real scene.

Yes, maybe at your shooting angle the rotation-augmentation is possible only +-15 degree.

Does the "angle" parameter in the cfg-file mean rotation augmentation?

Yes. But it works only for training Classificator currently.


This is the point that confuse me. Why the number of images matters, instead of the number of objects in the images? There might be dozens even hundreds of objects in a single frame:

What matters is the number of objects and the number of backgrounds.
So you should collect more images (even if there are no objects) to get more backgrounds.


My goal is to train a yolo detector to detect 'bus', 'car, and 'truck' in the frames taken by a camera mounted on a drone. The detector model will be deployed on the NVIDIA Jetson TX2(i) and flying with the drone. Refer to deploying of the model, I'll use NVIDIA TensorRT, which will speed up the inference about 2 to 3 times.

What soft(repository) do you use for detection by using TensorRT?

Also you should compare different trained models by Accuracy/Detection_Time. There are LSTM-Convolutional networks which can detect on video much better than usual Convolutional networks ~1.5x higher mAP on Video.

Did you compare Accuracy/Detection_Time for [maxpool] stride=2 instead of [convolutional] stride=2 in your small model?

In you small model, each final activation has receptive field about ~160x160 pixels, so I think it should be enough for your small objects.

@yangulei
Copy link
Author

@AlexeyAB

Set num=6 for both [yolo] layers in cfg-file, since you use only 6 anchors.

Oh, you are right, I just forgot that. I'll correct this and train my model again, thanks for pointing that out.

maybe at your shooting angle the rotation-augmentation is possible only +-15 degree.

If I do the rotation-augmentation myself, how to calculate the labels in the augmented images? I'm afraid the boundingbox of the rotated rectangle will be bigger than the ground truth more or less. Do you have better idea?

What soft(repository) do you use for detection by using TensorRT?

I write a simplified yolo parser refer to the deepstream reference apps . I doesn't use any plugins but only the layers officially supported by TensorRT.

Also you should compare different trained models by Accuracy/Detection_Time. There are LSTM-Convolutional networks which can detect on video much better than usual Convolutional networks ~1.5x higher mAP on Video.

So far, I'm following the tracking-by-detection scheme. I agree that LSTM-Convolutional networks should be better, but I don't have enough understanding of LSTM for now. I learn CNN mainly through Stanford CS231n, do you have any suggestions of courses or books for learning LSTM?

Did you compare Accuracy/Detection_Time for [maxpool] stride=2 instead of [convolutional] stride=2 in your small model?

Not yet. In fact, I combine a [maxpool] and the [convolutional] after it to a single [convolutional] with stride=2, because personally I don't like [maxpool]. In my opinion, the downsample strategy should be learned during the training process, instead of being set artificially. I also noticed that the [maxpool] + [convolutional] in yolov2.cfg are replaced by [convolutional] with stride=2 in yolov3.cfg too.

Back to my original concern, how to reduce the imbalance while enriching the dataset? Does this question have an answer, or it will not exists when the dataset is rich enough (by augmentation and/or by labeling)?

@LukeAI
Copy link

LukeAI commented Apr 24, 2019

You'll probably get better results with simple upsampling. This paper https://arxiv.org/pdf/1710.05381.pdf from October found consistently better results in visual object detection tasks by upsampling to parity. You could further enhance by tweaking the class thresholds before the softmax to reflect the expected distribution in the population. ie. usually you take whichever prediction is highest to be the most probable class but if you know that cars are much more common than buses then you might set thresholds as: [Car: 0.05, Bus: 0.2] and then you would interpret a probability vector [Car: 0.15, Bus: 0.17] to be a prediction for a Car. [also described in more detail in the afore-linked paper]

@AlexeyAB
Copy link
Owner

@yangulei

If I do the rotation-augmentation myself, how to calculate the labels in the augmented images? I'm afraid the boundingbox of the rotated rectangle will be bigger than the ground truth more or less. Do you have better idea?

No, I don't have any ideas ) Just may be if it will be very small rotations, then bbox will be a little bit bigger.

Not yet. In fact, I combine a [maxpool] and the [convolutional] after it to a single [convolutional] with stride=2, because personally I don't like [maxpool]. In my opinion, the downsample strategy should be learned during the training process, instead of being set artificially. I also noticed that the [maxpool] + [convolutional] in yolov2.cfg are replaced by [convolutional] with stride=2 in yolov3.cfg too.

In the big full yolov3.cfg the conv-stride=2 is used, but in the small yolov3-tiny.cfg the maxpool-stride=2 is used.
I used very few experiments with yolov3-tiny.cfg with conv-stride=2, and it seems you need to train it much longer.

Back to my original concern, how to reduce the imbalance while enriching the dataset? Does this question have an answer, or it will not exists when the dataset is rich enough (by augmentation and/or by labeling)?

In the optimizer, it is already solve as much as possible by using decay #1845 (comment)
In the most cases focal_loss=1 in the [yolo] layer doesn't help to solve imbalance (that is used in RetinaNet).
So you just should add more images, especially with buses and trucks.


So far, I'm following the tracking-by-detection scheme. I agree that LSTM-Convolutional networks should be better, but I don't have enough understanding of LSTM for now. I learn CNN mainly through Stanford CS231n, do you have any suggestions of courses or books for learning LSTM?

What Tracker do you use?
No, I don't have sugestion of a good book/course.
Just you can try to start from https://en.wikipedia.org/wiki/Long_short-term_memory
And if you have enough time https://arxiv.org/pdf/1506.04214v2.pdf
In several days I will add Conv-LSTM layers and model for Detection, with a description, currently I test it.
1200px-The_LSTM_cell

@holger-prause
Copy link

I am facing the same problem (unbalanced dataset) - here some things i want to try out

  • multilabeling
    with yolo v3 you can supply multiple classes for one object, try to build some kind of hierachie and label accordingly (i.e label vans as van and car)
  • augmentation
    you could crop and rotate the object and also place it on a new background (sampled from a negative)

I will try and see how this helps, i think these are not bad ideas.

@yangulei
Copy link
Author

@LukeAI Thanks for sharing your ideas.

You'll probably get better results with simple upsampling. ...

I think you mean oversampling in the paper. As it said in the paper:

The main idea is to ensure uniform class distribution of each mini-batch and control the selection of examples from each class.

This is applied by selecting more samples with minority classes during a mini-batch, which is straightforward for classification task. But I can't figure out how to apply this in a detection task. The samples of all classes are embed in the same image, and I don't have enough images in-which there are more "bus" or "truck" than "car", so I don't know how to balance the class distribution for a mini-batch.

@yangulei
Copy link
Author

@AlexeyAB

What Tracker do you use?

I'm using a C++ implementation of SORT.

In several days I will add Conv-LSTM layers and model for Detection, with a description, currently I test it.

Amazing, looking forward for that. : )

@yangulei
Copy link
Author

@holger-prause
Looking foreword to your updates. : )

@LukeAI
Copy link

LukeAI commented Apr 25, 2019

I think you mean oversampling in the paper. As it said in the paper:

yeah good point :)

I can't figure out how to apply this in a detection task. The samples of all classes are embed in the same image, and I don't have enough images in-which there are more "bus" or "truck" than "car", so I don't know how to balance the class distribution for a mini-batch.

I see what you mean... just throwing an idea out there but possibly you could write a script to crop out regions with lots of cars only and append that to the original dataset (and fill in with blackness) as a crude way to balance things out a bit more? Another approach mentioned in that paper is to tune the softmax thresholds to try to compensate for the bias in the model resulting from the imbalance? What do you think?

@holger-prause
Copy link

holger-prause commented Apr 25, 2019

Wow i think the idea to balance the !minibatch! is the way to go!
This way you make sure the model wont "forget" about "irrelevant" samples.
Well i don know how to do this in yolo (custom loss function? - hmm i dont think so) - i guess you would need to adopt the code which read in the samples?

So the idea is not to balance the dataset but to balance what your model sees during training?
I think that again makes sense for me - thank you guys very much - this thread is good!

Good this problem is solveable :-)

@LukeAI
Copy link

LukeAI commented Apr 25, 2019

I'm not sure I can see why balancing every mini-batch is going to have a different result to the conventional, more straightforward approach of balancing the whole dataset and shuffling. I'm ready to be proved wrong though, if you get any results please do update us.

@yangulei
Copy link
Author

yangulei commented Apr 26, 2019

@LukeAI

possibly you could write a script to crop out regions with lots of cars only and append that to the original dataset (and fill in with blackness) as a crude way to balance things out a bit more?

This is a good idea, just like @holger-prause said. But we need to find a better way to merge the cropped objects and the background, it's too artificial for now. And I got another idea: copy the sample images and blur the objects with majority classes, and maybe additional augmentation, to make more samples for the minority classes.

Another approach mentioned in that paper is to tune the softmax thresholds to try to compensate for the bias in the model resulting from the imbalance? What do you think?

I don't think this could affect the training process, unless we could find a way to add this logic to the loss function.

@LukeAI
Copy link

LukeAI commented Apr 26, 2019

I don't think this could affect the training process, unless we could find a way to add this logic to the loss function.

No it wouldn't affect the training - this is for inference time. This paper suggests that best results are with a balanced dataset and inference-time thresholds adjusted to reflect the population distribution but mentions that reasonable results have been obtained by others by using an unbalanced training set and using this technique to try to offset that bias at inference time.

@LukeAI
Copy link

LukeAI commented Apr 29, 2019

Related question:

In my previous trainings, I have generally tried to balance frequency of instances of each class - as opposed to number of images containing each class. eg. If I have 200 photos with a total of 1000 cars and 100 photos with a total of 200 dogs then I will oversample the dog photos by a factor of 5, not by a factor of 2. Does this sound like the best approach?

@AlexeyAB
Copy link
Owner

@LukeAI

If I have 200 photos with a total of 1000 cars and 100 photos with a total of 200 dogs then I will oversample the dog photos by a factor of 5, not by a factor of 2.

What is "oversample"?

@LukeAI
Copy link

LukeAI commented Apr 29, 2019

@LukeAI

If I have 200 photos with a total of 1000 cars and 100 photos with a total of 200 dogs then I will oversample the dog photos by a factor of 5, not by a factor of 2.

What is "oversample"?

Using duplicate copies of minority classes so that during training, the network is exposed to roughly the same number of examples each class, so as to avoid bias. A classic extreme example is if you are training a network to identify fraudulent bank transactions. Almost all bank examples are not fraudulent so if you use a representative sample of all bank transactions for training, your network will probably just learn to predict that all transactions are non-fraudulent because that prediction was correct 99.9999% of the time during training.

I do this with darknet by putting multiple entries of images containing minority classes into train.txt and test.txt

@AlexeyAB
Copy link
Owner

@LukeAI Yes, it's a good solution.

@mdv3101
Copy link

mdv3101 commented May 3, 2019

@AlexeyAB
I am trying to implement weapon detection and facing rotation related issues.
I am thinking of rotating each image by multiple of 30. So 12 versions of the same image.

If this can be handled in the cfg file that would be wonderful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants