Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raspberry Pi YOLO Training #289

Closed
WTeichert opened this issue Dec 2, 2017 · 36 comments
Closed

Raspberry Pi YOLO Training #289

WTeichert opened this issue Dec 2, 2017 · 36 comments
Labels
Solved The problem is solved using the correct settings

Comments

@WTeichert
Copy link

WTeichert commented Dec 2, 2017

Greatings everyone,

I am in the middel of my student research project.
Therefor I am creating an object detection and classification which fits for a pi.
I am using YAD2k running on PI, because it has less computational demands.
I plan to train my network by VOC with different training cfg's.

I am asking you, for some advises, tipps or tricks I can use.

I will change so far:

  • number of convoltional and pooling layers + filters per layer (like yolo-voc -> yolo-voc-tiny)
  • learning rate, steps, scale
  • max batches
  • height and weight (to a minimum of 224, possible? what should I do with anchors, devide by 2? Do I have to add "resize_network(nets + i, nets[i].w, nets[i].h);" in detector.c line 40-41?
  • greyscaled pictures with channels=1
  • random on or off
  • max pooling size
  • stride to 2 <- much more speed, less accuracy

I have also few questions:
What does activation: leaky or linear do?
saturation/exposure are always the same, what do they do?

Thank you for all inspiration! :)

@AlexeyAB
Copy link
Owner

Hi,

height and weight (to a minimum of 224, possible? what should I do with anchors, devide by 2? Do I have to add "resize_network(nets + i, nets[i].w, nets[i].h);" in detector.c line 40-41?

  • You shouldn't add resize_network(). Just set width=224 height=224 in the your cfg-file
  • Yes, just divide anchors by 2
  • if you use random=1 then you should change these two line:

    darknet/src/detector.c

    Lines 98 to 99 in 75c39f5

    int dim = (rand() % 10 + 10) * 32;
    if (get_current_batch(net)+100 > net.max_batches) dim = 544;

to these, for resolution ~224x224

            int dim = (rand() % 5 + 5) * 32;
            if (get_current_batch(net)+100 > net.max_batches) dim = 224;
  • random=1 gives you about +1% mAP

What does activation: leaky or linear do?

  • linear: y = x
  • leaky (ReLU): if(x>0) { y = x; } else { y = x/10; }

saturation/exposure are always the same, what do they do?

saturation, exposure and hue values - ranges for random changes of colours of images during training (params for data augumentation), in terms of HSV: https://en.wikipedia.org/wiki/HSL_and_HSV
The larger the value, the more invariance would neural network to change of lighting and color of the objects. More: #279 (comment)

@WTeichert
Copy link
Author

WTeichert commented Dec 16, 2017

Thank you alot!

Of course I first watch the training. There I came to the point, that
Darknet19 448x448
should be used.
You've written that "This model performs significantly better but is slower since the whole image is larger.".
Since I need to speed and tight up the whole algorithm, I want to use darknet19 on it s basic configuration.

Now my question, from where can I get these darknet19.conv.xx for training?
May I can use yolo-voc-tiny.weights as my basic, like this backup training? (but my cfg changed in a few lines)

And one more question:
I got access to a computation centre where I ve 4 CPUs and 2 GPUs I can use. As I've read on your page, YOLOv2 is not made for multi cpu. Are there have been some changes? Is it helpful, that tensorflow is configured for multi CPU?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 16, 2017

@WTeichert
Copy link
Author

Thank you again ^^
This partial training sounds interessting!
I take a trained weight and make it as my pre-trained base?
Where does the 13 come from (=number of layers - last "class" layer)?
Can I use my own cfg, or do I need to use tiny-yolo-voc.cfg? Wouldn t I ve problems when they are different?

Little missunderstanding,

I look for darknet19, not trained on 448x448, so the previous version of it!
I want to use it all, so 4 cpus + 2 gpus for training.

For detection I ve to look forward, to get the best out of a Raspberry Pi.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 16, 2017

  • Tiny-yolo has 16 layers, where last 2 layers: are detection-layer and conv-layer that depends on number of classes. So you can use any number of the first layers from 1 to 14 for partial.
  • To do partial you should use cfg that corresponds to the weights file - i.e. tiny-yolo-voc.cfg for tiny-yolo-voc.weights.
  • Just use multi-GPU, each GPU is 100x times faster than each CPU, so there isn't any reason to use CPU: https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

@WTeichert
Copy link
Author

Hey, first of all, thank you for your time.
I am now done with the trainings (had some different stuff to do), but it doesn t work out like I thought.

I tried to train on Pascal Voc, followed your instructions, went all fine.
Not sure if that matters, I chose the pre-trained model Darknet19_448.conv.23 instead of darknet53.conv.74 (I think this was changed by you?)
My cfg1 you can see below. With 45 000 iterations it mostly detects chairs, doesn t matter if it is a person or a dog or whatever
for cfg4 i just changed width+height to 608 and multiplied the anchors by 4 -> their is no detection at all, also IOU and Recall are 0 when i try to valid the weights

Did I missed something or is it just a network conflict, that the parameters doesn t fit to the dataset?
overview to all cfg's I trained

cfg_overview.pdf

`[net]
batch=64
subdivisions=64
width=224
height=224
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
max_batches = 45000
policy=steps
steps=100,25000,35000
scales=.1,.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

###########

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=125
activation=linear

[region]
anchors = 0.54,0.60, 1.71,2.2, 3.32,5.69, 4.71,2.55, 8.31,5.26
bias_match=1
classes=20
coords=4
num=5
softmax=1
jitter=.2
rescore=1

object_scale=5
noobject_scale=1
class_scale=1
coord_scale=1

absolute=1
thresh = .6
random=0
`

@AlexeyAB
Copy link
Owner

@WTeichert

  1. darknet53.conv.74 should be used only if your cfg-file based on yolov3.cfg (only for Yolo v3). But if your cfg-file based of tiny-yolo-voc.cfg, or yolov2-tiny-voc.cfg or yolo-voc.2.0.cfg or yolov2-voc.cfg (Yolo v2) then you should use darknet19_448.conv.23

  2. On what cfg-file did you base your cfg-file?

  3. Do you try to train yolo on CPU?

  4. Can you get any good results or the results of all the trainings are bad?

  5. How many iterations did you train?

@WTeichert
Copy link
Author

  1. Than I was right
  2. based on tiny-yolo-voc.cfg
  3. trained on gpu, cpu was way to slow (50min for 1k iterations)
  4. no, all are trash
  5. 45k-60k, see max batches.
    +I trained from windows

@AlexeyAB
Copy link
Owner

  • What was average loss?
  • And what mAP can you get for one of your weights file? darknet.exe detector map voc.data your.cfg your_40000.weights
  • Can you compress your files: .data, .cfg, .names, train.txt, cmd-file for training - and pin this compressed archive here to the message?

@WTeichert
Copy link
Author

WTeichert commented Mar 29, 2018

  • I don t know

  • mAP doesn t work, get this error...
    File "...\YOLOv2\darknet-master\build\darknet\x64\voc_eval_py3.py", line 157, in voc_eval
    R = class_recs[image_ids[d]]
    KeyError: '003028'

  • training_data.zip

For next day I ve nomore access to this data, so further data can be sended on monday

@AlexeyAB
Copy link
Owner

To this line (with your files: data, cfg, weights):
darknet.exe detector map data/obj.data cfg/yolo_obj.cfg yolo-obj.weights

And run it.

Also what cammnd do you use for training?

@WTeichert
Copy link
Author

I ve tried both ways. First with my data and cfg, and 2nd with your tiny-voc and voc. Doesn t worked.

The command for training is written in train.cmd
darknet.exe detector train data/cfg1.data cfg/cfg1.cfg darknet19_448.conv.23

@AlexeyAB
Copy link
Owner

@WTeichert This command:
darknet.exe detector map data/cfg1.data cfg/cfg1.cfg cfg1_40000.weights
can't give this error, because there isn't any from Python.

What is the error gives this command?

@WTeichert
Copy link
Author

Ahh little missunderstanding. I tried using map, but nothing happend (so far as i can see)
So I chose calc_mAP_voc_py.cmd and changed line 8 to my files and line 9 to my voc dir

voc_eval_py3.py", line 157 was the error

Could it be, that I chose the learning rate too little, so the network doesn t learn something new out of new input size?
Or could it be, that the change in anchors lead to this mistake?

@AlexeyAB
Copy link
Owner

Attach screenshot of "nothing happend" that happen after this command - you should wait sometimes 10 minutes while mAP will be calculated)
darknet.exe detector map data/cfg1.data cfg/cfg1.cfg cfg1_40000.weights

I don't know is there any mistake. I can't say anything without mAP.
What repo did you use for training?

@WTeichert
Copy link
Author

nothing happens, means nothing i can see directly.
I am not sure which repo I am using (repo?) but since in the introduction is said "If you use another GitHub repository, then use darknet.exe detector recall... instead of darknet.exe detector map". I tried both and map has not give me any visual output. cmd just ended and console is waiting for next cmd, nothing happend (<- that s ment by it, their is no screenshot for it) so I chose recall to check IOU and recall. IOU I got maximum of 28%
I copyed this github we re talking on and followed instructions for windwos, so I thought should be this repo...
When I use darknet.exe detector valid, it created lot of blank class files in results. Instead with yolo-voc they were full with notes and detections, that s all I can say to mAP.

Problem is I am not into c programming, I just in a little python, so all the dector.c - compile and functions in c I understand on the very top.

I am not at office these days, so I can try once more on Tuesday, but I don t think it will change the results.

@WTeichert
Copy link
Author

image
first commend was with recall instead of map, second doesn t give any result as far as I can see

image
first commend was with valid instead of recall and created the files in results

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 3, 2018

@WTeichert Try to update your code from this repo.

@WTeichert
Copy link
Author

Done, but same error.
Was someting changed to detector? Because I cannot update darknet version, since my MSVS liscenes expired.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 6, 2018

You should recompile code in MSVS after that your repo is updated.
Yes, there was added Yolo v3, fused batch_norm (+7% speedup), calc anchors and mAP, AVX on CPU (+20% speedup) and many other things...
You can install free MSVS2015 community that I use: https://go.microsoft.com/fwlink/?LinkId=532606&clcid=0x409

@WTeichert
Copy link
Author

Finally map works ! needed a few trys with cuda 9.1, 9.0, 8.0 and their cuDnn libarys cause this error accured:
errorcuda81
I solved it with creating new repo instead of updating.

grafik
It was the cfg which only gave me chairs as output

grafik
It was the cfg with 0 IOU and Recall and no detections.

The difference between them was: height weidth at first cfg=224 and at second 608

@WTeichert
Copy link
Author

I just checked again difference between my cfg and tiny yolo voc.
I changed anchors, weidth and height, deleted comments,
and their are these lines:
steps= -1,100,20000,30000
scales=.1,10,.1,.1
I chose instead:
steps=100,25000,35000
scales=.1,.1,.1

Because I did not understood the -1

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 7, 2018

@WTeichert It's bad mAP result. Check your dataset using Yolo_mark.
And use these lines:

learning_rate=0.0001
max_batches = 45000
policy=steps
steps=100,25000,35000
scales=10,.1,.1

@WTeichert
Copy link
Author

WTeichert commented Apr 7, 2018

Ah found it, but I used PascalVoc Dataset, do I need to mark bounding boxes?

@AlexeyAB I ve done check. The labels are not correctly signed, like persons are chairs, cats are boats. Should be connected to the voc.names list, am I right?

But the bounding boxes are all right!

And that doesn t explain why detection doesnt work.
Should I train new network with these lines?
learning_rate=0.0001
max_batches = 45000
policy=steps
steps=100,25000,35000
scales=10,.1,.1

That would be sad... to not know, why it doesn t work and just try again...

@WTeichert
Copy link
Author

WTeichert commented Apr 8, 2018

image
These are the results of yolov2-tiny-voc weights ... there should be an error somewhere else.
How does mAP depends on names list? Could the order of names cause this error?

If I compare the labels folder of voc labels with the voc.names file there are changes.
5 and 8 should be dog and person, while in voc.names that s are bus and chair

but when I validate tiny-voc with some example pictures, it is pretty good

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 8, 2018

  • Show your file obj.data
  • What command do you use for training?
  • What command do you use for calc mAP?

@WTeichert
Copy link
Author

WTeichert commented Apr 8, 2018

@AlexeyAB

image

  • darknet.exe detector train data/cfg1.data cfg/cfg1.cfg darknet19_448.conv.23
  • darknet.exe detector map data/cfg1.data cfg/cfg1.cfg cfg1_final.weights

Btw. why do i get complete different IOU and recall with commend: ... detector map ... or ... detector ... recall
image

@WTeichert
Copy link
Author

@AlexeyAB
I tryed to train with changed learning rate and anchors to standard. But result is again 0 detections, mAP is 0 too.
Do you have used the windows method to train tiny-yolo-voc.cfg? Is your version different from the repo? How many iterations did you trained? Which command do you used for training?

@AlexeyAB
Copy link
Owner

@WTeichert I trained any models on both Windows and Linux using this repo. It works fine.

@WTeichert
Copy link
Author

@AlexeyAB
I found a mistake in my train.txt. Now I gte 56,21% mAP for yolov2-tiny-voc

I tryed to train with 11 classes of VOC, so shortend class-list in voc-label.py and voc-names. Also set number of class to 11 in voc.data, cfg and last filter to 80. Again I get 0 mAP.

What was your average loss in training? I am always around 0,5 which seems pretty high.

@AlexeyAB
Copy link
Owner

@WTeichert About ~0.5

@WTeichert
Copy link
Author

Ok, I found the problems. Was some mess with the voc.data and label.txt files.

But I am still wondering why the cfg of tiny-yolo-voc starts steps with -1. If you could explain me that fact, I won t ask anything anymore :D

@AlexeyAB
Copy link
Owner

step with -1 means that 1st scale 0.1 will be applied immediately.
It was left just for some experiments.

This is:

learning_rate=0.001
max_batches = 40200
policy=steps
steps=-1,100,20000,30000
scales=.1,10,.1,.1

the same as reduced learning_rate and removed 1st steps/scales:

learning_rate=0.0001
max_batches = 40200
policy=steps
steps=100,20000,30000
scales=10,.1,.1

Because net.steps[i] > batch_num i.e. -1 > 0 then 1st scale is applied immediately:

darknet/src/network.c

Lines 94 to 101 in 5e3dcb6

case STEPS:
rate = net.learning_rate;
for(i = 0; i < net.num_steps; ++i){
if(net.steps[i] > batch_num) return rate;
rate *= net.scales[i];
//if(net.steps[i] > batch_num - 1 && net.scales[i] > 1) reset_momentum(net);
}
return rate;

@WTeichert
Copy link
Author

Thank You so much for your help! My research is done and went all well.

The manipulation of the filter number per layer and the reduce of the resolution brings the best performances of a pi!
Models, Classes and Greyscale are not that easy to manipulate, dependend of dataset.
Random should be selected, Performance decrease is minimal.

@AlexeyAB
Copy link
Owner

@WTeichert Can you attach your result cfg-file?

@WTeichert
Copy link
Author

Sorry, I lost the orginals at a system reset and have only the converted h5 files...

  • use VOC dataset
  • resoultion set to minium of 224
  • half the amount of filter per layer
  • reduce number of layer by one
  • set random =1

That were the results I found.
Performance on Pi increased from 4s to 1s per picture with a pretty good mAP
Mainfocus of my work were the decrease of fps.
Hope these information can help

here the h5 file with changed layernumber based on COCO
COCOh5.zip

here the h5 file with changed number of filter per layer based on COCO
COCOh5_2.zip

@AlexeyAB AlexeyAB added the Solved The problem is solved using the correct settings label Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Solved The problem is solved using the correct settings
Projects
None yet
Development

No branches or pull requests

2 participants