Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compute mAP of tiny yolo on VOC2007-test #350

Open
szm-R opened this issue Jan 25, 2018 · 18 comments
Open

How to compute mAP of tiny yolo on VOC2007-test #350

szm-R opened this issue Jan 25, 2018 · 18 comments
Labels

Comments

@szm-R
Copy link

szm-R commented Jan 25, 2018

Hi everyone,
The title says everything, I want to compute the mAP of tiny yolo on VOC2007-test, I have written a cpp code for this and get 39.78% for mAP whereas pjreddie reports 57.1% mAP on VOC2007-test.
I first downloaded the weights using:
wget https://pjreddie.com/media/files/tiny-yolo-voc.weights

Then performed detection with:
./darknet -i 0 detector valid cfg/voc.data cfg/tiny-yolo-voc.cfg models/tiny-yolo-voc.weights

I just changed detector.c code to save the results in a different format that was easier for me to read in my code.

I then count all the TPs and FPs (in all classes) and compute Precision-Recall for 11 thresholds (from 0 to 1) and then the AP (with the formula mentioed in Pascal VOC paper). Here is the PR curve I get:
prcurve_ap 39 7814

My true purpose is to write a code to compute the AP for the model trained on my own custom data, but in order to verify it I am testing on a pretrained tiny yolo.

Thanks in advance for your help.

@AlexeyAB
Copy link
Owner

@szm2015 Hi,

  1. Did you try to use https://github.com/AlexeyAB/darknet/blob/master/scripts/voc_eval.py and compare with your results?

  2. What validation dataset did you use, is it voc/2007_test.txt?

  3. Did you use such approach in your C-code?

  • mAP = AVG(AP for each object class)
  • AP = AVG(Precision for each of 11 Recalls {0, 0.1, ..., 1})
  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • TP = number of detections with IoU>0.5
  • FP = number of detections with IoU<=0.5 or detected more than once
  • FN = number of objects that not detected or detected with IoU<=0.5

@szm-R
Copy link
Author

szm-R commented Jan 25, 2018

1- I will look into it and report, thank you.
2. Yes.
3.
The number I reported was the AP taken over all classes, meaning that for every threshold I summed all TPs, FPs, and FNs from all classes (all counted the same as you mentioned) and then computed the overall Precision and Recall for every threshold and then computed the AP (also computed as you mentioned, for the points with no recall, like the recalls more than 60% in the image, I considered the precision 0).

Now I tried what you said about computing AP for every class and then averaging over them to get the mAP, here are the results:

AP of class "aeroplane": 44.0196
AP of class "bicycle": 47.6694
AP of class "bird": 28.4797
AP of class "boat": 20.7133
AP of class "bottle": 10.0164
AP of class "bus": 48.7149
AP of class "car": 48.0142
AP of class "cat": 49.9185
AP of class "chair": 14.9642
AP of class "cow": 38.3175
AP of class "diningtable": 34.7925
AP of class "dog": 41.3822
AP of class "horse": 48.7818
AP of class "motorbike": 37.5573
AP of class "person": 42.4456
AP of class "pottedplant": 15.4256
AP of class "sheep": 38.9875
AP of class "sofa": 19.6689
AP of class "train": 50.1913
AP of class "tvmonitor": 43.5746
mean Average Precision: 36.1818

Now it's even less!

@AlexeyAB
Copy link
Owner

@szm2015 Can you show your C-code for mAP?

@szm-R
Copy link
Author

szm-R commented Jan 26, 2018

Hello again, I attached the code. It's a Qt project (I use the ui for plotting). Here's an overall explanation:

In lines 57 to 112, there's a for loop on 11 thresholds (from 0 to 1). Inside this (line 63 to 108) is a loop over txt prediction files (which I also attached). In this loop the detections that have a score above the threshold are stored in cv::Rect objects (along with their scores and labels). then functions "FillEvaluationsMatrix" evaluates the predictions against the ground truth and fills a confusion matrix which is initialized at the beginning of threshold loop.

Outside the predictions loop (still inside threshold loop), the TP and FP values are computed using the confusion matrix in "finalEval" function (I count the total objects in every class in ground truth labels and use that for TP+FN value in recall denominator). This function computes precision and recall and saves them in a matrix (named PRpairs) that has 20 rows (number of classes) and 11 columns (number of thresholds), this way each class has PRpair for every threshold at the end of the loop.

Finally "ComputeAPs" function computes the APs of every class using the PRpairs calculated before and averages them to get mAP.

Detections.zip

YoloPRcurve.zip

@MiZhangWhuer
Copy link

@szm2015 How about the issues now? I also have the problem w.r.t PR curve. I wonder why the recall (x axis) is 60 rather than 100?

@szm-R
Copy link
Author

szm-R commented Jan 29, 2018

Hello everyone, I haven't had the time to work on this matter for a while. just now I was checking voc_eval,py and came across these lines:

if ovmax > ovthresh: if not R['difficult'][jmax]: if not R['det'][jmax]: tp[d] = 1. R['det'][jmax] = 1 else: fp[d] = 1. else: fp[d] = 1.

It seems that detections are only counted if their ground truth is not "difficult" and also if not R['det'][jmax]. I haven't considered any of these in my code, though I have no idea what the second one is! I would appreciate any clarifications!

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jan 29, 2018

  • R['det'][jmax] - this flag means that this ground-truth already detected, so re-detection of the same object is a False-Positive - fp[d] = 1.

  • Yes, ground-truths (objects) with parameter difficult=1 in the PascalVOC annotation .xml-file - count neither as false positives or negatives:


This code is taken from the repository of the author of Faster-RCNN detector: https://github.com/rbgirshick/py-faster-rcnn/blob/781a917b378dbfdedb45b6a56189a31982da1b43/lib/datasets/voc_eval.py#L177-L189

            overlaps = inters / uni
            ovmax = np.max(overlaps)
            jmax = np.argmax(overlaps)

        if ovmax > ovthresh:
            if not R['difficult'][jmax]:
                if not R['det'][jmax]:
                    tp[d] = 1.
                    R['det'][jmax] = 1
                else:
                    fp[d] = 1.
        else:
            fp[d] = 1.

Where:

  • overlaps is an array of IoUs - Intersects of Unions
  • ovmax is the maximum IoU
  • ovthresh is constant value of IoU-threshold = 0.5
  • R['difficult'][jmax] - flag that this ground-truth is difficult
  • R['det'][jmax] - flag that this ground-truth already detected
  • tp - true positive flag
  • fp - false positive flag

So if ground-truth is difficult - then this ground-truth is not taken into account in Precision (neither in true-positive, nor in false-positive).
If the the ovmax > 0.5 re-detected again - then this is false-positive fp.

10.1.1.157.5766.pdf

@szm-R
Copy link
Author

szm-R commented Jan 29, 2018

Thank you @AlexeyAB for your complete explanation. I do something exactly like checking R['det'][jmax] in my code. I added the "difficult" checking part, but for some reason, the AP got even worse, I should check it more, but meanwhile, can you point me to the exact procedure of evaluating with voc_eval.py? Most important of which is the command line code to get the detection results as there are several validation functions in detector.c as far as I have understood.

@MiZhangWhuer
Copy link

Hi @szm2015 @AlexeyAB , I also plot the PR curve following the link https://github.com/D-X-Y/caffe-faster-rcnn/blob/dev/examples/FRCNN/calculate_voc_ap.py , and print the P-R values before mAP is computed.
1. But I wonder why the precision value do not approximate to zero ? Attachments are my predicted bounding boxes file (PredBBoxes.txt) and corresponding ground truth bounding boxes file (GTBBoxes.txt). Note that all the bounding boxes is not the "difficult" type.
2. The P-R curve seems correct when I adding the following code (see attachment: line 251 in calculate_ap.py.txt ) to normalize the recall values:
rec = (rec - rec.min())/(rec.max() - rec.min())

I would be much appreciate if all of you can help me to solve the problems above. And hope further discussions on the P-R curve issues.

GTBBoxes.txt
PredBBoxes.txt
calculate_ap.py.txt

@szm-R
Copy link
Author

szm-R commented Feb 12, 2018

Hi @MiZhangWhuer , as I myself am still struggling with this issues I can't be of much help to you! But hopefully, if I could solve it, I would share my results.

Now @AlexeyAB , I still don't know how to run voc_eval.py . My python knowledge is really rusty! I first created the detection results using the following command:
./darknet -i 0 detector valid cfg/voc.data cfg/tiny-yolo-voc.cfg models/tiny-yolo-voc.weights

(Note that I'm using pjreddie version of darknet)

Then I had my detection results in a folder named voc in results directory with the following format:
className.txt

Now I added these lines at the end of voc_eval.py code to be able to run it (told you my python is rusty!!!):

print "Starting here"
detpath = "/path/to/results/voc/"
annopath = "/path/to/data/voc/VOC2007/Annotations/"
imagesetfile = "/path/to/data/voc/2007_test_FileNames.txt"
classname = "/path/to/data/voc/voc.names"
cachedir = "/path/to/data/voc/VOC2007/cache/"
ovthresh = 0.7
use_07_metric = True 
voc_eval(detpath, annopath, imagesetfile, classname, cachedir, ovthresh, use_07_metric)

But detpath and others seem to be something other that simple addresses, because running the code gives me the following error:

Traceback (most recent call last):
  File "voc_eval.py", line 211, in <module>
    voc_eval(detpath, annopath, imagesetfile, classname, cachedir, ovthresh, use_07_metric)
  File "voc_eval.py", line 137, in voc_eval
    with open(detfile, 'r') as f:
IOError: [Errno 21] Is a directory: '/home/szm/Work/Research/Models_and_Codes/darknet/darknet_GPU/results/voc/'

Can you please tell me how should I path these arguments to voc_eval.py?

@szm-R
Copy link
Author

szm-R commented Feb 13, 2018

Hi everyone,
I figured out my last question now I run the voc_eval.py by adding the following lines at the end:

detpath = '/path/to/darknet/darknet_GPU/results/voc/{}.txt'
annopath = '/path/to/darknet/darknet_GPU/data/voc/VOC2007/Annotations/{}.xml'
imagesetfile = '/path/to/darknet/darknet_GPU/data/voc/2007_test_FileNames.txt'
cachedir = '/path/to/darknet/darknet_GPU/data/voc/VOC2007/cache/'
ovthresh = 0.7
use_07_metric = True 
classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
for classname in classes:
    rec, prec, ap = voc_eval(detpath, annopath, imagesetfile, classname, cachedir, ovthresh, use_07_metric)
    print "ClassName: %s AveragePrecision: %f" % (classname, ap)

Now I have a more fundamental question, In this code, we just hand the previously generated detection files to the evaluation function and in that we just calculate one pair of recall and precision for every class (as far as I have understood) and then calculate AP. Shouldn't there be some kind of a loop over different score thresholds (applied on the "confidence") to give us the precision-recall curve (to use for AP calculation)?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Feb 13, 2018

@szm2015 In your code use_07_metric = True.
So if use_07_metric is true, voc_ap-function uses the VOC 11 point method - this is for mAP.
You get from voc_eval function 3 params:

  1. rec
  2. prec
  3. ap

    darknet/scripts/voc_eval.py

    Lines 198 to 200 in 9c84764

    ap = voc_ap(rec, prec, use_07_metric)
    return rec, prec, ap

    In the file reval_voc.py calculated mAP:
    aps += [ap]
    print('AP for {} = {:.4f}'.format(cls, ap))
    with open(os.path.join(output_dir, cls + '_pr.pkl'), 'w') as f:
    cPickle.dump({'rec': rec, 'prec': prec, 'ap': ap}, f)
    print('Mean AP = {:.4f}'.format(np.mean(aps)))

Also:


mAP calculation - 11 point method for PascalVOC:

def voc_ap(rec, prec, use_07_metric=False):
""" ap = voc_ap(rec, prec, [use_07_metric])
Compute VOC AP given precision and recall.
If use_07_metric is true, uses the
VOC 07 11 point method (default:False).
"""
if use_07_metric:
# 11 point metric
ap = 0.
for t in np.arange(0., 1.1, 0.1):
if np.sum(rec >= t) == 0:
p = 0
else:
p = np.max(prec[rec >= t])
ap = ap + p / 11.

@AlexeyAB
Copy link
Owner

@szm2015 @MiZhangWhuer I just added cmd-file for windows to calculate mAP. I got 56.6% for Tiny-Yolo 416x416 on PascalVOC 2007 test, that a little bit less than 57.1% that stated on the site: https://pjreddie.com/darknet/yolo/

If you use Windows and Python >= 3.5:



Mean AP = 0.5666
~~~~~~~~
Results:
0.629
0.725
0.487
0.427
0.212
0.678
0.678
0.709
0.353
0.546
0.581
0.628
0.710
0.697
0.604
0.283
0.559
0.524
0.712
0.590
0.567
~~~~~~~~

--------------------------------------------------------------
Results computed with the **unofficial** Python eval code.
Results should be very close to the official MATLAB eval code.
-- Thanks, The Management
--------------------------------------------------------------

@szm-R
Copy link
Author

szm-R commented Feb 14, 2018

Thank you @AlexeyAB, after digging a little more into the code I found what I've been missing, the fact that predicted bounding boxes are ranked according to their confidence scores and then the recall/precision is computed for every one of these ranks. What I myself have been doing was to consider a number of thresholds (say 20) and then compute the PR pair for each one of them (by omitting the predictions with confidence scores below the threshold in each turn) and then calculating AP from these PR pairs. I still don't know why this way of computing AP gives such a drastically wrong result, but for now, I will stick to your code. Thanks again.

@AlexeyAB
Copy link
Owner

@szm2015 @MiZhangWhuer

I added C-code for calculation mAP (mean average precision) using Darknet for VOCdataset and any your custom dataset. Just use command: darknet.exe detector map data/voc.data tiny-yolo-voc.cfg tiny-yolo-voc.weights
where in the voc.data file should be stated validation dataset valid=2007_test.txt

But my implementation shows lower value than reval_voc.py + voc_eval.py. If you will find error in my code and can fix it, let me know about it:

void validate_detector_map(char *datacfg, char *cfgfile, char *weightfile)

I don't check difficult of ground truth as it does voc_eval.py, but as I see voc_label.py remove difficult objects already on labeling stage:

if cls not in classes or int(difficult) == 1:
continue


class = 0, name = aeroplane,     ap = 61.01 %
class = 1, name = bicycle,       ap = 71.18 %
class = 2, name = bird,          ap = 47.84 %
class = 3, name = boat,          ap = 40.23 %
class = 4, name = bottle,        ap = 20.88 %
class = 5, name = bus,   ap = 67.68 %
class = 6, name = car,   ap = 66.21 %
class = 7, name = cat,   ap = 70.46 %
class = 8, name = chair,         ap = 33.77 %
class = 9, name = cow,   ap = 54.15 %
class = 10, name = diningtable,          ap = 55.45 %
class = 11, name = dog,          ap = 62.47 %
class = 12, name = horse,        ap = 71.24 %
class = 13, name = motorbike,    ap = 68.72 %
class = 14, name = person,       ap = 59.28 %
class = 15, name = pottedplant,          ap = 27.54 %
class = 16, name = sheep,        ap = 54.45 %
class = 17, name = sofa,         ap = 50.07 %
class = 18, name = train,        ap = 70.83 %
class = 19, name = tvmonitor,    ap = 58.63 %

 mean average precision (mAP) = 0.556050, or 55.61 %
Total Detection Time: 56.000000 Seconds

So on the site and in the article stated 78.6% page-4 table-3: https://arxiv.org/pdf/1612.08242v1.pdf
Lower value because:
* My implementation shows some lower value than reval_voc.py+voc_eval.py. If you will find error in my code, let me know about it.
* yolo-voc.weights trained with keeping aspect ratio and with some other modification in the original repo, so you should test it on original repo: https://github.com/pjreddie/darknet

class = 0, name = aeroplane,     ap = 80.84 %
class = 1, name = bicycle,       ap = 84.10 %
class = 2, name = bird,          ap = 75.03 %
class = 3, name = boat,          ap = 65.30 %
class = 4, name = bottle,        ap = 55.22 %
class = 5, name = bus,   ap = 83.66 %
class = 6, name = car,   ap = 84.53 %
class = 7, name = cat,   ap = 88.20 %
class = 8, name = chair,         ap = 58.35 %
class = 9, name = cow,   ap = 80.53 %
class = 10, name = diningtable,          ap = 69.81 %
class = 11, name = dog,          ap = 84.07 %
class = 12, name = horse,        ap = 86.17 %
class = 13, name = motorbike,    ap = 83.33 %
class = 14, name = person,       ap = 78.44 %
class = 15, name = pottedplant,          ap = 50.86 %
class = 16, name = sheep,        ap = 77.36 %
class = 17, name = sofa,         ap = 71.74 %
class = 18, name = train,        ap = 82.96 %
class = 19, name = tvmonitor,    ap = 74.95 %

 mean average precision (mAP) = 0.757728, or 75.77 %
Total Detection Time: 214.000000 Seconds

@szm-R
Copy link
Author

szm-R commented Feb 15, 2018 via email

@AlexeyAB
Copy link
Owner

AlexeyAB commented Feb 15, 2018

@szm2015 Yes, I think it can influence.

Maybe I'll add a separate Python-script voc_eval_difficult.py that creates a txt-files for Yolo with the labels (coordinates) of difficult objects from the XML-files of PascalVOC dataset. And will use these txt-files to remove difficult objects from calculating of TP and FP.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Feb 16, 2018

  1. I added python script to get a list of images and labels with Difficult objects that generates difficult_2007_test.txt file: https://github.com/AlexeyAB/darknet/blob/master/scripts/voc_label_difficult.py
    This file should be set here (without #):
    #difficult = data/voc/difficult_2007_test.txt

  1. Then darknet.exe detector map data/voc.data tiny-yolo-voc.cfg tiny-yolo-voc.weights gives 56.21%, (but reval_voc.py and voc_eval.py gives 56.6%, diff = 0.39):
class = 0, name = aeroplane,     ap = 61.05 %
class = 1, name = bicycle,       ap = 71.58 %
class = 2, name = bird,          ap = 48.26 %
class = 3, name = boat,          ap = 40.61 %
class = 4, name = bottle,        ap = 20.92 %
class = 5, name = bus,   ap = 68.13 %
class = 6, name = car,   ap = 66.48 %
class = 7, name = cat,   ap = 70.46 %
class = 8, name = chair,         ap = 35.08 %
class = 9, name = cow,   ap = 55.10 %
class = 10, name = diningtable,          ap = 58.06 %
class = 11, name = dog,          ap = 62.59 %
class = 12, name = horse,        ap = 71.42 %
class = 13, name = motorbike,    ap = 69.23 %
class = 14, name = person,       ap = 59.74 %
class = 15, name = pottedplant,          ap = 27.80 %
class = 16, name = sheep,        ap = 55.32 %
class = 17, name = sofa,         ap = 52.50 %
class = 18, name = train,        ap = 70.84 %
class = 19, name = tvmonitor,    ap = 59.13 %

 mean average precision (mAP) = 0.562140, or 56.21 %

  1. darknet.exe detector map data/voc.data yolo-voc.cfg yolo-voc.weights gives 76.94%, (but reval_voc.py and voc_eval.py gives 77.1%, diff = 0.16):
class = 0, name = aeroplane,     ap = 80.89 %
class = 1, name = bicycle,       ap = 84.36 %
class = 2, name = bird,          ap = 76.10 %
class = 3, name = boat,          ap = 66.57 %
class = 4, name = bottle,        ap = 55.50 %
class = 5, name = bus,   ap = 84.11 %
class = 6, name = car,   ap = 85.80 %
class = 7, name = cat,   ap = 88.31 %
class = 8, name = chair,         ap = 61.29 %
class = 9, name = cow,   ap = 82.67 %
class = 10, name = diningtable,          ap = 72.38 %
class = 11, name = dog,          ap = 84.46 %
class = 12, name = horse,        ap = 86.54 %
class = 13, name = motorbike,    ap = 83.92 %
class = 14, name = person,       ap = 79.27 %
class = 15, name = pottedplant,          ap = 51.84 %
class = 16, name = sheep,        ap = 78.71 %
class = 17, name = sofa,         ap = 75.63 %
class = 18, name = train,        ap = 83.19 %
class = 19, name = tvmonitor,    ap = 77.15 %

 mean average precision (mAP) = 0.769353, or 76.94 %

Now we somewhere lose 0.16 - 0.39 % of mAP :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants