Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of precision with v1.3 #247

Closed
Xavier31 opened this issue Jan 5, 2021 · 8 comments
Closed

Loss of precision with v1.3 #247

Xavier31 opened this issue Jan 5, 2021 · 8 comments

Comments

@Xavier31
Copy link

Xavier31 commented Jan 5, 2021

Hi
It seems that the results of the update (version 1.3.1) are noticeably worse than the last one (version 1.2).
I made some benchmark and although it does not seem much in terms of NME, it is there and even more noticeable when you look at the images
2D_benchmark_comparisons_0 02

(look at the eyes and temple landmarks)
00023_pred
version 1.2
00023_pred_1 3
version 1.3
I looked at a bunch of results on the FFHQ dataset, and noticed consistently worse precision.

I could not track down what causes this difference, my suspicion is currently on the new batch inference code but could not pinpoint it yet

@1adrianb
Copy link
Owner

1adrianb commented Jan 5, 2021

This is strange indeed, what function are you using to get the landmarks? Could you please attach the input image as seen by the detector, without points on top?

@Xavier31
Copy link
Author

Xavier31 commented Jan 5, 2021

I am using get_landmarks_from_image() with the face bounding box detected by SFD in argument
here is the original image
00023

@1adrianb
Copy link
Owner

1adrianb commented Jan 5, 2021

Thanks for attaching it. The difference is caused by how the image normalization is performed. The correct one for SFD should be BGR+subtracting the mean. Prior to 1.3.0 there were some inconsistencies on this matter (during batch detection this was wrongly not performed).
Now, it looks that the fix made things slightly worse in certain cases which implies that the scale used is slightly suboptimal.
If on average you are seeing better performance using this I can look into reverting this, thought this is not the proper fix. Have you tried this one other datasets too? This particular dataset has some strong bias towards large frontal poses that cover most of the image and may not be representative for more "in-the-wild" images.

@Xavier31
Copy link
Author

Xavier31 commented Jan 6, 2021

Thanks for taking the time to look into this.
I did try on a private dataset but it also consists of large frontal poses like selfies so I guess it is also biased. Results are also worse with v1.3 (graph at the top of my first post). On the other hand the results using the BlazeFace detector are fine. I did not try on a more in the wild dataset such as 300W
The question is how tuned is the actual implementation to the face detector ? I can see several hard coded variables when computing the scale. (the reference_scale, the 200.0 in the transform function..) Shouldn't the reference_scale be different for different detectors ? perhaps did you tune those values with the wrong normalization for SFD ?
a more general question is how would I go if I want to use another face detector ? especially one with non-square boxes ?

@Xavier31
Copy link
Author

Xavier31 commented Jan 6, 2021

bonus question : for selfie images, cropping the face to a square introduces a lot of deformation. Would you recommend to retrain (or fine-tune) the network on these images ?

@1adrianb
Copy link
Owner

1adrianb commented Jan 6, 2021

@Xavier31 At the time this models were trained the noise was generated synthetically during training, neither SFD nor blaze face were released yet.
There is not much tuning performed, the values for the scales and shifts were selected by testing 2-3 values on a small subset of images. Ideally yes, this values should be slightly different for each detector, assuming that they define or predict bounding boxes of different size of course (ie some may consider as face the full face, others just the region enclosing the eyes and the mouth).

The bounding boxes don't have to be square. The way the cropping works is by taking the bounding box and computing based on it a center point and a scale. Based on them a center crop and re-scaling is performed. There is no distortion introduced, the aspect ratio is preserved by the cropping function.

The detectors provided already predict rectangles so they are not squares. You can pretty much use any detector you like, there is no particular preference for one or other as long as they perform well.

@Xavier31
Copy link
Author

Xavier31 commented Jan 6, 2021

ah thanks for the explanation, I did not realize the aspect ratio was preserved. But that means that in some cases, the cropped image will include a lot of background, and the heatmap that are far from the center of the image are much more noisy right ?

@1adrianb
Copy link
Owner

1adrianb commented Jan 6, 2021

What happens is that sometimes the face indeed can be too small (ie include a lot of background) or cropped out (lets see that the chin may get cut by mistake if the scale is off.
The networks can be retrained with more rectangular shapes if that's desired. For example for human pose, since most often that not humans can be contained in a tall rectangle, the shape of the heatmaps are indeed rectangular.

@1adrianb 1adrianb closed this as completed Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants