Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data leakage issue of RSICD and RSITMD #26

Open
YiguoHe opened this issue Apr 2, 2024 · 2 comments
Open

data leakage issue of RSICD and RSITMD #26

YiguoHe opened this issue Apr 2, 2024 · 2 comments

Comments

@YiguoHe
Copy link

YiguoHe commented Apr 2, 2024

The RSITMD and RSICD datasets have a data leakage issue where they might share some common images and descriptions. how to deal with it properly?

@YiguoHe YiguoHe changed the title 关于模型权重文件格式转化的问题 data leakage issue of RSICD and RSITMD Apr 11, 2024
@gzqy1026
Copy link
Collaborator

You can calculate the distance between two images by hash values if there are duplicates in two datasets. If the distance is less than a certain threshold, it is defined as a duplicate image. It is recommended to manually check the deduplicated images in the code to avoid filtering out some images that are not actually duplicates.

@YiguoHe
Copy link
Author

YiguoHe commented May 22, 2024

You can calculate the distance between two images by hash values if there are duplicates in two datasets. If the distance is less than a certain threshold, it is defined as a duplicate image. It is recommended to manually check the deduplicated images in the code to avoid filtering out some images that are not actually duplicates.

Thank you for your response. Your work is excellent. Best wishes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants