Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex per DR to extract unique ID for the image to prevent re-downloading it #164

Open
sadeghim opened this issue Nov 10, 2021 · 0 comments

Comments

@sadeghim
Copy link
Member

Problem:
Whenever a major image provider (like iNat) changes a common part of their image URLs like protocol, domain or path causes image-service to start downloading all of them and then compare the hash code of the image with the image DB and find if they are duplicate or not and this can take a significant time from image-service and block the other loads.

Suggestion:
Implement groups of regular expressions for each DR that can extract the IDs of images prior to download and match it with the existing image URLs for that data resource in the database. Matched URLs will be added as alternative URL without downloading the images.
Non matched URLs will go through and they are needed to be downloaded.
This will eliminate the need to download all the images for most of the cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant