Regex per DR to extract unique ID for the image to prevent re-downloading it #164

sadeghim · 2021-11-10T00:24:02Z

Problem:
Whenever a major image provider (like iNat) changes a common part of their image URLs like protocol, domain or path causes image-service to start downloading all of them and then compare the hash code of the image with the image DB and find if they are duplicate or not and this can take a significant time from image-service and block the other loads.

Suggestion:
Implement groups of regular expressions for each DR that can extract the IDs of images prior to download and match it with the existing image URLs for that data resource in the database. Matched URLs will be added as alternative URL without downloading the images.
Non matched URLs will go through and they are needed to be downloaded.
This will eliminate the need to download all the images for most of the cases.

sadeghim added the enhancement label Nov 10, 2021

This was referenced Nov 10, 2021

Update original filename image url for Questagame AtlasOfLivingAustralia/data-management#770

Open

iNaturalist data load - image url fix AtlasOfLivingAustralia/data-management#765

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex per DR to extract unique ID for the image to prevent re-downloading it #164

Regex per DR to extract unique ID for the image to prevent re-downloading it #164

sadeghim commented Nov 10, 2021

Regex per DR to extract unique ID for the image to prevent re-downloading it #164

Regex per DR to extract unique ID for the image to prevent re-downloading it #164

Comments

sadeghim commented Nov 10, 2021