Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECAU: Generic provider #305

Open
ROpdebee opened this issue Dec 17, 2021 · 2 comments
Open

ECAU: Generic provider #305

ROpdebee opened this issue Dec 17, 2021 · 2 comments

Comments

@ROpdebee
Copy link
Owner

Could be useful for random "Purchase for mail order" URLs. Have it fetch the image found in the og:image meta property, if it exists.

@ROpdebee
Copy link
Owner Author

ROpdebee commented Jan 24, 2022

On second though, I'm not sure whether this is a good idea. It'd rely on heuristics, which can be finnicky, and the image displayed on "Purchase for mail order" pages might not accurately represent a physical release. Also, the generic provider might work for providers which really need a special-purpose provider because they might have multiple images (e.g. DatPiff would work with the og:image property, but also offers a back cover which would be missed by a generic provider).

@ROpdebee
Copy link
Owner Author

ROpdebee commented Apr 23, 2023

Another possibility which seems to work OK on ebay:

  1. Find the "semantic center" of the page (e.g., the <h1> element)
  2. Find all <img> elements and rank them according to some tree distance metric to the semantic center.
  3. Find the highest ranked image whose dimensions are above some arbitrary limit (e.g. 250x250 or 500x500). Probably need to maximise them beforehand.
  4. (Possibly: Expand the search while the distance doesn't increase too much, with a threshold based on the average distance between images and the semantic center, like 50% of the average, or some more statistically sound threshold taking standard deviation and quartiles into account).

For the tree distance metric, there are a few options that I've tested and all of them seem like they'd work:

  • Maximise depth of the deepest common ancestor.
  • Minimise depth between image and common ancestor + depth between h1 and common ancestor - depth of ancestor
  • Minimise steps between image and common ancestor + steps between h1 and common ancestor - depth of ancestor, with steps calculated as the depth between element and ancestor and the number of predecessor nodes at each level.

Another possibility might be to minimise the total number of nodes "between" the image and the h1 (including the size of any subtree that's between them).

At least for ebay, which I've been experimenting on, simply taking the largest image won't work, since that'll select the logo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant