Wrong display in wallabag (web.archive.org) for golem.de #7414

GAS85 · 2024-04-08T14:00:44Z

Before submitting the issue, please read:
If wallabag can't parse / extract content for a given link, please first read the documentation about it:
http://doc.wallabag.org/en/user/errors_during_fetching.html#how-can-i-help-to-fix-that

We have a lot of requests about fetching config issue. It'll help us A LOT if you give a try to fix it on your own following the doc.
If you failed to fix it yourself, tick the following boxes:

I've tried myself without success
I've replaced HOST in the issue title with the host of the URL that can't be fetched (ie: nytimes.com, 20minutes.fr, bbc.com, etc.)

Content related:

URL: [full url of the content]
wallabag version: 2.6.9

Describe what's wrong:
I found an URL of the Article in web.archive.org and would like to save it. Instead I got a banner of web.archive only with no article content:

f43.me can't parse it too:

The text was updated successfully, but these errors were encountered:

GAS85 · 2024-04-09T22:54:53Z

I checked HTML and there is a clear toolbar end marked:

HolgerAusB · 2024-04-12T19:49:15Z

This is a difficult one. I didn't find the time to give it a deeper look yet. web-archive-org is archiving many websites. And my guess is, that they are using the most of the original site's html. So we can't provide a site specific config which fits all archived websites like golem, Spiegel etc. Maybe I can find a way to strip that header and get the main content in a more or less nice view for some websites. No promises!

The <div class"golemContentoHide" is obviously from Golem and not from web-archive and for the green  we can't trigger the content. It must be real html entities.

And the use of JavaScript by web-archive could be tricky. So I don't know, if I could look on it next week or end of month.

But of course, it would be very nice, to be able to catch the archive generally.

HolgerAusB · 2024-04-17T13:24:40Z

Sorry I didn't find a way to snip out the content from original site. For golem it helps to set a body: //article[1] but that destroys a catch for faz.net which is fetching quite good without a config.

So that is more a feature request to the devs. Maybe with a new keyword for site_config:
try_webarchive: //div[@class='old-topic']

GAS85 added the Site Config label Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong display in wallabag (web.archive.org) for golem.de #7414

Wrong display in wallabag (web.archive.org) for golem.de #7414

GAS85 commented Apr 8, 2024

GAS85 commented Apr 9, 2024

HolgerAusB commented Apr 12, 2024

HolgerAusB commented Apr 17, 2024

Wrong display in wallabag (web.archive.org) for golem.de #7414

Wrong display in wallabag (web.archive.org) for golem.de #7414

Comments

GAS85 commented Apr 8, 2024

GAS85 commented Apr 9, 2024

HolgerAusB commented Apr 12, 2024

HolgerAusB commented Apr 17, 2024