Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong display in wallabag (web.archive.org) for golem.de #7414

Open
2 tasks done
GAS85 opened this issue Apr 8, 2024 · 3 comments
Open
2 tasks done

Wrong display in wallabag (web.archive.org) for golem.de #7414

GAS85 opened this issue Apr 8, 2024 · 3 comments

Comments

@GAS85
Copy link

GAS85 commented Apr 8, 2024

Before submitting the issue, please read:
If wallabag can't parse / extract content for a given link, please first read the documentation about it:
http://doc.wallabag.org/en/user/errors_during_fetching.html#how-can-i-help-to-fix-that

We have a lot of requests about fetching config issue. It'll help us A LOT if you give a try to fix it on your own following the doc.
If you failed to fix it yourself, tick the following boxes:

  • I've tried myself without success
  • I've replaced HOST in the issue title with the host of the URL that can't be fetched (ie: nytimes.com, 20minutes.fr, bbc.com, etc.)

Content related:

Describe what's wrong:
I found an URL of the Article in web.archive.org and would like to save it. Instead I got a banner of web.archive only with no article content:
image

image

f43.me can't parse it too:
image

@GAS85
Copy link
Author

GAS85 commented Apr 9, 2024

I checked HTML and there is a clear toolbar end marked:
image

@HolgerAusB
Copy link

This is a difficult one. I didn't find the time to give it a deeper look yet. web-archive-org is archiving many websites. And my guess is, that they are using the most of the original site's html. So we can't provide a site specific config which fits all archived websites like golem, Spiegel etc. Maybe I can find a way to strip that header and get the main content in a more or less nice view for some websites. No promises!

The <div class"golemContentoHide" is obviously from Golem and not from web-archive and for the green <!--comments--> we can't trigger the content. It must be real html entities.

And the use of JavaScript by web-archive could be tricky. So I don't know, if I could look on it next week or end of month.

But of course, it would be very nice, to be able to catch the archive generally.

@HolgerAusB
Copy link

Sorry I didn't find a way to snip out the content from original site. For golem it helps to set a body: //article[1] but that destroys a catch for faz.net which is fetching quite good without a config.

So that is more a feature request to the devs. Maybe with a new keyword for site_config:
try_webarchive: //div[@class='old-topic']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants