-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to loading URLs in a headless browser to scrape after JavaScript fully loaded #1740
Comments
Hi, Steven!
It will heavily increase server requirements. Specially for public instances. For example, if bridge is running on shared hosting, it probably won't work. AFAIK there are no shared hosting that provide headless browser.
In both cases, if bridge got broken, it is required to write patches for it.
Could not understand what "accept a different URL so a third-party service" mean. |
It would change the change the requirements for sure. It isn't possible to run a headless browser on a VPS? Maybe something like https://github.com/jsdom/jsdom ?
I often use a proxy scraping service which renders a page and returns the rendered HTML. For example:
It would be great to be able to use such a URL in a bridge. However, most of the time the URLs are hardcoded into the bridge code. |
Interesting point as to what the typical user runs rss-bridge on. Requiring the use of a binary for headless browsing could make rss-bridge unsuitable for many existing users on shared hosting.
Is there such a free service we can use for headless browsing? Assuming there is, it may not be free indefinitely. All that said, I agree in principle that moving to headless browsing has advantages for simplifying the scraping process. |
Might be viable: https://github.com/chrome-php/chrome |
Can the headless browser be an option or an alternative variant of RSS-Bridge? Feedless can run a headless Chromium in the background for generating feeds (admittedly I wasn't able to install in Docker). |
maybe. we want to do this but lacking skills to do it. |
I would not recommend switching to a headless browser because it being a resource hog. Instead add a new class for bridges that would depend on JavaScript only websites. The default should remain the existing classes. I stumbled on the same problem and my research so far points to the Symfony BrowserKit component or Panther. Other solutions are either deprecated or too complicated / heavy on resources. If I have time I would start some experiments and maybe find a working solution. |
BrowserKit supports static websites only and Panther depends on headless Chrome. I have come to the conclusion that scraping AJAX sites is not reasonable without a browser. Currently I'm preparing a prototype. Please see #3970 |
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
More and more pages aren't scrapeable due to the entire page being a Single Page Applications (SPA). For example, sites that use react or angular aren't readable using curl. Facebook is days away from rolling out site-wide changes that load all of Facebook as a SPA. This will break the Facebook bridge.
https://engineering.fb.com/web/facebook-redesign/
Describe the solution you'd like
A clear and concise description of what you want to happen.
I would like to see rss-bridge load URLs in a headless browser (or use a third-party service that does?). Then each bridge that needs to can scrape a fully loaded page.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Or at the very change several bridges to accept a different URL so a third-party service, which renders pages entirely, could be used from the resulting response.
Additional context
Add any other context or screenshots about the feature request here.
Obviously, this is a big change that would need to be thought through. I'm opening this just to hopefully get a conversation going. I'm happy to help out and try to implement some of this.
The text was updated successfully, but these errors were encountered: