Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to loading URLs in a headless browser to scrape after JavaScript fully loaded #1740

Open
stevenirby opened this issue Sep 14, 2020 · 8 comments
Labels
Feature-Request Issue is a feature request

Comments

@stevenirby
Copy link

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

More and more pages aren't scrapeable due to the entire page being a Single Page Applications (SPA). For example, sites that use react or angular aren't readable using curl. Facebook is days away from rolling out site-wide changes that load all of Facebook as a SPA. This will break the Facebook bridge.

https://engineering.fb.com/web/facebook-redesign/

Describe the solution you'd like
A clear and concise description of what you want to happen.

I would like to see rss-bridge load URLs in a headless browser (or use a third-party service that does?). Then each bridge that needs to can scrape a fully loaded page.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Or at the very change several bridges to accept a different URL so a third-party service, which renders pages entirely, could be used from the resulting response.

Additional context
Add any other context or screenshots about the feature request here.

Obviously, this is a big change that would need to be thought through. I'm opening this just to hopefully get a conversation going. I'm happy to help out and try to implement some of this.

@stevenirby stevenirby added the Feature-Request Issue is a feature request label Sep 14, 2020
@em92
Copy link
Contributor

em92 commented Sep 14, 2020

Hi, Steven!

I would like to see rss-bridge load URLs in a headless browser

It will heavily increase server requirements. Specially for public instances. For example, if bridge is running on shared hosting, it probably won't work. AFAIK there are no shared hosting that provide headless browser.

Then each bridge that needs to can scrape a fully loaded page.

In both cases, if bridge got broken, it is required to write patches for it.

Or at the very change several bridges to accept a different URL so a third-party service

Could not understand what "accept a different URL so a third-party service" mean.

@stevenirby
Copy link
Author

It will heavily increase server requirements. Specially for public instances. For example, if bridge is running on shared hosting, it probably won't work. AFAIK there are no shared hosting that provide headless browser.

It would change the change the requirements for sure. It isn't possible to run a headless browser on a VPS? Maybe something like https://github.com/jsdom/jsdom ?

Could not understand what "accept a different URL so a third-party service" mean.

I often use a proxy scraping service which renders a page and returns the rendered HTML. For example:

https://apiservice.com?url=www.google.com

It would be great to be able to use such a URL in a bridge. However, most of the time the URLs are hardcoded into the bridge code.

@triatic
Copy link
Contributor

triatic commented Oct 10, 2020

It would change the change the requirements for sure. It isn't possible to run a headless browser on a VPS? Maybe something like https://github.com/jsdom/jsdom ?

Interesting point as to what the typical user runs rss-bridge on. Requiring the use of a binary for headless browsing could make rss-bridge unsuitable for many existing users on shared hosting.

I often use a proxy scraping service which renders a page and returns the rendered HTML. For example:

https://apiservice.com?url=www.google.com

Is there such a free service we can use for headless browsing? Assuming there is, it may not be free indefinitely.

All that said, I agree in principle that moving to headless browsing has advantages for simplifying the scraping process.

@dvikan
Copy link
Contributor

dvikan commented Mar 26, 2022

Might be viable: https://github.com/chrome-php/chrome

@dvikan dvikan closed this as completed Mar 26, 2022
@dvikan dvikan reopened this Mar 27, 2022
@ghost
Copy link

ghost commented Jul 25, 2023

Can the headless browser be an option or an alternative variant of RSS-Bridge?

Feedless can run a headless Chromium in the background for generating feeds (admittedly I wasn't able to install in Docker).

@dvikan
Copy link
Contributor

dvikan commented Jul 25, 2023

maybe. we want to do this but lacking skills to do it.

@hleskien
Copy link
Contributor

I would not recommend switching to a headless browser because it being a resource hog. Instead add a new class for bridges that would depend on JavaScript only websites. The default should remain the existing classes.

I stumbled on the same problem and my research so far points to the Symfony BrowserKit component or Panther. Other solutions are either deprecated or too complicated / heavy on resources. If I have time I would start some experiments and maybe find a working solution.

@hleskien
Copy link
Contributor

hleskien commented Jan 22, 2024

BrowserKit supports static websites only and Panther depends on headless Chrome. I have come to the conclusion that scraping AJAX sites is not reasonable without a browser. Currently I'm preparing a prototype.

Please see #3970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature-Request Issue is a feature request
Projects
None yet
Development

No branches or pull requests

5 participants