Switch to loading URLs in a headless browser to scrape after JavaScript fully loaded #1740

stevenirby · 2020-09-14T07:07:21Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

More and more pages aren't scrapeable due to the entire page being a Single Page Applications (SPA). For example, sites that use react or angular aren't readable using curl. Facebook is days away from rolling out site-wide changes that load all of Facebook as a SPA. This will break the Facebook bridge.

https://engineering.fb.com/web/facebook-redesign/

Describe the solution you'd like
A clear and concise description of what you want to happen.

I would like to see rss-bridge load URLs in a headless browser (or use a third-party service that does?). Then each bridge that needs to can scrape a fully loaded page.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Or at the very change several bridges to accept a different URL so a third-party service, which renders pages entirely, could be used from the resulting response.

Additional context
Add any other context or screenshots about the feature request here.

Obviously, this is a big change that would need to be thought through. I'm opening this just to hopefully get a conversation going. I'm happy to help out and try to implement some of this.

em92 · 2020-09-14T08:54:35Z

Hi, Steven!

I would like to see rss-bridge load URLs in a headless browser

It will heavily increase server requirements. Specially for public instances. For example, if bridge is running on shared hosting, it probably won't work. AFAIK there are no shared hosting that provide headless browser.

Then each bridge that needs to can scrape a fully loaded page.

In both cases, if bridge got broken, it is required to write patches for it.

Or at the very change several bridges to accept a different URL so a third-party service

Could not understand what "accept a different URL so a third-party service" mean.

stevenirby · 2020-09-15T09:04:58Z

It will heavily increase server requirements. Specially for public instances. For example, if bridge is running on shared hosting, it probably won't work. AFAIK there are no shared hosting that provide headless browser.

It would change the change the requirements for sure. It isn't possible to run a headless browser on a VPS? Maybe something like https://github.com/jsdom/jsdom ?

Could not understand what "accept a different URL so a third-party service" mean.

I often use a proxy scraping service which renders a page and returns the rendered HTML. For example:

https://apiservice.com?url=www.google.com

It would be great to be able to use such a URL in a bridge. However, most of the time the URLs are hardcoded into the bridge code.

triatic · 2020-10-10T00:06:14Z

It would change the change the requirements for sure. It isn't possible to run a headless browser on a VPS? Maybe something like https://github.com/jsdom/jsdom ?

Interesting point as to what the typical user runs rss-bridge on. Requiring the use of a binary for headless browsing could make rss-bridge unsuitable for many existing users on shared hosting.

I often use a proxy scraping service which renders a page and returns the rendered HTML. For example:

https://apiservice.com?url=www.google.com

Is there such a free service we can use for headless browsing? Assuming there is, it may not be free indefinitely.

All that said, I agree in principle that moving to headless browsing has advantages for simplifying the scraping process.

dvikan · 2022-03-26T02:21:30Z

Might be viable: https://github.com/chrome-php/chrome

ghost · 2023-07-25T11:00:06Z

Can the headless browser be an option or an alternative variant of RSS-Bridge?

Feedless can run a headless Chromium in the background for generating feeds (admittedly I wasn't able to install in Docker).

dvikan · 2023-07-25T20:44:21Z

maybe. we want to do this but lacking skills to do it.

hleskien · 2023-12-15T19:45:07Z

I would not recommend switching to a headless browser because it being a resource hog. Instead add a new class for bridges that would depend on JavaScript only websites. The default should remain the existing classes.

I stumbled on the same problem and my research so far points to the Symfony BrowserKit component or Panther. Other solutions are either deprecated or too complicated / heavy on resources. If I have time I would start some experiments and maybe find a working solution.

hleskien · 2024-01-22T17:57:24Z

BrowserKit supports static websites only and Panther depends on headless Chrome. I have come to the conclusion that scraping AJAX sites is not reasonable without a browser. Currently I'm preparing a prototype.

Please see #3970

stevenirby added the Feature-Request Issue is a feature request label Sep 14, 2020

dvikan closed this as completed Mar 26, 2022

dvikan reopened this Mar 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to loading URLs in a headless browser to scrape after JavaScript fully loaded #1740

Switch to loading URLs in a headless browser to scrape after JavaScript fully loaded #1740

stevenirby commented Sep 14, 2020

em92 commented Sep 14, 2020

stevenirby commented Sep 15, 2020

triatic commented Oct 10, 2020

dvikan commented Mar 26, 2022

ghost commented Jul 25, 2023

dvikan commented Jul 25, 2023

hleskien commented Dec 15, 2023

hleskien commented Jan 22, 2024 •

edited

Loading

Switch to loading URLs in a headless browser to scrape after JavaScript fully loaded #1740

Switch to loading URLs in a headless browser to scrape after JavaScript fully loaded #1740

Comments

stevenirby commented Sep 14, 2020

em92 commented Sep 14, 2020

stevenirby commented Sep 15, 2020

triatic commented Oct 10, 2020

dvikan commented Mar 26, 2022

ghost commented Jul 25, 2023

dvikan commented Jul 25, 2023

hleskien commented Dec 15, 2023

hleskien commented Jan 22, 2024 • edited Loading

hleskien commented Jan 22, 2024 •

edited

Loading