Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splash support #6

Closed
lopuhin opened this issue Mar 18, 2016 · 5 comments
Closed

Splash support #6

lopuhin opened this issue Mar 18, 2016 · 5 comments

Comments

@lopuhin
Copy link
Contributor

lopuhin commented Mar 18, 2016

Will solve #5 and will also allow to support phpbb3-style sessions that are tied to user-agent and ip.

I see two ways to implement it:

  • call splash directly via requests, perhaps with a simple splash script.
  • use a simple scrapy spider with hh_splash middleware.

I'm still not sure which is better... The first option is more self-contained.

@madisonb
Copy link

I dont understand why Splash is needed in order to support phpbb3 style cookies? If autologin requires Splash, then it is no longer really a python module and requires greater architecture for it to function. While I am not well versed in phpbb3 style cookies - I do not see why faking a header request with all of the proper information cannot be done - which is pretty easy in Scrapy.

We have been very happy with integrating autologin in our scraping architecture, and I think the best use of the module will be to make it standalone as much as possible.

@lopuhin
Copy link
Contributor Author

lopuhin commented Mar 18, 2016

Thanks for the feedback, @madisonb! Do you use autologin as a library to get the request data and then send it with Scrapy?

The situation where splash support is helpful is when we use autologin as a service, perhaps even on a different host, and also crawl via a separate splash instance. In this case by using the same splash instance both in autologin and in the crawler we get the same ip and the same user-agent, and can also log in on sites that are hard to handle without splash (js heavy or tor).

@lopuhin
Copy link
Contributor Author

lopuhin commented Mar 18, 2016

Just to clarify - splash support it intended to be optional, not a requirement.

@madisonb
Copy link

Precisely, we use autologin/formasaurus in library form and but could switch over to autologin as a service if needed, and then use the cookies generated within Scrapy. We dont use Splash instances to crawl the open web, and for Tor we have our spiders configured to work with the network.

Most sites in the past have not cared whether the cookie comes from a different IP, but the phpbb3 sites may and we may need extra engineering for work with that.

@lopuhin lopuhin mentioned this issue Apr 5, 2016
4 tasks
@lopuhin
Copy link
Contributor Author

lopuhin commented Apr 5, 2016

This is done in #8 by using scrapy and scrapy-splash.

@lopuhin lopuhin closed this as completed Apr 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants