New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
basic scrapy spider that takes a url argument #1
Conversation
use markdown formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One feature that would be nice:
As a debugger I want to rerun a scrape on all links that gave non-200 status codes, similar to
pytest --lf
This script took 10 minutes on my machine for CDH web. I would imagine this kind of feature could save a lot of time.
I saw that scrapy allows you to put an allowed_domains
property. Would that clean up some of this logic?
https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains
|
||
|
||
def closed(self, reason): | ||
# TODO: improve reporting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy overdue fancy logging I guess, but I've really fallen in love with rich lately: https://github.com/willmcgugan/rich Their tables might be helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also love rich. sometimes it's overkill, but it's super gratifying...
@@ -12,6 +12,9 @@ | |||
SPIDER_MODULES = ['caliper.spiders'] | |||
NEWSPIDER_MODULE = 'caliper.spiders' | |||
|
|||
# override scrapy default log level of DEBUG | |||
LOG_LEVEL = "WARN" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that the debug logging is quite a lot of spam, but is there a clean way to show progress? I felt nervous having a script run without any feedback for five minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you show progress when you have no idea many urls you are going to crawl?
I'm wondering about integrating the sitemaps into this to make sure we hit every known page, which might give us a starting point for a minimum number of pages, but wasn't sure if we needed that in the first version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, I think the hope is that we'll automated running it — so we may not be running it manually all that often
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything looks good — I had the same question as @kmcelwee about allowed_domains
, which I used for a sitemap crawler when I was testing Percy and did what I wanted. also wonder why a more specific Spider
subclass wasn't used — maybe CrawlSpider
? My guess is because we want to check a bunch more types of content (mailto:
links, iframes, etc) but just wanted to check.
|
||
|
||
def closed(self, reason): | ||
# TODO: improve reporting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also love rich. sometimes it's overkill, but it's super gratifying...
# should we follow iframes urls? | ||
# yield response.follow(iframe_src, callback=self.parse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't think of why we'd want to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we might want to is because we probably do want to know if iframes embedded in our site are broken; same could apply to other assets like images or videos.
(although maybe we won't have iframes anymore, since we're using oembed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I misinterpreted this as being about following URLs inside the frame, not testing the URL of the frame. makes sense!
Interesting... I'm not exactly sure what the use case is for this, though. Probably not to hard to add a way to feed it a list of exact urls, instead of crawling the whole site — seems like the logic might be kind of different, though?
I was experimenting with that in an earlier version, when I had the start url hardcoded (before I'd figured out how to pass arguments so we can specify the starting point dynamically). It doesn't do exactly what we want, because it considers startwords.cdh.princeton.edu to be included if we say that cdh.princeton.edu is our allowed domain. I didn't try setting it dynamically; could be there's a related setting or configuration that makes this do what we want, though. |
This is at least partly because I started with their tutorial / I'm also interested in the sitemap spider — it would be nice to use that as a secondary source of urls, to make sure we crawl all the pages on the site (and it would be nice to know if there's anything in our sitemap that isn't somehow linked from somewhere else, although I have trouble imagining that could be the case), and we could track which urls are in the sitemap — but I don't have a sense of how we would integrate that right now! Any ideas? |
I cleaned up my prototype / experimental scrapy code to get to a version that takes a url via command line argument, and I added some notes documenting how to run it.
I generated a new project with
scrapy startproject
and there are a lot of things I haven't touched; we may not need a lot of the default files, but I want to read more about them first before jettisoning. I checked the starting project output into develop, so this PR highlights the changes I actually made.For now, see if the code generally makes sense and if you can run it. We can start brainstorming improvements and creating issues as needed, but for I'm hoping to get a minimal working version we can use to help with migrating/archiving Derrida's Margins.