Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basic scrapy spider that takes a url argument #1

Merged
merged 3 commits into from Sep 8, 2021

Conversation

rlskoeser
Copy link
Contributor

I cleaned up my prototype / experimental scrapy code to get to a version that takes a url via command line argument, and I added some notes documenting how to run it.

I generated a new project with scrapy startproject and there are a lot of things I haven't touched; we may not need a lot of the default files, but I want to read more about them first before jettisoning. I checked the starting project output into develop, so this PR highlights the changes I actually made.

For now, see if the code generally makes sense and if you can run it. We can start brainstorming improvements and creating issues as needed, but for I'm hoping to get a minimal working version we can use to help with migrating/archiving Derrida's Margins.

use markdown formatting
Copy link

@kmcelwee kmcelwee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One feature that would be nice:

As a debugger I want to rerun a scrape on all links that gave non-200 status codes, similar to pytest --lf

This script took 10 minutes on my machine for CDH web. I would imagine this kind of feature could save a lot of time.

I saw that scrapy allows you to put an allowed_domains property. Would that clean up some of this logic?
https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains



def closed(self, reason):
# TODO: improve reporting

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy overdue fancy logging I guess, but I've really fallen in love with rich lately: https://github.com/willmcgugan/rich Their tables might be helpful here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also love rich. sometimes it's overkill, but it's super gratifying...

@@ -12,6 +12,9 @@
SPIDER_MODULES = ['caliper.spiders']
NEWSPIDER_MODULE = 'caliper.spiders'

# override scrapy default log level of DEBUG
LOG_LEVEL = "WARN"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that the debug logging is quite a lot of spam, but is there a clean way to show progress? I felt nervous having a script run without any feedback for five minutes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you show progress when you have no idea many urls you are going to crawl?

I'm wondering about integrating the sitemaps into this to make sure we hit every known page, which might give us a starting point for a minimum number of pages, but wasn't sure if we needed that in the first version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, I think the hope is that we'll automated running it — so we may not be running it manually all that often

Copy link

@thatbudakguy thatbudakguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything looks good — I had the same question as @kmcelwee about allowed_domains, which I used for a sitemap crawler when I was testing Percy and did what I wanted. also wonder why a more specific Spider subclass wasn't used — maybe CrawlSpider? My guess is because we want to check a bunch more types of content (mailto: links, iframes, etc) but just wanted to check.



def closed(self, reason):
# TODO: improve reporting

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also love rich. sometimes it's overkill, but it's super gratifying...

Comment on lines +68 to +69
# should we follow iframes urls?
# yield response.follow(iframe_src, callback=self.parse)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't think of why we'd want to do this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we might want to is because we probably do want to know if iframes embedded in our site are broken; same could apply to other assets like images or videos.

(although maybe we won't have iframes anymore, since we're using oembed?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I misinterpreted this as being about following URLs inside the frame, not testing the URL of the frame. makes sense!

@rlskoeser
Copy link
Contributor Author

One feature that would be nice:

As a debugger I want to rerun a scrape on all links that gave non-200 status codes, similar to pytest --lf
This script took 10 minutes on my machine for CDH web. I would imagine this kind of feature could save a lot of time.

Interesting... I'm not exactly sure what the use case is for this, though. Probably not to hard to add a way to feed it a list of exact urls, instead of crawling the whole site — seems like the logic might be kind of different, though?

I saw that scrapy allows you to put an allowed_domains property. Would that clean up some of this logic?
https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains

I was experimenting with that in an earlier version, when I had the start url hardcoded (before I'd figured out how to pass arguments so we can specify the starting point dynamically). It doesn't do exactly what we want, because it considers startwords.cdh.princeton.edu to be included if we say that cdh.princeton.edu is our allowed domain. I didn't try setting it dynamically; could be there's a related setting or configuration that makes this do what we want, though.

@rlskoeser
Copy link
Contributor Author

everything looks good — I had the same question as @kmcelwee about allowed_domains, which I used for a sitemap crawler when I was testing Percy and did what I wanted. also wonder why a more specific Spider subclass wasn't used — maybe CrawlSpider? My guess is because we want to check a bunch more types of content (mailto: links, iframes, etc) but just wanted to check.

This is at least partly because I started with their tutorial / startproject instructions! But I couldn't figure out from the documentation/example on the crawl spider how to do what I wanted to do. I'd be open to refactoring it if you have a sense of how to do it and if it simplifies the code.

I'm also interested in the sitemap spider — it would be nice to use that as a secondary source of urls, to make sure we crawl all the pages on the site (and it would be nice to know if there's anything in our sitemap that isn't somehow linked from somewhere else, although I have trouble imagining that could be the case), and we could track which urls are in the sitemap — but I don't have a sense of how we would integrate that right now! Any ideas?

@rlskoeser rlskoeser merged commit 590f5e9 into develop Sep 8, 2021
@rlskoeser rlskoeser deleted the feature/basic-scrapy-spider branch September 8, 2021 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants