basic scrapy spider that takes a url argument #1

rlskoeser · 2021-07-16T21:30:57Z

I cleaned up my prototype / experimental scrapy code to get to a version that takes a url via command line argument, and I added some notes documenting how to run it.

I generated a new project with scrapy startproject and there are a lot of things I haven't touched; we may not need a lot of the default files, but I want to read more about them first before jettisoning. I checked the starting project output into develop, so this PR highlights the changes I actually made.

For now, see if the code generally makes sense and if you can run it. We can start brainstorming improvements and creating issues as needed, but for I'm hoping to get a minimal working version we can use to help with migrating/archiving Derrida's Margins.

use markdown formatting

kmcelwee

One feature that would be nice:

As a debugger I want to rerun a scrape on all links that gave non-200 status codes, similar to pytest --lf

This script took 10 minutes on my machine for CDH web. I would imagine this kind of feature could save a lot of time.

I saw that scrapy allows you to put an allowed_domains property. Would that clean up some of this logic?
https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains

kmcelwee · 2021-07-20T15:39:20Z

caliper/spiders/caliper_spider.py

+
+
+    def closed(self, reason):
+        # TODO: improve reporting


It's easy overdue fancy logging I guess, but I've really fallen in love with rich lately: https://github.com/willmcgugan/rich Their tables might be helpful here.

i also love rich. sometimes it's overkill, but it's super gratifying...

kmcelwee · 2021-07-20T16:00:24Z

caliper/settings.py

@@ -12,6 +12,9 @@
 SPIDER_MODULES = ['caliper.spiders']
 NEWSPIDER_MODULE = 'caliper.spiders'

+# override scrapy default log level of DEBUG
+LOG_LEVEL = "WARN"
+


I saw that the debug logging is quite a lot of spam, but is there a clean way to show progress? I felt nervous having a script run without any feedback for five minutes

How do you show progress when you have no idea many urls you are going to crawl?

I'm wondering about integrating the sitemaps into this to make sure we hit every known page, which might give us a starting point for a minimum number of pages, but wasn't sure if we needed that in the first version.

fwiw, I think the hope is that we'll automated running it — so we may not be running it manually all that often

thatbudakguy

everything looks good — I had the same question as @kmcelwee about allowed_domains, which I used for a sitemap crawler when I was testing Percy and did what I wanted. also wonder why a more specific Spider subclass wasn't used — maybe CrawlSpider? My guess is because we want to check a bunch more types of content (mailto: links, iframes, etc) but just wanted to check.

thatbudakguy · 2021-07-21T19:00:51Z

caliper/spiders/caliper_spider.py

+
+
+    def closed(self, reason):
+        # TODO: improve reporting


i also love rich. sometimes it's overkill, but it's super gratifying...

thatbudakguy · 2021-07-21T19:02:47Z

caliper/spiders/caliper_spider.py

+            # should we follow iframes urls?
+            # yield response.follow(iframe_src, callback=self.parse)


can't think of why we'd want to do this

The reason we might want to is because we probably do want to know if iframes embedded in our site are broken; same could apply to other assets like images or videos.

(although maybe we won't have iframes anymore, since we're using oembed?)

oh, I misinterpreted this as being about following URLs inside the frame, not testing the URL of the frame. makes sense!

rlskoeser · 2021-07-29T19:47:11Z

One feature that would be nice:

As a debugger I want to rerun a scrape on all links that gave non-200 status codes, similar to pytest --lf
This script took 10 minutes on my machine for CDH web. I would imagine this kind of feature could save a lot of time.

Interesting... I'm not exactly sure what the use case is for this, though. Probably not to hard to add a way to feed it a list of exact urls, instead of crawling the whole site — seems like the logic might be kind of different, though?

I saw that scrapy allows you to put an allowed_domains property. Would that clean up some of this logic?
https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains

I was experimenting with that in an earlier version, when I had the start url hardcoded (before I'd figured out how to pass arguments so we can specify the starting point dynamically). It doesn't do exactly what we want, because it considers startwords.cdh.princeton.edu to be included if we say that cdh.princeton.edu is our allowed domain. I didn't try setting it dynamically; could be there's a related setting or configuration that makes this do what we want, though.

rlskoeser · 2021-07-29T19:53:31Z

everything looks good — I had the same question as @kmcelwee about allowed_domains, which I used for a sitemap crawler when I was testing Percy and did what I wanted. also wonder why a more specific Spider subclass wasn't used — maybe CrawlSpider? My guess is because we want to check a bunch more types of content (mailto: links, iframes, etc) but just wanted to check.

This is at least partly because I started with their tutorial / startproject instructions! But I couldn't figure out from the documentation/example on the crawl spider how to do what I wanted to do. I'd be open to refactoring it if you have a sense of how to do it and if it simplifies the code.

I'm also interested in the sitemap spider — it would be nice to use that as a secondary source of urls, to make sure we crawl all the pages on the site (and it would be nice to know if there's anything in our sitemap that isn't somehow linked from somewhere else, although I have trouble imagining that could be the case), and we could track which urls are in the sitemap — but I don't have a sense of how we would integrate that right now! Any ideas?

rlskoeser added 2 commits July 16, 2021 17:25

Add caliper spider code

824d8df

Add basic documentation about running caliper

36ed286

rlskoeser assigned thatbudakguy and kmcelwee Jul 16, 2021

Update README.md

cfc5a27

use markdown formatting

kmcelwee approved these changes Jul 20, 2021

View reviewed changes

thatbudakguy approved these changes Jul 21, 2021

View reviewed changes

rlskoeser merged commit 590f5e9 into develop Sep 8, 2021

rlskoeser deleted the feature/basic-scrapy-spider branch September 8, 2021 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

basic scrapy spider that takes a url argument #1

basic scrapy spider that takes a url argument #1

rlskoeser commented Jul 16, 2021

kmcelwee left a comment

kmcelwee Jul 20, 2021

thatbudakguy Jul 21, 2021

kmcelwee Jul 20, 2021

rlskoeser Jul 29, 2021

rlskoeser Jul 29, 2021

thatbudakguy left a comment

thatbudakguy Jul 21, 2021

thatbudakguy Jul 21, 2021

rlskoeser Jul 29, 2021

thatbudakguy Aug 3, 2021

rlskoeser commented Jul 29, 2021

rlskoeser commented Jul 29, 2021

		# should we follow iframes urls?
		# yield response.follow(iframe_src, callback=self.parse)

basic scrapy spider that takes a url argument #1

basic scrapy spider that takes a url argument #1

Conversation

rlskoeser commented Jul 16, 2021

kmcelwee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thatbudakguy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlskoeser commented Jul 29, 2021

rlskoeser commented Jul 29, 2021