A simple, lightweight web crawling framework
Features:
- HTTP crawling
- Various data extraction methods (regex, css selectors, xpath)
- Very configurable and extensible
Crawlster is a web crawling library designed to build lightweight and reusable web crawlers. It is very extensible and provides many shortcuts for the most common tasks in a web crawler, such as HTTP request sending and parsing and info extraction.
It was created out of need of a lighter framework for web crawling, as an alternative to Scrapy.
From PyPi:
pip install crawlster
From source:
git clone https://github.com/vladcalin/crawlster.git
cd crawlster
python setup.py install
This is the hello world equivalent for this library:
import crawlster
from crawlster.handlers import JsonLinesHandler
class MyCrawler(crawlster.Crawlster):
# items will be saved to items.jsonl
item_handler = JsonLinesHandler('items.jsonl')
@crawlster.start
def step_start(self, url):
resp = self.http.get(url)
# we select elements with the expression and we are interested
# only in the 'href' attribute. Also, we get only the first result
# for this example
events_uri = self.extract.css(resp.text, '#events > a', attr='href')[0]
# we specify what method should be called next
self.schedule(self.step_events_page, self.urls.join(url, events_uri))
def step_events_page(self, url):
resp = self.http.get(url)
# We extract the content/text of all the selected titles
events = self.extract.css(resp.text, 'h3.event-title a', content=True)
for event_name in events:
# submitting items to be processed by the item handler
self.submit_item({'event': event_name})
if __name__ == '__main__':
# defining the configuration
config = crawlster.Configuration({
# the start pages
'core.start_urls': ['https://www.python.org/'],
# the method that will process the start pages
'core.start_step': 'step_start',
# to see in-depth what happens
'log.level': 'debug'
})
# starting the crawler
crawler = MyCrawler(config)
# this will block until everything finishes
crawler.start()
# printing some run stats, such as the number of requests, how many items
# were submitted, etc.
print(crawler.stats.dump())
Running the above code will fetch the event names from python.org and save them in a items.jsonl
file in the current directory.
For more advanced usage, consult the documentation.