A recursive async crawler which creates a graph of connected webpages
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
README.md
data.tar.gz
scraper.py

README.md

Async Recursive Crawler

This is a simple crawler that crawls webpages according to the regex provided, starting from the given url, and crawls till the max depth given. It uses the new async/await coroutines introduced in PEP 492.

Todo

  • create a network visualization with the data saved
  • convert mongodb operations to bulk update

Stats

These tests were run on a free tier AWS EC2 server with this starting url.
Current results :

  • Time Taken for 494 requests(recursion level 1) : 5.484668092802167 sec
  • Time Taken for 36997 requests(recursion level 2) : 415.45510824956 sec

Dependencies