public
Fork of michaelmelanson/spider
Description: A web spider written in Erlang.
Homepage:
Clone URL: git://github.com/formido/spider.git
spider /
name age message
file .gitignore Tue Sep 09 23:15:34 -0700 2008 Update .gitignore to kill .DS_Store files [formido]
file Emakefile Sun Sep 07 13:28:06 -0700 2008 Updated to handle common cases of URL parsing b... [formido]
file Makefile.am Sun Sep 14 20:57:43 -0700 2008 Hopefully no longer need headers from ICU, sinc... [xanados]
file README.markdown Thu Sep 11 00:22:39 -0700 2008 Requests and decodes zip, when available Also ... [formido]
file autogen.sh Sat Sep 13 23:28:33 -0700 2008 adding autogen.sh file [xanados]
file configure.ac Wed Sep 17 21:39:39 -0700 2008 Including patch for the configure file. There ... [xanados]
directory ebin/ Wed May 28 18:36:12 -0700 2008 Initial commit. [michaelmelanson]
directory googleurl/ Sat Sep 20 15:06:52 -0700 2008 Fixed the bug for too long of lists to canonica... [xanados]
directory include/ Sat Sep 06 17:36:34 -0700 2008 Added support for control of search depth and s... [xanados]
file install-sh Sat Sep 13 22:26:03 -0700 2008 Building up a more standard build system [xanados]
file missing Sat Sep 13 22:26:03 -0700 2008 Building up a more standard build system [xanados]
file mkinstalldirs Sun Sep 14 21:02:09 -0700 2008 Added mkinstalldirs, since I think this is supp... [xanados]
file spider.app Sun Sep 07 13:28:06 -0700 2008 Updated to handle common cases of URL parsing b... [formido]
file spider_run.sh Sat Sep 20 15:06:52 -0700 2008 Fixed the bug for too long of lists to canonica... [xanados]
directory src/ Sat Sep 20 15:06:52 -0700 2008 Fixed the bug for too long of lists to canonica... [xanados]

An Erlang Spider

As far as I know, this is the only free, open source, publicly available web crawler application library in Erlang. Thanks, Michael Melanson. It should be pretty solid 'ere long.

DONE

  1. We now ask for and uncompress gzip.

TODO

  1. Jailing to directory, subdomain, and host. Easy to implement on top of regex sandbox.
  2. Unit test for regex sandbox to work out general approach.
    • Use framework like eunit?
    • Use test/0 and Mochikit's automatic reload module?
    • Use another spider's result as reference
  3. Add interesting examples to Wiki.
  4. Switch build process to regular makefile (as Joe Armstrong prefers, but that's not why ;) ).
  5. Specify user-agent.
  6. Specify from header with email address.
  7. Robots.txt parsing.
  8. Specify whether to obey robots.txt.
  9. Parse html robots commands.
  10. Specify whether to obey html robots commands.
  11. Specify crawl delay.
  12. Specify allowed mime types
  13. Specify allowed extensions.
  14. Specify allowed page size.
  15. Specify how many times to try a failed url.
  16. Specify delay between retrying.
  17. Use google url parser.
  18. User callback to filter links before adding to tasks.
  19. User callback to filter links before processing.
  20. User callback to process page data structure returned by spider engine.
  21. Systematic, configurable logging to discover problems.
  22. Specify whether to allow redirects.
  23. Specify how many redirects to allow.
  24. Decide how to handle redirects exactly.
  25. Decide how to handle meta refresh.
  26. Add simplest possible quickstart to README.
  27. Add simple fun example to README.
  28. Cancel crawl.
  29. Data to be passed to aftercrawl callback:
    • headers
    • source
    • parsed source
    • last-modified
    • etag
    • httpstatus
    • httpreason
    • content-type
    • content-encoding
    • list of redirect urls (and meta refreshes?)
  30. Handle soft 404s.
  31. Does google url lop off duplicate url params?