formido / spider forked from michaelmelanson/spider
- Source
- Commits
- Network (2)
- Issues (2)
- Downloads (0)
- Wiki (2)
- Graphs
-
Tree:
ade2440
xanados (author)
Sat Sep 20 15:06:52 -0700 2008
spider / README.markdown
An Erlang Spider
As far as I know, this is the only free, open source, publicly available web crawler application library in Erlang. Thanks, Michael Melanson. It should be pretty solid 'ere long.
DONE
- We now ask for and uncompress gzip.
TODO
- Jailing to directory, subdomain, and host. Easy to implement on top of regex sandbox.
- Unit test for regex sandbox to work out general approach.
- Use framework like eunit?
- Use test/0 and Mochikit's automatic reload module?
- Use another spider's result as reference
- Add interesting examples to Wiki.
- Switch build process to regular makefile (as Joe Armstrong prefers, but that's not why ;) ).
- Specify user-agent.
- Specify from header with email address.
- Robots.txt parsing.
- Specify whether to obey robots.txt.
- Parse html robots commands.
- Specify whether to obey html robots commands.
- Specify crawl delay.
- Specify allowed mime types
- Specify allowed extensions.
- Specify allowed page size.
- Specify how many times to try a failed url.
- Specify delay between retrying.
- Use google url parser.
- User callback to filter links before adding to tasks.
- User callback to filter links before processing.
- User callback to process page data structure returned by spider engine.
- Systematic, configurable logging to discover problems.
- Specify whether to allow redirects.
- Specify how many redirects to allow.
- Decide how to handle redirects exactly.
- Decide how to handle meta refresh.
- Add simplest possible quickstart to README.
- Add simple fun example to README.
- Cancel crawl.
- Data to be passed to aftercrawl callback:
- headers
- source
- parsed source
- last-modified
- etag
- httpstatus
- httpreason
- content-type
- content-encoding
- list of redirect urls (and meta refreshes?)
- Handle soft 404s.
- Does google url lop off duplicate url params?
