This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (
This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (
xanados (author)
Sat Sep 20 15:06:52 -0700 2008
spider /
| name | age | message | |
|---|---|---|---|
| |
.gitignore | Tue Sep 09 23:15:34 -0700 2008 | |
| |
Emakefile | Sun Sep 07 13:28:06 -0700 2008 | |
| |
Makefile.am | Sun Sep 14 20:57:43 -0700 2008 | |
| |
README.markdown | Thu Sep 11 00:22:39 -0700 2008 | |
| |
autogen.sh | Sat Sep 13 23:28:33 -0700 2008 | |
| |
configure.ac | Wed Sep 17 21:39:39 -0700 2008 | |
| |
ebin/ | Wed May 28 18:36:12 -0700 2008 | |
| |
googleurl/ | Sat Sep 20 15:06:52 -0700 2008 | |
| |
include/ | Sat Sep 06 17:36:34 -0700 2008 | |
| |
install-sh | Sat Sep 13 22:26:03 -0700 2008 | |
| |
missing | Sat Sep 13 22:26:03 -0700 2008 | |
| |
mkinstalldirs | Sun Sep 14 21:02:09 -0700 2008 | |
| |
spider.app | Sun Sep 07 13:28:06 -0700 2008 | |
| |
spider_run.sh | Sat Sep 20 15:06:52 -0700 2008 | |
| |
src/ | Sat Sep 20 15:06:52 -0700 2008 |
An Erlang Spider
As far as I know, this is the only free, open source, publicly available web crawler application library in Erlang. Thanks, Michael Melanson. It should be pretty solid 'ere long.
DONE
- We now ask for and uncompress gzip.
TODO
- Jailing to directory, subdomain, and host. Easy to implement on top of regex sandbox.
- Unit test for regex sandbox to work out general approach.
- Use framework like eunit?
- Use test/0 and Mochikit's automatic reload module?
- Use another spider's result as reference
- Add interesting examples to Wiki.
- Switch build process to regular makefile (as Joe Armstrong prefers, but that's not why ;) ).
- Specify user-agent.
- Specify from header with email address.
- Robots.txt parsing.
- Specify whether to obey robots.txt.
- Parse html robots commands.
- Specify whether to obey html robots commands.
- Specify crawl delay.
- Specify allowed mime types
- Specify allowed extensions.
- Specify allowed page size.
- Specify how many times to try a failed url.
- Specify delay between retrying.
- Use google url parser.
- User callback to filter links before adding to tasks.
- User callback to filter links before processing.
- User callback to process page data structure returned by spider engine.
- Systematic, configurable logging to discover problems.
- Specify whether to allow redirects.
- Specify how many redirects to allow.
- Decide how to handle redirects exactly.
- Decide how to handle meta refresh.
- Add simplest possible quickstart to README.
- Add simple fun example to README.
- Cancel crawl.
- Data to be passed to aftercrawl callback:
- headers
- source
- parsed source
- last-modified
- etag
- httpstatus
- httpreason
- content-type
- content-encoding
- list of redirect urls (and meta refreshes?)
- Handle soft 404s.
- Does google url lop off duplicate url params?











