Couch Crawler

A search engine built on top of couchdb-lucene.

Dependencies

Optionally for Yammer spidering:

Installation

Assuming couchdb-lucene was installed to the "_fti" endpoint, you can push Couch Crawler to your CouchDB instance with the command:

cd couchapp
couchapp push

This will create a new CouchDB database called "crawler" on the localhost:5984 CouchDB instance. To change the db, modify couchapp/.couchapprc and do another couchapp push.

To configure the crawler, copy python/couchcrawler-sample.cfg to python/couchcrawler.cfg and fill out the appropriate configuration values.

To start indexing pages, run the crawler script:

cd python
./scrapy-ctl.py crawl domain_to_crawl.com

While it's indexing, you can visit the search engine at the following url:

http://localhost:5984/crawler/_design/crawler/index.html

Spiders

The crawler current has spiders for:

MediaWiki
Twiki
Yammer

It's pretty easy to create your own. See python/couchcrawler/spiders/wiki.py for an example, or Scrapy documentation for more a more in-depth explanation.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
couchapp		couchapp
python		python
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

couchapp

couchapp

python

python

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Couch Crawler

Dependencies

Installation

Spiders

About

Releases

Packages

Languages

License

clofresh/couch-crawler

Folders and files

Latest commit

History

Repository files navigation

Couch Crawler

Dependencies

Installation

Spiders

About

Resources

License

Stars

Watchers

Forks

Languages