This repository is private.
All pages are served over SSL and all pushing and pulling is done over SSH.
No one may fork, clone, or view it unless they are added as a member.
Every repository with this icon (
) is private.
Every repository with this icon (
This repository is public.
Anyone may fork, clone, or view it.
Every repository with this icon (
) is public.
Every repository with this icon (
Lucandra /
| name | age | message | |
|---|---|---|---|
| |
.classpath | Sun Oct 04 20:18:57 -0700 2009 | |
| |
.project | Sun Aug 23 20:44:01 -0700 2009 | |
| |
README | Sun Oct 04 21:26:24 -0700 2009 | |
| |
bookmarks.tsv | Thu Oct 01 20:34:08 -0700 2009 | |
| |
build.xml | Sun Oct 04 20:18:57 -0700 2009 | |
| |
config/ | Tue Nov 10 21:17:47 -0800 2009 | |
| |
gen-java/ | Thu Oct 01 20:34:08 -0700 2009 | |
| |
lib/ | Thu Oct 01 20:34:08 -0700 2009 | |
| |
run_demo.sh | Fri Oct 02 20:40:23 -0700 2009 | |
| |
src/ | Wed Dec 02 20:55:28 -0800 2009 | |
| |
test/ | Wed Nov 18 18:22:35 -0800 2009 |
README
Lucandra - Lucene on Cassandra http://twitter.com/tjake ============================== Lucandra provides a Lucene IndexReader and IndexWriter that interfaces with Cassandra. To get started run the following: 1. Setup Cassandra with storage-conf.xml in config 2. ant lucandra.jar 3. ant test -Dcassandra.host=127.0.0.1 -Dcassandra.port=9999 -Dcassandra.framed=false #edit run-demo with appropriate settings (delicious clone) 4. run_demo.sh -index bookmarks.tsv 5. run_demo.sh -search title:linu* Background ========== Storing an inverted index in Cassandra was the initial use-case for Cassandra at Facebook. The Cassandra wiki discusses this: "You can think of each super column name as a term and the columns within as the docids with rank info and other attributes being a part of it. If you have keys as the userids then you can have a per-user index stored in this form. This is how the per user index for term search is laid out for Inbox search at Facebook." Initially we implemented Lucene support with supercolumn as described but we ran into a major scaling issue when we tried to index all of wikipedia. Turns out Cassandra keeps the supercolumn in memory for a given key. Also all columns for a key are tied to one cassandra node so we don't gain much scalability this way. Thankfully Cassandra recently added support for distributed ordered keys that allows us to use keys to store index terms without supercolumns. Implementation Notes ====================== The Lucandra Cassandra config looks like this. <Keyspaces> <Keyspace Name="Lucandra"> <ColumnFamily CompareWith="BytesType" Name="TermVectors"/> <ColumnFamily CompareWith="BytesType" Name="Documents"/> </Keyspace> </Keyspaces> *Documents Ids are currently random and autogenerated. *Term keys and Document Keys are encoded as follows (using a random binary delimiter) Term Key col name value "index_name/field/term" => { documentId , position vector } Document Key "index_name/documentId" => { fieldName , value } The IndexReader caches terms aggressively during search and tries to avoid lots of back and forth with Cassandra. What Works ========== * Indexing (though slow for now) * Search * Sort * Wildcards and other Lucene magic * Real-Time What's Missing (for now) ============== * You can't delete/update * No normalizations are stored (no scoring) * Document fields are always stored as binary * You can't walk the documents with index reader







