public
Description: Lucandra = Lucene + Cassandra
Homepage:
Clone URL: git://github.com/tjake/Lucandra.git
name age message
file .classpath Sun Oct 04 20:18:57 -0700 2009 fix for sorted results with test [tjake]
file .project Sun Aug 23 20:44:01 -0700 2009 Initial commit. untested but i think it might j... [tjake]
file README Sun Oct 04 21:26:24 -0700 2009 typos [tjake]
file bookmarks.tsv Thu Oct 01 20:34:08 -0700 2009 getting ready for a first release with simple demo [tjake]
file build.xml Sun Oct 04 20:18:57 -0700 2009 fix for sorted results with test [tjake]
directory config/ Tue Nov 10 21:17:47 -0800 2009 performance fix for range queries [tjake]
directory gen-java/ Thu Oct 01 20:34:08 -0700 2009 getting ready for a first release with simple demo [tjake]
directory lib/ Thu Oct 01 20:34:08 -0700 2009 getting ready for a first release with simple demo [tjake]
file run_demo.sh Fri Oct 02 20:40:23 -0700 2009 renamed demo runner [tjake]
directory src/ Wed Dec 02 20:55:28 -0800 2009 major bug affecting search, out of order docIds. [tjake]
directory test/ Wed Nov 18 18:22:35 -0800 2009 fix to use proper delimiter throught the code [tjake]
README
Lucandra - Lucene on Cassandra
http://twitter.com/tjake
==============================

Lucandra provides a Lucene IndexReader and IndexWriter that interfaces with Cassandra.

To get started run the following:

1. Setup Cassandra with storage-conf.xml in config

2. ant lucandra.jar
3. ant test -Dcassandra.host=127.0.0.1 -Dcassandra.port=9999 -Dcassandra.framed=false

#edit run-demo with appropriate settings (delicious clone)
4. run_demo.sh -index bookmarks.tsv
5. run_demo.sh -search title:linu*


Background
==========

Storing an inverted index in Cassandra was the initial use-case for Cassandra at Facebook.
The Cassandra wiki discusses this:

"You can think of each super column name as a term and the columns within as the docids
with rank info and other attributes being a part of it. If you have keys as the userids
then you can have a per-user index stored in this form. This is how the per user index
 for term search is laid out for Inbox search at Facebook."

Initially we implemented Lucene support with supercolumn as described but we ran into
a major scaling issue when we tried to index all of wikipedia.
Turns out Cassandra keeps the supercolumn in memory for a given key.
Also all columns for a key are tied to one cassandra node so we don't gain much scalability this way.

Thankfully Cassandra recently added support for distributed ordered keys that
allows us to use keys to store index terms without supercolumns.

Implementation Notes
======================

The Lucandra Cassandra config looks like this.

<Keyspaces>
  <Keyspace Name="Lucandra">
      <ColumnFamily CompareWith="BytesType" Name="TermVectors"/>
      <ColumnFamily CompareWith="BytesType" Name="Documents"/>
  </Keyspace>
</Keyspaces>


*Documents Ids are currently random and autogenerated.

*Term keys and Document Keys are encoded as follows (using a random binary delimiter)

      Term Key                     col name         value
      "index_name/field/term" => { documentId , position vector }

      Document Key
      "index_name/documentId" => { fieldName , value }


The IndexReader caches terms aggressively during search and tries to avoid lots of back and forth with Cassandra.


What Works
==========
* Indexing (though slow for now)
* Search
* Sort
* Wildcards and other Lucene magic
* Real-Time

What's Missing (for now)
==============
* You can't delete/update
* No normalizations are stored (no scoring)
* Document fields are always stored as binary
* You can't walk the documents with index reader