Discontinued in favour of Cassandra Lucene Index
Java Python GAP Shell Thrift PowerShell Other
Switch branches/tags
trunk/4120 trunk/3881 stratio-cassandra-2.1.07 stratio-cassandra-2.1.06 stratio-cassandra-2.1.5.0 stratio-cassandra-2.1.05 stratio-cassandra-2.1.4.1 stratio-cassandra-2.1.4.0 stratio-cassandra-2.1.04 stratio-cassandra-2.1.3.1 stratio-cassandra-2.1.3.0 stratio-cassandra-2.1.03 stratio-cassandra-2.1.2.2 stratio-cassandra-2.1.2.1 stratio-cassandra-2.1.02 stratio-cassandra-2.1.01 stratio-cassandra-2.0.92 stratio-cassandra-2.0.91 stratio-cassandra-2.0.83 stratio-cassandra-2.0.82 stratio-cassandra-2.0.81 stratio-cassandra-2.0.72 list drivers cassandra-2.1.5 cassandra-2.1.4 cassandra-2.1.3 cassandra-2.1.2 cassandra-2.1.1 cassandra-2.1.0 cassandra-2.1.0-rc7 cassandra-2.1.0-rc6 cassandra-2.1.0-rc5 cassandra-2.1.0-rc4 cassandra-2.1.0-rc3 cassandra-2.1.0-rc2 cassandra-2.1.0-rc1 cassandra-2.1.0-deb cassandra-2.1.0-beta2 cassandra-2.1.0-beta1 cassandra-2.0.14 cassandra-2.0.13 cassandra-2.0.12 cassandra-2.0.11 cassandra-2.0.10 cassandra-2.0.9 cassandra-2.0.8 cassandra-2.0.7 cassandra-2.0.6 cassandra-2.0.5 cassandra-2.0.4 cassandra-2.0.3 cassandra-2.0.2 cassandra-2.0.1 cassandra-2.0.0 cassandra-2.0.0-rc2 cassandra-2.0.0-rc1 cassandra-2.0.0-beta2 cassandra-2.0.0-beta1 cassandra-1.2.19 cassandra-1.2.18 cassandra-1.2.17 cassandra-1.2.16 cassandra-1.2.15 cassandra-1.2.14 cassandra-1.2.13 cassandra-1.2.12 cassandra-1.2.11 cassandra-1.2.10 cassandra-1.2.9 cassandra-1.2.8 cassandra-1.2.7 cassandra-1.2.6 cassandra-1.2.5 cassandra-1.2.4 cassandra-1.2.3 cassandra-1.2.2 cassandra-1.2.1 cassandra-1.2.0 cassandra-1.2.0-rc2 cassandra-1.2.0-rc1 cassandra-1.2.0-beta3 cassandra-1.2.0-beta2 cassandra-1.2.0-beta1 cassandra-1.1.12 cassandra-1.1.11 cassandra-1.1.10 cassandra-1.1.9 cassandra-1.1.8 cassandra-1.1.7 cassandra-1.1.6 cassandra-1.1.5 cassandra-1.1.4 cassandra-1.1.3 cassandra-1.1.2 cassandra-1.1.1 cassandra-1.1.0 cassandra-1.1.0-rc1 cassandra-1.1.0-beta2 cassandra-1.1.0-beta1
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin Fix write meter for cqlsh COPY TO Apr 22, 2015
conf
debian bump version Apr 27, 2015
doc Merge tag 'cassandra-2.1.5' into release/2.1.5.0 Apr 30, 2015
examples Pig: Refactor and deprecate CqlStorage Jan 13, 2015
ide/idea Add generate-idea-files target to build.xml Apr 7, 2015
interface Backport MultiSliceRequest May 28, 2014
lib Merge tag 'cassandra-2.1.5' into release/2.1.5.0 Apr 30, 2015
pylib Fix write meter for cqlsh COPY TO Apr 22, 2015
src Merge tag 'cassandra-2.1.5' into release/2.1.5.0 Apr 30, 2015
test-output Added some unit test for cell mappings. Fix minor bugs in query boost… Mar 17, 2014
test Merge tag 'cassandra-2.1.5' into release/2.1.5.0 Apr 30, 2015
tools permit n=1 in cassandra-stress Mar 24, 2015
.gitignore Add conf/hotspot_compiler to .gitignore Jul 3, 2014
.rat-excludes Fix build.xml. Apr 15, 2015
CHANGES.txt Revert "Add date and time types" on 2.1 branch Apr 22, 2015
CHANGES_STRATIO.txt Merge tag 'cassandra-2.1.5' into release/2.1.5.0 Apr 30, 2015
LICENSE.txt
NEWS.txt Revert "Add date and time types" on 2.1 branch Apr 22, 2015
NOTICE.txt Update NOTICE.txt Jun 25, 2015
README.md Add discontinued message. Jun 8, 2015
build.properties.default Simplify ant build.xml and remove rat. Mar 27, 2015
build.xml Merge tag 'cassandra-2.1.5' into release/2.1.5.0 Apr 30, 2015

README.md

This project has been discontinued in favour of Cassandra Lucene Index, which maintains exactly the same features being a plugin of Apache Cassandra instead of a fork. It is worth noting that plugin distribution is rather preferred than a fork, which is harder to maintain. This have been possible thanks to CASSANDRA-8717, among others.

Stratio Cassandra

Stratio Cassandra is a fork of Apache Cassandra where index functionality has been extended to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio Cassandra is one of the core modules on which Stratio's BigData platform (SDS) is based.

Index relevance queries allows you to retrieve the n more relevant results satisfying a query. The coordinator node sends the query to each node in the cluster, each node returns its n best results and then the coordinator combines these partial results and gives you the n best of them, avoiding full scan. You can also base the sorting in a combination of fields.

Index filtered queries are a powerful help when analyzing the data stored in Cassandra with MapReduce frameworks as Apache Hadoop or, even better, Apache Spark through Stratio Deep. Adding Lucene filters in the jobs input can dramatically reduce the amount of data to be processed, avoiding full scan.

Any cell in the tables can be indexed, including those in the primary key as well as collections. Wide rows are also supported. You can scan token/key ranges, apply additional CQL3 clauses and page on the filtered results.

More detailed information is available at Stratio Cassandra documentation .

Features

Stratio Cassandra and its integration with Lucene search technology provides:

  • Big data full text search
  • Relevance scoring and sorting
  • General top-k queries
  • Complex boolean queries (and, or, not)
  • Near real-time search
  • Custom analyzers
  • CQL3 support
  • Wide rows support
  • Partition and cluster composite keys support
  • Support for indexing columns part of primary key
  • Third-party drivers compatibility
  • Spark compatibility
  • Hadoop compatibility

Not yet supported:

  • Thrift API
  • Legacy compact storage option
  • Indexing counter columns
  • Columns with TTL
  • CQL user defined types
  • Static columns

Requirements

  • Java >= 1.7 (OpenJDK and Sun have been tested)
  • Ant >= 1.8

Building and running

Stratio Cassandra is distributed as a fork of Apache Cassandra, so its building, installation, operation and maintenance is overall identical. To build and run type:

ant build
bin/cassandra -f

Now you can do some tests using the Cassandra Query Language:

bin/cqlsh

The Lucene's index files will be stored in the same directories where the Cassandra's will be. The default data directory is /var/lib/cassandra/data, and each index is placed next to the SSTables of its indexed column family.

For more details about Apache Cassandra please see its documentation.

Example

We will create the following table to store tweets:

CREATE KEYSPACE demo
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE tweets (
    id INT PRIMARY KEY,
    user TEXT,
    body TEXT,
    time TIMESTAMP,
    lucene TEXT
);

We have created a column called lucene to link the index queries. This column will not store data. Now you can create a custom Lucene index on it with the following statement:

CREATE CUSTOM INDEX tweets_index ON tweets (lucene) 
USING 'com.stratio.cassandra.index.RowIndex'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            id   : {type : "integer"},
            user : {type : "string"},
            body : {type : "text",  analyzer : "english"},
            time : {type : "date", pattern  : "yyyy/MM/dd"}
        }
    }'
};

This will index all the columns in the table with the specified types, and it will be refreshed once per second.

Now, to query the top 100 more relevant tweets where body field contains the phrase "big data gives organizations":

SELECT * FROM tweets WHERE lucene='{
	query : {type:"phrase", field:"body", values:["big","data","gives","organizations"]}
}' limit 100;

To restrict the search for tweets within a certain date range, then you must add to the search a filter as follows:

SELECT * FROM tweets WHERE lucene='{
    filter : {type:"range", field:"time", lower:"2014/04/25", upper:"2014/04/1"},
    query  : {type:"phrase", field:"body", values:["big","data","gives","organizations"]}
}' limit 100;

To refine the search to get only the tweets written by users whose name starts with "a":

SELECT * FROM tweets WHERE lucene='{
    filter : {type:"boolean", must:[
                   {type:"range", field:"time", lower:"2014/04/25", upper:"2014/04/1"},
                   {type:"prefix", field:"user", value:"a"} ] },
    query  : {type:"phrase", field:"body", values:["big","data","gives","organizations"]}
}' limit 100;

To get the 100 more recent filtered results you can use the sort option:

SELECT * FROM tweets WHERE lucene='{
    filter : {type:"boolean", must:[
                   {type:"range", field:"time", lower:"2014/04/25", upper:"2014/04/1"},
                   {type:"prefix", field:"user", value:"a"} ] },
    query  : {type:"phrase", field:"body", values:["big","data","gives","organizations"]},
    sort  : {fields: [ {field:"time", reverse:true} ] }
}' limit 100;

Finally, if you want to restrict the search to a certain token range:

SELECT * FROM tweets WHERE lucene='{
    filter : {type:"boolean", must:[
                   {type:"range", field:"time", lower:"2014/04/25", upper:"2014/04/1"},
                   {type:"prefix", field:"user", value:"a"} ] },
    query  : {type:"phrase", field:"body", values:["big","data","gives","organizations"]}
}' AND token(id) >= token(0) AND token(id) < token(10000000) limit 100;

This last is the basis for Hadoop, Spark and other MapReduce frameworks support.

Please, refer to the comprehensive Stratio Cassandra documentation.