getAllRows for Column Family with ByteOrder Partitioner does not return all rows #126

vgarud opened this Issue Oct 10, 2012 · 8 comments


None yet
7 participants

vgarud commented Oct 10, 2012

My row key is a composite key and am using the AnnotatedCompositeSerializer.
I am using astayanax version 1.0.3

Here is my code snippet. It only returns 100 rows (as set in setRowLimit).
Am I doing something wrong. Or is there another way to iterate through all rows for CF with ByteOrderPartitioner ?

// code begin
ColumnFamilyQuery<WDKey, WDColumn> nq = keyspace.prepareQuery(WeatherColumnFamilies.CF_WEATHER_DATA);

    ArrayList<WDKey> list = new ArrayList<WDKey>();

        Rows<WDKey, WDColumn> rows = nq.getAllRows().setRowLimit(100)
            .withColumnRange(new RangeBuilder().setLimit(0).build())
        Iterator<Row<WDKey, WDColumn>> it = rows.iterator(); 
            Row<WDKey, WDColumn> r =;
            WDKey key = r.getKey();
            System.out.println("Key - " + key.getCountry() + "-" + key.getZip());
    catch (ConnectionException e)
        throw new RuntimeException("Can't get key list", e);

// code end


elandau commented Nov 15, 2012

ByteOrderPartitioner is not recommended and as such Astyanax does not support it. Is there any reason why you chose to use BOP?

My company is using the ByteOrderedPartitioner because it fit really well with our use case. We hold data for thousands of clients (too many to create a ColumnFamily per client) where each client has thousands to millions of data items. Most data access is key/value get/set/append. But we also frequently extract all data for a particular client to produce data feeds, export to Hadoop, or bootstrap a secondary index (eg. Elasticsearch). By using the ByteOrderedPartitioner we can extract all data for a given client using range scans and sequential I/O--it's very fast. To avoid cluster hotspots we manually prefix all our row keys with 1-byte from a murmer3 hash of the rest of the row key. This spreads data around the ring such that we get a nice even data distribution, and we can extract all data for a given client with 256 range scans. In practice, the ByteOrderedPartitioner has worked very well for this design.

As a sample, I am storing XML as a values, and manually calculated MD5 hash as a key. Obviously, ByteOrderedPartitioner will do better with already calculated MD5.

@landau: I have a different use case:

I currently am using the random partitioner and a column family with account UUIDs as row keys, query time-UUIDs as column names and JSON-serialized query results as column values. The wide row approach works great, but storing the JSON is terribly inefficient. A back-of-the-envelope calculation suggests I would save 90% of the storage space (for my data) if I used an order-preserving partitioner and a column family with the composite of the account UUID and the query time-UUID as the row key, the JSON-field names as the static column names, and only the JSON-field values as the column values. For example, instead of a row looking like this:

 row key: <some-account's-uuid>
      column name: <some-query's-time-uuid>
           column value: "field1: value1, field2: value2, field3: value3, ..."
      column name: <some-OTHER-query's-time-uuid>
           column value: "field1: value4, field2: value5, field3: value6"

I would have this:

 row key: <some-account's-uuid, some-query's-time-uuid>
      column name: "field1"
           column value: "value1"
      column name: "field2"
           column value: "value2"
      column name" field3"
           column value: "value3"

 row key: <some-account's-uuid, some-OTHER-query's-time-uuid>
      column name: "field1"
           column value: "value4"
      column name: "field2"
           column value: "value5"
      column name" field3"
           column value: "value6"

The benefit of the OPP relative to the RP is that the field names need only be stored once, in the column-family definition, rather than in each and every column value. When you're talking about field names with several characters each vs. field values that are only a few bytes, just storing those few bytes for each record rather than the several fields X the length of their names PLUS the few bytes is much more attractive.

Feedback appreciated.


carrino commented Mar 26, 2013

Can we get an idea of what work needs to be done to support ByteOrderedPartitioner? Given that virtual nodes take away a lot of the pain that was caused by using BOP, I think it isn't crazy to run with ordering anymore.

I have many use cases where I need to either scan rows in order (range read) or need to do a lookup for the first row greater than some value (range read limit 1). It'd be nice to get pointed in the right direction to add this. I'd rather start with a library like astyanax and add range read support than to just roll my own client queries for everything.

I believe we can at least implement simplified use case: 16-bytes RowKey arrays (fixed-size array of 16 bytes)...
but I spent 4 hours with Astyanax "Partitioner" interface implementations and I didn't find yet proper and simple solution... and I don't yet fully understand details of Connection Pool, Discovery, etc...

Looks like it is not easy...


@xedin xedin referenced this issue in thinkaurelius/titan Apr 25, 2013


g.V.count() returning only half the vertices. #227

Byte order partitioning is not recommended in Cassandra:

Although use cases may exist to use it, the number of them don't warrant the effort to fully support in Astyanax.

@ckalantzis ckalantzis closed this May 5, 2015

carrino commented May 5, 2015

I think the amount of FUD around ordered keys vs hashed keys is quite unwarranted. Once vnodes were added the only argument against them is "you need to think about how your data is laid out or you will get hotspots". I think this is a weak argument because you really should think about how your data is laid out.

Byte order is strictly better than hashed because you can always turn an ordered key into a hashed key by prepending the hash. Saying this is always the correct choice for all users I think is a mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment