Skip to content
seancribbs edited this page Apr 27, 2012 · 5 revisions

Secondary Indexes

At the end of this guide, you should be familiar with:

  • Adding secondary indexes to your values
  • Performing equality and range queries across indexes
  • Using the special $bucket and $key indexes
  • Using index queries as input to MapReduce

This and the other guides in this wiki assume you have Riak installed locally. If you don't have Riak already, please read and follow how to install Riak and then come back to this guide. If you haven't yet installed the client library, please do so before starting this guide.

This guide also assumes you know how to connect the client to Riak and store and retrieve values. All examples assume a local variable of client which is an instance of Riak::Client and points at some Riak node you want to work with.

How Secondary Indexes (aka 2I) work

Secondary indexes are one of the most popular recent features of Riak since the 1.0 release; however, they work very differently than both Search and a traditional index in a relational database. Here's how:

  • Secondary indexes are discrete; that is, you can only query on the entire secondary key. Search, on the other hand, lets you query inside each field.

  • Secondary indexes are defined per object. There is no schema or automatic indexing for 2I, you just add the indexes you want to the object before you store it.

  • Secondary indexes are stored in the same location as your regular value. This means that while it remains more consistent in the long run, Riak has to query a large portion of the cluster to satisfy any query. (This is also called a "coverage query" and is implemented similarly to list-keys and list-buckets, although it is much more efficient.)

  • Secondary indexes are currently only supported on the LevelDB storage engine (and the memory engine in the upcoming 1.2 release), so if you want to query them, make sure you have the below snippet in your app.config file:

      {storage_backend, riak_kv_eleveldb_backend}

Now that you've got that set, restart your Riak node if necessary and let's start playing with 2I!

Adding indexes to your values

In order to find things with 2I, we have to add some indexes first. Let's say I'm storing user profile information in Riak, and I want to look them up by email address or their handle. Naturally, you'd want users to be able to change their email address and handle too, so we can't use that as the key. Instead, let's use an arbitrary identifier (chosen by Riak in this case), and add indexes on those fields so we can look them up later. First, I'll initialize a new RObject to store my profile data:

sean = client['users'].new
# => #<Riak::RObject {users} [application/json]:nil>
sean.data = {:name => "Sean Cribbs",
             :email => "sean@basho.com",
             :handle => "seancribbs"}

Now I'll add an index entry for the email address, and for the handle by working with the indexes accessor.

sean.indexes['email_bin'] << 'sean@basho.com'
sean.indexes['handle_bin'] << 'seancribbs'

You should notice two things in the above snippet:

  1. The key in the indexes Hash ends with _bin. This means that the index we're storing is a String, or "binary".
  2. I didn't set the value, but instead appended it to the entry in the Hash. This is because indexes can have more than one value, which is useful if you want to, say, "tag" something like a blog post with multiple "tags". The indexes accessor is always initialized as a Hash whose default value is a Set for this reason.

Let's look at the value of indexes and then store the object.

sean.indexes
# => {"email_bin"=>#<Set: {"sean@basho.com"}>, "handle_bin"=>#<Set: {"seancribbs"}>} 
sean.store
# => => #<Riak::RObject {users,RgOVpKn6yirTTiOjlogMpkTlV1U} [application/json]:{"name"=>"Sean Cribbs", "email"=>"sean@basho.com", "handle"=>"seancribbs"}>

You'll see that Riak picked a long, quasi-random key for me. Now let's see if we can find my profile.

Equality queries

The simplest secondary-index query is equality, which we'll use to look up my user profile by email and handle. Both queries will use the get_index method on the Bucket. The first argument is the index to query, the second is the value of that index to lookup.

client['users'].get_index('handle_bin', 'seancribbs')
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U"] 
client['users']["RgOVpKn6yirTTiOjlogMpkTlV1U"] 
# => #<Riak::RObject {users,RgOVpKn6yirTTiOjlogMpkTlV1U} [application/json]:{"name"=>"Sean Cribbs", "email"=>"sean@basho.com", "handle"=>"seancribbs"}> 

Now let's try the email:

client['users'].get_index('email_bin', 'sean@basho.com')
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U"] 

Alright, we got the same answer! Our secondary index worked.

Multi-valued indexes

We mentioned earlier that indexes can have multiple values. Let's add another email address to my profile and query for it.

sean.indexes['email_bin'] << 'sean.cribbs@private-mail.com'
# => #<Set: {"sean@basho.com", "sean.cribbs@private-mail.com"}> 
sean.store
# Now let's query it.
client['users'].get_index('email_bin', 'sean.cribbs@private-mail.com')
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U"]

Range queries

The indexes we have added so far on this user profile wouldn't be very meaningful to query in a range, so let's add another index, and also put some more keys in our 'users' bucket with indexes on them.

Let's assume we want to track when the user signed up, so we'll add an integer index which is a UNIX timestamp.

# We'll cheat for the first one and use the last_modified metadata.
sean.indexes["joined_int"] << sean.last_modified.utc.to_i
# => #<Set: {1335541214}>
sean.store

# Now let's make another user profile and store it
brian = client['users'].new.tap do |b|
  b.data = {:name => "Brian Roach", 
            :email => "roach@basho.com",
            :handle => "roach"}
  b.indexes['email_bin'] << 'roach@basho.com'
  b.indexes['handle_bin'] << 'roach'
  b.indexes['joined_int'] << Time.now.utc.to_i
  b.store
end
# => #<Riak::RObject {users,ITbrdX4MdIfONI9YL7bCpv4nmYV} [application/json]:{"name"=>"Brian Roach", "email"=>"roach@basho.com", "handle"=>"roach"}>

brian.indexes['joined_int']
# => #<Set: {1335548562}>

Now we can query on that index. Let's find the users that joined today. We do that with the same get_index method on the bucket, but pass a Range object as the query argument.

# Let's first figure out the boundaries of the day. If you're using
# Rails, use Time#end_of_day and Time#beginning_of_day.
now = Time.now.utc
start_of_today = Time.utc(now.year, now.month, now.day, 0, 0, 0).to_i
end_of_today = Time.utc(now.year, now.month, now.day, 23, 59, 59).to_i

# Now we can query the range.
client['users'].get_index('joined_int', start_of_today..end_of_today)
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U", "ITbrdX4MdIfONI9YL7bCpv4nmYV"]

Good, we got both of our users' keys back. Now let's pick a moment between the two indexes so we can see the range query returning only a portion of our keyspace.

# Find the midpoint between when they joined:
midpoint = brian.indexes['joined_int'].first - sean.indexes['joined_int'].first) / 2 + sean.indexes['joined_int'].first

client['users'].get_index('joined_int', start_of_today..midpoint)
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U"]
sean.key
# => "RgOVpKn6yirTTiOjlogMpkTlV1U"

client['users'].get_index('joined_int', midpoint..end_of_today)
# => ["ITbrdX4MdIfONI9YL7bCpv4nmYV"] 
brian.key
# => "ITbrdX4MdIfONI9YL7bCpv4nmYV" 

The special $bucket and $key indexes

Riak also has two built-in indexes that you don't have to define, and they are $bucket and $key, which unsurprisingly, are indexes over the bucket and key, respectively. While still not-recommended, in some cases they will be more efficient than the list-keys functionality. Each index has only one query type you can do on it; the bucket index only supports equality, and the key index only supports range.

# Bucket equality query
client['users'].get_index('$bucket', 'users')
# => ["RgOVpKn6yirTTiOjlogMpkTlV1U", "ITbrdX4MdIfONI9YL7bCpv4nmYV"] 

# Key range query
client['users'].get_index('$key', 'H'..'J')
# => ["ITbrdX4MdIfONI9YL7bCpv4nmYV"] 

One point this example drives home about ranges on binary/String indexes is that they are strictly by byte-order, so when using them, be aware of the raw byte-ordering of your Ruby Strings.

Feeding index queries to MapReduce

As with Full-text Search, you can feed the results of a secondary index query into a MapReduce job. We'll just do a simple one, the MapReduce guide will have more detailed examples.

On a Riak::MapReduce object, call the index method to add a secondary index query as the input. The first argument is the bucket, followed by the index and the query (both equality and range are supported).

Riak::MapReduce.new(client).
  index('users', 'email_bin', 'sean@basho.com').
  map('Riak.mapValuesJson', :keep => true).run
# => [{"name"=>"Sean Cribbs", "email"=>"sean@basho.com", "handle"=>"seancribbs"}] 
  
Riak::MapReduce.new(client).
  index('users', 'joined_int', start_of_today..end_of_today).
  map('Riak.mapValuesJson', :keep => true).run
# => [{"name"=>"Sean Cribbs", "email"=>"sean@basho.com", "handle"=>"seancribbs"}, 
#     {"name"=>"Brian Roach", "email"=>"roach@basho.com", "handle"=>"roach"}]

So, combining secondary indexes with MapReduce, we can fetch the values in a single round-trip, or if we choose, do more complicated processing.

What to do next

Congratulations, you finished the Secondary Indexes guide! You might next want to compare them to Full-text Search or go into more detail of processing query outputs with MapReduce.