The following is a tutorial on using the new IOPro Accumulo adapter. In order to run this notebook, you'll need a running Accumulo server with the proxy server running. Otherwise you can simply follow along with the saved example results. All the examples here were tested with Python 2.7 (due to the pyaccumulo module used to create the test data), but they should also work with Python 3.4+.

First we need to generate some test data. Since IOPro only supports pulling data from data sources, we'll use the Python module pyaccumulo to create and populate a new table. The connection parameters will need to be changed to run this for your own Accumulo server (preferably a non production server!). The following example will create two new tables called 'iopro_tutorial_data' and 'iopro_tutorial_missing_data'. If tables already exists with those names, they will be destroyed.

In [1]:
import pyaccumulo
conn = pyaccumulo.Accumulo('172.17.0.1', port=42424, user='root', password='secret')

name = 'iopro_tutorial_data'
if conn.table_exists(name):
    conn.delete_table(name)
conn.create_table(name)

writer = conn.create_batch_writer(name)
for i in range(0, 5):
    value = '{0:07f}'.format(i + 0.5)
    m = pyaccumulo.Mutation('row{0:02d}'.format(i + 1))
    m.put(cf='f{0:02d}'.format(i + 1), cq='q{0:02d}'.format(i + 1), val=value)
    writer.add_mutation(m)
writer.close()

name = 'iopro_tutorial_missing_data'
if conn.table_exists(name):
    conn.delete_table(name)
conn.create_table(name)

writer = conn.create_batch_writer(name)
m = pyaccumulo.Mutation('row01')
m.put(cf='f01', cq='q01', val='NA')
writer.add_mutation(m)
m = pyaccumulo.Mutation('row02')
m.put(cf='f02', cq='q02', val='nan')
writer.add_mutation(m)
writer.close()

First we'll create an Accumulo adapter for the first table we created above. Since Accumulo returns values as variable length strings, we need to tell the adapter the data type for our results by using the 'field_type' argument.

In [2]:
import iopro
adapter = iopro.AccumuloAdapter(server='172.17.0.1',
                                port=42424,
                                username='root',
                                password='secret',
                                field_type='f4',
                                table='iopro_tutorial_data')

The Accumulo adapter supports slicing, similar to NumPy array slicing. For example, to retrieve all records:

In [3]:
adapter[:]

array([ 0.5,  1.5,  2.5,  3.5,  4.5], dtype=float32)

retrieve the first three records:

In [4]:
adapter[0:3]

array([ 0.5,  1.5,  2.5], dtype=float32)

retrieve every other record from the first four records:

In [5]:
adapter[0:4:2]

array([ 0.5,  2.5], dtype=float32)

Since the underlying Accumulo interface provided by the Accumulo server doesn't allow seeking from the last record, negative slicing is not supported at this time.

Since Accumulo is essentially a key/value store, we can also filter results based on key. For example, we can set a start key value using the start_key property. This will retrieve all values with a key equal to or greater than the start key value.

In [6]:
adapter.start_key = 'row02'
adapter[:]

array([ 1.5,  2.5,  3.5,  4.5], dtype=float32)

Likewise, we can set a stop key value. This will retrieve all values with a key less than the stop key value but equal to or greater than the start key value which we've already set.

In [7]:
adapter.stop_key = 'row04'
adapter[:]

array([ 1.5,  2.5], dtype=float32)

By default, the start key is inclusive. We can change this by setting the start_key_inclusive property to False.

In [8]:
adapter.start_key_inclusive = False
adapter[:]

array([ 2.5], dtype=float32)

By default, the stop key is exclusive. We can change this by setting the stop_key_inclusive property to True.

In [9]:
adapter.stop_key_inclusive = True
adapter[:]

array([ 2.5,  3.5], dtype=float32)

To show how the Accumulo adapter deals with missing data, we'll create a new adapter for the missing data table created above and set the field type to a string of length 10 to see what the raw data looks like.

In [11]:
import numpy as np
adapter = iopro.AccumuloAdapter('172.17.0.1', 42424, 'root', 'secret', 'iopro_tutorial_missing_data', field_type='S10')
adapter[:]

array([b'NA', b'nan'], 
      dtype='|S10')

If we know that the strings 'NA' and 'nan' signify missing float values, we can use the missing_values property to tell the adapter to treat these strings as missing values. We can then use the fill_values property to specify a value to replace the missing values with.

In [12]:
adapter = iopro.AccumuloAdapter('172.17.0.1', 42424, 'root', 'secret', 'iopro_tutorial_missing_data', field_type='f8')
adapter.missing_values = ['NA', 'nan']
adapter.fill_value = np.nan
adapter[:]

array([ nan,  nan])