# Tutorial on ctable objects
[Go to tutorials´ index](tutorials.ipynb)

<a id='go to index'></a>
Index:
  1. <a href='#Creating a ctable'>Creating a ctable</a>
  -  <a href='#Accessing and setting rows'>Accessing and setting rows</a>
  -  <a href='#Adding and deleting columns'>Adding and deleting columns</a>
  - <a href='#Iterating over ctable data'>Iterating over ctable data</a>
  - <a href='#Iterating over the output of conditions along columns'>Iterating over the output of conditions along columns</a>
  - <a href='#Performing operations on ctable columns'>Performing operations on ctable columns</a>

The bcolz package comes with a handy object that arranges data by
column (and not by row, as in NumPy's structured arrays).  This allows
for much better performance for walking tabular data by column and
also for adding and deleting columns.

In [1]:
import bcolz
import numpy as np

<a id='Creating a ctable'></a>
## Creating a ctable
<a href='#go to index'>Go to index</a>

You can build ctable objects in many different ways, but perhaps the
easiest one is using the `fromiter` constructor:

In [2]:
N = int(5*1e6)
ct = bcolz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
ct

ctable((5000000,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 57.22 MB; cbytes: 9.95 MB; ratio: 5.75
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(0, 0.0) (1, 1.0) (2, 4.0) ..., (4999997, 24999970000009.0)
 (4999998, 24999980000004.0) (4999999, 24999990000001.0)]

If you wish to create an empty ctable and append data afterwards, 
this is posible using `bzolz.zeros` indicating zero length.
We encourage you to use the `with` statement for this, 
it will take care of flushing data to disk once you are 
done appending data.

In [3]:
with bcolz.zeros(0, dtype="i4,f8") as ct:
    for i in xrange(N):
        ct.append((i, i**2))
ct

ctable((5000000,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 57.22 MB; cbytes: 11.01 MB; ratio: 5.20
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(0, 0.0) (1, 1.0) (2, 4.0) ..., (4999997, 24999970000009.0)
 (4999998, 24999980000004.0) (4999999, 24999990000001.0)]

However, we can see how the latter approach does not compress as well.
Why?  Well, carray has machinery for computing 'optimal' chunksizes
depending on the number of entries.  For the first case, carray can
figure out the number of entries in final array, but not for the loop
case.  You can solve this by passing the final length with the
`expectedlen` argument to the ctable constructor:

In [4]:
ct = bcolz.zeros(0, dtype="i4,f8", expectedlen=N)
for i in xrange(N):
    ct.append((i, i**2))
ct

ctable((5000000,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 57.22 MB; cbytes: 9.95 MB; ratio: 5.75
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(0, 0.0) (1, 1.0) (2, 4.0) ..., (4999997, 24999970000009.0)
 (4999998, 24999980000004.0) (4999999, 24999990000001.0)]

Okay, the compression ratio is the same now.

<a id='Accessing and setting rows'></a>
## Accessing and setting rows
<a href='#go to index'>Go to index</a>

The ctable object supports the most common indexing operations in
NumPy:

In [5]:
ct[1]

(1, 1.0)

In [6]:
type(ct[1])

numpy.void

In [7]:
ct[1:6]

array([(1, 1.0), (2, 4.0), (3, 9.0), (4, 16.0), (5, 25.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

The first thing to have in mind is that, similarly to `carray`
objects, the result of an indexing operation is a native NumPy object
(in the case above a scalar and a structured array).

Fancy indexing is also supported:

In [8]:
ct[[1,6,13]]

array([(1, 1.0), (6, 36.0), (13, 169.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

In [9]:
ct["(f0>0) & (f1<10)"]

array([(1, 1.0), (2, 4.0), (3, 9.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

Note that conditions over columns are expressed as string expressions
(in order to use Numexpr under the hood), and that the column names
are understood correctly.

Setting rows is also supported:

In [10]:
ct[1] = (0,0)
ct

ctable((5000000,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 57.22 MB; cbytes: 9.95 MB; ratio: 5.75
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(0, 0.0) (0, 0.0) (2, 4.0) ..., (4999997, 24999970000009.0)
 (4999998, 24999980000004.0) (4999999, 24999990000001.0)]

In [11]:
ct[1:6]

array([(0, 0.0), (2, 4.0), (3, 9.0), (4, 16.0), (5, 25.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

And in combination with fancy indexing too:

In [12]:
ct[[1,6,13]] = (1,1)
ct[[1,6,13]]

array([(1, 1.0), (1, 1.0), (1, 1.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

In [13]:
ct["(f0>=0) & (f1<10)"] = (2,2)
ct[:7]

array([(2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0), (4, 16.0), (5, 25.0),
       (2, 2.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

As you may have noticed, fancy indexing in combination with conditions
is a very powerful feature.

<a id='Adding and deleting columns'></a>
## Adding and deleting columns
<a href='#go to index'>Go to index</a>

Adding and deleting columns is easy and, due to the column-wise data
arrangement, very efficient.  Let's add a new column on an existing
ctable:

In [14]:
N = int(1e5)
ct = bcolz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
new_col = np.linspace(0, 1, 100*1000)
ct.addcol(new_col)
ct

ctable((100000,), [('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
  nbytes: 1.91 MB; cbytes: 1.13 MB; ratio: 1.69
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(0, 0.0, 0.0) (1, 1.0, 1.000010000100001e-05)
 (2, 4.0, 2.000020000200002e-05) ...,
 (99997, 9999400009.0, 0.999979999799998)
 (99998, 9999600004.0, 0.999989999899999) (99999, 9999800001.0, 1.0)]

Now, remove the already existing 'f1' column:

In [15]:
ct.delcol('f1')

As said, adding and deleting columns is very cheap, so don't be afraid
of using this feature as much as you like.

<a id='Iterating over ctable data'></a>
## Iterating over ctable data
<a href='#go to index'>Go to index</a>

You can make use of the `iter()` method in order to easily iterate
over the values of a ctable.  `iter()` has support for start, stop and
step parameters:

In [16]:
N = 100*1000
t = bcolz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
[row for row in ct.iter(1,10,3)]

[row(f0=1, f2=1.000010000100001e-05),
 row(f0=4, f2=4.000040000400004e-05),
 row(f0=7, f2=7.000070000700007e-05)]

Note how the data is returned as `namedtuple` objects of type
``row``.  This allows you to iterate the fields more easily by using
field names:

In [17]:
[(f0,f1) for f0,f1 in ct.iter(1,10,3)]

[(1, 1.000010000100001e-05),
 (4, 4.000040000400004e-05),
 (7, 7.000070000700007e-05)]

You can also use the ``[:]`` accessor to get rid of the ``row``
namedtuple, and return just bare tuples:

In [18]:
[row[:] for row in ct.iter(1,10,3)]

[(1, 1.000010000100001e-05),
 (4, 4.000040000400004e-05),
 (7, 7.000070000700007e-05)]

Also, you can select specific fields to be read via the `outcols`
parameter:

In [19]:
[row for row in ct.iter(1,10,3, outcols='f0')]

[row(f0=1), row(f0=4), row(f0=7)]

In [20]:
[(nr,f0) for nr,f0 in ct.iter(1,10,3, outcols='nrow__,f0')]

[(1, 1), (4, 4), (7, 7)]

Please note the use of the special 'nrow__' label for referring to
the current row.

<a id='Iterating over the output of conditions along columns'></a>
## Iterating over the output of conditions along columns
<a href='#go to index'>Go to index</a>

One of the most powerful capabilities of the ctable is the ability to
iterate over the rows whose fields fulfill certain conditions (without
the need to put the results in a NumPy container, as described in the
previous section).  This can be very useful for performing operations 
on very large ctables without consuming lots of storage space.

Here it is an example of use:

In [21]:
N = 100*1000
ct = bcolz.fromiter(((i,i*i) for i in xrange(N)), dtype="i4,f8", count=N)
[row for row in ct.where("(f0>0) & (f1<10)")]

[row(f0=1, f1=1.0), row(f0=2, f1=4.0), row(f0=3, f1=9.0)]

And by using the `outcols` parameter, you can specify the fields that
you want to be returned:

In [22]:
[row for row in ct.where("(f0>0) & (f1<10)", outcols="f1")]

[row(f1=1.0), row(f1=4.0), row(f1=9.0)]

You can even specify the row number fulfilling the condition:

In [23]:
[(f1,nr) for f1,nr in ct.where("(f0>0) & (f1<10)", outcols="f1, nrow__")]

[(1.0, 1), (4.0, 2), (9.0, 3)]

<a id='Performing operations on ctable columns'></a>
## Performing operations on ctable columns
<a href='#go to index'>Go to index</a>

The ctable object also wears an `eval()` method, this methiod is 
handy for carrying out operations among columns.

The best way to illustrate the point would be to squeeze out an example, here we go:

In [24]:
ct.eval("cos((3+f0)/sqrt(2*f1))")

carray((100000,), float64)
  nbytes: 781.25 KB; cbytes: 594.78 KB; ratio: 1.31
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[        nan -0.95136313 -0.19569944 ...,  0.76023082  0.76023082
  0.76023082]

Here, one can see an exception in ctable methods behaviour: the
resulting output is a ctable, and not a NumPy structured array.  
This was designed like this because the output of `eval()` has 
the same length than the ctable, and thus it can be pretty large, 
so compression maybe of help to reduce its storage needs.