Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFStore appending for mixed datatypes, including NumPy arrays #3032

Closed
alexbw opened this issue Mar 12, 2013 · 20 comments
Closed

HDFStore appending for mixed datatypes, including NumPy arrays #3032

alexbw opened this issue Mar 12, 2013 · 20 comments
Labels
IO Data IO issues that don't fit into a more specific label

Comments

@alexbw
Copy link

alexbw commented Mar 12, 2013

A pandas array I have contains some image data, recorded from a camera during a behavioral experiment. A simplified version looks like this:

num_frames = 100
mouse = [{"velocity":np.random.random((1,))[0], \
        "image":np.random.random((80,80)).astype('float32'), \
        "spine":np.r_[0:80].astype('float32'),
        #"time":millisec(i*33),
        "mouse_id":"mouse1",
        "special":i} for i in range(num_frames)]
df = DataFrame(mouse)

I understand I can't query over the image or spine entries. Of course, I can easily query for low velocity frames, like this:

low_velocity = df[df['velocity'] < 0.5]

However, there is a lot of this data (several hundred gigabytes), so I'd like to keep it in an HDF5 file, and pull up frames only as needed from disk.

In v0.10, I understand that "mixed-type" frames now can be appended into the HDFStore. However, I get an error when trying to append this dataframe into the HDFStore.

store = HDFStore("mouse.h5", "w")
store.append("mouse", df)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-30-8f0da271e75f> in <module>()
      1 store = HDFStore("mouse.h5", "w")
----> 2 store.append("mouse", df)

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in append(self, key, value, columns, **kwargs)
    543             raise Exception("columns is not a supported keyword in append, try data_columns")
    544 
--> 545         self._write_to_group(key, value, table=True, append=True, **kwargs)
    546 
    547     def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, **kwargs):

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in _write_to_group(self, key, value, index, table, append, complib, **kwargs)
    799             raise ValueError('Compression not supported on non-table')
    800 
--> 801         s.write(obj = value, append=append, complib=complib, **kwargs)
    802         if s.is_table and index:
    803             s.create_index(columns = index)

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, **kwargs)
   2537         # create the axes
   2538         self.create_axes(axes=axes, obj=obj, validate=append,
-> 2539                          min_itemsize=min_itemsize, **kwargs)
   2540 
   2541         if not self.is_exists:

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2279                 raise
   2280             except (Exception), detail:
-> 2281                 raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
   2282             j += 1
   2283 

Exception: cannot find the correct atom type -> [dtype->object,items->Index([image, mouse_id, spine], dtype=object)] cannot set an array element with a sequence

I'm working with a relatively new release of pandas:

pandas.__version__
'0.11.0.dev-95a5326'

import tables
tables.__version__
'2.4.0+1.dev'

It would be immensely convenient to have a single repository for all of this data, instead of fragmenting just the queryable parts off to separate nodes.
Is this possible currently with some work-around (maybe with record arrays), and will this be supported officially in the future?

As a side-note, this kind of heterogeneous data ("ragged" arrays) is incredibly wide-spread in neurobiology and the biological sciences in general. Any extra support along these lines would be incredibly well-received.

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

mixed means mixed 'pandas' data types (e.g. string/int/float/datetimes)

you basically have 2 ways to go here:

  1. the PyTables Table object that HDFStore uses underlying, DOES support a pure 'object' type, but it is very inefficient, as its basically a pickle of the data (in this case your spine/image fields), FYI, the 'mouse_id' is not 'object' datatype as its a 'string' fields (and this is how its stored)

This of this as OBT (one-big-table)

pros:

  • all of your data is 'together'

cons:

  • You would not be able to search on spine/image
  • not sure how fast this would write/read (of course your queries would be fast, and depending on the density, meaning the number of rows that are returned it could be acceptable)
  • This would require a code change to support the object type
  1. You could make your main data only have these fundamental types and use a reference (say a string) to image/spine, then store these as separate nodes

This is a main table and sub-tables
(this is a variant on http://pandas.pydata.org/pandas-docs/dev/io.html#multiple-table-queries)

pros:

  • this would be very fast (as everything is a pure type)
  • you then have the ability to store arbitrary amounts of data in the image/spine category
  • you can split up the nodes in different files if you wish
  • you can search on image/spine (not sure if that's useful)
  • if your spine/image is 'constant/fixed' this is actually a big win in terms of creating your store

cons:

  • you data is now spread out and you can potentially have consistency issues (e.g. you update the main pointer
    but not the sub-data, can easily be fixed by wrapping your accesses in a class)

I don't mind making the change to support pure objects (in 1), but given your data description I think you much better off served by 2) (and possibly some wrapper code)

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

I mispoke a bit

There is an object type, but you cannot store this in a table (you can store it in a fixed store, basically non-searchable node).

The other thing with the cons of 2), you can simply use sub-nodes if you want, then the data is 'together',
then you don't even need a reference to it (thought I will still suggest one!) e.g.

pseudo code here:

store.append('df',df)
store.append('df/image',image)
store.append('df/spine',spine)

@alexbw
Copy link
Author

alexbw commented Mar 12, 2013

That's probably the right way to do things for me. I always access the data in an "all-at-once" pattern, where I grab all the images for a mouse, compute some statistic (which may be time-dependent, which is why I need them all), and then save that statistic out. I then will only query on the statistic, and not the image. I'll go ahead and implement this. Thank you very much for your expertise and input.

On Mar 12, 2013, at 6:17 PM, jreback notifications@github.com wrote:

I mispoke a bit

There is an object type, but you cannot store this in a table (you can store it in a fixed store, basically non-searchable node).

The other thing with the cons of 2), you can simply use sub-nodes if you want, then the data is 'together',
then you don't even need a reference to it (thought I will still suggest one!) e.g.

pseudo code here:

store.append('df',df)
store.append('df/image',image)
store.append('df/spine',spine)


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

great..

and the bottom could also be (to make your writing faster, assuming when you read an image/spine you read the entire one, e.g. you don't need to search on an image itself).

store.append('df',df)
store.put('df/image',image)
store.put('df/spine',spine)

also, this is obviously a parallizable problem (keep in mind that you CANNOT write in parallel), but you CAN read
(if you need to write, then write to separate files)

@alexbw
Copy link
Author

alexbw commented Mar 12, 2013

And one last question, what's the most expedient way to run the line

store.put('df/image',image)

if image is an ndarray? The call is currently telling me

Exception: cannot properly create the storer for: [_TYPE_MAP] [group->/data/m1/images (Group) '',value-><type 'numpy.ndarray'>,table->None,append->False,kwargs->{}]

@jreback
Copy link
Contributor

jreback commented Mar 12, 2013

just wrap it with a DataFrame (its a 2d ndarray)
store.put('df/image',DataFrame(image) )

1-d use a Series
3-d use a Panel
4-d use Panel4D
gt 4dims call me in the morning!

In [42]: df.iloc[0]
Out[42]: 
image       [[0.60904, 0.0175226, 0.36146, 0.947978, 0.327...
mouse_id                                               mouse1
special                                                     0
spine       [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...
velocity                                            0.9005659
Name: 0, dtype: object

In [43]: df.iloc[0]['image'].shape
Out[43]: (80, 80)

@alexbw
Copy link
Author

alexbw commented Mar 13, 2013

And you can even do

images = store.select(path_to_images_as_panel, columns=range(10,20)).values

and select out just the ones you'd like. Perfect. Thank you much.

@alexbw alexbw closed this as completed Mar 13, 2013
@jreback
Copy link
Contributor

jreback commented Mar 13, 2013

great...you are welcome

@jreback
Copy link
Contributor

jreback commented Mar 13, 2013

FYI...to select out the 'rows' of your image (and this will be very fast, done totally in-kernal in hdf5

images = store.select(path_to_images_as_panel, start=10,stop=20).values

and you always have an index, so the following is the same as above

images = store.select(path_to_images_as_panel, [Term('index>=10'),Term('index<20')] ).values

@alexbw
Copy link
Author

alexbw commented Mar 16, 2013

Re your comment:

  1. the PyTables Table object that HDFStore uses underlying, DOES support a pure 'object' type, but it is very
    inefficient, as its basically a pickle of the data (in this case your spine/image fields), FYI, the 'mouse_id' is not 'object'
    datatype as its a 'string' fields (and this is how its stored)

This seems like the most viable option to me at this point, after wrangling a bunch of strange data.
The loose-ends are becoming unmanageable, at least in the data organization sense.
How difficult would the following change you mentioned take?

This would require a code change to support the object type

If you point me to the location required for this support, it would be MUCH appreciated.

@alexbw alexbw reopened this Mar 16, 2013
@jreback
Copy link
Contributor

jreback commented Mar 16, 2013

unfortunately I don't believe tables will support an object type
they need apriori fixed types in order to support indexing
best to store a pointer ( eg a string or number ) to where your image/spline actually reside (which even could be another DB if u want)

what is the data organization issue?

@alexbw
Copy link
Author

alexbw commented Mar 17, 2013

I'm a research scientist, and I'm trying to record the essential variables that I measure in a set of pilot experiments. I'm recording movies of an animal, and taking scalar measurements about the environment the animal is in, like temperature, humidity, etc.

The data is heterogeneous — for each frame of the video during the experiment, we also have a combination of scalars, arrays and strings that describe what occurred during that frame. Because these are pilot experiments, we don't necessarily know what is essential to measure yet. We need to be able to query over all of the scalar columns to do analysis, and we need to be able to append data.

The main problem as I see it, with respect to the current capabilities of HDFStore, is heterogeneous data. I need to be able to append and query the scalar data, but each row necessarily is associated with image data.
I've written a bit of software to store the images separately, but I've not found a way to do queries in the kind of "one-line" way that store.select() allows you to do. I find that kind of simplicity absolutely necessary for exploratory data analysis.

If you have a great way to organize pointers with some wrapper code so that I and my colleagues don't have to think about this kind of data scattering that's fine. The key point here, though, is reducing our mental workload, so that we can think more about the research question, and less about the data.

One thing to note, though with our heterogeneous data. If we store a column "images", that contains a 2D array, you can be guaranteed that every row that has an image column will have the same size 2D array. Does this uniformity help at all? I do wish that DataFrames would take advantage of this, so if I did

df['images'].values

it would return a 3D numpy array, as opposed to an array of objects. Do you think that's a possibility?

@ghost
Copy link

ghost commented Mar 17, 2013

It sounds like you have two issues: HDFStore support for storing object data
in queryable tables, and a need to get work done right now, conveniently.

Let me try and help with the latter, with an example of "rolling-your-own" via monkey-patching.

def f(self):
    import urllib

    # do sql/http/whatever fetch here, based on row data
    img_size = int((100+200*self['velocity']))
    image = urllib.urlopen("http://placekitten.com/%s" % img_size  ).read()

    # can just return the data
    # return image

    # or even, have it display inline in IPython frontends
    from IPython.display import Image
    return Image(image,format='jpeg')

pd.Series.silvester=f
# each row you pick is a series, which now has a `silvester()` method
df[df['velocity'] < 0.5].irow(0).silvester()

meow

@ghost
Copy link

ghost commented Mar 17, 2013

If there were a placemouse.com website, that would have been just too perfect.

@alexbw
Copy link
Author

alexbw commented Mar 17, 2013

That's brilliant. I had absolutely not thought of placing the method for retrieving the pointer in the array.

Do you have a recommendation of how to pack this up so the query occurs out-of-core? I have dozens of gigabytes of metadata alone.

@ghost
Copy link

ghost commented Mar 17, 2013

Not sure what's "in core" here.
You can probably be up and running with mongodb within a couple of hours,
if that's what you mean.

Or just use the filesystem.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2013

Try this out, this implements the solution that I pointed to above. Its pretty straightforward and should get you started. Let me know.

The Store
<class 'pandas.io.pytables.HDFStore'>
File path: my_mouse.h5
/my_mouse                              frame_table  (typ->appendable,nrows->5,ncols->5,indexers->[index],dc->[image_loc,n,spine_loc,typ,velocity])
/my_mouse/mouse/image/loc_1            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/image/loc_2            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/image/loc_3            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/image/loc_4            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/image/loc_5            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/spine/loc_1            frame        (shape->[80,1])                                                                               
/my_mouse/mouse/spine/loc_2            frame        (shape->[80,1])                                                                               
/my_mouse/mouse/spine/loc_3            frame        (shape->[80,1])                                                                               
/my_mouse/mouse/spine/loc_4            frame        (shape->[80,1])                                                                               
/my_mouse/mouse/spine/loc_5            frame        (shape->[80,1])                                                                                

The Query Results
           image_loc  n          spine_loc    typ  velocity
3  mouse/image/loc_3  3  mouse/spine/loc_3  mouse         6
4  mouse/image/loc_4  4  mouse/spine/loc_4  mouse         8
5  mouse/image/loc_5  5  mouse/spine/loc_5  mouse        10 

New Animals
n=3,typ=mouse,velocity=6,image_loc=mouse/image/loc_3,spine_loc=mouse/spine/loc_3,image_data=(80, 80),spine_data=(80, 1)
n=4,typ=mouse,velocity=8,image_loc=mouse/image/loc_4,spine_loc=mouse/spine/loc_4,image_data=(80, 80),spine_data=(80, 1)
n=5,typ=mouse,velocity=10,image_loc=mouse/image/loc_5,spine_loc=mouse/spine/loc_5,image_data=(80, 80),spine_data=(80, 1)

Here's the script

import numpy as np
import pandas as pd

scalars = ['n','typ','velocity','image_loc','spine_loc']

class Animal(object):
    """ generate a single frame of data for an experiment """
    typ = None

    @classmethod
    def create(cls, n):
        """ create a new anmial """
        velocity = n*2
        image_data    = pd.DataFrame(np.random.random((80,80)).astype('float32'))
        spine_data    = pd.DataFrame(np.r_[0:80].astype('float32'))

        return cls(n = n,
                   velocity = velocity,
                   image_data = image_data,
                   spine_data = spine_data)

    def __init__(self, **kwargs):
        for k, v in kwargs.items():
            setattr(self,k,v)

        if not getattr(self,'image_loc',None):
            self.image_loc = "%s/image/loc_%s" % (self.typ,self.n)

        if not getattr(self,'spine_loc',None):
            self.spine_loc = "%s/spine/loc_%s" % (self.typ,self.n)

    def __str__(self):
        return ','.join([ "%s=%s" % (s,getattr(self,s,None)) for s in scalars ] + 
                        [ "image_data=%s" % str(self.image_data.shape), "spine_data=%s" % str(self.spine_data.shape) ])

class Mouse(Animal):
    typ = 'mouse'

class Animals(object):
    """ hold a bunch of animals """

    def __init__(self, name=None):
        self.data = []
        self.name = name
        self._store = None

    def __iter__(self):
        return iter(self.data)

    def __len__(self):
        return len(self.data)

    def __str__(self):
        return "Animals [%s]" % len(self)

    def append(self, a):
        if hasattr(a,'__iter__'):
            for e in a:
                self.data.append(e)
        else:
            self.data.append(a)
        return self

    def get_store(self):
        """ create the store only when we ask for it """
        if self._store is None:
            if self.name is None:
                raise Exception("cannot create a store as name is None")
            self._store = pd.HDFStore('%s.h5' % self.name)
        return self._store

    def get_data(self):
        """ return the data as a frame for my animals """
        d = dict()
        for s in scalars:

            series = dict()
            for e in self:
                series[e.n] = getattr(e,s,None)

            d[s] = series

        return pd.DataFrame(d)

    def clear(self):
        """ clear out the store """
        self.get_store().remove(self.name)

    def image_loc(self, e):
        return "%s/%s" % (self.name,e.image_loc)

    def spine_loc(self, e):
        return "%s/%s" % (self.name,e.spine_loc)

    def store(self):
        store = self.get_store()

        # store my main data
        store.append(self.name, self.get_data(), data_columns=True)

        # store the image/spine
        for e in self:
            store.put(self.image_loc(e), e.image_data)
            store.put(self.spine_loc(e), e.spine_data)

        print "The Store"
        print store, "\n"
        self.close()

    def query(self, terms):
        """ query on the scalars """
        store = self.get_store()

        df = store.select(self.name,terms)
        print "The Query Results"
        print df, "\n"

        # create and return new animals
        animals = Animals()
        for i, r in df.iterrows():

            # create a new animal and populate using the retrieved data
            a = globals()[r['typ'].capitalize()](**r.to_dict())

            # spine/image
            a.image_data = store.get(self.image_loc(a))
            a.spine_data = store.get(self.spine_loc(a))

            animals.append(a)

        self.close()
        return animals

    def close(self):
        """ close the store """
        if self._store is not None:
            try:
                self._store.close()
            except:
                pass
            self._store = None

# create some animals
animals = Animals('my_mouse').append([ Mouse.create(i+1) for i in range(5) ])

# clear the store
animals.clear()

# store em
animals.store()

# retreive based on some scalars
new_animals = animals.query([ pd.Term('velocity>4') ])

print "New Animals"
for a in new_animals:
    print a

@alexbw
Copy link
Author

alexbw commented Mar 17, 2013

Yep, you're a beast.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2013

To answer your last question, yes, if the images are all the same shape then this is the way to store them.
Create a panel (key is to make the appendable axes the minor_axis (or major_axis), then you
can just keep adding them on (you can actually append to both dimensions, but you don't need that here)

Creating a panel (in the store method of Animals)

image = pd.Panel(dict([ (e.n,e.image_data) for e in self ])).swapaxes('items','minor_axis')
store.append('%s/images' % self.name,image)

Retrieving the images (in the query method)

animals.images = store.select('my_mouse/images', pd.Term("minor_axis", '=', [ a.n for a in animals ]))

Showing the images Panel

The Store
<class 'pandas.io.pytables.HDFStore'>
File path: my_mouse.h5
/my_mouse                              frame_table  (typ->appendable,nrows->5,ncols->5,indexers->[index],dc->[image_loc,n,spine_loc,typ,velocity])
/my_mouse/images                       wide_table   (typ->appendable,nrows->400,ncols->80,indexers->[major_axis,minor_axis])                      
/my_mouse/mouse/image/loc_1            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/image/loc_2            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/image/loc_3            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/image/loc_4            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/image/loc_5            frame        (shape->[80,80])                                                                              
/my_mouse/mouse/spine/loc_1            frame        (shape->[80,1])                                                                               
/my_mouse/mouse/spine/loc_2            frame        (shape->[80,1])                                                                               
/my_mouse/mouse/spine/loc_3            frame        (shape->[80,1])                                                                               
/my_mouse/mouse/spine/loc_4            frame        (shape->[80,1])                                                                               
/my_mouse/mouse/spine/loc_5            frame        (shape->[80,1])                                                                                

The selected panel (which corresponds to the mice we selected in this example

<class 'pandas.core.panel.Panel'>
Dimensions: 80 (items) x 80 (major_axis) x 3 (minor_axis)
Items axis: 0 to 79
Major_axis axis: 0 to 79
Minor_axis axis: 3 to 5

@jreback
Copy link
Contributor

jreback commented Mar 20, 2013

@alexbw close this?

@alexbw alexbw closed this as completed Mar 20, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

2 participants