Skip to content


Subversion checkout URL

You can clone with
Download ZIP


UnicodeAtom / UnicodeString types #151

wesm opened this Issue · 8 comments

3 participants


Are there plans for these / are they pretty simple to do? Cross linking from pydata/pandas#626


That seems like it is a pretty simple fix, though more properly we should probably have separate UnicodeAtom and UnicodeCol for 2.* and just make this the StringAtom and StringCol for 3.*.


Hi @scopatz, I'm not sure what is the expected behavior here.

numpy represents unicode (U) as strings of 4 byte items (fixed size).
It is not clear to me which should be the correct mapping on HDF5 data types without loosing of generality.

Also, there already is a VLUnicodeAtom pseudo atom that IMHO is a good candidate to represent that is commonly intended as an unicode string in python.

Any idea?


Hello @avalentino, probably following the numpy convention for now is sufficient. We could have Unicode32Atom, Unicode64Atom, etc for supporting things in full generality.

The problem with the VLUnicodeAtom is that there is no associated VLUnicodeCol, right?


Hi @scopatz, some link from the Internet:

Unicode (fixed size) for data seems to be not directly supported by HDF5. We could try to set up some over-structure or some PyTabes specific machinery to work around the issue but I'm not sure it is a good idea.
Also, at the moment I don't have a clear mind on how to do it.

A better support to unicode (utf-8 encoding) in file and object names is out of the cope of this ticket.
I would target it for PyTables 3.0 since it requires a not so small effort.


Hi @avalentino, Ok. I agree. If this is too difficult, then we should target it for v3.0, where we will certainly need it.


oh, sorry for confusion

IMHO here we have 2 different issues:

  1. Unicode (fixed size) atoms for data: HSD5 do not support it. Solutions:
    • a. wait since HDF5 supports it directy
    • b. try so set up some special handling (I'm -1 for this)
  2. handling of unicode names of groups, datasets, etc. This supported by the current HDF5 library but requires some effort. We should handle this wit a separate ticket and my suggestion is to target his one for PyTables 3.0.

So we agree about (2).

On (1), I think we should see whether this is something that they are interested in at all. Maybe I'll resurrect that thread and ask them again. If they are interested in supporting unicode, then we can wait or maybe send them a patch that includes it. If they are never ever going to implement then we should figure out what to do, if anything.


+1, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.