Are there plans for these / are they pretty simple to do? Cross linking from pydata/pandas#626
That seems like it is a pretty simple fix, though more properly we should probably have separate UnicodeAtom and UnicodeCol for 2.* and just make this the StringAtom and StringCol for 3.*.
Hi @scopatz, I'm not sure what is the expected behavior here.
numpy represents unicode (U) as strings of 4 byte items (fixed size).
It is not clear to me which should be the correct mapping on HDF5 data types without loosing of generality.
Also, there already is a VLUnicodeAtom pseudo atom that IMHO is a good candidate to represent that is commonly intended as an unicode string in python.
Hello @avalentino, probably following the numpy convention for now is sufficient. We could have Unicode32Atom, Unicode64Atom, etc for supporting things in full generality.
The problem with the VLUnicodeAtom is that there is no associated VLUnicodeCol, right?
Hi @scopatz, some link from the Internet:
Unicode (fixed size) for data seems to be not directly supported by HDF5. We could try to set up some over-structure or some PyTabes specific machinery to work around the issue but I'm not sure it is a good idea.
Also, at the moment I don't have a clear mind on how to do it.
A better support to unicode (utf-8 encoding) in file and object names is out of the cope of this ticket.
I would target it for PyTables 3.0 since it requires a not so small effort.
Hi @avalentino, Ok. I agree. If this is too difficult, then we should target it for v3.0, where we will certainly need it.
oh, sorry for confusion
IMHO here we have 2 different issues:
So we agree about (2).
On (1), I think we should see whether this is something that they are interested in at all. Maybe I'll resurrect that thread and ask them again. If they are interested in supporting unicode, then we can wait or maybe send them a patch that includes it. If they are never ever going to implement then we should figure out what to do, if anything.
+1, let me know.