RW for char arrays with unicode support#234
Conversation
|
Looks like a great clean-up! I'm only slightly worried about performance because I know val-based dispatching can be slower than normal dispatching. See e.g. https://github.com/ztangent/ValSplit.jl |
|
The reference you linked mentions this:
Here, the What do you think? |
|
I don't think we need to update |
|
Added write support for both Julia string and char types. Added some tests with unicode characters as well. The basic gist is this:
I've added a bunch of tests for both read and write, and they all work with my version of MATLAB (2025b). I think I've handled some edge cases like 1D vectors and stuff as well. Honestly there's probably a better way to do this, could use some pointers there. Till then I believe this is a reasonable solution. |
I'm okay with a manual if-else in
Alright.
At a glance the code looks okay, is there anything in specific you'd like some advice on? |
| "concatenated_strings" => String["this is a string ", "this is another string"], | ||
| "cell_strings" => Any["this is a string" "this is another string"], | ||
| "empty_string" => "" | ||
| "empty_string" => String[] |
There was a problem hiding this comment.
ah we go from an empty string to an empty string array now. Was this necessary/cleaner?
There was a problem hiding this comment.
It's cleaner. An empty string was being written as a 1x0 char array when we needed it to be 0x0. This explicitly writes it as 0x0 (or 0x0x0... if defined).
Also note the change in concatenated_strings. MATLAB pads with spaces to align column width. I'm not sure what was happening before, but the spaces were being stripped or something I guess. Now we're preserving the spaces which is in sync with MATLAB behaviour.
|
I'm wondering about the next version bump after this. I think this is a serious behavior change for strings/chars, so I suppose we should go from 0.11 to 0.12. |
|
Thanks for the review!
Great, I'll make a small update for this.
I was just wondering if there's some way to avoid iterating over all elements of the array when encoding
I agree. I believe there is no issue with backwards compatibility, but it is a major update nonetheless, and could come with its own set of bugs. |
|
| else # :utf8 | ||
| # Byte swap on LE systems else utf-8 code points will be out of order | ||
| bytes = reinterpret(UInt8, ENDIAN_BOM == 0x4030201 ? bswap.(row) : row) | ||
| String(filter(!=(0x00), bytes)) |
There was a problem hiding this comment.
I just realised that this step would actually filter out valid null characters in the original array. We could replace it with something like this:
out = UInt8[]
for i in 1:2:length(bytes)
msb = bytes[I]
lsb = bytes[i + 1]
if msb == 0x00
push!(out, lsb)
else
push!(out, msb, lsb)
end
end
String(out)
This will go over the whole bytes array and construct the expected UTF-8 sequence for decoding. It's expensive, but this whole path is just legacy support so I don't think it's a performance critical operation.
What do you think?
There was a problem hiding this comment.
sure, let's go for it for now. do we need a small unit test for this case though, if we want to refactor later?
There was a problem hiding this comment.
Unfortunately I don't think we can add a unit test for this case with null characters. This is just a legacy load support for some old MAT-files, we don't use the same paths for load/save right now.
The case without null characters is already tested in read.jl with string.mat files in v6 and v7
There was a problem hiding this comment.
Actually, I managed to add a test here. I directly modified one of the other char files using a binary editor to get an example for this case. It works
|
|
||
| decoded = Vector{String}(undef, n_strings) | ||
| for i in 1:n_strings | ||
| decoded[i] = _decode_row(flat[i, :], codec) |
There was a problem hiding this comment.
flat[i, :] makes a copy. If you want to consider performance improvements, we could try a view(flat, i, :) instead.
I'm also wondering now if all the reshaping are done without making copies.
There was a problem hiding this comment.
I refactored the method, much more readable now and removed all that reshape stuff. Will try out with views in a while.
|
Can we consider this PR as finished? |
|
Yes, I believe the latest changes has improved from where we started and addressed your concerns as well. There is room for optimisation which could be addressed in future PRs. Immediately, I see a place we could optimise in
If that sounds good, I'll just refactor that method and I think we're good to go. |
|
Sure, go ahead. Also feel free to version bump, so I can register right after merging this PR. |
Extension of #222 that attempts to fix several bugs in handling char arrays. I've attempted a common decode method
decode_char_arrayinMAT_types.jl. TheMAT_v5andMAT_HDF5readers now just read raw integer data, and then pass that todecode_char_arraywhere the data is decoded. Seems to work with N-D arrays without issue.So far all the test cases for
readpass. I'll add some more test cases to check this out, particularly with unicode characters. and N-D arrays. Will need to see how to incorporate the same changes inMAT_v4.jlas well.CC @matthijscox