Skip to content

RW for char arrays with unicode support#234

Merged
matthijscox merged 13 commits intoJuliaIO:masterfrom
foreverallama:char_arrays
Apr 7, 2026
Merged

RW for char arrays with unicode support#234
matthijscox merged 13 commits intoJuliaIO:masterfrom
foreverallama:char_arrays

Conversation

@foreverallama
Copy link
Copy Markdown
Contributor

Extension of #222 that attempts to fix several bugs in handling char arrays. I've attempted a common decode method decode_char_array in MAT_types.jl. The MAT_v5 and MAT_HDF5 readers now just read raw integer data, and then pass that to decode_char_array where the data is decoded. Seems to work with N-D arrays without issue.

So far all the test cases for read pass. I'll add some more test cases to check this out, particularly with unicode characters. and N-D arrays. Will need to see how to incorporate the same changes in MAT_v4.jl as well.

CC @matthijscox

@matthijscox
Copy link
Copy Markdown
Member

Looks like a great clean-up!

I'm only slightly worried about performance because I know val-based dispatching can be slower than normal dispatching. See e.g. https://github.com/ztangent/ValSplit.jl

@foreverallama
Copy link
Copy Markdown
Contributor Author

The reference you linked mentions this:

Note that dynamic dispatch does not always occur: When there are a small number of values to split on (less than 4, as of Julia 1.6), the Julia compiler automatically generates a switch statement

Here, the Val argument has 3 types (UTF-8, UTF-16, UTF-32), so I don't think the dispatching here impacts performance? In any case, an alternative I guess is we could do a manual if-else and do everything inline? Anyways only the uint16 type array would need the codec.

What do you think?

@foreverallama
Copy link
Copy Markdown
Contributor Author

I don't think we need to update MAT_v4. It doesn't support unicode so we don't need any special handling. The output type/formatting of the v4 reader is consistent with what the other formats.

@foreverallama
Copy link
Copy Markdown
Contributor Author

Added write support for both Julia string and char types. Added some tests with unicode characters as well. The basic gist is this:

  • For string arrays, the max length of a string is added as the 2nd dimension. Additionally, pad other rows with spaces if necessary for uniform width (which is required by MATLAB).
  • For char arrays, transcode each character to utf-16 code units and merge along rows. Again, pad rows if necessary.

I've added a bunch of tests for both read and write, and they all work with my version of MATLAB (2025b). I think I've handled some edge cases like 1D vectors and stuff as well.

Honestly there's probably a better way to do this, could use some pointers there. Till then I believe this is a reasonable solution.

@matthijscox
Copy link
Copy Markdown
Member

Here, the Val argument has 3 types (UTF-8, UTF-16, UTF-32), so I don't think the dispatching here impacts performance? In any case, an alternative I guess is we could do a manual if-else and do everything inline? Anyways only the uint16 type array would need the codec. What do you think?

I'm okay with a manual if-else in _decode_row, just to be sure. I don't expect us to add many more codecs.

I don't think we need to update MAT_v4. It doesn't support unicode so we don't need any special handling. The output type/formatting of the v4 reader is consistent with what the other formats.

Alright.

Honestly there's probably a better way to do this, could use some pointers there. Till then I believe this is a reasonable solution.

At a glance the code looks okay, is there anything in specific you'd like some advice on?

Comment thread test/read.jl
"concatenated_strings" => String["this is a string ", "this is another string"],
"cell_strings" => Any["this is a string" "this is another string"],
"empty_string" => ""
"empty_string" => String[]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah we go from an empty string to an empty string array now. Was this necessary/cleaner?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cleaner. An empty string was being written as a 1x0 char array when we needed it to be 0x0. This explicitly writes it as 0x0 (or 0x0x0... if defined).

Also note the change in concatenated_strings. MATLAB pads with spaces to align column width. I'm not sure what was happening before, but the spaces were being stripped or something I guess. Now we're preserving the spaces which is in sync with MATLAB behaviour.

@matthijscox
Copy link
Copy Markdown
Member

I'm wondering about the next version bump after this. I think this is a serious behavior change for strings/chars, so I suppose we should go from 0.11 to 0.12.

@foreverallama
Copy link
Copy Markdown
Contributor Author

foreverallama commented Mar 12, 2026

Thanks for the review!

I'm okay with a manual if-else in _decode_row, just to be sure. I don't expect us to add many more codecs.

Great, I'll make a small update for this.

is there anything in specific you'd like some advice on?

I was just wondering if there's some way to avoid iterating over all elements of the array when encoding

we should go from 0.11 to 0.12

I agree. I believe there is no issue with backwards compatibility, but it is a major update nonetheless, and could come with its own set of bugs.

@foreverallama
Copy link
Copy Markdown
Contributor Author

  • Updated to a symbol based runtime dispatch of codec with uint16 arrays.
  • Fix: Use Endianness based byte swapping for uint16 arrays using utf8 encoding
  • Simplified if-else flow for determining codecs & reading raw char array data

Comment thread src/MAT_types.jl Outdated
else # :utf8
# Byte swap on LE systems else utf-8 code points will be out of order
bytes = reinterpret(UInt8, ENDIAN_BOM == 0x4030201 ? bswap.(row) : row)
String(filter(!=(0x00), bytes))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that this step would actually filter out valid null characters in the original array. We could replace it with something like this:

out = UInt8[]
for i in 1:2:length(bytes)
    msb = bytes[I]
    lsb = bytes[i + 1]
    if msb == 0x00
        push!(out, lsb)
    else
        push!(out, msb, lsb)
    end
end
String(out)

This will go over the whole bytes array and construct the expected UTF-8 sequence for decoding. It's expensive, but this whole path is just legacy support so I don't think it's a performance critical operation.

What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let's go for it for now. do we need a small unit test for this case though, if we want to refactor later?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I don't think we can add a unit test for this case with null characters. This is just a legacy load support for some old MAT-files, we don't use the same paths for load/save right now.

The case without null characters is already tested in read.jl with string.mat files in v6 and v7

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I managed to add a test here. I directly modified one of the other char files using a binary editor to get an example for this case. It works

Comment thread src/MAT_types.jl Outdated

decoded = Vector{String}(undef, n_strings)
for i in 1:n_strings
decoded[i] = _decode_row(flat[i, :], codec)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flat[i, :] makes a copy. If you want to consider performance improvements, we could try a view(flat, i, :) instead.

I'm also wondering now if all the reshaping are done without making copies.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the method, much more readable now and removed all that reshape stuff. Will try out with views in a while.

@matthijscox
Copy link
Copy Markdown
Member

Can we consider this PR as finished?

@foreverallama
Copy link
Copy Markdown
Contributor Author

Yes, I believe the latest changes has improved from where we started and addressed your concerns as well. There is room for optimisation which could be addressed in future PRs.

Immediately, I see a place we could optimise in _decode_row(row::AbstractVector{UInt16}, codec):

  • Instead of byte swapping, we could just interchange MSB and LSB based on endianness
  • Pre-allocate the new uint8 buffer to max possible size and then shrink at the end before transcoding, instead of push! for every byte.

If that sounds good, I'll just refactor that method and I think we're good to go.

The changes here should close #33 #143 #176 and #180

@matthijscox
Copy link
Copy Markdown
Member

Sure, go ahead. Also feel free to version bump, so I can register right after merging this PR.

@matthijscox matthijscox merged commit 73fa0d3 into JuliaIO:master Apr 7, 2026
8 checks passed
@foreverallama foreverallama deleted the char_arrays branch April 7, 2026 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants