RW for char arrays with unicode support by foreverallama · Pull Request #234 · JuliaIO/MAT.jl

foreverallama · 2026-03-06T18:20:02Z

Extension of #222 that attempts to fix several bugs in handling char arrays. I've attempted a common decode method decode_char_array in MAT_types.jl. The MAT_v5 and MAT_HDF5 readers now just read raw integer data, and then pass that to decode_char_array where the data is decoded. Seems to work with N-D arrays without issue.

So far all the test cases for read pass. I'll add some more test cases to check this out, particularly with unicode characters. and N-D arrays. Will need to see how to incorporate the same changes in MAT_v4.jl as well.

CC @matthijscox

This reverts commit 6c9c5bb.

matthijscox · 2026-03-07T07:06:32Z

Looks like a great clean-up!

I'm only slightly worried about performance because I know val-based dispatching can be slower than normal dispatching. See e.g. https://github.com/ztangent/ValSplit.jl

foreverallama · 2026-03-08T14:58:25Z

The reference you linked mentions this:

Note that dynamic dispatch does not always occur: When there are a small number of values to split on (less than 4, as of Julia 1.6), the Julia compiler automatically generates a switch statement

Here, the Val argument has 3 types (UTF-8, UTF-16, UTF-32), so I don't think the dispatching here impacts performance? In any case, an alternative I guess is we could do a manual if-else and do everything inline? Anyways only the uint16 type array would need the codec.

What do you think?

foreverallama · 2026-03-08T18:28:43Z

I don't think we need to update MAT_v4. It doesn't support unicode so we don't need any special handling. The output type/formatting of the v4 reader is consistent with what the other formats.

foreverallama · 2026-03-09T13:26:46Z

Added write support for both Julia string and char types. Added some tests with unicode characters as well. The basic gist is this:

For string arrays, the max length of a string is added as the 2nd dimension. Additionally, pad other rows with spaces if necessary for uniform width (which is required by MATLAB).
For char arrays, transcode each character to utf-16 code units and merge along rows. Again, pad rows if necessary.

I've added a bunch of tests for both read and write, and they all work with my version of MATLAB (2025b). I think I've handled some edge cases like 1D vectors and stuff as well.

Honestly there's probably a better way to do this, could use some pointers there. Till then I believe this is a reasonable solution.

matthijscox · 2026-03-12T08:11:08Z

Here, the Val argument has 3 types (UTF-8, UTF-16, UTF-32), so I don't think the dispatching here impacts performance? In any case, an alternative I guess is we could do a manual if-else and do everything inline? Anyways only the uint16 type array would need the codec. What do you think?

I'm okay with a manual if-else in _decode_row, just to be sure. I don't expect us to add many more codecs.

I don't think we need to update MAT_v4. It doesn't support unicode so we don't need any special handling. The output type/formatting of the v4 reader is consistent with what the other formats.

Alright.

Honestly there's probably a better way to do this, could use some pointers there. Till then I believe this is a reasonable solution.

At a glance the code looks okay, is there anything in specific you'd like some advice on?

matthijscox · 2026-03-12T08:12:55Z

+        "concatenated_strings" => String["this is a string      ", "this is another string"],
        "cell_strings" => Any["this is a string" "this is another string"],
-        "empty_string" => ""
+        "empty_string" => String[]


ah we go from an empty string to an empty string array now. Was this necessary/cleaner?

It's cleaner. An empty string was being written as a 1x0 char array when we needed it to be 0x0. This explicitly writes it as 0x0 (or 0x0x0... if defined).

Also note the change in concatenated_strings. MATLAB pads with spaces to align column width. I'm not sure what was happening before, but the spaces were being stripped or something I guess. Now we're preserving the spaces which is in sync with MATLAB behaviour.

matthijscox · 2026-03-12T08:14:26Z

I'm wondering about the next version bump after this. I think this is a serious behavior change for strings/chars, so I suppose we should go from 0.11 to 0.12.

foreverallama · 2026-03-12T14:36:34Z

Thanks for the review!

I'm okay with a manual if-else in _decode_row, just to be sure. I don't expect us to add many more codecs.

Great, I'll make a small update for this.

is there anything in specific you'd like some advice on?

I was just wondering if there's some way to avoid iterating over all elements of the array when encoding

we should go from 0.11 to 0.12

I agree. I believe there is no issue with backwards compatibility, but it is a major update nonetheless, and could come with its own set of bugs.

foreverallama · 2026-03-14T06:11:55Z

Updated to a symbol based runtime dispatch of codec with uint16 arrays.
Fix: Use Endianness based byte swapping for uint16 arrays using utf8 encoding
Simplified if-else flow for determining codecs & reading raw char array data

foreverallama · 2026-03-16T06:47:57Z

+    else # :utf8
+        # Byte swap on LE systems else utf-8 code points will be out of order
+        bytes = reinterpret(UInt8, ENDIAN_BOM == 0x4030201 ? bswap.(row) : row)
+        String(filter(!=(0x00), bytes))


I just realised that this step would actually filter out valid null characters in the original array. We could replace it with something like this:

out = UInt8[] for i in 1:2:length(bytes) msb = bytes[I] lsb = bytes[i + 1] if msb == 0x00 push!(out, lsb) else push!(out, msb, lsb) end end String(out)

This will go over the whole bytes array and construct the expected UTF-8 sequence for decoding. It's expensive, but this whole path is just legacy support so I don't think it's a performance critical operation.

What do you think?

sure, let's go for it for now. do we need a small unit test for this case though, if we want to refactor later?

Unfortunately I don't think we can add a unit test for this case with null characters. This is just a legacy load support for some old MAT-files, we don't use the same paths for load/save right now.

The case without null characters is already tested in read.jl with string.mat files in v6 and v7

Actually, I managed to add a test here. I directly modified one of the other char files using a binary editor to get an example for this case. It works

matthijscox · 2026-03-16T08:11:24Z

+
+    decoded = Vector{String}(undef, n_strings)
+    for i in 1:n_strings
+        decoded[i] = _decode_row(flat[i, :], codec)


flat[i, :] makes a copy. If you want to consider performance improvements, we could try a view(flat, i, :) instead.

I'm also wondering now if all the reshaping are done without making copies.

I refactored the method, much more readable now and removed all that reshape stuff. Will try out with views in a while.

matthijscox · 2026-04-02T11:56:04Z

Can we consider this PR as finished?

foreverallama · 2026-04-03T06:06:44Z

Yes, I believe the latest changes has improved from where we started and addressed your concerns as well. There is room for optimisation which could be addressed in future PRs.

Immediately, I see a place we could optimise in _decode_row(row::AbstractVector{UInt16}, codec):

Instead of byte swapping, we could just interchange MSB and LSB based on endianness
Pre-allocate the new uint8 buffer to max possible size and then shrink at the end before transcoding, instead of push! for every byte.

If that sounds good, I'll just refactor that method and I think we're good to go.

The changes here should close #33 #143 #176 and #180

matthijscox · 2026-04-03T07:27:11Z

Sure, go ahead. Also feel free to version bump, so I can register right after merging this PR.

matthijscox and others added 3 commits February 7, 2026 13:22

fixing HDF5 char arrays

6c9c5bb

Revert "fixing HDF5 char arrays"

a41abf0

This reverts commit 6c9c5bb.

Read char arrays with unicode characters in v5 and HDF

00d61b8

New test with some more unicode characters in new MATLAB versions

a6b84e6

foreverallama added 3 commits March 8, 2026 23:59

Improve readability

d46dc40

Write support for char arrays with unicode

8960eca

Write to char arrays; both Julia strings and chars

1086e9c

matthijscox reviewed Mar 12, 2026

View reviewed changes

Change to symbol based codec runtime dispatch for uint16 arrays

d056fd9

foreverallama commented Mar 16, 2026

View reviewed changes

matthijscox reviewed Mar 16, 2026

View reviewed changes

foreverallama added 3 commits March 16, 2026 18:09

Simplify decode logic; Test for chars with nulls

499f47e

Use view instead of copy

1354710

remove line

04b4a9f

foreverallama added 2 commits April 3, 2026 13:19

Optimize decode_row for uint16 array, utf8 codec case

682b141

version bump to 0.12

ca55c7d

matthijscox merged commit 73fa0d3 into JuliaIO:master Apr 7, 2026
8 checks passed

foreverallama deleted the char_arrays branch April 7, 2026 15:17

Conversation

foreverallama commented Mar 6, 2026

Uh oh!

matthijscox commented Mar 7, 2026

Uh oh!

foreverallama commented Mar 8, 2026

Uh oh!

foreverallama commented Mar 8, 2026

Uh oh!

foreverallama commented Mar 9, 2026

Uh oh!

matthijscox commented Mar 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthijscox commented Mar 12, 2026

Uh oh!

foreverallama commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

foreverallama commented Mar 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthijscox commented Apr 2, 2026

Uh oh!

foreverallama commented Apr 3, 2026

Uh oh!

matthijscox commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

foreverallama commented Mar 12, 2026 •

edited

Loading