Skip to content

Commit

Permalink
apacheGH-37653: [MATLAB] Add arrow.array.StructArray MATLAB class (a…
Browse files Browse the repository at this point in the history
…pache#37806)

### Rationale for this change

Now that many of the commonly-used "primitive" array types have been added to the MATLAB Inferface, we can implement `arrow.array.StructArray` class.

### What changes are included in this PR?

Added `arrow.array.StructArray` MATLAB class. 

*Methods* of `arrow.array.StructArray` include: 

- `fromArrays(arrays, nvpairs)`
- `field(i)` -> get the `i` field as an `arrow.array.Array`. `i` can be a positive integer or a field name.
- `toMATLAB()` -> convert to a MATLAB `table`
- `table()` -> convert to a MATLAB `table`

*Properties* of `arrow.array.StructArray` include:

- `Type`
- `Length`
- `NumFields`
- `FieldNames`
- `Valid`

**Example Usage**
```matlab
>> a = arrow.array([1, 2, 3, 4]);
>> b = arrow.array(["A", "B", "C", "D"]);
>> s = arrow.array.StructArray.fromArrays(a, b, FieldNames=["A", "B"])
s = 

-- is_valid: all not null
-- child 0 type: double
  [
    1,
    2,
    3,
    4
  ]
-- child 1 type: string
  [
    "A",
    "B",
    "C",
    "D"
  ]

% Convert StructArray to a MATLAB table
>> t = toMATLAB(s)

t =

  4×2 table

    A     B 
    _    ___

    1    "A"
    2    "B"
    3    "C"
    4    "D"
```

### Are these changes tested?

Yes. Added a new test class `tStructArray.m`

### Are there any user-facing changes?

Yes. Users can now construct an `arrow.array.StructArray` instance. 

### Notes

1. Although [`struct`](https://www.mathworks.com/help/matlab/ref/struct.html) is a MATLAB datatype, `StructArray`'s `toMATLAB` method returns a MATLAB `table`. We went with this design because the layout of MATLAB `table`s more closely resembles `StructArray`s. MATLAB `tables` ensure a consistent schema and the data is laid out in a columnar format. In a future PR, we plan on adding a `struct` method to `StructArray`, which will return a MATLAB `struct` array.
2. I removed the virtual `toMATLAB` method from `proxy::Array` because the nested array MATLAB will implement their `toMATLAB` method by invoking the `toMATLAB` method on their field arrays. There's no need for the C++ proxy classes of nested arrays to have a `toMATLAB` method.

### Future Directions
1. Add a `fromMATLAB` static method to create `StructArray`s from MATLAB `tables` and MATLAB `struct` arrays.
4. Add a `fromTable` static method to create `StructArray`s from `arrow.tabular.Table`s
5. Add a `fromRecordBatch` static method to create `StructArray`s from `arrow.tabular.RecordBatch`s

* Closes: apache#37653 

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Kevin Gurney <kgurney@mathworks.com>
  • Loading branch information
sgilmore10 authored and Jeremy Aguilon committed Oct 23, 2023
1 parent ba6b473 commit 78124ba
Show file tree
Hide file tree
Showing 41 changed files with 803 additions and 85 deletions.
1 change: 0 additions & 1 deletion matlab/src/cpp/arrow/matlab/array/proxy/array.cc
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ namespace arrow::matlab::array::proxy {

// Register Proxy methods.
REGISTER_METHOD(Array, toString);
REGISTER_METHOD(Array, toMATLAB);
REGISTER_METHOD(Array, getLength);
REGISTER_METHOD(Array, getValid);
REGISTER_METHOD(Array, getType);
Expand Down
2 changes: 0 additions & 2 deletions matlab/src/cpp/arrow/matlab/array/proxy/array.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,6 @@ class Array : public libmexclass::proxy::Proxy {

void getType(libmexclass::proxy::method::Context& context);

virtual void toMATLAB(libmexclass::proxy::method::Context& context) = 0;

void isEqual(libmexclass::proxy::method::Context& context);

std::shared_ptr<arrow::Array> array;
Expand Down
4 changes: 3 additions & 1 deletion matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@
namespace arrow::matlab::array::proxy {

BooleanArray::BooleanArray(std::shared_ptr<arrow::BooleanArray> array)
: arrow::matlab::array::proxy::Array{std::move(array)} {}
: arrow::matlab::array::proxy::Array{std::move(array)} {
REGISTER_METHOD(BooleanArray, toMATLAB);
}

libmexclass::proxy::MakeResult BooleanArray::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) {
::matlab::data::StructArray opts = constructor_arguments[0];
Expand Down
2 changes: 1 addition & 1 deletion matlab/src/cpp/arrow/matlab/array/proxy/boolean_array.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ namespace arrow::matlab::array::proxy {
static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments);

protected:
void toMATLAB(libmexclass::proxy::method::Context& context) override;
void toMATLAB(libmexclass::proxy::method::Context& context);
};

}
6 changes: 4 additions & 2 deletions matlab/src/cpp/arrow/matlab/array/proxy/numeric_array.h
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,9 @@ class NumericArray : public arrow::matlab::array::proxy::Array {
public:

NumericArray(const std::shared_ptr<arrow::NumericArray<ArrowType>> numeric_array)
: arrow::matlab::array::proxy::Array{std::move(numeric_array)} {}
: arrow::matlab::array::proxy::Array{std::move(numeric_array)} {
REGISTER_METHOD(NumericArray, toMATLAB);
}

static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments) {
using MatlabBuffer = arrow::matlab::buffer::MatlabBuffer;
Expand All @@ -67,7 +69,7 @@ class NumericArray : public arrow::matlab::array::proxy::Array {
}

protected:
void toMATLAB(libmexclass::proxy::method::Context& context) override {
void toMATLAB(libmexclass::proxy::method::Context& context) {
using CType = typename arrow::TypeTraits<ArrowType>::CType;
using NumericArray = arrow::NumericArray<ArrowType>;

Expand Down
4 changes: 3 additions & 1 deletion matlab/src/cpp/arrow/matlab/array/proxy/string_array.cc
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,9 @@
namespace arrow::matlab::array::proxy {

StringArray::StringArray(const std::shared_ptr<arrow::StringArray> string_array)
: arrow::matlab::array::proxy::Array(std::move(string_array)) {}
: arrow::matlab::array::proxy::Array(std::move(string_array)) {
REGISTER_METHOD(StringArray, toMATLAB);
}

libmexclass::proxy::MakeResult StringArray::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) {
namespace mda = ::matlab::data;
Expand Down
2 changes: 1 addition & 1 deletion matlab/src/cpp/arrow/matlab/array/proxy/string_array.h
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ namespace arrow::matlab::array::proxy {
static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments);

protected:
void toMATLAB(libmexclass::proxy::method::Context& context) override;
void toMATLAB(libmexclass::proxy::method::Context& context);
};

}
199 changes: 199 additions & 0 deletions matlab/src/cpp/arrow/matlab/array/proxy/struct_array.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include "arrow/matlab/array/proxy/struct_array.h"
#include "arrow/matlab/array/proxy/wrap.h"
#include "arrow/matlab/bit/pack.h"
#include "arrow/matlab/error/error.h"
#include "arrow/matlab/index/validate.h"

#include "arrow/util/utf8.h"

#include "libmexclass/proxy/ProxyManager.h"

namespace arrow::matlab::array::proxy {

StructArray::StructArray(std::shared_ptr<arrow::StructArray> struct_array)
: proxy::Array{std::move(struct_array)} {
REGISTER_METHOD(StructArray, getNumFields);
REGISTER_METHOD(StructArray, getFieldByIndex);
REGISTER_METHOD(StructArray, getFieldByName);
REGISTER_METHOD(StructArray, getFieldNames);
}

libmexclass::proxy::MakeResult StructArray::make(const libmexclass::proxy::FunctionArguments& constructor_arguments) {
namespace mda = ::matlab::data;
using libmexclass::proxy::ProxyManager;

mda::StructArray opts = constructor_arguments[0];
const mda::TypedArray<uint64_t> arrow_array_proxy_ids = opts[0]["ArrayProxyIDs"];
const mda::StringArray field_names_mda = opts[0]["FieldNames"];
const mda::TypedArray<bool> validity_bitmap_mda = opts[0]["Valid"];

std::vector<std::shared_ptr<arrow::Array>> arrow_arrays;
arrow_arrays.reserve(arrow_array_proxy_ids.getNumberOfElements());

// Retrieve all of the Arrow Array Proxy instances from the libmexclass ProxyManager.
for (const auto& arrow_array_proxy_id : arrow_array_proxy_ids) {
auto proxy = ProxyManager::getProxy(arrow_array_proxy_id);
auto arrow_array_proxy = std::static_pointer_cast<proxy::Array>(proxy);
auto arrow_array = arrow_array_proxy->unwrap();
arrow_arrays.push_back(arrow_array);
}

// Convert the utf-16 encoded field names into utf-8 encoded strings
std::vector<std::string> field_names;
field_names.reserve(field_names_mda.getNumberOfElements());
for (const auto& field_name : field_names_mda) {
const auto field_name_utf16 = std::u16string(field_name);
MATLAB_ASSIGN_OR_ERROR(const auto field_name_utf8,
arrow::util::UTF16StringToUTF8(field_name_utf16),
error::UNICODE_CONVERSION_ERROR_ID);
field_names.push_back(field_name_utf8);
}

// Pack the validity bitmap values.
MATLAB_ASSIGN_OR_ERROR(auto validity_bitmap_buffer,
bit::packValid(validity_bitmap_mda),
error::BITPACK_VALIDITY_BITMAP_ERROR_ID);

// Create the StructArray
MATLAB_ASSIGN_OR_ERROR(auto array,
arrow::StructArray::Make(arrow_arrays, field_names, validity_bitmap_buffer),
error::STRUCT_ARRAY_MAKE_FAILED);

// Construct the StructArray Proxy
auto struct_array = std::static_pointer_cast<arrow::StructArray>(array);
return std::make_shared<proxy::StructArray>(std::move(struct_array));
}

void StructArray::getNumFields(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;

mda::ArrayFactory factory;
const auto num_fields = array->type()->num_fields();
context.outputs[0] = factory.createScalar(num_fields);
}

void StructArray::getFieldByIndex(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
using namespace libmexclass::proxy;

mda::StructArray args = context.inputs[0];
const mda::TypedArray<int32_t> index_mda = args[0]["Index"];
const auto matlab_index = int32_t(index_mda[0]);

auto struct_array = std::static_pointer_cast<arrow::StructArray>(array);

const auto num_fields = struct_array->type()->num_fields();

// Validate there is at least 1 field
MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(
index::validateNonEmptyContainer(num_fields),
context, error::INDEX_EMPTY_CONTAINER);

// Validate the matlab index provided is within the range [1, num_fields]
MATLAB_ERROR_IF_NOT_OK_WITH_CONTEXT(
index::validateInRange(matlab_index, num_fields),
context, error::INDEX_OUT_OF_RANGE);

// Note: MATLAB uses 1-based indexing, so subtract 1.
const int32_t index = matlab_index - 1;

auto field_array = struct_array->field(index);

// Wrap the array within a proxy object if possible.
MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto field_array_proxy,
proxy::wrap(field_array),
context, error::UNKNOWN_PROXY_FOR_ARRAY_TYPE);
const auto field_array_proxy_id = ProxyManager::manageProxy(field_array_proxy);
const auto type_id = field_array->type_id();

// Return a struct with two fields: ProxyID and TypeID. The MATLAB
// layer will use these values to construct the appropriate MATLAB
// arrow.array.Array subclass.
mda::ArrayFactory factory;
mda::StructArray output = factory.createStructArray({1, 1}, {"ProxyID", "TypeID"});
output[0]["ProxyID"] = factory.createScalar(field_array_proxy_id);
output[0]["TypeID"] = factory.createScalar(static_cast<int32_t>(type_id));
context.outputs[0] = output;
}

void StructArray::getFieldByName(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;
using libmexclass::proxy::ProxyManager;

mda::StructArray args = context.inputs[0];

const mda::StringArray name_mda = args[0]["Name"];
const auto name_utf16 = std::u16string(name_mda[0]);
MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(const auto name,
arrow::util::UTF16StringToUTF8(name_utf16),
context, error::UNICODE_CONVERSION_ERROR_ID);


auto struct_array = std::static_pointer_cast<arrow::StructArray>(array);
auto field_array = struct_array->GetFieldByName(name);
if (!field_array) {
// Return an error if we could not query the field by name.
const auto msg = "Could not find field named " + name + ".";
context.error = libmexclass::error::Error{
error::ARROW_TABULAR_SCHEMA_AMBIGUOUS_FIELD_NAME, msg};
return;
}

// Wrap the array within a proxy object if possible.
MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto field_array_proxy,
proxy::wrap(field_array),
context, error::UNKNOWN_PROXY_FOR_ARRAY_TYPE);
const auto field_array_proxy_id = ProxyManager::manageProxy(field_array_proxy);
const auto type_id = field_array->type_id();

// Return a struct with two fields: ProxyID and TypeID. The MATLAB
// layer will use these values to construct the appropriate MATLAB
// arrow.array.Array subclass.
mda::ArrayFactory factory;
mda::StructArray output = factory.createStructArray({1, 1}, {"ProxyID", "TypeID"});
output[0]["ProxyID"] = factory.createScalar(field_array_proxy_id);
output[0]["TypeID"] = factory.createScalar(static_cast<int32_t>(type_id));
context.outputs[0] = output;
}

void StructArray::getFieldNames(libmexclass::proxy::method::Context& context) {
namespace mda = ::matlab::data;

const auto& fields = array->type()->fields();
const auto num_fields = fields.size();
std::vector<mda::MATLABString> names;
names.reserve(num_fields);

for (size_t i = 0; i < num_fields; ++i) {
auto str_utf8 = fields[i]->name();

// MATLAB strings are UTF-16 encoded. Must convert UTF-8
// encoded field names before returning to MATLAB.
MATLAB_ASSIGN_OR_ERROR_WITH_CONTEXT(auto str_utf16,
arrow::util::UTF8StringToUTF16(str_utf8),
context, error::UNICODE_CONVERSION_ERROR_ID);
const mda::MATLABString matlab_string = mda::MATLABString(std::move(str_utf16));
names.push_back(matlab_string);
}

mda::ArrayFactory factory;
context.outputs[0] = factory.createArray({1, num_fields}, names.begin(), names.end());
}
}
44 changes: 44 additions & 0 deletions matlab/src/cpp/arrow/matlab/array/proxy/struct_array.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include "arrow/matlab/array/proxy/array.h"

namespace arrow::matlab::array::proxy {

class StructArray : public arrow::matlab::array::proxy::Array {
public:
StructArray(std::shared_ptr<arrow::StructArray> struct_array);

~StructArray() {}

static libmexclass::proxy::MakeResult make(const libmexclass::proxy::FunctionArguments& constructor_arguments);

protected:

void getNumFields(libmexclass::proxy::method::Context& context);

void getFieldByIndex(libmexclass::proxy::method::Context& context);

void getFieldByName(libmexclass::proxy::method::Context& context);

void getFieldNames(libmexclass::proxy::method::Context& context);

};

}
3 changes: 3 additions & 0 deletions matlab/src/cpp/arrow/matlab/array/proxy/wrap.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
#include "arrow/matlab/array/proxy/boolean_array.h"
#include "arrow/matlab/array/proxy/numeric_array.h"
#include "arrow/matlab/array/proxy/string_array.h"
#include "arrow/matlab/array/proxy/struct_array.h"

namespace arrow::matlab::array::proxy {

Expand Down Expand Up @@ -61,6 +62,8 @@ namespace arrow::matlab::array::proxy {
return std::make_shared<proxy::NumericArray<arrow::Date64Type>>(std::static_pointer_cast<arrow::Date64Array>(array));
case ID::STRING:
return std::make_shared<proxy::StringArray>(std::static_pointer_cast<arrow::StringArray>(array));
case ID::STRUCT:
return std::make_shared<proxy::StructArray>(std::static_pointer_cast<arrow::StructArray>(array));
default:
return arrow::Status::NotImplemented("Unsupported DataType: " + array->type()->ToString());
}
Expand Down
2 changes: 1 addition & 1 deletion matlab/src/cpp/arrow/matlab/error/error.h
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ namespace arrow::matlab::error {
static const char* CHUNKED_ARRAY_MAKE_FAILED = "arrow:chunkedarray:MakeFailed";
static const char* CHUNKED_ARRAY_NUMERIC_INDEX_WITH_EMPTY_CHUNKED_ARRAY = "arrow:chunkedarray:NumericIndexWithEmptyChunkedArray";
static const char* CHUNKED_ARRAY_INVALID_NUMERIC_CHUNK_INDEX = "arrow:chunkedarray:InvalidNumericChunkIndex";

static const char* STRUCT_ARRAY_MAKE_FAILED = "arrow:array:StructArrayMakeFailed";
static const char* INDEX_EMPTY_CONTAINER = "arrow:index:EmptyContainer";
static const char* INDEX_OUT_OF_RANGE = "arrow:index:OutOfRange";
}
2 changes: 2 additions & 0 deletions matlab/src/cpp/arrow/matlab/proxy/factory.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
#include "arrow/matlab/array/proxy/timestamp_array.h"
#include "arrow/matlab/array/proxy/time32_array.h"
#include "arrow/matlab/array/proxy/time64_array.h"
#include "arrow/matlab/array/proxy/struct_array.h"
#include "arrow/matlab/array/proxy/chunked_array.h"
#include "arrow/matlab/tabular/proxy/record_batch.h"
#include "arrow/matlab/tabular/proxy/table.h"
Expand Down Expand Up @@ -57,6 +58,7 @@ libmexclass::proxy::MakeResult Factory::make_proxy(const ClassName& class_name,
REGISTER_PROXY(arrow.array.proxy.Int64Array , arrow::matlab::array::proxy::NumericArray<arrow::Int64Type>);
REGISTER_PROXY(arrow.array.proxy.BooleanArray , arrow::matlab::array::proxy::BooleanArray);
REGISTER_PROXY(arrow.array.proxy.StringArray , arrow::matlab::array::proxy::StringArray);
REGISTER_PROXY(arrow.array.proxy.StructArray , arrow::matlab::array::proxy::StructArray);
REGISTER_PROXY(arrow.array.proxy.TimestampArray, arrow::matlab::array::proxy::NumericArray<arrow::TimestampType>);
REGISTER_PROXY(arrow.array.proxy.Time32Array , arrow::matlab::array::proxy::NumericArray<arrow::Time32Type>);
REGISTER_PROXY(arrow.array.proxy.Time64Array , arrow::matlab::array::proxy::NumericArray<arrow::Time64Type>);
Expand Down
5 changes: 1 addition & 4 deletions matlab/src/matlab/+arrow/+array/Array.m
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,9 @@
Proxy
end

properties (Dependent)
properties(Dependent, SetAccess=private, GetAccess=public)
Length
Valid % Validity bitmap
end

properties(Dependent, SetAccess=private, GetAccess=public)
Type(1, 1) arrow.type.Type
end

Expand Down
6 changes: 3 additions & 3 deletions matlab/src/matlab/+arrow/+array/BooleanArray.m
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
classdef BooleanArray < arrow.array.Array
% arrow.array.BooleanArray

properties (Hidden, SetAccess=private)
NullSubstitionValue = false;
properties (Hidden, GetAccess=public, SetAccess=private)
NullSubstitutionValue = false;
end

methods
Expand All @@ -35,7 +35,7 @@

function matlabArray = toMATLAB(obj)
matlabArray = obj.Proxy.toMATLAB();
matlabArray(~obj.Valid) = obj.NullSubstitionValue;
matlabArray(~obj.Valid) = obj.NullSubstitutionValue;
end
end

Expand Down
Loading

0 comments on commit 78124ba

Please sign in to comment.