Skip to content

Add Tensor.copyDataInto for allocation-free output reads on Android#8

Merged
jgibson2 merged 2 commits intopolycamfrom
jgibson/android-tensor-copyDataInto-into-polycam
Apr 29, 2026
Merged

Add Tensor.copyDataInto for allocation-free output reads on Android#8
jgibson2 merged 2 commits intopolycamfrom
jgibson/android-tensor-copyDataInto-into-polycam

Conversation

@jgibson2
Copy link
Copy Markdown
Collaborator

Summary

Adds Tensor.copyDataInto(<TypedBuffer>) to the Android Java API for every dtype, mirroring the version up for review on pytorch/executorch#19171. Eliminates the per-call float[] / int[] / etc. allocation that getDataAs*Array() performs by letting the caller supply (and reuse) a destination buffer.

Motivation

Profiling on Android (Perfetto) showed output.toTensor().dataAsFloatArray driving substantial ART GC pressure in the depth-inference loop. Each call allocates a fresh Java float[] sized to the tensor's element count and bulk-copies from the underlying off-heap buffer into it. For a model with two [1, 1, 144, 192] fp32 outputs that's ~110 KB × 2 = 220 KB of Java-heap churn per inference. At 2 fps that's 440 KB/sec just from reading outputs — visibly enough to add several young-generation GCs per second, and ~14% of the inference wall time was being lost to run-queue contention as GC suspended app threads.

The native side already exposes the underlying off-heap FloatBuffer as a zero-copy view of the C++ tensor's data_ptr() (via the package-private getRawDataBuffer). The new copyDataInto API gives external callers a public way to drain that buffer into a destination they own and reuse across calls.

API

public void copyDataInto(FloatBuffer dst)        // float32 + float16 (widening)
public void copyDataInto(IntBuffer dst)          // int32
public void copyDataInto(LongBuffer dst)         // int64
public void copyDataInto(DoubleBuffer dst)       // float64
public void copyDataInto(ShortBuffer dst)        // float16 raw bits
public void copyDataInto(ByteBuffer dst)         // int8
public void copyDataIntoUnsigned(ByteBuffer dst) // uint8
  • Float16 has both a FloatBuffer overload (per-element half→float widening) and a ShortBuffer overload (raw fp16 bits, no widening).
  • int8 / uint8 are split into separate methods to mirror the existing getDataAsByteArray vs getDataAsUnsignedByteArray distinction — calling the wrong one throws.
  • All overloads write at the destination's current position and advance it (standard Buffer.put semantics).
  • The fp16 → fp32 widening path includes an upfront dst.remaining() capacity check so an undersized destination throws BufferOverflowException before any partial widening is observed (matching the all-or-nothing semantics of the bulk-put paths).

Caller-side usage

// One-time setup
FloatBuffer depthBuf = Tensor.allocateFloatBuffer(numelDepth);

// Per inference
EValue[] outputs = module.forward(...);
depthBuf.rewind();
outputs[0].toTensor().copyDataInto(depthBuf);   // no allocation
// ... read from depthBuf ...

Test plan

  • TensorTest.kt covers each variant's happy path, asymmetric int8/uint8 rejection, fp16 raw-bits vs widened paths, position-respecting writes, and BufferOverflowException on undersized destinations (including the fp16 widening path's upfront check).
  • Tensor.java compiles standalone with javac 17.
  • google-java-format (1.23.0, project standard) and ktfmt (Meta style, matches spotless { kotlin { ktfmt() } } config) both clean.
  • Verify against polycam EstimateDepth.kt (forthcoming change to swap dataAsFloatArray for copyDataInto + a 2-slot encoder pool).

🤖 Generated with Claude Code

jgibson2 and others added 2 commits April 29, 2026 11:25
…put reads

The existing Tensor.getDataAsFloatArray() allocates a fresh float[] on every
call and copies from the underlying off-heap buffer into it. In a
steady-state inference loop this is a per-frame allocation proportional to
the output tensor size — for a 144x192 single-channel depth output that's
~110 KB per call, multiplied by however many output tensors a model
returns. Profiling on Android (Perfetto) showed this driving substantial
ART GC pressure that ended up stealing CPU back from the inference thread
itself via run-queue contention.

The off-heap FloatBuffer that backs the output Tensor is already owned by
the native side and exposed internally via the package-private
getRawDataBuffer(). copyDataInto(FloatBuffer dst) lets callers reuse a
pre-allocated destination across calls, eliminating the per-call float[]
allocation while keeping ownership unambiguous (the destination is
caller-owned, so it's safe to hand off to async consumers without racing
against the next forward() overwriting native memory).

Implemented for float32 (zero-copy bulk put) and float16 (per-element
half->float widening, mirroring getDataAsFloatArray). Other dtypes inherit
the base-class IllegalStateException default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the previous commit (which added copyDataInto(FloatBuffer) for
float32 / float16) to cover every dtype that the Tensor class supports:

  copyDataInto(ByteBuffer)            int8
  copyDataIntoUnsigned(ByteBuffer)    uint8
  copyDataInto(IntBuffer)             int32
  copyDataInto(LongBuffer)            int64
  copyDataInto(DoubleBuffer)          float64
  copyDataInto(ShortBuffer)           float16 (raw fp16 bits, no widening)

Mirrors the asymmetry of getDataAsByteArray (int8) vs
getDataAsUnsignedByteArray (uint8) — calling copyDataInto(ByteBuffer) on a
uint8 tensor throws, and copyDataIntoUnsigned(ByteBuffer) on int8 throws,
just like the array accessors. The two methods are intentionally separate
even though the underlying bits are identical, so a misuse caused by a
dtype switch surfaces as an exception instead of silently producing values
with the wrong sign interpretation.

The float16 ShortBuffer overload writes the raw 16-bit half-precision bits
without widening; copyDataInto(FloatBuffer) on the same tensor still
performs the half->float widening, matching the dual getDataAsShortArray /
getDataAsFloatArray accessors that float16 already exposes.

Tests cover each new variant's happy path plus the asymmetric int8/uint8
rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@meredithbayne
Copy link
Copy Markdown

Nice improvements!

@jgibson2 jgibson2 merged commit 0783572 into polycam Apr 29, 2026
@jgibson2 jgibson2 deleted the jgibson/android-tensor-copyDataInto-into-polycam branch April 29, 2026 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants