Disable Object Reuse By Default #1382

rickysaltzer · 2023-06-14T20:18:03Z

Summary

Enabling object reuse by default has the potential to cause unforeseen bugs in user code.
When a user decides to pass a Stream<ClickHouseRecord> back as a List<ClickHouseRecord>, all objects by default will be pointing back to the same object. This can be extremely jarring for users who aren't aware objects are being reused.
Added a reuseObjects() method to quickly enable object reuse when appropriate. This allows the user to decide when memory efficiency is a goal.

Checklist

Delete items not relevant to your PR:

Unit and integration tests covering the common scenarios were added
A human-readable description of the changes was provided to include in CHANGELOG
For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

- Enabling object reuse by default has the potential to cause unforeseen bugs in user code. - When a user decides to pass a `Stream<ClickHouseRecord>` back as a `List<ClickHouseRecord>`, all objects by default will be pointing back to the same object. This can be extremely jarring for users who aren't aware objects are being reused. - Added a `reuseObjects()` method to quickly enable object reuse when appropriate. This allows the user to decide when memory efficiency is a goal.

- Decided it was too hard to understand in tests

- Horray copy paste

rickysaltzer · 2023-06-14T20:26:54Z

I use this library within Kotlin code, and so, Kotlin Collections makes it very easy to go from Iterable<ClickHouseRecord> to Sequence or List. This is what caused me to spend quite a long time figuring out why my tests were failing (expecting data from clickhouse to be correct).

zhicwu · 2023-06-15T05:04:31Z

Thanks for your contribution @rickysaltzer! Besides memory efficiency, object creation also slows things down - the CI failure was because all tests couldn't finish in 15 minutes.

I think the proposed change will greatly impact to all reads, so I'd not suggest to do that. Have you tried ClickHouseRecord.copy() if you don't need to process many records? Alternatively, we may add a new method in the same class to extract all values? For example:

public Object[] extractValues() {
  int size = size();
  Object[] arr = new Object[size];
  for (int i = 0; i < size; i++) {
      arr[i] = getValue(i).asObject();
  }
  return arr;
}

rickysaltzer · 2023-06-15T13:51:59Z

I think the underlying question is, should we be silently corrupting data returned from ClickHouse? Because that is exactly what is happening if a user decides to pass the ClickHouseRecord back.

I think in general it's bad API design to rely on a user to read the code implementation (as I had to do) and call .copy() on an object because it's unknowingly backed by a mutable reference.

Can we not simply enable object reuse by default for tests? I think it might be presumptuous of us to assume it would slow down user's code significantly, because it depends entirely on what they're doing with the Java API. Are they streaming hundreds of millions of rows? Maybe should consider object reuse. Are they simply performing a large aggregation that returns a few hundred or thousand rows? Object reuse might not be so significant now.

Take Apache Flink's API for example, a streaming platform that is meant for extreme scale and load. They disable objectReuse by default with a big warning that enabling it could lead to user's bugs (as this did for me).

mshustov · 2023-06-15T14:53:15Z

@zhicwu It would be worthwhile to brainstorm alternative changes to API with @rickysaltzer and @mzitnik. Data integrity is a top priority. Let's try to find a balance here.

zhicwu · 2023-06-16T10:11:38Z

Thanks again @rickysaltzer for the inputs! Your points are indeed valid and well-reasoned. However, I would like to emphasize that it's important to consider the differences in memory efficiency and performance between a small library optimized for a single JVM and a distributed middleware like Flink.

As you know, we have multiple APIs to choose from, each with its own characteristics. JDBC is a well-known and mature option, while R2DBC is asynchronous and gaining popularity, although the driver has not thoroughly tested yet. On the other hand, the Java client provides better performance and lower memory usage compared to others. If performance and memory usage are not a concern, why not stick with JDBC?

Anyway, I think what we're trying to resolve here is to improve Java client API to minimize unintended side effects. ClickHouseResponse.records(Class<T>) can be think of an attempt, which has no such issue and it will be faster than records() once ASM is integrated. Apart from that, I hope we can eventually drop ClickHouse*Value and potentially ClickHouseRecord. These additions were initially made to support a middleware called JDBC bridge, which is no longer a part of the goals. Instead, a potential approach could be to utilize Object[] along with precise APIs to retrieve each value, without relying on implicit type conversions, which can sometimes be problematic.

@mzitnik & @mshustov, anything to add?

rickysaltzer · 2023-06-16T15:40:01Z

Thanks for your response, I do very much appreciate the efficiency we're trying to maintain, especially when it comes to higher-level APIs leveraging this one.

That being said, I think coming up with an elegant solution to this issue is warranted.

mzitnik · 2023-06-18T11:31:35Z

A few comments here
I think API should be self-descriptive. I would also not expect Data integrity issues @zhicwu I understand that currently, it blocks our CI from completing in 15 min. Do you have any expectations for when ASM will be integrated?
Thanks, @rickysaltzer for rising this issue.

rickysaltzer added 5 commits June 14, 2023 15:14

Remove Set Import

c9a9af0

- Decided it was too hard to understand in tests

Make reuse tests use the same SQL literal

68bdabc

Fix test comment

175e756

- Horray copy paste

Last Nit

507c9bf

mshustov requested a review from mzitnik January 19, 2024 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable Object Reuse By Default #1382

Disable Object Reuse By Default #1382

rickysaltzer commented Jun 14, 2023

rickysaltzer commented Jun 14, 2023

zhicwu commented Jun 15, 2023

rickysaltzer commented Jun 15, 2023

mshustov commented Jun 15, 2023

zhicwu commented Jun 16, 2023

rickysaltzer commented Jun 16, 2023

mzitnik commented Jun 18, 2023 •

edited

Disable Object Reuse By Default #1382

Are you sure you want to change the base?

Disable Object Reuse By Default #1382

Conversation

rickysaltzer commented Jun 14, 2023

Summary

Checklist

rickysaltzer commented Jun 14, 2023

zhicwu commented Jun 15, 2023

rickysaltzer commented Jun 15, 2023

mshustov commented Jun 15, 2023

zhicwu commented Jun 16, 2023

rickysaltzer commented Jun 16, 2023

mzitnik commented Jun 18, 2023 • edited

mzitnik commented Jun 18, 2023 •

edited