Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output all rows as single list in Protobuf format #16436

Closed
akuzm opened this issue Oct 27, 2020 · 10 comments · Fixed by #35152
Closed

Output all rows as single list in Protobuf format #16436

akuzm opened this issue Oct 27, 2020 · 10 comments · Fixed by #35152
Assignees
Labels
comp-formats Input / output formats easy task Good for first contributors feature st-community-taken External developer is working on that

Comments

@akuzm
Copy link
Contributor

akuzm commented Oct 27, 2020

Currently Protobuf outputs each row as a separate message, but some consumers require a list.
We could add a new format ProtobufList that does that.

@akuzm akuzm added feature easy task Good for first contributors comp-formats Input / output formats labels Oct 27, 2020
@vaibhav1865
Copy link

I would like to work on this

@An-DJ
Copy link

An-DJ commented Sep 22, 2021

Is there any new progress on this issue? @akuzm

If not, could you assign this issue to me : ) I would like to work on it.

@akuzm
Copy link
Contributor Author

akuzm commented Sep 22, 2021

If not, could you assign this issue to me : ) I would like to work on it.

Sure, your help will be very appreciated!

@An-DJ
Copy link

An-DJ commented Sep 23, 2021

I found that the gRPC interface may already contain multi-rows supported.

Test by grpcurl:

grpcurl -plaintext -d '{"query": "SELECT metric from system.metrics"}' -proto src/Server/grpc_protos/clickhouse_grpc.proto 127.0.0.1:9100 clickhouse.grpc.ClickHouse/ExecuteQuery

Response:

{
  "output": "UXVlcnkKTWVyZ2UKUGFydE11dGF0aW9uClJlcGxpY2F0ZWRGZXRjaApSZXBsaWNhdGVkU2VuZApSZXBsaWNhdGVkQ2hlY2tzCkJhY2tncm91bmRQb29sVGFzawpCYWNrZ3JvdW5kRmV0Y2hlc1Bvb2xUYXNrCkJhY2tncm91bmRNb3ZlUG9vbFRhc2sKQmFja2dyb3VuZFNjaGVkdWxlUG9vbFRhc2sKQmFja2dyb3VuZEJ1ZmZlckZsdXNoU2NoZWR1bGVQb29sVGFzawpCYWNrZ3JvdW5kRGlzdHJpYnV0ZWRTY2hlZHVsZVBvb2xUYXNrCkJhY2tncm91bmRNZXNzYWdlQnJva2VyU2NoZWR1bGVQb29sVGFzawpDYWNoZURpY3Rpb25hcnlVcGRhdGVRdWV1ZUJhdGNoZXMKQ2FjaGVEaWN0aW9uYXJ5VXBkYXRlUXVldWVLZXlzCkRpc2tTcGFjZVJlc2VydmVkRm9yTWVyZ2UKRGlzdHJpYnV0ZWRTZW5kClF1ZXJ5UHJlZW1wdGVkClRDUENvbm5lY3Rpb24KTXlTUUxDb25uZWN0aW9uCkhUVFBDb25uZWN0aW9uCkludGVyc2VydmVyQ29ubmVjdGlvbgpQb3N0Z3JlU1FMQ29ubmVjdGlvbgpPcGVuRmlsZUZvclJlYWQKT3BlbkZpbGVGb3JXcml0ZQpSZWFkCldyaXRlCk5ldHdvcmtSZWNlaXZlCk5ldHdvcmtTZW5kClNlbmRTY2FsYXJzClNlbmRFeHRlcm5hbFRhYmxlcwpRdWVyeVRocmVhZApSZWFkb25seVJlcGxpY2EKTWVtb3J5VHJhY2tpbmcKRXBoZW1lcmFsTm9kZQpab29LZWVwZXJTZXNzaW9uClpvb0tlZXBlcldhdGNoClpvb0tlZXBlclJlcXVlc3QKRGVsYXllZEluc2VydHMKQ29udGV4dExvY2tXYWl0ClN0b3JhZ2VCdWZmZXJSb3dzClN0b3JhZ2VCdWZmZXJCeXRlcwpEaWN0Q2FjaGVSZXF1ZXN0cwpSZXZpc2lvbgpWZXJzaW9uSW50ZWdlcgpSV0xvY2tXYWl0aW5nUmVhZGVycwpSV0xvY2tXYWl0aW5nV3JpdGVycwpSV0xvY2tBY3RpdmVSZWFkZXJzClJXTG9ja0FjdGl2ZVdyaXRlcnMKR2xvYmFsVGhyZWFkCkdsb2JhbFRocmVhZEFjdGl2ZQpMb2NhbFRocmVhZApMb2NhbFRocmVhZEFjdGl2ZQpEaXN0cmlidXRlZEZpbGVzVG9JbnNlcnQKQnJva2VuRGlzdHJpYnV0ZWRGaWxlc1RvSW5zZXJ0ClRhYmxlc1RvRHJvcFF1ZXVlU2l6ZQpNYXhERExFbnRyeUlECk1heFB1c2hlZERETEVudHJ5SUQKUGFydHNUZW1wb3JhcnkKUGFydHNQcmVDb21taXR0ZWQKUGFydHNDb21taXR0ZWQKUGFydHNPdXRkYXRlZApQYXJ0c0RlbGV0aW5nClBhcnRzRGVsZXRlT25EZXN0cm95ClBhcnRzV2lkZQpQYXJ0c0NvbXBhY3QKUGFydHNJbk1lbW9yeQpNTWFwcGVkRmlsZXMKTU1hcHBlZEZpbGVCeXRlcwpBc3luY0RyYWluZWRDb25uZWN0aW9ucwpBY3RpdmVBc3luY0RyYWluZWRDb25uZWN0aW9ucwpTeW5jRHJhaW5lZENvbm5lY3Rpb25zCkFjdGl2ZVN5bmNEcmFpbmVkQ29ubmVjdGlvbnMKQXN5bmNocm9ub3VzUmVhZFdhaXQK",
  "progress": {
    "readRows": "74",
    "readBytes": "8569"
  },
  "stats": {
    "rows": "74",
    "blocks": "1",
    "allocatedBytes": "6144"
  }
}

We can decode the output base64 string:

Query
Merge
PartMutation
ReplicatedFetch
ReplicatedSend
ReplicatedChecks
BackgroundPoolTask
BackgroundFetchesPoolTask
BackgroundMovePoolTask
BackgroundSchedulePoolTask
BackgroundBufferFlushSchedulePoolTask
BackgroundDistributedSchedulePoolTask
BackgroundMessageBrokerSchedulePoolTask
CacheDictionaryUpdateQueueBatches
CacheDictionaryUpdateQueueKeys
DiskSpaceReservedForMerge
DistributedSend
QueryPreempted
TCPConnection
MySQLConnection
HTTPConnection
InterserverConnection
PostgreSQLConnection
OpenFileForRead
OpenFileForWrite
Read
Write
NetworkReceive
NetworkSend
SendScalars
SendExternalTables
QueryThread
ReadonlyReplica
MemoryTracking
EphemeralNode
ZooKeeperSession
ZooKeeperWatch
ZooKeeperRequest
DelayedInserts
ContextLockWait
StorageBufferRows
StorageBufferBytes
DictCacheRequests
Revision
VersionInteger
RWLockWaitingReaders
RWLockWaitingWriters
RWLockActiveReaders
RWLockActiveWriters
GlobalThread
GlobalThreadActive
LocalThread
LocalThreadActive
DistributedFilesToInsert
BrokenDistributedFilesToInsert
TablesToDropQueueSize
MaxDDLEntryID
MaxPushedDDLEntryID
PartsTemporary
PartsPreCommitted
PartsCommitted
PartsOutdated
PartsDeleting
PartsDeleteOnDestroy
PartsWide
PartsCompact
PartsInMemory
MMappedFiles
MMappedFileBytes
AsyncDrainedConnections
ActiveAsyncDrainedConnections
SyncDrainedConnections
ActiveSyncDrainedConnections
AsynchronousReadWait

It seems that the issue has been done. Is that right? @akuzm

@An-DJ
Copy link

An-DJ commented Oct 3, 2021

Currently Protobuf outputs each row as a separate message, but some consumers require a list. We could add a new format ProtobufList that does that.

Hi @akuzm

Just as I said above, the multi-rows format in gRPC has been implemented now. All rows are included into a single bytes field output in the message Result.

Don't you mean to say that we can split the output bytes array into list[bytes] which each row is encoded into one element?

@alexey-milovidov
Copy link
Member

@An-DJ We want to have this option in format level.
So, you can use ProtobufList format with HTTP interface, in clickhouse-client, in clickhouse-local, everywhere.

@alexey-milovidov alexey-milovidov added the st-community-taken External developer is working on that label Feb 25, 2022
@KochetovNicolai
Copy link
Member

Robert Schulze will be doing it.

@rschu1ze
Copy link
Member

Created a GitHub account in the meantime. Feel free to assign to me.

@rschu1ze
Copy link
Member

rschu1ze commented Mar 2, 2022

Just for my understanding: The docs for the Protobuf input/output format state that the schemafile for the current format Protobuf looks like this:

syntax = "proto3";

message MessageType {
  string name = 1;                     // standard ClickHouse String data type
  string surname = 2;                  // standard ClickHouse String data type
  uint32 birthDate = 3;                // standard ClickHouse UInt32 data type
  repeated string phoneNumbers = 4;    // I believe this maps to some ARRAY data type in ClickHouse
};

The table will be serialized as a sequence of messages (= one per row). Because messages are prefixed each with their byte size (as varint, i.e. "length-delimited" format), space is wasted unnecessarily. This becomes more painful when the ratio between the number of table rows and the number of columns grows.

So the goal would be to add a format ProtobufList with a schemafile

syntax = "proto3";

message MessageType {
  message Row {
    string name = 1;
    string surname = 2;
    uint32 birthDate = 3;
    repeated string phoneNumbers = 4;
  }

  repeated Row = 1;
};

which produces a list of rows within a single message, see akuzms first comment.

We would save the repeated per-message size prefix. However, protobuf would still somehow need to discriminate Row-s in the serialized representation, and as far as I understand, this will be done using a standard 1-byte key before each Row which encodes the field id (1 for Row), and the wire type (I am not sure what is used for a composite structures like here, perhaps "Start Group" as per protobuf encoding documentation). As a result, the space savings will be less significant than desired.

So my question would be if above proposed format is what you had in mind or something else?

@KochetovNicolai
Copy link
Member

The schema for ProtobufList looks fine to me. It contains a single repeated field, as I would expect. I suppose, here we don't have a goal to save some extra bytes, but just make it more convenient for user. (Also, I understood, packed repeated field are only for scalar types).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp-formats Input / output formats easy task Good for first contributors feature st-community-taken External developer is working on that
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants