New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement new "datatype" for blob handling #7739
Comments
This could be solved by other means than a new data type, for example by streaming the blobs inline in a result set, or maybe some compound format that contains the blob ID and the first x bytes of the blob (e.g. +/- 32KB or the actual length if it is less). For example, by introducing a If a blob id from a blr_blob3 message is opened, it will be positioned at the first byte after the returned data (or after the end of the blob). |
If you know for sure that your data fits 32k - you don't need BLOBs. And if your data don't fit 32k - you don't want to transfer them through a thin wire without explicit request. |
In a lot of use cases for blobs, the majority of blobs may be relatively small, while there are a few outliers which are larger. So you use blobs to also be able to store the outliers, or because the fact blobs are stored out-of-band has a benefit when querying things that don't involve the blob data. Not to mention that you may need it to avoid the 64KiB row limit. Returning the first ~32KiB inline will return all data for cases where the blob is actually smallish, and for cases where the blob is larger, you will already have the first 32KiB available for processing, which could very well improve performance even for larger blobs. |
I also beleive it is a way to go. To make it more flexible I can think about per-statement or per-attachment setting that control if blob data should be streamed inline or not. Also, it could be useful to set the number of bytes to stream inline. And no, I don't like |
Well, I needed something the distinguish from the current behaviour of |
Why a setting?
|
To let user decide. |
On 9/6/23 10:29, Mark Rotteveel wrote:
This could be solved by other means than a new data type, for example
by streaming the blobs inline in a result set, or maybe some compound
format that contains the blob ID and the first x bytes of the blob
(e.g. +/- 32KB or the actual length if it is less).
Using new datatype, where first N (not too big value) bytes of BLOB is
stored inline, may have one more advantage - such field may be indexed.
Something like Paradox used to have such fields.
|
Many apps implemented in a following way: it have a grid with many fields and shows a blob(s) content on demand, when user pressed a button. Almost always the resultset behind a grid contains all fields, including a blob ones. There is no need and no way to show all blobs from all records at the same time. If we will always include blob data into resultset, it will slow such apps significantly. More, it will force users to exclude blob fields from main resultset and create additional request just to fetch desired blob - this is not faster than current separate blob open\fetch\close. |
We speak about storage or about network transfers here ?
Couldn't expression indices serve this goal ? |
How could applications degrades to have the data not indexed (or fully indexed) where the field length increases? |
On 9/6/23 13:10, Vlad Khorsun wrote:
Using new datatype, where first N (not too big value) bytes of
BLOB is stored inline
We speak about storage or about network transfers here ?
Blob storage is another theme, I believe.
Yes, certainly. But for sure related.
may have one more advantage - such field may be indexed. Something
like Paradox used to have such fields.
Couldn't expression indices serve this goal ?
Am I wrong that to make use of expression index in a plan that
expression should be used in WHERE clause?
|
I don't think so, could you explain ?
You correct, but... I still see no relation of index thing with storage details nor with the subject. |
On 9/6/23 16:18, Vlad Khorsun wrote:
Blob storage is another theme, I believe.
Yes, certainly. But for sure related.
I don't think so, could you explain ?
If we introduce new datatype it's quite logocal to transfer it over the
wire in same way as it\s stored on disk. Certainly that's not absolute
requirement, it does not make thingsidentical- but for sure relationship.
Couldn't expression indices serve this goal ?
Am I wrong that to make use of expression index in a plan that
expression should be used in WHERE clause?
You correct, but... I still see no relation of index thing with
storage details nor with the subject.
I wanted to say that in order to use expression index one has to use not
obvious tricks in SQL statements. With discussed datatype it can be
WHERE fieldOfNewBlobType starting with 'ABC'
|
Why invent a new datatype when IStatement::openCursor() has enough room in "flags" parameter for CURSOR_PREFETCH_BLOBS flag? |
Client-side record is a message which format is described using BLR. If we extend the message with blob data chunk, it should be somehow described. AFAIU, this is what Mark suggests. If you're going to prefetch blobs in some internal buffers of the statement object and feed getSegment() from these buffers, of course a new data type is not required. |
Yes, that's what I suggested to do: transfer BLOBs as usual by id but if the flag is set - send content of the blob immediately (automatically requesting it from client on fetch() may be) and store it in a temporary file. Then client application request BLOB content as usual but it turn out to be client-only call which is greatly faster. |
Blob data can be sent by server automatically with usual sequence of packets before op_fetch response and client can handle it while waiting for the response. This will make blob delivery zero round-trip. |
Hello, Let me describe how it has already done in another "deprecated" API - OLEDB :)
For example, you can bind BLOB column
In your case, you can define two binds for BLOB
If the data of bind1 has S_OK status - you will use this data. If the data of bind1 has S_ISNULL, then the data of bind2 will have S_ISNULL too. If the data of bind1 has S_TRUNCATED, you can use data (BLOB ID) from bind2 and read BLOB through separated calls. OLEDB also allows to read the row data twice. Because the fetch operation does not return the row data but the row handle. You can try to read the column with BLOB directly in the user memory buffer with fixed length. If you get a value with status S_TRUNCATED, you can read this column as storage (BLOB ID) again. Of course, you can always read all the data of BLOB directly. Just use DBTYPE_BYREF modificator for bind datatype. OLEDB provider will allocate memory for BLOB data and return the pointer of this buffer. User must free this memory. This was invented 25 years ago. |
I have thought about introducing a new blob data type eg. |
No new type is needed. It is necessary to optimize the protocol for the current blob implementation. For example, be able to prefetch a small part of a blob. If the entire blob fits into this prefetch, then open, get_segment and close will not create additional network packets at all. |
@sim1984 In addressing the current issue with blobs, yes, It can conceal some problems, but not the overarching issues related to the lifespan of a blob. |
How |
AFAIU GBLOB is suggested as PSQL-only and never appear on client side. |
It appear in select too so also on client side.
Interesting thing, but i think that we must then cast all to |
It is not necessary to invent new types, it is necessary to optimize work with existing ones. New data types will practically not be used, because it requires rewriting access components or drivers (odbc, etc.). For example, look at decfloat. Do many components support it? |
@sim1984 Addoption of new data types in drivers is slow but increase. But current design of blob rather prohibit speed in this matter, providing handle to object which cannot be forgotten and can be used/fetched in any times have big consequences as you see with BLOB_APPEND existence... |
There is no need to mix up the method of transferring BLOBs (network protocol) and retaining record versions (database bloat).
|
Any impreovement in this matter will be huge benefit. We will use whatever will be probided. |
First, in this case they used to occupy temporary space, not database space. |
I hear that FB supports sharing a transaction between two connection. If it is true and I am right - I can share transaction between two different processes or computers .... and get blobID in one but read blobData in another :) |
Not transaction, only database snapshot. |
Hi
currently we have
BLOB
field which require additional call to retrive it.It is so so slow.
Consider implementing different datatype to handle "blob" data. It should be part of record transmission as normal field.
Especially when this data column is readonly, like in belolw query result of
LIST
function.Below comparision queries for speed test. Test it from remote connection (e.g. 20-30 ms ping).
I have tested on FB3 and FB4. First query uses blob from LIST function. Second have CAST list to VARCHAR(32000) and i have added additional ~1KB of data to every record, to show that much more data retrived, but still ~60x faster!
query 1. 15089 ms records 540
query 2. 253 ms records 540
query 1.
query 2.
If you like more records to test, simply change
C.LEVEL<10
to something else like 'C.LEVEL<50`The text was updated successfully, but these errors were encountered: