Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New built-in function BLOB_APPEND #6983

Merged
merged 1 commit into from Jun 20, 2022
Merged

New built-in function BLOB_APPEND #6983

merged 1 commit into from Jun 20, 2022

Conversation

hvlad
Copy link
Member

@hvlad hvlad commented Sep 27, 2021

Regular operator || (concatenation) with BLOB arguments creates temporary BLOB per every pair of args
with BLOB. This could lead to the excessive memory consumption and growth of database file.
The new BLOB_APPEND function is designed to concatenate (or append) BLOBs without creating intermediate BLOBs.

More detailed description at
https://github.com/sim1984/blob_append/blob/main/blob_append_en.md

@asfernandes
Copy link
Member

I see that the function may fallback to not-optimized way of work (copy blob) when it's already opened.

Also, it was implemented in simple way that makes that happens frequently.

For example:

execute block
as
    declare variable v1 blob;
begin
    v1 = 'abc';
    v1 = blob_append(v1, 'd');
end!

There was no reason to copy v1.

Much more important example could be with LIST aggregate INTO variable and subsequent append to variable.

I don't support the function in this way it is done. IMO things should be different.

BLB_close of temporary blob should defer close (BLB_close_on_read) so many more operations could be optimized.

And then BLOB_APPEND should not fallback to copy. It should fail when not possible to append.

Documentation should be added specifying which operations closes a just-created blob.

I see value in the function (over RDB$BLOB_UTIL) if done in this way.

Though I think it may be very confusing. It will make blob variables sometimes work by value and sometimes work by reference.

@hvlad
Copy link
Member Author

hvlad commented Sep 27, 2021

I see that the function may fallback to not-optimized way of work (copy blob) when it's already opened.

Also, it was implemented in simple way that makes that happens frequently.

For example:

execute block
as
    declare variable v1 blob;
begin
    v1 = 'abc';
    v1 = blob_append(v1, 'd');
end!

There was no reason to copy v1.

And it does not copy in this case. Please read description of 3rd case for 1st argument:
= = = = =
The first argument is BLOB or NULL. The following options are possible:
...

  • temporary unclosed BLOB with the BLB_close_on_read flag: it will be used further

= = = = =

Much more important example could be with LIST aggregate INTO variable and subsequent append to variable.

Why it is more important ?

I don't support the function in this way it is done. IMO things should be different.

BLB_close of temporary blob should defer close (BLB_close_on_read) so many more operations could be optimized.

I don't want to open can of worms. But I'm perhaps too careful. It could be discussed.

And then BLOB_APPEND should not fallback to copy. It should fail when not possible to append.

When it is not possible to append ?

Documentation should be added specifying which operations closes a just-created blob.

= = = = =
This BLOB will be automatically closed by the engine when the client tries to read it, assign it to a table, or use it in other expressions that require reading the content.
= = = = =
This is full list, what else you see should be added ?

I see value in the function (over RDB$BLOB_UTIL) if done in this way.

Though I think it may be very confusing. It will make blob variables sometimes work by value and sometimes work by reference.

Blob variables always works as reference, I'd say, while not everyone understand it.

@aafemt
Copy link
Contributor

aafemt commented Sep 27, 2021

And it does not copy in this case.

I don't see in the code how during initial assignment "v1 = 'abc';" the variable receives BLB_close_on_read flag.

@hvlad
Copy link
Member Author

hvlad commented Sep 27, 2021

And it does not copy in this case.

I don't see in the code how during initial assignment "v1 = 'abc';" the variable receives BLB_close_on_read flag.

Ah, sorry. Yes, in this case it have no such flag.
If one need efficient code, it should write it as:

v1 = blob_append(v1, 'abc', 'd');

or

v1 = blob_append(null, 'abc');
v1 = blob_append(v1, 'd');

@aafemt
Copy link
Contributor

aafemt commented Sep 27, 2021

Yes, in this case it have no such flag.

Shouldn't it receive one? In the documentation I see example of call like BLOB_APPEND('abc', 'def', 'ghi').
AFAIU the first string literal will be implicitly casted to BLOB and without the flag the content will be copied later, no?
Or is it a mistake in the documentation and there must be NULL there as the first parameter?

@AlexPeshkoff
Copy link
Member

AlexPeshkoff commented Sep 27, 2021 via email

@aafemt
Copy link
Contributor

aafemt commented Sep 27, 2021

Generally speaking I would prefer opposite logic: every blob is created as appendable and receive "copy on write" flag during materialization.

@asfernandes
Copy link
Member

Much more important example could be with LIST aggregate INTO variable and subsequent append to variable.

Why it is more important ?

It could be a large blob. But I mean, every large blob come from the engine. More examples: UPPER(blob).

I don't support the function in this way it is done. IMO things should be different.
BLB_close of temporary blob should defer close (BLB_close_on_read) so many more operations could be optimized.

I don't want to open can of worms. But I'm perhaps too careful. It could be discussed.

And then BLOB_APPEND should not fallback to copy. It should fail when not possible to append.

When it is not possible to append ?

Yes, IMO it should be a function that user really know is working fast and must understand when it can be used.

Otherwise it's very risk that someone writes the code to be fast and with maintenance it became slower (and untested with large data).

Documentation should be added specifying which operations closes a just-created blob.

= = = = =
This BLOB will be automatically closed by the engine when the client tries to read it, assign it to a table, or use it in other expressions that require reading the content.
= = = = =
This is full list, what else you see should be added ?

Ok, but for example OCTET_LENGTH will not need to open it, while CHAR_LENGTH of MBCS will need. So docs should be improved.

I see value in the function (over RDB$BLOB_UTIL) if done in this way.
Though I think it may be very confusing. It will make blob variables sometimes work by value and sometimes work by reference.

Blob variables always works as reference, I'd say, while not everyone understand it.

In PSQL you'd need first to create-and-close it, then assign to others variables. And when you assign to different sub_type or charset, it's copied (so in this case it's like by-value). So in both cases it appears like a by-value operation with the reference being like an optimization. (I know, there is the blob-id, but the user visible effect is like a by-value in all cases).

With this function, some (but not all) usages will appear as a by-reference, as you may have the bid in two variables and updating one will affect the other.

@hvlad
Copy link
Member Author

hvlad commented Sep 27, 2021

Yes, in this case it have no such flag.

Shouldn't it receive one?

No. This ticket not about assignment.

In the documentation I see example of call like BLOB_APPEND('abc', 'def', 'ghi').

There is no example with string literal at 1st arg. By doc it should not be allowed, but I'm not sure it
is good limitation. Committed code creates new blob (with flag BLB_close_on_read) and put string
content into it, then it will put contents of 2nd and 3rd strings into this blob and return it as result.

AFAIU the first string literal will be implicitly casted to BLOB and without the flag the content will be copied later, no?

No, it is not casted and BLOB_APPEND see it as is.

Or is it a mistake in the documentation and there must be NULL there as the first parameter?

See above. We might demand blob or null at 1st param, and initial implementation did so, but
after some discussion it was considered as not needed.

@hvlad
Copy link
Member Author

hvlad commented Sep 27, 2021

Generally speaking I would prefer opposite logic: every blob is created as appendable and receive "copy on write" flag during materialization.

IIRC, it was discussed but considered as too risky.
But you may start new discussion.
Probably we could do it in master with careful testing.

@aafemt
Copy link
Contributor

aafemt commented Sep 27, 2021

There is no example with string literal at 1st arg.

It is there:

SELECT
      LIST(
        BLOB_APPEND(
          '{',
          JSON_UTILS.NUMERIC_PAIR('age', MEASURE.AGE),
          ',',
          JSON_UTILS.NUMERIC_PAIR('height', MEASURE.HEIGHT_HORSE),
          ',',
          JSON_UTILS.NUMERIC_PAIR('length', MEASURE.LENGTH_HORSE),
          ',',
          JSON_UTILS.NUMERIC_PAIR('chestaround', MEASURE.CHESTAROUND),
          ',',
          JSON_UTILS.NUMERIC_PAIR('wristaround', MEASURE.WRISTAROUND),
          ',',
          JSON_UTILS.NUMERIC_PAIR('weight', MEASURE.WEIGHT_HORSE),
          '}'
        )
      ) AS JSON_M
    FROM MEASURE

@aafemt
Copy link
Contributor

aafemt commented Sep 27, 2021

Probably we could do it in master with careful testing.

Ah, I didn't notice that this PR is into the stable branch. In this case I'm against it (though my opinion worth nothing, of course).

@hvlad
Copy link
Member Author

hvlad commented Sep 27, 2021

Much more important example could be with LIST aggregate INTO variable and subsequent append to variable.

Why it is more important ?

It could be a large blob. But I mean, every large blob come from the engine. More examples: UPPER(blob).

But it is not because of BLOB_APPEND presence. One could use concatenation and get same old issue.
Any blob could be a large blob. I.e. BLOB_APPEND introduces no new problem here.

I don't support the function in this way it is done. IMO things should be different.
BLB_close of temporary blob should defer close (BLB_close_on_read) so many more operations could be optimized.

I don't want to open can of worms. But I'm perhaps too careful. It could be discussed.

And then BLOB_APPEND should not fallback to copy. It should fail when not possible to append.

When it is not possible to append ?

Yes, IMO it should be a function that user really know is working fast and must understand when it can be used.

That's what the documentation is for.

Yes, we might disable to pass closed blob into 1st argument, but it gives nothing to the end users. imho.
They will put existing blob in to 2nd arg and NULL into 1st arg - with the same performance penalty.
Or they will get random errors at run-time, and this is really not good.

Otherwise it's very risk that someone writes the code to be fast and with maintenance it became slower (and untested with large data).

We can't fix all mistakes that users might do :)

Documentation should be added specifying which operations closes a just-created blob.

= = = = =
This BLOB will be automatically closed by the engine when the client tries to read it, assign it to a table, or use it in other expressions that require reading the content.
= = = = =
This is full list, what else you see should be added ?

Ok, but for example OCTET_LENGTH will not need to open it, while CHAR_LENGTH of MBCS will need. So docs should be improved.

Agree. This case could be documented.

I see value in the function (over RDB$BLOB_UTIL) if done in this way.
Though I think it may be very confusing. It will make blob variables sometimes work by value and sometimes work by reference.

Blob variables always works as reference, I'd say, while not everyone understand it.

In PSQL you'd need first to create-and-close it, then assign to others variables. And when you assign to different sub_type or charset, it's copied (so in this case it's like by-value). So in both cases it appears like a by-value operation with the reference being like an optimization. (I know, there is the blob-id, but the user visible effect is like a by-value in all cases).

Yes, sure.

With this function, some (but not all) usages will appear as a by-reference, as you may have the bid in two variables and updating one will affect the other.

Yes, and this also could be documented. BTW, if follow your suggestion to defer blob's closure for all blobs,
it could affect existing code and introduce not expected side effects.

@hvlad
Copy link
Member Author

hvlad commented Sep 27, 2021

There is no example with string literal at 1st arg.

It is there:

You right. Anyway, I already answered your question, I think.

@hvlad
Copy link
Member Author

hvlad commented Sep 27, 2021

Probably we could do it in master with careful testing.

Ah, I didn't notice that this PR is into the stable branch. In this case I'm against it (though my opinion worth nothing, of course).

Will see.

@hvlad
Copy link
Member Author

hvlad commented Sep 27, 2021

Probably we could do it in master with careful testing.

Ah, I didn't notice that this PR is into the stable branch. In this case I'm against it (though my opinion worth nothing, of course).

Will see.

Forget to add - this code was developed few month ago and was extensively tested by customer.
Not at production, but anyway.
It was not in-time to be included into v4.0, unfortunately.

@aafemt
Copy link
Contributor

aafemt commented Sep 28, 2021

What should be the result of following code?

v1 = APPEND_BLOB(NULL, 'abc');
v2 = APPEND_BLOB(v1, 'def');
v3 = APPEND_BLOB(v1, 'ghi');

Will all three variables have the same value?

@hvlad
Copy link
Member Author

hvlad commented Sep 28, 2021

What should be the result of following code?

v1 = APPEND_BLOB(NULL, 'abc');
v2 = APPEND_BLOB(v1, 'def');
v3 = APPEND_BLOB(v1, 'ghi');

Will all three variables have the same value?

Yes:

execute block
  returns (b1 blob, b2 blob, b3 blob)
as
begin
  b1 = blob_append(null, 'abc');
  b2 = blob_append(b1,   'def');
  b3 = blob_append(b1,   'ghi');
  suspend;
end
B1 B2 B3
0:1 0:1 0:1

B1: abcdefghi
B2: abcdefghi
B3: abcdefghi

@dyemanov
Copy link
Member

dyemanov commented Oct 20, 2021

For me, the primary benefit of this function (as compared to RDB$BLOB_UTILS) is simplicity. It covers one common pattern and does it efficiently. So both solutions may surely coexist. And I see no problems if it has some drawbacks if used without care (it should not be worse than explicit concatenation anyway). Of course, all the tricky issues should be documented.

IMO, no limitations should be applied to the first argument, literals and expressions should be allowed (with their contents copied to the newly created blob in the beginning of execution).

As for other possible improvements (e.g. every blob is created as appendable), I don't mind considering them for v5. Am I right that BLOB_APPEND is not going to be affected, except of becoming more "optimized" (not copying the input blob, for example)? I'd hate to introduce BLOB_APPEND now and change its visible behaviour later.

@livius2
Copy link

livius2 commented Nov 22, 2021

May i ask why this function is needed at all?
Why "temporary blob" exists during whole transaction and increase database file size.
Only blob returned to the user in returns params should be preserved.
Adding reference counting will fix all problems.

CREATE PROCEDURE TEST_BLOBS 
 ( ) 
RETURNS 
 ( myBlob blob sub_type text)
AS 
DECLARE VARIABLE b1 blob sub_type text; 
DECLARE VARIABLE b2 blob sub_type text; 
DECLARE VARIABLE c blob sub_type text; 
BEGIN
	b1 = '1111'; --blob 
	b2 = '2222'; --blob
	c = b1 || b2 || b3; --blob c - no temporary blobs needed to be preserved

	b1 = b1 || 'xxx'; --blob b1 append (but maybe here will be better to have b1 |= 'xxx')
	
	myBlob = b1 || b2 || c; --blob myBlob - no temporary blobs needed to be preserved only myBlob content
	suspend;
END

and this is totally counterintuitive from any programming language POV

b1 = blob_append(null, 'abc');
b2 = blob_append(b1,   'def');
b3 = blob_append(b1,   'ghi');

B1: abcdefghi
B2: abcdefghi ---this should be B2: abcdef
B3: abcdefghi

It will be better if blob_append will be procedure not function.
and used like this

b1 = 'abc';
blob_append(b1,   'def');
b2 = b1
blob_append(b1,   'ghi');
b3 = b1

and then result
B1: abcdefghi
B2: abcdef
B3: abcdefghi

@hvlad
Copy link
Member Author

hvlad commented Nov 22, 2021

May i ask why this function is needed at all?

It is explained at description, read first message, please.

Why "temporary blob" exists during whole transaction and increase database file size.

Because we have no better way to control its lifetime.

Only blob returned to the user in returns params should be preserved.

It is not as simple as you see it. Consider blob variable as reference to the real blob object. Don't forget that blob could be stored into table and its ID is changed. Then same blob could be referred by its temporary (initial) BlobID and\or by new (materialized) BlobID. Also, blobs could be referred by triggers, derived columns (VIEW) and so on.

Adding reference counting will fix all problems.

Probably. But nobody implemented it so far. Guess, why ? ;)

...

and this is totally counterintuitive from any programming language POV

Only if one not looking into details. And details is not too hard to understand, I believe ;)

b1 = blob_append(null, 'abc');
b2 = blob_append(b1,   'def');
b3 = blob_append(b1,   'ghi');

B1: abcdefghi
B2: abcdefghi ---this should be B2: abcdef
B3: abcdefghi

If you need 'abcdef' at B2 - use concatenation: b2 = b1 || 'def'

It will be better if blob_append will be procedure not function. and used like this

BLOB_APPEND is special in that it returns not closed blob. There is no way to "re-open" blob for writing.
Therefore it can't be a procedure.

Please, read description and examples again, I hope you'll change your opinion about BLOB_APPEND.

@livius2
Copy link

livius2 commented Nov 23, 2021

It is not as simple as you see it. Consider blob variable as reference to the real blob object. Don't forget that blob could be stored into table and its ID is changed. Then same blob could be referred by its temporary (initial) BlobID and\or by new (materialized) BlobID. Also, blobs could be referred by triggers, derived columns (VIEW) and so on.

I suppose you overcomplicate things (but maybe i oversimplify it). The real problem is that declared variables in PSQL and operations on it are "remembered" and not forget during all operations inside e.g. stored procedure.
Only blob passed outside by retunds, insert into, update, output of view, should create (in the time when passed, materialized blob).

e.g.

CREATE PROCEDURE TEST_BLOBS 
 ( ) 
RETURNS 
 ( myBlob blob sub_type text)
AS 
DECLARE VARIABLE b1 blob sub_type text; 
DECLARE VARIABLE b2 blob sub_type text; 
DECLARE VARIABLE c blob sub_type text; 
BEGIN
	b1 = '1111'; --new blob
	b2 = '2222'; --new blob
	c = b1 || b2 || b3; --new blob c  - no temporary blobs needed to be preserved after concatenation

	b1 = b1 || 'xxx'; --blob b1 append (but maybe here will be better to have b1 |= 'xxx')
	
	myBlob = b1 || b2 || c; --blob myBlob - no temporary blobs needed to be preserved only myBlob content

        UPDATE TABLE XXX SET B=:b1 WHERE ID=5; --here blob b1 should got flag (**new when modified**)     
	suspend;  (myBlob  should got flag **new when modified** as it is psssed outshide)  
END

and if you really go into direction of BLOB_APPEND
then BLOB_APPEND should be procedure not function but first parameter should be passed by reference, to preserve blob open (you append to blob passed by first parameter)

PS. I can compare here blob to the file. I can append to it, unles someone access it by exclusive flag.
Such exclusive flag should be set when blob is passed outside. But this flag should not be a flag of blob itself i suppose.
Only inside psql code should be some structure "exclusiveFlags" when contain blob id it cannot be modified inside this block.
But can be modified outside. E.g. stored procedure got exclusive flag of some blob. But this blob do not contain such flag. When stored procedure do e.g. UPDATE TABLE XXX SET B=:b1 WHERE ID=5; then b1 have flag set in scope of this stored procedure. But when trigger ON UPDATE is fired same rules work inside this trigger.

@livius2
Copy link

livius2 commented Nov 23, 2021

Maybe this is more understandable then above description. This is how it should work during compilation to BLR.
References processing.

INETRNAL needed only during compilation to BLR
 REF_LINKS(BlobFrom, ToBlob); --dictionary
 REF_BACK_LINKS(ToBlob, LISTofBlobs)
 LOCK_LIST(Blob)

PROCEDURE
RETURNS (BBB BLOB SUB_TYPE TEXT, CCC BLOB SUB_TYPE TEXT)

DECLARE VARIABLE B1 BLOB SUB_TYPE TEXT;
DECLARE VARIABLE B2 BLOB SUB_TYPE TEXT;
DECLARE VARIABLE B3 BLOB SUB_TYPE TEXT;
DECLARE VARIABLE B4 BLOB SUB_TYPE TEXT;
DECLARE VARIABLE B5 BLOB SUB_TYPE TEXT;
BEGIN
  SELECT B FROM XXX WHERE ID=3 INTO :B1; --B1 add to LOCK_LIST(B1)
  
  
  /*
  REF_LINKS - empty {}
  REF_BACK_LINKS - empty {}
  LOCK_LIST - [B1]
  */ 
  
  
  BBB = B1; --add reference(link to B1 as nothing modified, add to REF_LINKS(BBB, B1) and fill REF_BACK_LINKS(B1, LIST.add(BBB))))
  
  /*
  REF_LINKS - {(BBB, B1)}
  REF_BACK_LINKS - {(B1, [BBB])}
  LOCK_LIST - [B1]
  */   
  
  
  B2 = B1; --add reference(link to B1 as nothing modified, add to REF_LINKS(B2, B1) and fill REF_BACK_LINKS(B1, LIST.add(B2))))
  
  /*
  REF_LINKS - {(BBB, B1), (B2, B1)}
  REF_BACK_LINKS - {(B1, [BBB, B2])}
  LOCK_LIST - [B1]
  */   
  
  B3 = B1; --add reference(link to B1 as nothing modified, add to REF_LINKS(B3, B1) and fill REF_BACK_LINKS(B1, LIST.add(B3))))
  
  /*
  REF_LINKS - {(BBB, B1), (B2, B1), (B3, B1)}
  REF_BACK_LINKS - {(B1, [BBB, B2, B3])}
  LOCK_LIST - [B1]
  */   
  
  B4 = B1 || 'abc'; --B4 is new blob id=6
  
  /*
  REF_LINKS - {(BBB, B1), (B2, B1), (B3, B1)}
  REF_BACK_LINKS - {(B1, [BBB, B2, B3])}
  LOCK_LIST - [B1]
  */ 
  
  CCC = B4; --add reference(link to B4 as nothing modified, add to REF_LINKS(CCC, B4) and fill REF_BACK_LINKS(B1, LIST.add(B3))))
  
  /*
  REF_LINKS - {(BBB, B1), (B2, B1), (B3, B1), {CCC, B4}}
  REF_BACK_LINKS - {(B1, [BBB, B2, B3]), (B4, [CCC])}
  LOCK_LIST - [B1]
  */ 
  
  B2 = B2 || 'xyz'; --check first REF_LINKS, LOCK_LIST if exists create new blob id=7 else simply append to, here exists REF_LINK, so create new blob and remove from lists, first find in REF_LINKS, then delete from REF_BACK_LINKS and delete from REF_LINKS
  
  /*
  REF_LINKS - {(BBB, B1), (B3, B1), {CCC, B4}}
  REF_BACK_LINKS - {(B1, [BBB, B3]), (B4, [CCC])}
  LOCK_LIST - [B1]
  */ 
  
  
  UPDATE TABLE YYY SET B=:B3 WHERE ID=2; --B3 add to lock list LOCK_LIST(B3) and remove from ref
  
  /*
  REF_LINKS - {(BBB, B1), {CCC, B4}}
  REF_BACK_LINKS - {(B1, [BBB]), (B4, [CCC])}
  LOCK_LIST - [B1, B3]
  */ 
  
  SUSPEND; --BBB check LOCK_LIST nothing(if contain new blob needed), check REF_LINKS - contain ref to B1, check then LOCK_LIST of B1, it contain B1 so new blob needed, remove from lists and add this new blob to LOCK_LIST(BBB)
           --CCC check LOCK_LIST nothing, check REF_LINKS - contain B4 check then LOCK_LIST of B4, nothing, add to LOCK_LIST(CCC)  
  
  B5 = B1;  
  
  /*
  REF_LINKS - {{CCC, B4}, {B5, B1}}
  REF_BACK_LINKS - {(B1, [B5]), (B4, [CCC])}
  LOCK_LIST - [B1, B3, BBB, CCC]
  */  
  
  B1 = B1 || 'cde'; --check LOCK_LIST contain B1 so chcek REF_BACK_LINKS contain B5 so B5 should have content of this blob and remove from REF_BACK_LINKS(B5) and for B1 create new blob, remove from LOCK_LIST(B1)

  /*
  REF_LINKS - {{CCC, B4}}
  REF_BACK_LINKS - {(B4, [CCC])}
  LOCK_LIST - [B3, BBB, CCC]
  */    
  
  BBB = BBB || 'def'; --check LOCK_LIST - exists - so create new blob and remove from lock list if ref_back_links present (here nothing) then put old content to the first from list and add links for others there and create new blob
END

@livius2
Copy link

livius2 commented Nov 26, 2021

Is above anyhow readable or have no sense to you?

@hvlad
Copy link
Member Author

hvlad commented Nov 26, 2021

Karol,

I'm currently busy with another things but when I'll return back to blobs I'll consider your suggestion.
And - yes, it is a bit hard to read :)
Thanks!

@alexeykovyazin
Copy link

Hi All,
This feature is in active use by several HQbird users, and it provides improvement on blob concatenations up to 18 (eighteen) times.
It is fast, well-tested, documented, and ready to appear in 4.0.2, so I think, regardless of the syntax, it should be accepted. Documentation can be improved, of course.

@pavel-zotov
Copy link

pavel-zotov commented Feb 23, 2022

Last year I was asked to run intensive tests of function BLOB_APPEND() which was assumed to be introduced in FB 4.0.x.
It does not require changes of ODS or core functionality, but speeds up blobs concantenation/adding a lot - see below.

Some issues were found and (quickly) fixed, several tests was performed (both artificial and also on production DB with real data and complex PSQL).
Eventually we (me and my colleagues) concluded that all fine and this function will be approved for merge.
It is very sad that it is still not in FB 4.0.x.

I cannot show here PSQL code from production which was refactored in order to estimate performance gain of BLOB_APPEND() - and not only because it is proprietary: there are lot of complexity and dependencies which can not be dropped.

So, I want to show benchmark of simple (synthetic) test:
when we have initially empty blob, short string ( v_add char(8) character set octets = x'deadbeefdecade01' ) and make then concatenation of this string with blob.

And do this 1000 times, i.e.:

declare n int = 1000;
declare v_add char(8) character set octets = x'deadbeefdecade01';
. . .
    while (n>0) do
    begin
        x = x || lpad('', 65530, v_add);
        n = n-1;
    end

This code was compared with another:

    while (n>0) do
    begin
        x = blob_append( x, lpad('', 65530, v_add) );
        n = n-1;
    end

For both of them following values were logged:

  • elapsed time, ms
  • counters from monitoring tables: reads, writes, fetches and marks.

Test was done on machine with 128 Gb RAM (half of this volume is RAM drive), database was created on Samsung SSD 870 QVO 1TB.
Experimental FB snapshot provided by Vlad was used for this.

I ran FIVE times two scripts from attached .7z:

  • blob-improvements-test-bconcat.sql -- this creates long blob using "old good" concatenation operator ( || );
  • blob-improvements-test-blobapp.sql -- this used BLOB_APPEND for creating long blob instead of concatenion.

Firebird ran as service and was restarted before each measurement.

One may see results in logs, but I also made .xlsx-file for better readability.

It seems that ratio of results not depends on FB config parameters.
For example, I checked these parameters:
1) DefaultDbCachePages = 256K; TempCacheLimit = 3G
vs
2) DefaultDbCachePages = 1M; TempCacheLimit = 30G
-- and outcomes seem to be similar.

Content of firebird.conf that was in use:

DefaultDbCachePages = 1M
FileSystemCacheThreshold = 2G

LockHashSlots = 30011
LockMemSize = 10M
MaxUnflushedWrites = -1
MaxUnflushedWriteTime = -1
ServerMode = Super

TempBlockSize = 2M
TempCacheLimit = 30G

WireCompression = true
WireCrypt = Enabled

RemoteServicePort = 3050
RemoteAuxPort = 3060
AuthClient = Srp, cluster, Legacy_Auth, Win_Sspi 
AuthServer = Srp, cluster, Legacy_Auth, Win_Sspi
UserManager = Srp, Legacy_UserManager

One may see that speed of BLOB_APPEND() is much faster than usual concatenation: difference in elapsed time is more than 8400 times(!).

Difference for other counters is also very high: about 200 times for page fetches/marks and about 50 times for memory used/allocated.
Extremely high ratio is for page writes counter: more than 61000.

######################################################################

Of course this was just trivial and artificial test, but it demonstrates the potential speed gain.
Beside of that, DB growth was greatly reduced because of applying this func.
In this test blob concatenation leads to size ~33Gb while BLOB_APPEND() only to 67 Mb.

@pavel-zotov
Copy link

@alexeykovyazin
Copy link

alexeykovyazin commented Apr 23, 2022

Hello All, 2 months passed after the tests results were presented - since there is no comments and objections, let's commit this awesome feature?

@livius2
Copy link

livius2 commented Apr 23, 2022

There were comments and objections ;-) But i do not know if were my comments understandable.
Even if blob_append will be provided instead of full fix, then it should be procedure not function as described in comments above.

But i have only a little voice so you all do what you need.
As a programmer, I know that when such a function is made available, it will be difficult to withdraw from it. In my opinion, there is a "simple" implementation that requires "only" compile-level changes to blr. But as I wrote, I do not know if my proposal is understandable. Sadly i am not C++ programmer.

@alexeykovyazin
Copy link

Hello,
Your comments are based on the idea that working with BLOB should be as close as possible to work with files, right?
However, BLOBs are different from files, both in terms of storage and in terms of usage approach. In some cases it could be beneficial, of course.
Currently, blob_append implements the specific task (speed up concatentation and get rid of issues with memory/database abnormal growth) in the very efficient way, and it does not have implications or restictions for possible future improvements for BLOBs.
Yes, it may look like a kind of booster, but it is very capable, well-tested and ready-to-use booster.

@WarmBooter
Copy link

My 2 cents: looks like blob_append is a full isolated new feature that it is already tested (in HQBird) and works well bringing an awesome speed enhancement for such operations. As Alexey pointed out, it would not block any other further enhancements in blob "optimization", so I really don't see a reason for not committing it and allow all Firebird users to benefit from it.

@livius2
Copy link

livius2 commented Apr 23, 2022

@alexeykovyazin

Your comments are based on the idea that working with BLOB should be as close as possible to work with files, right?

No, i only point to files to accuire some lock, but my comment is near to reference counting but a little different mechanism.
It only need to work during "compilation" of code to blr format, to flag blob which can be modified, or should be created new one as current one is returned and cannot be modified from this reason.
Then blob_append will be not needed at all and simply current concatenation will work fast.

@alexeykovyazin
Copy link

Then blob_append will be not needed at all and simply current concatenation will work fast.

Well, I think it is much harder to implement than describe :) However, I dont see a reason why we cannot use less universal but working solution right now. If the improved universal contactenation will appear in the future (v6 or v7? in 2025-2027), it can coexists with blob_append.

@AlexPeshkoff
Copy link
Member

AlexPeshkoff commented Apr 25, 2022 via email

@arkinform
Copy link

The approach mentioned by @livius2 is good, we asked about similar capabilities when ordered the development of blob_append function. The advantage of blob_append - this function creates temporary blob (not inside database file), so all blob concat operations can be executed in fast temporary storage instead of database file. In parallel we ordered another improvement - option BlobTempSpace in firebird.conf to enable by default creation of all PSQL blobs in temporary space outside database. I think that BlobTempSpace option and @livius2 proposal is the best universal solution. But I think that one thing does not exclude another. In many RDBMS there is internal concat function in addition to standard concatenation syntax. I even proposed to give more common name for "blob_append" function, for example, "concat". So we can implement internal function which for current versions will be more efficient for blobs, and in future universal optimization of concatenation (if it will appear) will not conflict with the internal function.

@livius2
Copy link

livius2 commented Apr 25, 2022

@AlexPeshkoff

In that case we will need second function to create initially
ready-for-append blob, at least I do not see too simple way to avoid it.

You should only recognize where the blob is exposed to outside world. All other blobs created in psql can be modified in any manner. Only blob returned should stay unmodified and new one is needed if someone need to modify it inside the code e.g. next loop.

@arkinform
Copy link

@livius2 @AlexPeshkoff Do you have any plans for the implementation of the discussed universal optimization? If yes, then it will be perfect solution. If no, I think you should consider the current function for optimized blob concatenation (at least as an intermediate solution), because at present concatenation of big blobs in PSQL is a pain. We started using blob_append function in our HQBird installations and performance increased up to 10 times. I think we are not only ones with blob performance issues like this, unexpected database growth and so on.

@dyemanov
Copy link
Member

dyemanov commented Jun 13, 2022

Thinking about naming, maybe it should be named APPEND_BLOB rather than BLOB_APPEND. This is more natural from the English POV and better corresponds to other functions like RDB$GET/SET_CONTEXT, MAKE_DBKEY and do on.

@livius2
Copy link

livius2 commented Jun 13, 2022

Just a 5 cent, but better is first context then action. It is more natural from programming POV.
This is same as we use in OO programing.
We do not have other choice there as we must write object.methodName.

@dyemanov
Copy link
Member

dyemanov commented Jun 13, 2022

Programming may use different rules, see e.g. Win32 API: SetTimer, PostQuitMessage, TerminateProcess, etc.

@aafemt
Copy link
Contributor

aafemt commented Jun 13, 2022

Oracle has DBMS_LOB.APPEND(). There is sense to make migration easier using similar naming.

@dyemanov
Copy link
Member

We already have a PR with BLOB_UTIL package. This function is independent and does not have to mimic anything else.

@dyemanov
Copy link
Member

Regarding "second function to create initially ready-for-append blob" -- I'd rather avoid it. After all, this function is a custom "shorthand" solution for quick concatenation, and BLOB_UTIL package is also to be added into v5. So I'd prefer to minimize a number of "special" non-packaged functions. And surely this function does not stop us from enhancing blob handling inside the engine in different ways.

Copy link
Member

@dyemanov dyemanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally approve this PR, with a rename suggestion still pending.

@AlexPeshkoff
Copy link
Member

AlexPeshkoff commented Jun 15, 2022 via email

@dyemanov
Copy link
Member

Regarding renaming to APPEND_BLOB, I seem to represent a minority so far and I don't want to be a blocker ;-) If nobody posts any additional pros in favor of renaming, let's keep the existing name.

@hvlad hvlad merged commit 294ba14 into v4.0-release Jun 20, 2022
@hvlad hvlad self-assigned this Jun 21, 2022
@hvlad hvlad deleted the work/blob_append branch June 22, 2022 10:43
@asfernandes
Copy link
Member

It seems this function got merged without documentation (README).

And link in the external repository does not exist anymore.

@dyemanov
Copy link
Member

Looks so, although I remember myself paying attention to this issue before merging. It seems it was forgotten, sigh.

@dyemanov
Copy link
Member

Ah, no, it wasn't forgotten, just committed separately:
https://github.com/FirebirdSQL/firebird/blob/v4.0-release/doc/sql.extensions/README.blob_append.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants