Allow to search for transactions by data field #2404

slaweet · 2018-09-18T15:09:09Z

Expected behavior

It should be possible to search for 0-type transactions by the optional data field.

/api/transactions/?senderAddress=12498L&data=message would return transactions with sender address 12498L that contain word message in data field

In Lisk Hub, we would use this feature to allow users to look up their transactions by their "reference message".

Actual behavior

It is not possible to filter by data field
https://app.swaggerhub.com/apis/LiskHQ/Lisk/1.0.20#/Transactions/getTransactions

The text was updated successfully, but these errors were encountered:

…ransactions Add filter data field to get transactions endpoint - Closes #2404

slaweet · 2018-10-22T13:18:46Z

I'd like to re-open this issue because the PR #2470 doesn't solve what I asked for in the description. It should return transactions that contain the search term in data field, but the PR returns only exact match.

I was talking to @diego-G when implementing it that is should work the same as Core has in /api/delegates?search= https://app.swaggerhub.com/apis/LiskHQ/Lisk/1.0.20#/Delegates/getDelegates

nazarhussain · 2018-11-06T11:19:55Z

I did some research on the topic, most effective way to implement this feature;

Performance

BYTEA

# PostgreSQL 9.6.9 on x86_64-pc-linux-gnu (Debian 9.6.9-2.pgdg90+1), compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

CREATE OR REPLACE FUNCTION random_string(lengh integer) 
RETURNS varchar AS $$ 
  SELECT array_to_string(ARRAY(
          SELECT substr('abcdefghijklmnopqrstuv',trunc(random()*21+1)::int,1)       
             FROM generate_series(1,$1)),'')
$$ LANGUAGE sql VOLATILE;

CREATE TABLE binary_data ("data" BYTEA NOT NULL);

INSERT INTO binary_data ("data") select CONVERT_TO(random_string(64), 'utf8') FROM generate_series(1,1000000);

So we get 1 million records in the table as BYTEA data type containing utf8 strings. Now running following command to tests performance.

# Q1 - select "data" from binary_data where "data" like CONCAT('%', random_string(6) ,'%')::bytea limit 10;
# Q2 - select "data" from binary_data where ENCODE("data", 'escape') like CONCAT('%', random_string(6) ,'%') limit 10;

pgbench -c 30 -T 300 -n -f ./q1.sql -f ./q2.sql -P 5 -S my_db

SQL script 1: ./q1.sql
 - weight: 1 (targets 33.3% of total)
 - 32 transactions (35.2% of total, tps = 0.090486)
 - latency average = 174151.250 ms
 - latency stddev = 2405.413 ms
SQL script 2: ./q2.sql
 - weight: 1 (targets 33.3% of total)
 - 28 transactions (30.8% of total, tps = 0.079175)
 - latency average = 176614.835 ms
 - latency stddev = 3649.871 ms

VARCHAR

If we want to compare it with varchar to see difference.

create table varchar_data ("data" VARCHAR(64) NOT NULL);

insert into varchar_data ("data") select random_string(64) FROM generate_series(1,1000000);

# Q3 - select "data" from varchar_data where "data" like CONCAT('%', random_string(7) ,'%') limit 10;

pgbench -c 30 -T 300 -n -f ./q3.sql -P 5 -S my_db

SQL script 1: ./q3.sql
 - weight: 1 (targets 50.0% of total)
 - 60 transactions (44.1% of total, tps = 0.153975)
 - latency average = 192334.567 ms
 - latency stddev = 9490.788 ms

With Index

Lets do that again with index.


# Q4 - select "data" from varchar_data where "data" like CONCAT('%', random_string(7) ,'%') limit 10;

create index ON varchar_data("data");

pgbench -c 30 -T 300 -n -f ./q4.sql -f ./q5.sql -P 5 -S my_db

SQL script 1: ./q4.sql
 - weight: 1 (targets 50.0% of total)
 - 60 transactions (45.1% of total, tps = 0.150917)
 - latency average = 192468.228 ms
 - latency stddev = 12958.045 ms

Accuracy

Now testing the accuracy of both approaches;

my_db=# INSERT INTO binary_data ("data") VALUES ('I ❤ Lisk');

my_db=# select CONVERT_FROM("data", 'utf8') from binary_data where ENCODE("data", 'escape') like '%❤%' limit 10;
 convert_from
--------------
(0 rows)

my_db=# select CONVERT_FROM("data", 'utf8') from binary_data where "data" like CONCAT('%', '❤' ,'%')::bytea limit 10;
 convert_from
--------------
 I ❤ Lisk
(1 row)

Analysis

Partial search or pattern matching on BYTEA is too much expense approximately 12 sec per transaction
BYTEA with ENCODE have similar performance approximately 13 sec per transaction.
The approach used in Allow fuzzy search on data filter when getting transactions - Closes #2404 #2513, means using ENCODE(escape) is not accurate as it does not search for unicode characters. It just escapee the characters.
Similar search directly on VARCHAR data result in double performance of roughly 6 sec per transaction
Having an index even on VARCHAR data does not improve the performance when searched with %me% pattern.
Having an index even on VARCHAR and searching for pattern me% with ILIKE can have improved performance (theoretically).

Suggested actions

So the suggested solution to move forward are following options:

Disable the zero byte character in transaction data field, on HTTP API only.
Disable the zero byte character in Lisk-Elements
Short term solution: Move with BYTEA to BYTEA comparison with LIKE. As BYTEA does not support ILIKE.
Long term solution: When its the right time, take all exceptions from both networks and convert that column to VARCHAR.

Be aware of the fact that we support 64 bytes not, 64 characters. A unicode character can contain more than one byte.

@MaciejBaj @diego-G @shuse2 @vitaly-t

4miners · 2018-11-06T11:41:18Z

I still don't get why we actually need this feature (search). The use case because of which we implemented this additional data field is described in the original issue #26.

Have an ability to get the transactions with exact match is usefull - for example, I'm a service or exchange operator and I assigned to my users unique identifiers, then I'm able to get all transactions that involves particular user.

For partial match it doesn't really make sense, it only adds the complexity within the code and the actual API response will become slower over time. It's also not really the responsibility of Lisk Core to provide such a feature.

I also don't like the idea of converting this from BYTEA, as data field is 64 bytes, so we should keep it as bytes and not as a string.

…filter Allow fuzzy search on data filter when getting transactions - Closes #2404

diego-G added type: improvement api labels Sep 20, 2018

MaciejBaj assigned diego-G Oct 9, 2018

diego-G added *medium and removed *easy labels Oct 10, 2018

MaciejBaj added *hard and removed *medium labels Oct 15, 2018

diego-G mentioned this issue Oct 15, 2018

Add filter data field to get transactions endpoint - Closes #2404 #2470

Merged

shuse2 closed this as completed in #2470 Oct 18, 2018

shuse2 added a commit that referenced this issue Oct 18, 2018

Merge pull request #2470 from LiskHQ/2404-add_filter_data_field_get_t…

427da7e

…ransactions Add filter data field to get transactions endpoint - Closes #2404

diego-G reopened this Oct 23, 2018

MaciejBaj added this to the Version 1.4.0 milestone Oct 25, 2018

diego-G mentioned this issue Oct 31, 2018

Allow fuzzy search on data filter when getting transactions - Closes #2404 #2513

Merged

MaciejBaj closed this as completed in #2513 Nov 8, 2018

MaciejBaj added a commit that referenced this issue Nov 8, 2018

Merge pull request #2513 from LiskHQ/2404-allow_fuzzy_search_on_data_…

b922930

…filter Allow fuzzy search on data filter when getting transactions - Closes #2404

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to search for transactions by data field #2404

Allow to search for transactions by data field #2404

slaweet commented Sep 18, 2018

slaweet commented Oct 22, 2018

nazarhussain commented Nov 6, 2018

4miners commented Nov 6, 2018

Allow to search for transactions by data field #2404

Allow to search for transactions by data field #2404

Comments

slaweet commented Sep 18, 2018

Expected behavior

Actual behavior

slaweet commented Oct 22, 2018

nazarhussain commented Nov 6, 2018

Performance

BYTEA

VARCHAR

With Index

Accuracy

Analysis

Suggested actions

4miners commented Nov 6, 2018