Skip to content
This repository has been archived by the owner on Jun 11, 2024. It is now read-only.

Allow to search for transactions by data field #2404

Closed
slaweet opened this issue Sep 18, 2018 · 3 comments · Fixed by #2470 or #2513
Closed

Allow to search for transactions by data field #2404

slaweet opened this issue Sep 18, 2018 · 3 comments · Fixed by #2470 or #2513
Assignees
Milestone

Comments

@slaweet
Copy link
Contributor

slaweet commented Sep 18, 2018

Expected behavior

It should be possible to search for 0-type transactions by the optional data field.

/api/transactions/?senderAddress=12498L&data=message would return transactions with sender address 12498L that contain word message in data field

In Lisk Hub, we would use this feature to allow users to look up their transactions by their "reference message".

Actual behavior

It is not possible to filter by data field
https://app.swaggerhub.com/apis/LiskHQ/Lisk/1.0.20#/Transactions/getTransactions

@slaweet
Copy link
Contributor Author

slaweet commented Oct 22, 2018

I'd like to re-open this issue because the PR #2470 doesn't solve what I asked for in the description. It should return transactions that contain the search term in data field, but the PR returns only exact match.

I was talking to @diego-G when implementing it that is should work the same as Core has in /api/delegates?search= https://app.swaggerhub.com/apis/LiskHQ/Lisk/1.0.20#/Delegates/getDelegates

@nazarhussain
Copy link
Contributor

I did some research on the topic, most effective way to implement this feature;

Performance

BYTEA

# PostgreSQL 9.6.9 on x86_64-pc-linux-gnu (Debian 9.6.9-2.pgdg90+1), compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

CREATE OR REPLACE FUNCTION random_string(lengh integer) 
RETURNS varchar AS $$ 
  SELECT array_to_string(ARRAY(
          SELECT substr('abcdefghijklmnopqrstuv',trunc(random()*21+1)::int,1)       
             FROM generate_series(1,$1)),'')
$$ LANGUAGE sql VOLATILE;

CREATE TABLE binary_data ("data" BYTEA NOT NULL);

INSERT INTO binary_data ("data") select CONVERT_TO(random_string(64), 'utf8') FROM generate_series(1,1000000);

So we get 1 million records in the table as BYTEA data type containing utf8 strings. Now running following command to tests performance.

# Q1 - select "data" from binary_data where "data" like CONCAT('%', random_string(6) ,'%')::bytea limit 10;
# Q2 - select "data" from binary_data where ENCODE("data", 'escape') like CONCAT('%', random_string(6) ,'%') limit 10;

pgbench -c 30 -T 300 -n -f ./q1.sql -f ./q2.sql -P 5 -S my_db

SQL script 1: ./q1.sql
 - weight: 1 (targets 33.3% of total)
 - 32 transactions (35.2% of total, tps = 0.090486)
 - latency average = 174151.250 ms
 - latency stddev = 2405.413 ms
SQL script 2: ./q2.sql
 - weight: 1 (targets 33.3% of total)
 - 28 transactions (30.8% of total, tps = 0.079175)
 - latency average = 176614.835 ms
 - latency stddev = 3649.871 ms

VARCHAR

If we want to compare it with varchar to see difference.

create table varchar_data ("data" VARCHAR(64) NOT NULL);

insert into varchar_data ("data") select random_string(64) FROM generate_series(1,1000000);

# Q3 - select "data" from varchar_data where "data" like CONCAT('%', random_string(7) ,'%') limit 10;

pgbench -c 30 -T 300 -n -f ./q3.sql -P 5 -S my_db

SQL script 1: ./q3.sql
 - weight: 1 (targets 50.0% of total)
 - 60 transactions (44.1% of total, tps = 0.153975)
 - latency average = 192334.567 ms
 - latency stddev = 9490.788 ms

With Index

Lets do that again with index.


# Q4 - select "data" from varchar_data where "data" like CONCAT('%', random_string(7) ,'%') limit 10;

create index ON varchar_data("data");

pgbench -c 30 -T 300 -n -f ./q4.sql -f ./q5.sql -P 5 -S my_db

SQL script 1: ./q4.sql
 - weight: 1 (targets 50.0% of total)
 - 60 transactions (45.1% of total, tps = 0.150917)
 - latency average = 192468.228 ms
 - latency stddev = 12958.045 ms

Accuracy

Now testing the accuracy of both approaches;

my_db=# INSERT INTO binary_data ("data") VALUES ('I ❤ Lisk');

my_db=# select CONVERT_FROM("data", 'utf8') from binary_data where ENCODE("data", 'escape') like '%❤%' limit 10;
 convert_from
--------------
(0 rows)

my_db=# select CONVERT_FROM("data", 'utf8') from binary_data where "data" like CONCAT('%', '❤' ,'%')::bytea limit 10;
 convert_from
--------------
 I ❤ Lisk
(1 row)

Analysis

  • Partial search or pattern matching on BYTEA is too much expense approximately 12 sec per transaction
  • BYTEA with ENCODE have similar performance approximately 13 sec per transaction.
  • The approach used in Allow fuzzy search on data filter when getting transactions - Closes #2404 #2513, means using ENCODE(escape) is not accurate as it does not search for unicode characters. It just escapee the characters.
  • Similar search directly on VARCHAR data result in double performance of roughly 6 sec per transaction
  • Having an index even on VARCHAR data does not improve the performance when searched with %me% pattern.
  • Having an index even on VARCHAR and searching for pattern me% with ILIKE can have improved performance (theoretically).

Suggested actions

So the suggested solution to move forward are following options:

  1. Disable the zero byte character in transaction data field, on HTTP API only.
  2. Disable the zero byte character in Lisk-Elements
  3. Short term solution: Move with BYTEA to BYTEA comparison with LIKE. As BYTEA does not support ILIKE.
  4. Long term solution: When its the right time, take all exceptions from both networks and convert that column to VARCHAR.

Be aware of the fact that we support 64 bytes not, 64 characters. A unicode character can contain more than one byte.

@MaciejBaj @diego-G @shuse2 @vitaly-t

@4miners
Copy link
Contributor

4miners commented Nov 6, 2018

I still don't get why we actually need this feature (search). The use case because of which we implemented this additional data field is described in the original issue #26.

Have an ability to get the transactions with exact match is usefull - for example, I'm a service or exchange operator and I assigned to my users unique identifiers, then I'm able to get all transactions that involves particular user.

For partial match it doesn't really make sense, it only adds the complexity within the code and the actual API response will become slower over time. It's also not really the responsibility of Lisk Core to provide such a feature.

I also don't like the idea of converting this from BYTEA, as data field is 64 bytes, so we should keep it as bytes and not as a string.

MaciejBaj added a commit that referenced this issue Nov 8, 2018
…filter

Allow fuzzy search on data filter when getting transactions - Closes #2404
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.