Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/Evolveum/midpoint
Browse files Browse the repository at this point in the history
  • Loading branch information
katkav committed Dec 8, 2020
2 parents 7ec00fc + 7dd1f4d commit 8802a81
Show file tree
Hide file tree
Showing 5 changed files with 121 additions and 144 deletions.
69 changes: 0 additions & 69 deletions repo/repo-sqale/README.adoc
Expand Up @@ -277,32 +277,6 @@ https://stackoverflow.com/questions/722221/how-to-log-postgresql-queries
We may need regular `ANALYZE` and/or `VACUUM`.
This should be run regularly - can it be done in DB or should MP call this or something else will trigger it?

== Pagination

Various types of pagination are summed up in https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/[this article].

For *pagination in the GUI* `OFFSET`/`LIMIT` seems to be the acceptable despite the performance
degradation with big values of `OFFSET`.
The reason is that GUI requires random access and first pages are accessed more than later/last pages.
Also, any inconsistency (suddenly added entry) is easy to explain and user probably knows what is happening.

For long term processes that need to process many items we use *keyset pagination*, e.g. using last
ID from current page to define the next page without any `OFFSET`, only using `LIMIT`.
This is very efficient *even when no additional index besides PK is used* - and it's natural, too.
It is called "top-N query" and typically the best thing is to order and filter it by the same index.
Even when ordered by ID with filters not using the index, it tends to be very fast unless
the required number of rows satisfying the condition is far from the beginning of the search.
While this may skip some items that are added after we processed the page (and at the same time
process other items added later that appear on later pages) it is more or less deterministic.
We can also avoid processing "future" items with `WHERE` clause using creation timestamp
(or current maximal ID, if sequential) at the processing start time.

Following techniques are generally not usable for us:

* *TID Scan* and *Keyset with Estimated Bookmarks* does not support `WHERE` clauses.
* *Cursor* pagination causes high client-server coupling and is state-full.
We don't want to hold the cursor for operations that can take longer and need transactions.

== Performance drop with volume

TL/DR:
Expand Down Expand Up @@ -428,46 +402,3 @@ JIT:
Functions: 18
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
----

== JSONB vs EAV performance

[NOTE]
Performance was measured on default PG installation on 2GB RAM VirtualBox.
The results can be better (or worse depending on the host) but not order-of-magnitude different.

See link:sql/pgnew-eav-vs-json.sql[] for the whole source.

Structure for JSON is simple, single table `tjson` with `(oid UUID PK, name VARCHAR, ext JSONB)`.
Column `name` has a unique index and `ext` has a `GIN` index only by default.
Other indexes are added as needed and mentioned during tests.

Structure for EAV model has a master table `teav` with `(oid UUID PK, name VARCHAR)`
and detail table `teav_ext_string` with `(owner_oid FK, key VARCHAR, value VARCHAR)`.
Extension table has `PK(owner_oid, key, value)`, value could be omitted if uniqueness is not necessary.
Because the PK covers all columns, it should be index-organized table in Oracle (PG doesn't have this feature).
In PG it can be https://www.postgresql.org/docs/13/sql-cluster.html[clustered] by an index,
but this is potentially costly (12min for 15M rows) during which the table is unusable
because of the used `ACCESS EXCLUSIVE` lock.

=== 15 million rows

|===
| Object | Size | Note

| teav_ext_string | 2763 MB | attribute value table (detail for EAV)
| teav_ext_string_pk | 2312 MB | PK of AV table, nearly the same size
| tjson | 1909 MB | JSON table (smaller than just AV table alone)
| tjson_ext_idx | 1620 MB | GIN index for `ext` column
| teav_ext_string_key_value_idx | 867 MB | index for searching by key+value
| teav | 862 MB | master table for EAV model
| teav_oid_pk | 583 MB | primary key index of that table
| tjson_oid_pk | 582 MB | PK of JSON table, virtually the same as for EAV
| teav_name_key, tjson_name_key | 451 MB | name indexes are the same
|===


|===
| Operation | Time | Note

// TODO
|===
98 changes: 40 additions & 58 deletions repo/repo-sqale/sql/pgnew-eav-vs-json.sql
Expand Up @@ -108,7 +108,6 @@ BEGIN
END LOOP;
END IF;


IF r % 1000 = 0 THEN
COMMIT;
END IF;
Expand All @@ -128,71 +127,54 @@ select * from teav_ext_string;
analyze teav;
analyze teav_ext_string;
analyze tjson;
cluster teav_ext_string using teav_ext_string_pk; -- 10-13min


-- matching big portion of the table (>20%), doesn't use the index
-- 2399882
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
select count(*)
-- select *
from tjson where ext @> '{"hobbies":["video"]}';

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
select count(*)
-- select *
from teav t
where exists (
select from teav_ext_string es
where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video');

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
select count(*)
-- select *
from teav t
where exists (
select from teav_ext_string es
where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video')
and name > 'user0009000000';


EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
-- select count(*)
select *
FROM teav t
JOIN teav_ext_string es ON es.owner_oid = t.oid
WHERE es.key = 'hobbies' and es.value = 'video';
-- cluster teav_ext_string using teav_ext_string_pk; -- 10-13min, not sure how useful this is

-- selecting rare values, uses index and is very fast (under 50ms)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
select count(*)
-- select *
from tjson
where ext @> '{"hobbies":["sleeping"]}';
-- SHOW enable_seqscan;
-- SET enable_seqscan = on;

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
select count(*)
-- select *
from teav t
where exists (
select from teav_ext_string es
where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping');

-- selecting emails between two values (comparison)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
select count(*)
-- select *
from tjson
-- not numerical, matches unexpected values too, but that's not the point here
where ext->>'email' BETWEEN 'user200000@mycompany.com' AND 'user250000@mycompany.com';
where ext->>'email' LIKE 'user2%'
and ext ? 'email';

-- index experiments
analyze;
CREATE INDEX tjson_exttmp_idx ON tjson ((ext->>'email'));
DROP INDEX tjson_exttmp_idx;

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
select count(*)
-- select *
from teav t
where exists (
select from teav_ext_string es
where es.owner_oid = t.oid and es.key = 'email' and es.value BETWEEN 'user200000@mycompany.com' AND 'user250000@mycompany.com');
-- BENCHMARK
-- counts
select count(*) from tjson;
-- low selectivity count
select count(*) from tjson where ext @> '{"hobbies":["video"]}';
-- high selectivity count (result is low)
select count(*) from tjson where ext @> '{"hobbies":["sleeping"]}';
select count(*) from tjson where ext->>'email' LIKE 'user2%';
select count(*) from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%');

select count(*) from teav;
select count(*) from teav_ext_string;
select count(*) from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video');
select count(*) from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping');

-- selects
select * from tjson limit 500;
select * from tjson where ext @> '{"hobbies":["video"]}' limit 500;
select * from tjson where ext @> '{"hobbies":["video"]}' order by oid limit 500;
select * from tjson where ext @> '{"hobbies":["video"]}' and oid>'fffe0000-0000-0000-0000-000000000000' order by oid limit 500;
select * from tjson where ext @> '{"hobbies":["sleeping"]}' limit 500;
select * from tjson where ext @> '{"hobbies":["sleeping"]}' order by oid limit 500;
select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video') limit 500;
select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video') order by t.oid limit 500;
select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'video') and t.oid>'fffe0000-0000-0000-0000-000000000000' order by t.oid limit 500;
select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping') limit 500;
-- EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'hobbies' and es.value = 'sleeping') order by t.oid limit 500;

-- TODO
select * from tjson where ext->>'email' LIKE 'user2%' limit 500;
select * from teav t where exists (select from teav_ext_string es where es.owner_oid = t.oid and es.key = 'email' and es.value LIKE 'user2%') limit 500;

30 changes: 19 additions & 11 deletions repo/repo-sqale/sql/pgnew-experiments.sql
Expand Up @@ -73,7 +73,6 @@ SHOW session_replication_role;
select count(*) from m_object_oid;
explain
select count(*) from m_user;
select * from m_user order by name_norm offset 990 limit 50;
-- vacuum full analyze; -- this requires exclusive lock on processed table and can be very slow, with 1M rows it takes 10s
vacuum analyze; -- this is normal operation version (can run in parallel, ~25s/25m rows)

Expand Down Expand Up @@ -101,12 +100,13 @@ SELECT 'user-' || LPAD(r::text, 10, '0'),
(select '{' || string_agg('"key-' || i || '": "value"', ',') || '}' from generate_series(1, 1000) i)::jsonb
WHEN r % 10 = 7 THEN
-- let's add some numbers and wannabe "dates"
('{' || '"hired": "' || current_date - width_bucket(random(), 0, 1, 1000) || '",' || '"rating": ' || width_bucket(random(), 0, 1, 10) || '}')::jsonb
('{"hired": "' || current_date - width_bucket(random(), 0, 1, 1000) || '", "rating": ' || width_bucket(random(), 0, 1, 10) || '}')::jsonb
ELSE
('{"hobbies": ' || array_to_json(random_pick(ARRAY['eating', 'books', 'music', 'dancing', 'walking', 'jokes', 'video', 'photo'], 0.5))::text || '}')::jsonb
('{"hired": "' || current_date - width_bucket(random(), 0, 1, 1000) || '",' ||
'"hobbies": ' || array_to_json(random_pick(ARRAY['eating', 'books', 'music', 'dancing', 'walking', 'jokes', 'video', 'photo'], 0.5))::text || '}')::jsonb
END,
1
from generate_series(50001, 100000) as r;
from generate_series(2000001, 3000000) as r;

select * from m_user;

Expand Down Expand Up @@ -140,9 +140,13 @@ CREATE INDEX m_user_ext_hired2_idx ON m_user ((ext ->> 'hired')) WHERE ext ? 'hi
-- set jit=on; -- JIT can sometimes be slower when planner guesses wrong

-- see also https://www.postgresql.org/docs/13/functions-json.html some stuff is only for JSONB
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
select count(*) from m_user
where ext?'hobbies' -- faster, uses GIN index
EXPLAIN --(ANALYZE, BUFFERS, FORMAT TEXT)
-- select count(*)
select oid, name_norm, ext
from m_user
where
-- ext?'hobbies' and -- faster, uses GIN index
ext @> '{"hobbies":["video"]}'
-- where ext->>'hobbies' is not null -- seq-scan
;

Expand Down Expand Up @@ -181,17 +185,18 @@ analyse;

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
-- select count(*)
select oid -- "select *" with big bytea stuff can dominate the total cost
-- select * -- "select *" with big bytea stuff can dominate the total cost
select oid, length(fullobject) -- much faster than *
from m_user
where
ext?'hobbies' and
-- without array test jsonb_array_elements_text function can fail on scalar value (if allowed)
exists (select from jsonb_array_elements_text(ext->'hobbies') v
-- where jsonb_typeof(ext->'hobbies') = 'array'
-- and upper(v::text) LIKE '%ING')
where upper(v::text) LIKE 'VID%')
where upper(v::text) LIKE '%ING')
-- and oid > 14000000
-- order by oid
order by oid
;

-- MANAGEMENT queries
Expand All @@ -214,7 +219,7 @@ FROM pg_class t
LEFT JOIN pg_namespace tftns ON tftns.oid = tft.relnamespace
WHERE 'public' IN (ns.nspname, tftns.nspname)
ORDER BY pg_relation_size(t.oid) DESC
LIMIT 20;
LIMIT 50;

vacuum full analyze;
-- database size
Expand All @@ -230,6 +235,9 @@ where ut.relkind = 'r' and tt.relkind = 't'
and ns.nspname = 'public'
order by relpages desc;

-- find sequence name for serial column (e.g. to alter its value later)
select pg_get_serial_sequence('m_qname', 'id');

-- PRACTICAL UTILITY FUNCTIONS

-- based on https://dba.stackexchange.com/a/22571
Expand Down
55 changes: 55 additions & 0 deletions repo/repo-sqale/sql/pgnew-repo.sql
Expand Up @@ -416,6 +416,61 @@ CREATE TABLE m_acc_cert_wi_reference (

PRIMARY KEY (owner_owner_owner_oid, owner_owner_id, owner_id, relation, targetOid)
);

-- TODO: catalog unused at the moment
CREATE TABLE m_ext_item (
id SERIAL NOT NULL,
kind INTEGER, -- see RItemKind, is this necessary? does it contain cardinality?
name VARCHAR(157), -- no path for nested props needed?
type VARCHAR(32), -- data type
storageType VARCHAR(32) NOT NULL default 'EXT_JSON', -- type of storage (JSON, column, table separate/common, etc.)
storageInfo VARCHAR(32), -- optional storage detail, name of column or table if necessary

PRIMARY KEY (id)
);

-- EXPERIMENTAL EAV (first without catalog, so string keys are used)
CREATE TABLE m_object_ext_boolean (
owner_oid UUID NOT NULL REFERENCES m_object_oid(oid),
ext_item_id VARCHAR(32) NOT NULL,
value BOOLEAN NOT NULL,
PRIMARY KEY (owner_oid, ext_item_id, value)
);
CREATE TABLE m_object_ext_date (
owner_oid UUID NOT NULL REFERENCES m_object_oid(oid),
ext_item_id VARCHAR(32) NOT NULL,
value TIMESTAMP NOT NULL,
PRIMARY KEY (owner_oid, ext_item_id, value)
);
CREATE TABLE m_object_ext_long (
owner_oid UUID NOT NULL REFERENCES m_object_oid(oid),
ext_item_id VARCHAR(32) NOT NULL,
value INTEGER NOT NULL,
PRIMARY KEY (owner_oid, ext_item_id, value)
);
CREATE TABLE m_object_ext_poly (
owner_oid UUID NOT NULL REFERENCES m_object_oid(oid),
ext_item_id VARCHAR(32) NOT NULL,
orig VARCHAR(255) NOT NULL,
norm VARCHAR(255),
PRIMARY KEY (owner_oid, ext_item_id, orig)
);
CREATE TABLE m_object_ext_reference (
owner_oid UUID NOT NULL REFERENCES m_object_oid(oid),
ext_item_id VARCHAR(32) NOT NULL,
targetoid UUID NOT NULL,
relation VARCHAR(157),
targetType INTEGER,
PRIMARY KEY (owner_oid, ext_item_id, targetoid)
);
CREATE TABLE m_object_ext_string (
owner_oid UUID NOT NULL REFERENCES m_object_oid(oid),
ext_item_id VARCHAR(32) NOT NULL,
value VARCHAR(255) NOT NULL,
PRIMARY KEY (owner_oid, ext_item_id, value)
);


-- TODO what of assignment extensions? Can they change for various types of assignments?
-- Then what? Inheritance is impractical, legacy extension tables are unwieldy.

Expand Down
Expand Up @@ -239,7 +239,7 @@ public ListTasksPage listTasks() {

public ListTasksPage listTasks(String objectListMenuItemKey) {
if (StringUtils.isEmpty(objectListMenuItemKey)) {
clickAdministrationMenu("PageAdmin.menu.top.serverTasks", "PageAdmin.menu.top.serverTasks.list");
clickAdministrationMenu("PageAdmin.menu.top.serverTasks", "PageAdmin.menu.top.tasks.list");
} else {
clickAdministrationMenu("PageAdmin.menu.top.serverTasks", objectListMenuItemKey);
}
Expand Down Expand Up @@ -447,11 +447,12 @@ private SelenideElement getMainMenuItemElement(String topLevelMenuKey, String ma
SelenideElement topLevelMenu = $(Schrodinger.byDataResourceKey(topLevelMenuKey));
topLevelMenu.shouldBe(Condition.visible);

SelenideElement topLevelMenuChevron = topLevelMenu.parent().$(By.tagName("i"));
if (!topLevelMenuChevron.has(Condition.cssClass("fa-chevron-down"))) {
topLevelMenu.click();
topLevelMenuChevron.shouldHave(Condition.cssClass("fa-chevron-down")).waitUntil(Condition.cssClass("fa-chevron-down"), MidPoint.TIMEOUT_DEFAULT_2_S);
}
//todo temporary commented
// SelenideElement topLevelMenuChevron = topLevelMenu.parent().$(By.tagName("i"));
// if (!topLevelMenuChevron.has(Condition.cssClass("fa-chevron-down"))) {
// topLevelMenu.click();
// topLevelMenuChevron.shouldHave(Condition.cssClass("fa-chevron-down")).waitUntil(Condition.cssClass("fa-chevron-down"), MidPoint.TIMEOUT_DEFAULT_2_S);
// }

SelenideElement mainMenu = topLevelMenu.$(Schrodinger.byDataResourceKey("span", mainMenuKey), index);
mainMenu.shouldBe(Condition.visible);
Expand Down

0 comments on commit 8802a81

Please sign in to comment.