Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/Evolveum/midpoint
Browse files Browse the repository at this point in the history
  • Loading branch information
KaterynaHonchar committed Dec 4, 2020
2 parents c3f672e + 372401c commit 6d861ca
Show file tree
Hide file tree
Showing 3 changed files with 72 additions and 14 deletions.
63 changes: 56 additions & 7 deletions repo/repo-sqale/README.adoc
Expand Up @@ -25,6 +25,21 @@ Also, we need various extension tables for various data types.
Can https://www.postgresql.org/docs/13/datatype-json.html[JSON types] (namely JSONB) be our salvation?
We need to strike the balance between *query performance* and *extension attributes maintainability/flexibility*.

Regarding *query performance* we have to realize that even current implementation under favorable
conditions (middle-sized databases) has various levels of performance for various operations:

* Equality, comparison and "starts with" filters can rely on indexes like `iAExtensionDate` for `m_assignment_ext_date (dateValue)`.
// TODO: why is this on dateValue only and not combined with item_id?

* "Ends with" or "contains" substrings are not optimized unless we add a trigram or similar index,
see https://stackoverflow.com/a/17646278/658826[this answer]
and https://www.postgresql.org/docs/13/indexes-types.html[index types documentation].
Trigram indexes support only alphanumerical characters and can, reportedly, be huge and inefficient.
"Ends with" can benefit from `reverse()` index.

* Case-insensitive search is not optimized without appropriate function-based index.


=== Modeling options

Following options are available, but they still must be proved:
Expand All @@ -48,20 +63,35 @@ on different sets of the same type (e.g. shadows for different resources).
One possible problem is that the whole JSONB value must be updated so the column value can be
a source of contention if small changes (e.g. value of a single attribute) are performed
on large extensions containing a lot of data.

Also, related to the "blobby" nature of all-in-one JSON, it can be more costly for more difficult
queries like substrings where index helps only partially and the value must be consulted.
As the JSON grows it can be more and more expensive to filter over it (probably more if TOAST-ed).
--

* Each attribute in separate columns of various non-JSON types with multi-values stored as JSON array.
This makes each column simpler but requires dynamic DDL management and other complications mentioned
above that single general JSONB does not have.

* Obviously, existing master-detail model is still in play as well if nothing above works well enough.
* Master-detail (https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model[EAV])
model is still in play as well if nothing above works well enough.
We would like to avoid very big tables though, so current implementation is not suitable.
Following modifications are possible:

** Splitting tables for single-value and multi-value attributes.
** Separate tables for attributes used most (e.g. some external identifier used for all users),
especially if multi-valued.

* Single JSON column for single-value attributes but keeping multi-value attributes in separate
table(s) (per type or per attribute).
This is a mix of the first approach ("JSONB handles it all") and some version EAV model just mentioned.

The typical problems are:

* How to identify attributes?
This is beyond the scope of this section, but currently `m_ext_item.id` is used.
Some kind of string identifiers must be used in JSON.
These don't have to be globally unique for distinct object types, but must be unique enough for queries and stored JSON.
Attributes are identified with synthetic identifier from extension attributes catalog.
Legacy implementation uses `m_ext_item` table and its `id` column - this works just fine.
Repository can easily cache this catalog, there is no need to JOIN the table.

* Check if some multi-value attribute contains a value.
With `ext->'hobby' @> '"video"'` (`ext` being of type JSONB) it's possible to check value exactly.
Expand Down Expand Up @@ -165,7 +195,11 @@ benefit from the generic GIN index a lot.

== TODO

* Can we merge boring entities to a single `m_object_generic` table?
Things like `m_sequence`, `m_security_policy`, `m_system_configuration`, etc.
Of course, if some of these can have many rows it's not desirable, perhaps it's more confusing in general anyway.
* How is `m_object_subtype` (`ObjectType.subtype`) used and searched?
*Obsolete,* even if necessary, single JSON array should cover it, no entity needed.
* Untackled yet: Tree tables, organization, see: https://www.postgresql.org/docs/13/ltree.html
* Mention how LIMIT makes queries faster, mentioned in comments to
Also that Q/A shows how to look into JSON array with `jsonb_array_elements` without expanding the result with the help of `EXISTS`.
Expand Down Expand Up @@ -238,6 +272,11 @@ https://stackoverflow.com/questions/722221/how-to-log-postgresql-queries

* The default `public` schema is used for all midpoint objects, that's OK.

== Maintenance

We may need regular `ANALYZE` and/or `VACUUM`.
This should be run regularly - can it be done in DB or should MP call this or something else will trigger it?

== Pagination

Various types of pagination are summed up in https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/[this article].
Expand All @@ -261,7 +300,17 @@ Following techniques are generally not usable for us:
* *Cursor* pagination causes high client-server coupling and is state-full.
We don't want to hold the cursor for operations that can take longer and need transactions.

== Performance
== Performance drop with volume

TL/DR:

* After first million, insert performance drops.
* So does query, but if it uses an index, not that significantly.
* Count queries suffer with volume - avoid count whenever possible.
* Avoid solutions where number of inherited tables affects the performance, e.g. unique over
hierarchy - perhaps externalize it to dedicated table.
* Nothing was optimized, it was just couple of experiments to get a feeling for it.
* After mass-deletes, performance can still be slow before `VACUUM` and/or `ANALYZE` is not ran.

Tested on VirtualBox, 2 GB RAM, 60+ GB disk.

Expand Down Expand Up @@ -303,11 +352,11 @@ Table sizes after x inserts (index means PK index):
| 40M | 3858/1721 MB | 1689/1721 MB | 11 GB
|===

With user names formatted like `user-0000000001` both name indexes had 1269 MB at 40M rows.
With user's names formatted like `user-0000000001` both name indexes had 1269 MB at 40M rows.

== Performance of searching for unused OIDs

If delete is not guarded by trigger `m_object_oid` can have unused OIDs.
If delete is not guarded by a trigger, `m_object_oid` can have unused OIDs.
It's crucial to use the right select/delete construction to find/delete them.
With 26M rows naive approach with `NOT IN` to delete 200k unused OIDs took over 1h without finishing.
Following output shows the plan for `NOT IN`, `LEFT JOIN` and `NOT EXISTS`.
Expand Down
Expand Up @@ -39,7 +39,11 @@
import com.evolveum.midpoint.prism.query.ObjectQuery;
import com.evolveum.midpoint.prism.util.CloneUtil;
import com.evolveum.midpoint.repo.sql.SqlRepositoryConfiguration.Database;
import com.evolveum.midpoint.repo.sql.audit.SqlQueryExecutor;
import com.evolveum.midpoint.repo.sql.audit.beans.MAuditDelta;
import com.evolveum.midpoint.repo.sql.audit.beans.MAuditEventRecord;
import com.evolveum.midpoint.repo.sql.audit.mapping.*;
import com.evolveum.midpoint.repo.sql.audit.querymodel.*;
import com.evolveum.midpoint.repo.sql.data.SelectQueryBuilder;
import com.evolveum.midpoint.repo.sql.data.audit.RAuditEventStage;
import com.evolveum.midpoint.repo.sql.data.audit.RAuditEventType;
Expand All @@ -49,14 +53,10 @@
import com.evolveum.midpoint.repo.sql.helpers.BaseHelper;
import com.evolveum.midpoint.repo.sql.helpers.JdbcSession;
import com.evolveum.midpoint.repo.sql.perf.SqlPerformanceMonitorImpl;
import com.evolveum.midpoint.repo.sql.audit.SqlQueryExecutor;
import com.evolveum.midpoint.repo.sql.audit.querymodel.*;
import com.evolveum.midpoint.repo.sql.audit.beans.MAuditDelta;
import com.evolveum.midpoint.repo.sql.audit.beans.MAuditEventRecord;
import com.evolveum.midpoint.repo.sqlbase.QueryException;
import com.evolveum.midpoint.repo.sql.util.DtoTranslationException;
import com.evolveum.midpoint.repo.sql.util.RUtil;
import com.evolveum.midpoint.repo.sql.util.TemporaryTableDialect;
import com.evolveum.midpoint.repo.sqlbase.QueryException;
import com.evolveum.midpoint.schema.*;
import com.evolveum.midpoint.schema.internals.InternalsConfig;
import com.evolveum.midpoint.schema.result.OperationResult;
Expand Down Expand Up @@ -230,7 +230,8 @@ private MAuditDelta convertDelta(ObjectDeltaOperation<?> deltaOperation, long re

// serializedDelta is transient, needed for changed items later
mAuditDelta.serializedDelta = serializedDelta;
mAuditDelta.delta = RUtil.getBytesFromSerializedForm(serializedDelta, true);
mAuditDelta.delta = RUtil.getBytesFromSerializedForm(
serializedDelta, sqlConfiguration().isUseZipAudit());
mAuditDelta.deltaOid = delta.getOid();
mAuditDelta.deltaType = MiscUtil.enumOrdinal(
RUtil.getRepoEnumValue(delta.getChangeType(), RChangeType.class));
Expand All @@ -245,7 +246,8 @@ private MAuditDelta convertDelta(ObjectDeltaOperation<?> deltaOperation, long re
String full = prismContext.xmlSerializer()
.options(SerializationOptions.createEscapeInvalidCharacters())
.serializeRealValue(jaxb, SchemaConstantsGenerated.C_OPERATION_RESULT);
mAuditDelta.fullResult = RUtil.getBytesFromSerializedForm(full, true);
mAuditDelta.fullResult = RUtil.getBytesFromSerializedForm(
full, sqlConfiguration().isUseZipAudit());
}
}
mAuditDelta.resourceOid = deltaOperation.getResourceOid();
Expand Down
Expand Up @@ -281,6 +281,7 @@ public static IncompatibleSchemaAction fromValue(String text) {
public static final String PROPERTY_JDBC_URL = "jdbcUrl";
public static final String PROPERTY_DATASOURCE = "dataSource";
public static final String PROPERTY_USE_ZIP = "useZip";
public static final String PROPERTY_USE_ZIP_AUDIT = "useZipAudit";
public static final String PROPERTY_CREATE_MISSING_CUSTOM_COLUMNS = "createMissingCustomColumns";

/**
Expand Down Expand Up @@ -369,6 +370,7 @@ public static IncompatibleSchemaAction fromValue(String text) {
private final Long maxLifetime;
private final Long idleTimeout;
private final boolean useZip;
private final boolean useZipAudit;
private String fullObjectFormat; // non-final for testing

private TransactionIsolation defaultTransactionIsolation;
Expand Down Expand Up @@ -487,6 +489,7 @@ public SqlRepositoryConfiguration(Configuration configuration) {
idleTimeout = configuration.getLong(PROPERTY_IDLE_TIMEOUT, null);

useZip = configuration.getBoolean(PROPERTY_USE_ZIP, false);
useZipAudit = configuration.getBoolean(PROPERTY_USE_ZIP_AUDIT, true);
createMissingCustomColumns = configuration.getBoolean(PROPERTY_CREATE_MISSING_CUSTOM_COLUMNS, false);
fullObjectFormat = configuration.getString(
PROPERTY_FULL_OBJECT_FORMAT,
Expand Down Expand Up @@ -925,6 +928,10 @@ public boolean isUseZip() {
return useZip;
}

public boolean isUseZipAudit() {
return useZipAudit;
}

/**
* This is normally not used outside of tests, but should be safe to change any time.
*/
Expand Down

0 comments on commit 6d861ca

Please sign in to comment.