Merge branch 'master' of https://github.com/Evolveum/midpoint

Evolveum · Dec 4, 2020 · 6d861ca · 6d861ca
2 parents c3f672e + 372401c
commit 6d861ca
Show file tree

Hide file tree

Showing 3 changed files with 72 additions and 14 deletions.
diff --git a/repo/repo-sqale/README.adoc b/repo/repo-sqale/README.adoc
@@ -25,6 +25,21 @@ Also, we need various extension tables for various data types.
 Can https://www.postgresql.org/docs/13/datatype-json.html[JSON types] (namely JSONB) be our salvation?
 We need to strike the balance between *query performance* and *extension attributes maintainability/flexibility*.
 
+Regarding *query performance* we have to realize that even current implementation under favorable
+conditions (middle-sized databases) has various levels of performance for various operations:
+
+* Equality, comparison and "starts with" filters can rely on indexes like `iAExtensionDate` for `m_assignment_ext_date (dateValue)`.
+// TODO: why is this on dateValue only and not combined with item_id?
+
+* "Ends with" or "contains" substrings are not optimized unless we add a trigram or similar index,
+see https://stackoverflow.com/a/17646278/658826[this answer]
+and https://www.postgresql.org/docs/13/indexes-types.html[index types documentation].
+Trigram indexes support only alphanumerical characters and can, reportedly, be huge and inefficient.
+"Ends with" can benefit from `reverse()` index.
+
+* Case-insensitive search is not optimized without appropriate function-based index.
+
+
 === Modeling options
 
 Following options are available, but they still must be proved:
@@ -48,20 +63,35 @@ on different sets of the same type (e.g. shadows for different resources).
 One possible problem is that the whole JSONB value must be updated so the column value can be
 a source of contention if small changes (e.g. value of a single attribute) are performed
 on large extensions containing a lot of data.
+
+Also, related to the "blobby" nature of all-in-one JSON, it can be more costly for more difficult
+queries like substrings where index helps only partially and the value must be consulted.
+As the JSON grows it can be more and more expensive to filter over it (probably more if TOAST-ed).
 --
 
 * Each attribute in separate columns of various non-JSON types with multi-values stored as JSON array.
 This makes each column simpler but requires dynamic DDL management and other complications mentioned
 above that single general JSONB does not have.
 
-* Obviously, existing master-detail model is still in play as well if nothing above works well enough.
+* Master-detail (https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model[EAV])
+model is still in play as well if nothing above works well enough.
+We would like to avoid very big tables though, so current implementation is not suitable.
+Following modifications are possible:
+
+** Splitting tables for single-value and multi-value attributes.
+** Separate tables for attributes used most (e.g. some external identifier used for all users),
+especially if multi-valued.
+
+* Single JSON column for single-value attributes but keeping multi-value attributes in separate
+table(s) (per type or per attribute).
+This is a mix of the first approach ("JSONB handles it all") and some version EAV model just mentioned.
 
 The typical problems are:
 
 * How to identify attributes?
-This is beyond the scope of this section, but currently `m_ext_item.id` is used.
-Some kind of string identifiers must be used in JSON.
-These don't have to be globally unique for distinct object types, but must be unique enough for queries and stored JSON.
+Attributes are identified with synthetic identifier from extension attributes catalog.
+Legacy implementation uses `m_ext_item` table and its `id` column - this works just fine.
+Repository can easily cache this catalog, there is no need to JOIN the table.
 
 * Check if some multi-value attribute contains a value.
 With `ext->'hobby' @> '"video"'` (`ext` being of type JSONB) it's possible to check value exactly.
@@ -165,7 +195,11 @@ benefit from the generic GIN index a lot.
 
 == TODO
 
+* Can we merge boring entities to a single `m_object_generic` table?
+Things like `m_sequence`, `m_security_policy`, `m_system_configuration`, etc.
+Of course, if some of these can have many rows it's not desirable, perhaps it's more confusing in general anyway.
 * How is `m_object_subtype` (`ObjectType.subtype`) used and searched?
+*Obsolete,* even if necessary, single JSON array should cover it, no entity needed.
 * Untackled yet: Tree tables, organization, see: https://www.postgresql.org/docs/13/ltree.html
 * Mention how LIMIT makes queries faster, mentioned in comments to
 Also that Q/A shows how to look into JSON array with `jsonb_array_elements` without expanding the result with the help of `EXISTS`.
@@ -238,6 +272,11 @@ https://stackoverflow.com/questions/722221/how-to-log-postgresql-queries
 
 * The default `public` schema is used for all midpoint objects, that's OK.
 
+== Maintenance
+
+We may need regular `ANALYZE` and/or `VACUUM`.
+This should be run regularly - can it be done in DB or should MP call this or something else will trigger it?
+
 == Pagination
 
 Various types of pagination are summed up in https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/[this article].
@@ -261,7 +300,17 @@ Following techniques are generally not usable for us:
 * *Cursor* pagination causes high client-server coupling and is state-full.
 We don't want to hold the cursor for operations that can take longer and need transactions.
 
-== Performance
+== Performance drop with volume
+
+TL/DR:
+
+* After first million, insert performance drops.
+* So does query, but if it uses an index, not that significantly.
+* Count queries suffer with volume - avoid count whenever possible.
+* Avoid solutions where number of inherited tables affects the performance, e.g. unique over
+hierarchy - perhaps externalize it to dedicated table.
+* Nothing was optimized, it was just couple of experiments to get a feeling for it.
+* After mass-deletes, performance can still be slow before `VACUUM` and/or `ANALYZE` is not ran.
 
 Tested on VirtualBox, 2 GB RAM, 60+ GB disk.
 
@@ -303,11 +352,11 @@ Table sizes after x inserts (index means PK index):
 | 40M | 3858/1721 MB | 1689/1721 MB | 11 GB
 |===
 
-With user names formatted like `user-0000000001` both name indexes had 1269 MB at 40M rows.
+With user's names formatted like `user-0000000001` both name indexes had 1269 MB at 40M rows.
 
 == Performance of searching for unused OIDs
 
-If delete is not guarded by trigger `m_object_oid` can have unused OIDs.
+If delete is not guarded by a trigger, `m_object_oid` can have unused OIDs.
 It's crucial to use the right select/delete construction to find/delete them.
 With 26M rows naive approach with `NOT IN` to delete 200k unused OIDs took over 1h without finishing.
 Following output shows the plan for `NOT IN`, `LEFT JOIN` and `NOT EXISTS`.

diff --git a/repo/repo-sql-impl/src/main/java/com/evolveum/midpoint/repo/sql/SqlAuditServiceImpl.java b/repo/repo-sql-impl/src/main/java/com/evolveum/midpoint/repo/sql/SqlAuditServiceImpl.java
@@ -39,7 +39,11 @@
 import com.evolveum.midpoint.prism.query.ObjectQuery;
 import com.evolveum.midpoint.prism.util.CloneUtil;
 import com.evolveum.midpoint.repo.sql.SqlRepositoryConfiguration.Database;
+import com.evolveum.midpoint.repo.sql.audit.SqlQueryExecutor;
+import com.evolveum.midpoint.repo.sql.audit.beans.MAuditDelta;
+import com.evolveum.midpoint.repo.sql.audit.beans.MAuditEventRecord;
 import com.evolveum.midpoint.repo.sql.audit.mapping.*;
+import com.evolveum.midpoint.repo.sql.audit.querymodel.*;
 import com.evolveum.midpoint.repo.sql.data.SelectQueryBuilder;
 import com.evolveum.midpoint.repo.sql.data.audit.RAuditEventStage;
 import com.evolveum.midpoint.repo.sql.data.audit.RAuditEventType;
@@ -49,14 +53,10 @@
 import com.evolveum.midpoint.repo.sql.helpers.BaseHelper;
 import com.evolveum.midpoint.repo.sql.helpers.JdbcSession;
 import com.evolveum.midpoint.repo.sql.perf.SqlPerformanceMonitorImpl;
-import com.evolveum.midpoint.repo.sql.audit.SqlQueryExecutor;
-import com.evolveum.midpoint.repo.sql.audit.querymodel.*;
-import com.evolveum.midpoint.repo.sql.audit.beans.MAuditDelta;
-import com.evolveum.midpoint.repo.sql.audit.beans.MAuditEventRecord;
-import com.evolveum.midpoint.repo.sqlbase.QueryException;
 import com.evolveum.midpoint.repo.sql.util.DtoTranslationException;
 import com.evolveum.midpoint.repo.sql.util.RUtil;
 import com.evolveum.midpoint.repo.sql.util.TemporaryTableDialect;
+import com.evolveum.midpoint.repo.sqlbase.QueryException;
 import com.evolveum.midpoint.schema.*;
 import com.evolveum.midpoint.schema.internals.InternalsConfig;
 import com.evolveum.midpoint.schema.result.OperationResult;
@@ -230,7 +230,8 @@ private MAuditDelta convertDelta(ObjectDeltaOperation<?> deltaOperation, long re
 
                 // serializedDelta is transient, needed for changed items later
                 mAuditDelta.serializedDelta = serializedDelta;
-                mAuditDelta.delta = RUtil.getBytesFromSerializedForm(serializedDelta, true);
+                mAuditDelta.delta = RUtil.getBytesFromSerializedForm(
+                        serializedDelta, sqlConfiguration().isUseZipAudit());
                 mAuditDelta.deltaOid = delta.getOid();
                 mAuditDelta.deltaType = MiscUtil.enumOrdinal(
                         RUtil.getRepoEnumValue(delta.getChangeType(), RChangeType.class));
@@ -245,7 +246,8 @@ private MAuditDelta convertDelta(ObjectDeltaOperation<?> deltaOperation, long re
                     String full = prismContext.xmlSerializer()
                             .options(SerializationOptions.createEscapeInvalidCharacters())
                             .serializeRealValue(jaxb, SchemaConstantsGenerated.C_OPERATION_RESULT);
-                    mAuditDelta.fullResult = RUtil.getBytesFromSerializedForm(full, true);
+                    mAuditDelta.fullResult = RUtil.getBytesFromSerializedForm(
+                            full, sqlConfiguration().isUseZipAudit());
                 }
             }
             mAuditDelta.resourceOid = deltaOperation.getResourceOid();

diff --git a/...epo-sql-impl/src/main/java/com/evolveum/midpoint/repo/sql/SqlRepositoryConfiguration.java b/...epo-sql-impl/src/main/java/com/evolveum/midpoint/repo/sql/SqlRepositoryConfiguration.java
@@ -281,6 +281,7 @@ public static IncompatibleSchemaAction fromValue(String text) {
     public static final String PROPERTY_JDBC_URL = "jdbcUrl";
     public static final String PROPERTY_DATASOURCE = "dataSource";
     public static final String PROPERTY_USE_ZIP = "useZip";
+    public static final String PROPERTY_USE_ZIP_AUDIT = "useZipAudit";
     public static final String PROPERTY_CREATE_MISSING_CUSTOM_COLUMNS = "createMissingCustomColumns";
 
     /**
@@ -369,6 +370,7 @@ public static IncompatibleSchemaAction fromValue(String text) {
     private final Long maxLifetime;
     private final Long idleTimeout;
     private final boolean useZip;
+    private final boolean useZipAudit;
     private String fullObjectFormat; // non-final for testing
 
     private TransactionIsolation defaultTransactionIsolation;
@@ -487,6 +489,7 @@ public SqlRepositoryConfiguration(Configuration configuration) {
         idleTimeout = configuration.getLong(PROPERTY_IDLE_TIMEOUT, null);
 
         useZip = configuration.getBoolean(PROPERTY_USE_ZIP, false);
+        useZipAudit = configuration.getBoolean(PROPERTY_USE_ZIP_AUDIT, true);
         createMissingCustomColumns = configuration.getBoolean(PROPERTY_CREATE_MISSING_CUSTOM_COLUMNS, false);
         fullObjectFormat = configuration.getString(
                 PROPERTY_FULL_OBJECT_FORMAT,
@@ -925,6 +928,10 @@ public boolean isUseZip() {
         return useZip;
     }
 
+    public boolean isUseZipAudit() {
+        return useZipAudit;
+    }
+
     /**
      * This is normally not used outside of tests, but should be safe to change any time.
      */