feat(audit-logs/search): Add filter using log and environments #633

gagantrivedi · 2021-12-28T08:05:29Z

Accepts search as query parms to perform search on log.

Since the audit log table is huge, I think we should run the SQL on prod database to see the performance once before rolling it out to production.
Generated sql looks like this:

 SELECT          "audit_auditlog"."id",
                "audit_auditlog"."created_date",
                "audit_auditlog"."project_id",
                "audit_auditlog"."environment_id",
                "audit_auditlog"."log",
                "audit_auditlog"."author_id",
                "audit_auditlog"."related_object_id",
                "audit_auditlog"."related_object_type",
                "projects_project"."id",
                "projects_project"."name",
                "projects_project"."created_date",
                "projects_project"."organisation_id",
                "projects_project"."hide_disabled_flags",
                "projects_project"."enable_dynamo_db",
                "environments_environment"."id",
                "environments_environment"."name",
                "environments_environment"."created_date",
                "environments_environment"."project_id",
                "environments_environment"."api_key",
                "environments_environment"."webhooks_enabled",
                "environments_environment"."webhook_url",
                "users_ffadminuser"."id",
                "users_ffadminuser"."password",
                "users_ffadminuser"."last_login",
                "users_ffadminuser"."is_superuser",
                "users_ffadminuser"."is_staff",
                "users_ffadminuser"."is_active",
                "users_ffadminuser"."date_joined",
                "users_ffadminuser"."email",
                "users_ffadminuser"."username",
                "users_ffadminuser"."first_name",
                "users_ffadminuser"."last_name",
                "users_ffadminuser"."google_user_id",
                "users_ffadminuser"."github_user_id"
FROM            "audit_auditlog"
INNER JOIN      "projects_project"
ON              (
                                "audit_auditlog"."project_id" = "projects_project"."id")
LEFT OUTER JOIN "environments_environment"
ON              (
                                "audit_auditlog"."environment_id" = "environments_environment"."id")
LEFT OUTER JOIN "users_ffadminuser"
ON              (
                                "audit_auditlog"."author_id" = "users_ffadminuser"."id")
WHERE           (
                                "projects_project"."organisation_id" IN
                                (
                                           SELECT     u0."id"
                                           FROM       "organisations_organisation" U0
                                           INNER JOIN "organisations_userorganisation" U1
                                           ON         (
                                                                 u0."id" = u1."organisation_id")
                                           WHERE      u1."user_id" = <user_id>)
                AND             Upper("audit_auditlog"."log"::text) LIKE Upper('%Segment%'))
ORDER BY        "audit_auditlog"."created_date" DESC limit 10

Full text search approach:
I think full text search approach seems kinda overkill right now because of multiple reasons:

The length of the log column is limited(less than 100 in most cases)
Django does not have support for full text search(out of the box) for all the databases that we support.

Ref: #408

dabeeeenster · 2022-01-04T10:05:19Z

Yes I think if we can manage to get away with out a text index so that we don't break compatibility with other DBMS's we should try and work to that goal. I'm assuming PG will run the filter on Env ID before doing the text search - which will cut down the query time a lot, especially on SaaS. Are we able to run some tests with e.g. 10,000 log record in an environment?

gagantrivedi · 2022-01-05T07:15:11Z

Are we able to run some tests with e.g. 10,000 log record in an environment?

I did not try that, since the query planer takes a lot of things into considerations before generating a query plan, which kind of makes my local benchmarks a bit unreal liable? That's why I wanted to run the SQL on production

dabeeeenster · 2022-01-05T11:33:04Z

it would be good to run an EXPLAIN on the SQL that is generated to make sure it is using the indexes etc first?

gagantrivedi · 2022-01-19T07:24:40Z

I created a gin index, but it seems like postgres find other index much faster. Below are some query plans generated by the same query on different dataset:

Query plan for smaller dataset(less than 1k rows):

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=42.09..42.09 rows=1 width=3567) (actual time=3.393..3.472 rows=10 loops=1)
   ->  Sort  (cost=42.09..42.09 rows=1 width=3567) (actual time=3.389..3.437 rows=10 loops=1)
         Sort Key: audit_auditlog.created_date DESC
         Sort Method: top-N heapsort  Memory: 35kB
         ->  Nested Loop Left Join  (cost=0.72..42.08 rows=1 width=3567) (actual time=0.159..3.275 rows=42 loops=1)
               ->  Nested Loop Left Join  (cost=0.58..33.53 rows=1 width=1804) (actual time=0.139..2.873 rows=42 loops=1)
                     ->  Nested Loop Semi Join  (cost=0.44..25.11 rows=1 width=635) (actual time=0.126..2.567 rows=42 loops=1)
                           ->  Nested Loop  (cost=0.14..24.07 rows=1 width=635) (actual time=0.083..1.800 rows=42 loops=1)
                                 ->  Seq Scan on audit_auditlog  (cost=0.00..15.79 rows=1 width=101) (actual time=0.059..1.307 rows=58 loops=1)
                                       Filter: (upper(log) ~~ '%SEGMENT%'::text)
                                       Rows Removed by Filter: 395
                                 ->  Index Scan using projects_project_pkey on projects_project  (cost=0.14..8.16 rows=1 width=534) (actual time=0.003..0.003 rows=1 loops=58)
                                       Index Cond: (id = audit_auditlog.project_id)
                           ->  Nested Loop  (cost=0.29..1.03 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=42)
                                 Join Filter: (u0.id = u1.organisation_id)
                                 ->  Index Only Scan using organisations_organisation_pkey on organisations_organisation u0  (cost=0.14..0.50 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=42)
                                       Index Cond: (id = projects_project.organisation_id)
                                       Heap Fetches: 42
                                 ->  Index Only Scan using organisations_userorgani_user_id_organisation_id_b0deede9_uniq on organisations_userorganisation u1  (cost=0.15..0.51 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=42)
                                       Index Cond: ((user_id = 1) AND (organisation_id = projects_project.organisation_id))
                                       Heap Fetches: 42
                     ->  Index Scan using environments_environment_pkey on environments_environment  (cost=0.14..8.16 rows=1 width=1169) (actual time=0.002..0.002 rows=0 loops=42)
                           Index Cond: (audit_auditlog.environment_id = id)
               ->  Index Scan using users_ffadminuser_pkey on users_ffadminuser  (cost=0.14..8.16 rows=1 width=1763) (actual time=0.004..0.004 rows=1 loops=42)
                     Index Cond: (audit_auditlog.author_id = id)
 Planning Time: 1.514 ms
 Execution Time: 3.712 ms
(27 rows)

It is doing a sequential scan on audit log because the dataset is small, and it does not make much sense to use any index here.

2.) 2million rows:

 Limit  (cost=104.97..104.98 rows=3 width=3520) (actual time=2.482..2.574 rows=10 loops=1)
   ->  Sort  (cost=104.97..104.98 rows=3 width=3520) (actual time=2.477..2.537 rows=10 loops=1)
         Sort Key: audit_auditlog.created_date DESC
         Sort Method: top-N heapsort  Memory: 30kB
         ->  Nested Loop Left Join  (cost=21.28..104.95 rows=3 width=3520) (actual time=0.361..2.383 rows=42 loops=1)
               ->  Nested Loop Left Join  (cost=21.14..103.77 rows=3 width=1757) (actual time=0.323..1.979 rows=42 loops=1)
                     ->  Nested Loop  (cost=21.00..102.59 rows=3 width=588) (actual time=0.310..1.672 rows=42 loops=1)
                           ->  Nested Loop  (cost=20.57..22.15 rows=2 width=534) (actual time=0.126..0.185 rows=2 loops=1)
                                 ->  Unique  (cost=20.42..20.43 rows=2 width=8) (actual time=0.107..0.139 rows=2 loops=1)
                                       ->  Sort  (cost=20.42..20.43 rows=2 width=8) (actual time=0.104..0.125 rows=2 loops=1)
                                             Sort Key: u0.id
                                             Sort Method: quicksort  Memory: 25kB
                                             ->  Hash Join  (cost=9.53..20.41 rows=2 width=8) (actual time=0.075..0.101 rows=2 loops=1)
                                                   Hash Cond: (u0.id = u1.organisation_id)
                                                   ->  Seq Scan on organisations_organisation u0  (cost=0.00..10.70 rows=70 width=4) (actual time=0.009..0.017 rows=3 loops=1)
                                                   ->  Hash  (cost=9.50..9.50 rows=2 width=4) (actual time=0.037..0.045 rows=2 loops=1)
                                                         Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                                         ->  Bitmap Heap Scan on organisations_userorganisation u1  (cost=4.16..9.50 rows=2 width=4) (actual time=0.017..0.026 rows=2 loops=1)
                                                               Recheck Cond: (user_id = 1)
                                                               Heap Blocks: exact=1
                                                               ->  Bitmap Index Scan on organisations_userorgani_user_id_organisation_id_b0deede9_uniq  (cost=0.00..4.16 rows=2 width=0) (actual time=0.008..0.010 rows=2 loops=1)
                                                                     Index Cond: (user_id = 1)
                                 ->  Index Scan using projects_project_organisation_id_97360d51 on projects_project  (cost=0.14..0.85 rows=1 width=534) (actual time=0.008..0.011 rows=1 loops=2)
                                       Index Cond: (organisation_id = u0.id)
                           ->  Index Scan using audit_auditlog_project_id_025694a7 on audit_auditlog  (cost=0.43..40.21 rows=1 width=54) (actual time=0.101..0.665 rows=21 loops=2)
                                 Index Cond: (project_id = projects_project.id)
                                 Filter: (upper(log) ~~ '%SEGMENT%'::text)
                                 Rows Removed by Filter: 196
                     ->  Index Scan using environments_environment_pkey on environments_environment  (cost=0.14..0.39 rows=1 width=1169) (actual time=0.002..0.002 rows=0 loops=42)
                           Index Cond: (audit_auditlog.environment_id = id)
               ->  Index Scan using users_ffadminuser_pkey on users_ffadminuser  (cost=0.14..0.39 rows=1 width=1763) (actual time=0.004..0.004 rows=1 loops=42)
                     Index Cond: (audit_auditlog.author_id = id)
 Planning Time: 2.289 ms
 Execution Time: 2.858 ms
(34 rows)

Now the query plan is wildly different from the one generated for 1k rows, and as you can see it is using the project_id index on audit log.

So to sum it all up, adding a gin index at this point has more downsides(because of multiple databases and also because it's not being used) than upside and the performance is not bad for a dataset of 2M rows.

dabeeeenster · 2022-01-19T11:43:35Z

We should go ahead with this, but have the FE behind a feature flag (so deciding whether to do the search client side as currently does, or server side as per this new endpoint).

Then we can test performance in prod.

One further option would be to default limit the audit log search to last 30 days which I think would be acceptable.

matthewelwell · 2022-01-31T14:54:40Z

api/audit/tests/test_views.py

+    assert response.json()["results"][1]["environment"] is None
+
+
+def test_audit_log_can_be_filtered_by_related_object_type(


This isn't really filtering by 'related object type' right?

* feat(audit-logs/search): Add filter using log and environment * Add tests * fix/improve tests

gagantrivedi requested a review from matthewelwell December 28, 2021 08:05

gagantrivedi mentioned this pull request Dec 28, 2021

Audit log search is client side only #408

Closed

gagantrivedi changed the title ~~feat(audit-logs/search): Add filter using log and environment~~ feat(audit-logs/search): Add filter using log and environments Dec 28, 2021

matthewelwell self-assigned this Jan 27, 2022

matthewelwell reviewed Jan 31, 2022

View reviewed changes

gagantrivedi requested review from matthewelwell and removed request for matthewelwell January 31, 2022 15:51

gagantrivedi force-pushed the fix/github-408/audit-log-search branch from 1c5e446 to eb97eba Compare February 1, 2022 08:29

gagantrivedi added 3 commits February 1, 2022 13:59

feat(audit-logs/search): Add filter using log and environment

37cadbe

Add tests

13294e0

fix/improve tests

eb97eba

gagantrivedi requested a review from matthewelwell February 1, 2022 08:59

matthewelwell approved these changes Feb 3, 2022

View reviewed changes

dabeeeenster merged commit 8e6efd0 into main Feb 3, 2022

dabeeeenster deleted the fix/github-408/audit-log-search branch February 3, 2022 17:01

gagantrivedi added a commit that referenced this pull request Feb 11, 2022

feat(audit-logs/search): Add filter using log and environments (#633)

d820b84

* feat(audit-logs/search): Add filter using log and environment * Add tests * fix/improve tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(audit-logs/search): Add filter using log and environments #633

feat(audit-logs/search): Add filter using log and environments #633

gagantrivedi commented Dec 28, 2021 •

edited

dabeeeenster commented Jan 4, 2022

gagantrivedi commented Jan 5, 2022 •

edited

dabeeeenster commented Jan 5, 2022

gagantrivedi commented Jan 19, 2022

dabeeeenster commented Jan 19, 2022

matthewelwell Jan 31, 2022

		assert response.json()["results"][1]["environment"] is None


		def test_audit_log_can_be_filtered_by_related_object_type(

feat(audit-logs/search): Add filter using log and environments #633

feat(audit-logs/search): Add filter using log and environments #633

Conversation

gagantrivedi commented Dec 28, 2021 • edited

dabeeeenster commented Jan 4, 2022

gagantrivedi commented Jan 5, 2022 • edited

dabeeeenster commented Jan 5, 2022

gagantrivedi commented Jan 19, 2022

dabeeeenster commented Jan 19, 2022

matthewelwell Jan 31, 2022

Choose a reason for hiding this comment

gagantrivedi commented Dec 28, 2021 •

edited

gagantrivedi commented Jan 5, 2022 •

edited