Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(audit-logs/search): Add filter using log and environments #633

Merged
merged 3 commits into from
Feb 3, 2022

Conversation

gagantrivedi
Copy link
Member

@gagantrivedi gagantrivedi commented Dec 28, 2021

Accepts search as query parms to perform search on log.

Since the audit log table is huge, I think we should run the SQL on prod database to see the performance once before rolling it out to production.
Generated sql looks like this:

 SELECT          "audit_auditlog"."id",
                "audit_auditlog"."created_date",
                "audit_auditlog"."project_id",
                "audit_auditlog"."environment_id",
                "audit_auditlog"."log",
                "audit_auditlog"."author_id",
                "audit_auditlog"."related_object_id",
                "audit_auditlog"."related_object_type",
                "projects_project"."id",
                "projects_project"."name",
                "projects_project"."created_date",
                "projects_project"."organisation_id",
                "projects_project"."hide_disabled_flags",
                "projects_project"."enable_dynamo_db",
                "environments_environment"."id",
                "environments_environment"."name",
                "environments_environment"."created_date",
                "environments_environment"."project_id",
                "environments_environment"."api_key",
                "environments_environment"."webhooks_enabled",
                "environments_environment"."webhook_url",
                "users_ffadminuser"."id",
                "users_ffadminuser"."password",
                "users_ffadminuser"."last_login",
                "users_ffadminuser"."is_superuser",
                "users_ffadminuser"."is_staff",
                "users_ffadminuser"."is_active",
                "users_ffadminuser"."date_joined",
                "users_ffadminuser"."email",
                "users_ffadminuser"."username",
                "users_ffadminuser"."first_name",
                "users_ffadminuser"."last_name",
                "users_ffadminuser"."google_user_id",
                "users_ffadminuser"."github_user_id"
FROM            "audit_auditlog"
INNER JOIN      "projects_project"
ON              (
                                "audit_auditlog"."project_id" = "projects_project"."id")
LEFT OUTER JOIN "environments_environment"
ON              (
                                "audit_auditlog"."environment_id" = "environments_environment"."id")
LEFT OUTER JOIN "users_ffadminuser"
ON              (
                                "audit_auditlog"."author_id" = "users_ffadminuser"."id")
WHERE           (
                                "projects_project"."organisation_id" IN
                                (
                                           SELECT     u0."id"
                                           FROM       "organisations_organisation" U0
                                           INNER JOIN "organisations_userorganisation" U1
                                           ON         (
                                                                 u0."id" = u1."organisation_id")
                                           WHERE      u1."user_id" = <user_id>)
                AND             Upper("audit_auditlog"."log"::text) LIKE Upper('%Segment%'))
ORDER BY        "audit_auditlog"."created_date" DESC limit 10


Full text search approach:
I think full text search approach seems kinda overkill right now because of multiple reasons:

  1. The length of the log column is limited(less than 100 in most cases)
  2. Django does not have support for full text search(out of the box) for all the databases that we support.

Ref: #408

@gagantrivedi gagantrivedi changed the title feat(audit-logs/search): Add filter using log and environment feat(audit-logs/search): Add filter using log and environments Dec 28, 2021
@dabeeeenster
Copy link
Contributor

Yes I think if we can manage to get away with out a text index so that we don't break compatibility with other DBMS's we should try and work to that goal. I'm assuming PG will run the filter on Env ID before doing the text search - which will cut down the query time a lot, especially on SaaS. Are we able to run some tests with e.g. 10,000 log record in an environment?

@gagantrivedi
Copy link
Member Author

gagantrivedi commented Jan 5, 2022

Are we able to run some tests with e.g. 10,000 log record in an environment?

I did not try that, since the query planer takes a lot of things into considerations before generating a query plan, which kind of makes my local benchmarks a bit unreal liable? That's why I wanted to run the SQL on production

@dabeeeenster
Copy link
Contributor

it would be good to run an EXPLAIN on the SQL that is generated to make sure it is using the indexes etc first?

@gagantrivedi
Copy link
Member Author

I created a gin index, but it seems like postgres find other index much faster. Below are some query plans generated by the same query on different dataset:

  1. Query plan for smaller dataset(less than 1k rows):
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=42.09..42.09 rows=1 width=3567) (actual time=3.393..3.472 rows=10 loops=1)
   ->  Sort  (cost=42.09..42.09 rows=1 width=3567) (actual time=3.389..3.437 rows=10 loops=1)
         Sort Key: audit_auditlog.created_date DESC
         Sort Method: top-N heapsort  Memory: 35kB
         ->  Nested Loop Left Join  (cost=0.72..42.08 rows=1 width=3567) (actual time=0.159..3.275 rows=42 loops=1)
               ->  Nested Loop Left Join  (cost=0.58..33.53 rows=1 width=1804) (actual time=0.139..2.873 rows=42 loops=1)
                     ->  Nested Loop Semi Join  (cost=0.44..25.11 rows=1 width=635) (actual time=0.126..2.567 rows=42 loops=1)
                           ->  Nested Loop  (cost=0.14..24.07 rows=1 width=635) (actual time=0.083..1.800 rows=42 loops=1)
                                 ->  Seq Scan on audit_auditlog  (cost=0.00..15.79 rows=1 width=101) (actual time=0.059..1.307 rows=58 loops=1)
                                       Filter: (upper(log) ~~ '%SEGMENT%'::text)
                                       Rows Removed by Filter: 395
                                 ->  Index Scan using projects_project_pkey on projects_project  (cost=0.14..8.16 rows=1 width=534) (actual time=0.003..0.003 rows=1 loops=58)
                                       Index Cond: (id = audit_auditlog.project_id)
                           ->  Nested Loop  (cost=0.29..1.03 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=42)
                                 Join Filter: (u0.id = u1.organisation_id)
                                 ->  Index Only Scan using organisations_organisation_pkey on organisations_organisation u0  (cost=0.14..0.50 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=42)
                                       Index Cond: (id = projects_project.organisation_id)
                                       Heap Fetches: 42
                                 ->  Index Only Scan using organisations_userorgani_user_id_organisation_id_b0deede9_uniq on organisations_userorganisation u1  (cost=0.15..0.51 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=42)
                                       Index Cond: ((user_id = 1) AND (organisation_id = projects_project.organisation_id))
                                       Heap Fetches: 42
                     ->  Index Scan using environments_environment_pkey on environments_environment  (cost=0.14..8.16 rows=1 width=1169) (actual time=0.002..0.002 rows=0 loops=42)
                           Index Cond: (audit_auditlog.environment_id = id)
               ->  Index Scan using users_ffadminuser_pkey on users_ffadminuser  (cost=0.14..8.16 rows=1 width=1763) (actual time=0.004..0.004 rows=1 loops=42)
                     Index Cond: (audit_auditlog.author_id = id)
 Planning Time: 1.514 ms
 Execution Time: 3.712 ms
(27 rows)


It is doing a sequential scan on audit log because the dataset is small, and it does not make much sense to use any index here.

2.) 2million rows:

 Limit  (cost=104.97..104.98 rows=3 width=3520) (actual time=2.482..2.574 rows=10 loops=1)
   ->  Sort  (cost=104.97..104.98 rows=3 width=3520) (actual time=2.477..2.537 rows=10 loops=1)
         Sort Key: audit_auditlog.created_date DESC
         Sort Method: top-N heapsort  Memory: 30kB
         ->  Nested Loop Left Join  (cost=21.28..104.95 rows=3 width=3520) (actual time=0.361..2.383 rows=42 loops=1)
               ->  Nested Loop Left Join  (cost=21.14..103.77 rows=3 width=1757) (actual time=0.323..1.979 rows=42 loops=1)
                     ->  Nested Loop  (cost=21.00..102.59 rows=3 width=588) (actual time=0.310..1.672 rows=42 loops=1)
                           ->  Nested Loop  (cost=20.57..22.15 rows=2 width=534) (actual time=0.126..0.185 rows=2 loops=1)
                                 ->  Unique  (cost=20.42..20.43 rows=2 width=8) (actual time=0.107..0.139 rows=2 loops=1)
                                       ->  Sort  (cost=20.42..20.43 rows=2 width=8) (actual time=0.104..0.125 rows=2 loops=1)
                                             Sort Key: u0.id
                                             Sort Method: quicksort  Memory: 25kB
                                             ->  Hash Join  (cost=9.53..20.41 rows=2 width=8) (actual time=0.075..0.101 rows=2 loops=1)
                                                   Hash Cond: (u0.id = u1.organisation_id)
                                                   ->  Seq Scan on organisations_organisation u0  (cost=0.00..10.70 rows=70 width=4) (actual time=0.009..0.017 rows=3 loops=1)
                                                   ->  Hash  (cost=9.50..9.50 rows=2 width=4) (actual time=0.037..0.045 rows=2 loops=1)
                                                         Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                                         ->  Bitmap Heap Scan on organisations_userorganisation u1  (cost=4.16..9.50 rows=2 width=4) (actual time=0.017..0.026 rows=2 loops=1)
                                                               Recheck Cond: (user_id = 1)
                                                               Heap Blocks: exact=1
                                                               ->  Bitmap Index Scan on organisations_userorgani_user_id_organisation_id_b0deede9_uniq  (cost=0.00..4.16 rows=2 width=0) (actual time=0.008..0.010 rows=2 loops=1)
                                                                     Index Cond: (user_id = 1)
                                 ->  Index Scan using projects_project_organisation_id_97360d51 on projects_project  (cost=0.14..0.85 rows=1 width=534) (actual time=0.008..0.011 rows=1 loops=2)
                                       Index Cond: (organisation_id = u0.id)
                           ->  Index Scan using audit_auditlog_project_id_025694a7 on audit_auditlog  (cost=0.43..40.21 rows=1 width=54) (actual time=0.101..0.665 rows=21 loops=2)
                                 Index Cond: (project_id = projects_project.id)
                                 Filter: (upper(log) ~~ '%SEGMENT%'::text)
                                 Rows Removed by Filter: 196
                     ->  Index Scan using environments_environment_pkey on environments_environment  (cost=0.14..0.39 rows=1 width=1169) (actual time=0.002..0.002 rows=0 loops=42)
                           Index Cond: (audit_auditlog.environment_id = id)
               ->  Index Scan using users_ffadminuser_pkey on users_ffadminuser  (cost=0.14..0.39 rows=1 width=1763) (actual time=0.004..0.004 rows=1 loops=42)
                     Index Cond: (audit_auditlog.author_id = id)
 Planning Time: 2.289 ms
 Execution Time: 2.858 ms
(34 rows)

Now the query plan is wildly different from the one generated for 1k rows, and as you can see it is using the project_id index on audit log.

So to sum it all up, adding a gin index at this point has more downsides(because of multiple databases and also because it's not being used) than upside and the performance is not bad for a dataset of 2M rows.

@dabeeeenster
Copy link
Contributor

We should go ahead with this, but have the FE behind a feature flag (so deciding whether to do the search client side as currently does, or server side as per this new endpoint).

Then we can test performance in prod.

One further option would be to default limit the audit log search to last 30 days which I think would be acceptable.

@matthewelwell matthewelwell self-assigned this Jan 27, 2022
assert response.json()["results"][1]["environment"] is None


def test_audit_log_can_be_filtered_by_related_object_type(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really filtering by 'related object type' right?

@gagantrivedi gagantrivedi requested review from matthewelwell and removed request for matthewelwell January 31, 2022 15:51
@gagantrivedi gagantrivedi force-pushed the fix/github-408/audit-log-search branch from 1c5e446 to eb97eba Compare February 1, 2022 08:29
@dabeeeenster dabeeeenster merged commit 8e6efd0 into main Feb 3, 2022
@dabeeeenster dabeeeenster deleted the fix/github-408/audit-log-search branch February 3, 2022 17:01
gagantrivedi added a commit that referenced this pull request Feb 11, 2022
* feat(audit-logs/search): Add filter using log and environment

* Add tests

* fix/improve tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants