Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed email_recipients indexes to match query usage #19918

Merged

Conversation

kevinansfield
Copy link
Contributor

@kevinansfield kevinansfield commented Mar 25, 2024

closes https://linear.app/tryghost/issue/ENG-791/migration-to-fix-email-recipients-indexes

Our indexes over single columns (delivered_at, opened_at, failed_at) were ineffective because the only time we query those is alongside email_id meaning we were frequently performing full table scans on very large tables during our email analytics jobs.

  • added migration to add new indexes covering email_id and the respective columns
  • added migration to drop the old indexes that weren't being used in any query plans

Local runtime with ~2M email_recipient rows:

  • before: 1.7s
  • after: 99ms

Explain output...

before:

+----+-------------+------------------+------------+-------+----------------------------------------------------------------------------------+----------------------------------------------+---------+-------+--------+----------+------------------------------------+
| id | select_type | table            | partitions | type  | possible_keys                                                                    | key                                          | key_len | ref   | rows   | filtered | Extra                              |
+----+-------------+------------------+------------+-------+----------------------------------------------------------------------------------+----------------------------------------------+---------+-------+--------+----------+------------------------------------+
|  1 | UPDATE      | emails           | NULL       | index | NULL                                                                             | PRIMARY                                      | 98      | NULL  |      1 |   100.00 | Using where                        |
|  4 | SUBQUERY    | email_recipients | NULL       | range | email_recipients_email_id_member_email_index,email_recipients_failed_at_index    | email_recipients_failed_at_index             | 6       | NULL  |   2343 |     7.76 | Using index condition; Using where |
|  3 | SUBQUERY    | email_recipients | NULL       | ref   | email_recipients_email_id_member_email_index,email_recipients_opened_at_index    | email_recipients_email_id_member_email_index | 98      | const | 159126 |    50.00 | Using where                        |
|  2 | SUBQUERY    | email_recipients | NULL       | ref   | email_recipients_email_id_member_email_index,email_recipients_delivered_at_index | email_recipients_email_id_member_email_index | 98      | const | 159126 |    50.00 | Using where                        |
+----+-------------+------------------+------------+-------+----------------------------------------------------------------------------------+----------------------------------------------+---------+-------+--------+----------+------------------------------------+

after:

+----+-------------+------------------+------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+------+--------+----------+--------------------------+
| id | select_type | table            | partitions | type  | possible_keys                                                                                                                                                                 | key                                          | key_len | ref  | rows   | filtered | Extra                    |
+----+-------------+------------------+------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+------+--------+----------+--------------------------+
|  1 | UPDATE      | emails           | NULL       | index | NULL                                                                                                                                                                          | PRIMARY                                      | 98      | NULL |      1 |   100.00 | Using where;             |
|  4 | SUBQUERY    | email_recipients | NULL       | range | email_recipients_email_id_member_email_index,email_recipients_email_id_delivered_at_index,email_recipients_email_id_opened_at_index,email_recipients_email_id_failed_at_index | email_recipients_email_id_failed_at_index    | 104     | NULL |     60 |   100.00 | Using where; Using index |
|  3 | SUBQUERY    | email_recipients | NULL       | range | email_recipients_email_id_member_email_index,email_recipients_email_id_delivered_at_index,email_recipients_email_id_opened_at_index,email_recipients_email_id_failed_at_index | email_recipients_email_id_opened_at_index    | 104     | NULL | 119496 |   100.00 | Using where; Using index |
|  2 | SUBQUERY    | email_recipients | NULL       | range | email_recipients_email_id_member_email_index,email_recipients_email_id_delivered_at_index,email_recipients_email_id_opened_at_index,email_recipients_email_id_failed_at_index | email_recipients_email_id_delivered_at_index | 104     | NULL | 146030 |   100.00 | Using where; Using index |
+----+-------------+------------------+------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+------+--------+----------+--------------------------+

closes https://linear.app/tryghost/issue/ENG-791/migration-to-fix-email-recipients-indexes

Our indexes over single columns (`delivered_at`, `opened_at`, `failed_at`) were ineffective because the only time we query those is alongside `email_id` meaning we were frequently performing full table scans on very large tables during our email analytics jobs.

- added migration to add new indexes covering `email_id` and the respective columns
- added migration to drop the old indexes that weren't being used in any query plans
@github-actions github-actions bot added the migration [pull request] Includes migration for review label Mar 25, 2024
Copy link
Contributor

It looks like this PR contains a migration 👀
Here's the checklist for reviewing migrations:

General requirements

  • Satisfies idempotency requirement (both up() and down())
  • Does not reference models
  • Filename is in the correct format (and correctly ordered)
  • Targets the next minor version
  • All code paths have appropriate log messages
  • Uses the correct utils
  • Contains a minimal changeset
  • Does not mix DDL/DML operations

Schema changes

  • Both schema change and related migration have been implemented
  • For index changes: has been performance tested for large tables
  • For new tables/columns: fields use the appropriate predefined field lengths
  • For new tables/columns: field names follow the appropriate conventions
  • Does not drop a non-alpha table outside of a major version

Data changes

  • Mass updates/inserts are batched appropriately
  • Does not loop over large tables/datasets
  • Defends against missing or invalid data
  • For settings updates: follows the appropriate guidelines

@erik-ghost
Copy link

@kevinansfield there's a few more things I think we should get clear on:

  • are we positive we're avoiding hitting caches when we're getting these timings? Can you describe the process you're using to get the timings, and are you using SQL_NO_CACHE? You can see some more details here for what variables to look for to understand if you're hitting caches.
  • adding more indexes will likely incur a penalty in performance of writing to that table. How can we reason about what impact this change has on the performance of the writes to this table?
  • although adding the indexes improves the performance, we're still hitting a lot of rows (as shown in the explain), by the Using where: this likely means that although it's using an index, it's still having to do a table scan for that number of rows. Changing this so that the explain output says Using index will be an improvement.
  • This is likely because we're doing a IS NOT NULL, you can read more about that here and you can spot that "ref" column in the explain output change from const to NULL.
  • IS NULL optimization can be a bit of a tricky subarea, can you include the full show create table output for this table before and after the index changes?

Thank you!

@kevinansfield
Copy link
Contributor Author

kevinansfield commented Mar 25, 2024

are we positive we're avoiding hitting caches when we're getting these timings? Can you describe the process you're using to get the timings, and are you using SQL_NO_CACHE? You can see some more details here for what variables to look for to understand if you're hitting caches.

There's no query cache in MySQL 8 as far as I can tell. To avoid cases of query data being in memory I was restarting the server between query runs.

@kevinansfield
Copy link
Contributor Author

adding more indexes will likely incur a penalty in performance of writing to that table. How can we reason about what impact this change has on the performance of the writes to this table?

In this case we're replacing an already existing index so we shouldn't have huge differences. I can take a look at timing a batch insert operation.

@kevinansfield
Copy link
Contributor Author

kevinansfield commented Mar 25, 2024

although adding the indexes improves the performance, we're still hitting a lot of rows (as shown in the explain), by the Using where: this likely means that although it's using an index, it's still having to do a table scan for that number of rows. Changing this so that the explain output says Using index will be an improvement.

I did try looking into the Using where; comment earlier but everything I found was inconclusive, it seems prevailing thought was it's not very indicative of anything because it can show up simply because there's a where condition in the query (the docs are also very ambiguous but seem to support that theory, especially in this case where we're passing the restricted row set to the COUNT function?).

The explain output does have Using index; which means that it was possible to purely use the index to retrieve the necessary rows. My reading of the docs is that it will output only Using index; with no Using where; when you're querying against a primary key.

@kevinansfield
Copy link
Contributor Author

This is likely because we're doing a IS NOT NULL, you can read more about that here and you can spot that "ref" column in the explain output change from const to NULL.
IS NULL optimization can be a bit of a tricky subarea, can you include the full show create table output for this table before and after the index changes?

My understanding from the docs is that ref_or_null is explicitly only used in the situation where you have x = y OR x IS NULL.

In the properly-indexed query it's showing a range type which means it's used the index to retrieve the exact rows (range is used for IS NULL queries). The Using index; extra info also means it's only used the index to retrieve the rows rather than scanning anything.

Here's the full table create table output for each

before:

CREATE TABLE `email_recipients` (
  `id` varchar(24) NOT NULL,
  `email_id` varchar(24) NOT NULL,
  `member_id` varchar(24) NOT NULL,
  `batch_id` varchar(24) NOT NULL,
  `processed_at` datetime DEFAULT NULL,
  `delivered_at` datetime DEFAULT NULL,
  `opened_at` datetime DEFAULT NULL,
  `failed_at` datetime DEFAULT NULL,
  `member_uuid` varchar(36) NOT NULL,
  `member_email` varchar(191) NOT NULL,
  `member_name` varchar(191) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `email_recipients_member_id_index` (`member_id`),
  KEY `email_recipients_batch_id_foreign` (`batch_id`),
  KEY `email_recipients_email_id_member_email_index` (`email_id`,`member_email`),
  KEY `email_recipients_delivered_at_index` (`delivered_at`),
  KEY `email_recipients_opened_at_index` (`opened_at`),
  KEY `email_recipients_failed_at_index` (`failed_at`),
  CONSTRAINT `email_recipients_batch_id_foreign` FOREIGN KEY (`batch_id`) REFERENCES `email_batches` (`id`),
  CONSTRAINT `email_recipients_email_id_foreign` FOREIGN KEY (`email_id`) REFERENCES `emails` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

after:

CREATE TABLE `email_recipients` (
  `id` varchar(24) NOT NULL,
  `email_id` varchar(24) NOT NULL,
  `member_id` varchar(24) NOT NULL,
  `batch_id` varchar(24) NOT NULL,
  `processed_at` datetime DEFAULT NULL,
  `delivered_at` datetime DEFAULT NULL,
  `opened_at` datetime DEFAULT NULL,
  `failed_at` datetime DEFAULT NULL,
  `member_uuid` varchar(36) NOT NULL,
  `member_email` varchar(191) NOT NULL,
  `member_name` varchar(191) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `email_recipients_member_id_index` (`member_id`),
  KEY `email_recipients_batch_id_foreign` (`batch_id`),
  KEY `email_recipients_email_id_member_email_index` (`email_id`,`member_email`),
  KEY `email_recipients_email_id_delivered_at_index` (`email_id`,`delivered_at`),
  KEY `email_recipients_email_id_opened_at_index` (`email_id`,`opened_at`),
  KEY `email_recipients_email_id_failed_at_index` (`email_id`,`failed_at`),
  CONSTRAINT `email_recipients_batch_id_foreign` FOREIGN KEY (`batch_id`) REFERENCES `email_batches` (`id`),
  CONSTRAINT `email_recipients_email_id_foreign` FOREIGN KEY (`email_id`) REFERENCES `emails` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

@erik-ghost
Copy link

Could we use the optimizer cost model to find out what the impact of the cost of the query is to MySQL?

@kevinansfield
Copy link
Contributor Author

kevinansfield commented Mar 26, 2024

Could we use the optimizer cost model to find out what the impact of the cost of the query is to MySQL?

I had used EXPLAIN ANALYZE on the individual selects (it didn't work for the UPDATE query with subqueries) to check and compare when I was experimenting yesterday. It appeared considerably better with the indexes in place

before:

mysql> explain analyze SELECT COUNT(id)FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND delivered_at IS NOT NULL \G;
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(email_recipients.id)  (cost=88993 rows=1) (actual time=506..506 rows=1 loops=1)
    -> Filter: (email_recipients.delivered_at is not null)  (cost=81036 rows=79563) (actual time=3.08..504 rows=75682 loops=1)
        -> Index lookup on email_recipients using email_recipients_email_id_member_email_index (email_id='00000000846f9cc01e7bd3cc')  (cost=81036 rows=159126) (actual time=3.07..501 rows=79275 loops=1)

after:

mysql> explain analyze SELECT COUNT(id)FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND delivered_at IS NOT NULL \G;
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(email_recipients.id)  (cost=47372 rows=1) (actual time=73.7..73.7 rows=1 loops=1)
    -> Filter: ((email_recipients.email_id = '00000000846f9cc01e7bd3cc') and (email_recipients.delivered_at is not null))  (cost=32769 rows=146030) (actual time=0.093..70.8 rows=75682 loops=1)
        -> Covering index range scan on email_recipients using email_recipients_email_id_delivered_at_index over (email_id = '00000000846f9cc01e7bd3cc' AND NULL < delivered_at)  (cost=32769 rows=146030) (actual time=0.0884..45.5 rows=75682 loops=1)

@kevinansfield
Copy link
Contributor Author

With the queries split out as in #19917 there may be some marginal benefit to changing the queries to match typical patterns.

Counting failed records is quick because there's usually very few of them

mysql> SELECT COUNT(*) FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND failed_at IS NOT NULL;
+----------+
| COUNT(*) |
+----------+
|       60 |
+----------+
1 row in set (0.00 sec)

Conversely counting delivered records with IS NOT NULL is slower because that almost always matches all of the records

mysql> SELECT COUNT(*) FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND delivered_at IS NOT NULL;
+----------+
| COUNT(*) |
+----------+
|    75682 |
+----------+
1 row in set (0.07 sec)

If we can flip that to count only the non-delivered records then it's a much quicker query

mysql> SELECT COUNT(*) FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND delivered_at IS NULL;
+----------+
| COUNT(*) |
+----------+
|     3593 |
+----------+
1 row in set (0.01 sec)

However, that requires us to already have a total email_recipient count in order to make the calculation in JS, which thankfully we store on the email table during email creation.

mysql> select email_count from emails where id='00000000846f9cc01e7bd3cc';
+-------------+
| email_count |
+-------------+
|       79248 |
+-------------+
1 row in set (0.00 sec)

From there it leaves us with calculating the opened_at count which is trickier because it will have a slower flip across that mid-point, although we hope it will eventually be faster to count the fewer number of unopened emails. I think it's probably a wash either way for using IS NULL vs IS NOT NULL on that table.

@kevinansfield kevinansfield merged commit d5a9731 into TryGhost:main Apr 3, 2024
21 checks passed
@kevinansfield kevinansfield deleted the email-recipients-index-migrations branch April 3, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
migration [pull request] Includes migration for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants