Fixed email_recipients indexes to match query usage #19918

kevinansfield · 2024-03-25T16:59:02Z

closes https://linear.app/tryghost/issue/ENG-791/migration-to-fix-email-recipients-indexes

Our indexes over single columns (delivered_at, opened_at, failed_at) were ineffective because the only time we query those is alongside email_id meaning we were frequently performing full table scans on very large tables during our email analytics jobs.

added migration to add new indexes covering email_id and the respective columns
added migration to drop the old indexes that weren't being used in any query plans

Local runtime with ~2M email_recipient rows:

before: 1.7s
after: 99ms

Explain output...

before:

+----+-------------+------------------+------------+-------+----------------------------------------------------------------------------------+----------------------------------------------+---------+-------+--------+----------+------------------------------------+
| id | select_type | table            | partitions | type  | possible_keys                                                                    | key                                          | key_len | ref   | rows   | filtered | Extra                              |
+----+-------------+------------------+------------+-------+----------------------------------------------------------------------------------+----------------------------------------------+---------+-------+--------+----------+------------------------------------+
|  1 | UPDATE      | emails           | NULL       | index | NULL                                                                             | PRIMARY                                      | 98      | NULL  |      1 |   100.00 | Using where                        |
|  4 | SUBQUERY    | email_recipients | NULL       | range | email_recipients_email_id_member_email_index,email_recipients_failed_at_index    | email_recipients_failed_at_index             | 6       | NULL  |   2343 |     7.76 | Using index condition; Using where |
|  3 | SUBQUERY    | email_recipients | NULL       | ref   | email_recipients_email_id_member_email_index,email_recipients_opened_at_index    | email_recipients_email_id_member_email_index | 98      | const | 159126 |    50.00 | Using where                        |
|  2 | SUBQUERY    | email_recipients | NULL       | ref   | email_recipients_email_id_member_email_index,email_recipients_delivered_at_index | email_recipients_email_id_member_email_index | 98      | const | 159126 |    50.00 | Using where                        |
+----+-------------+------------------+------------+-------+----------------------------------------------------------------------------------+----------------------------------------------+---------+-------+--------+----------+------------------------------------+

after:

+----+-------------+------------------+------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+------+--------+----------+--------------------------+
| id | select_type | table            | partitions | type  | possible_keys                                                                                                                                                                 | key                                          | key_len | ref  | rows   | filtered | Extra                    |
+----+-------------+------------------+------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+------+--------+----------+--------------------------+
|  1 | UPDATE      | emails           | NULL       | index | NULL                                                                                                                                                                          | PRIMARY                                      | 98      | NULL |      1 |   100.00 | Using where;             |
|  4 | SUBQUERY    | email_recipients | NULL       | range | email_recipients_email_id_member_email_index,email_recipients_email_id_delivered_at_index,email_recipients_email_id_opened_at_index,email_recipients_email_id_failed_at_index | email_recipients_email_id_failed_at_index    | 104     | NULL |     60 |   100.00 | Using where; Using index |
|  3 | SUBQUERY    | email_recipients | NULL       | range | email_recipients_email_id_member_email_index,email_recipients_email_id_delivered_at_index,email_recipients_email_id_opened_at_index,email_recipients_email_id_failed_at_index | email_recipients_email_id_opened_at_index    | 104     | NULL | 119496 |   100.00 | Using where; Using index |
|  2 | SUBQUERY    | email_recipients | NULL       | range | email_recipients_email_id_member_email_index,email_recipients_email_id_delivered_at_index,email_recipients_email_id_opened_at_index,email_recipients_email_id_failed_at_index | email_recipients_email_id_delivered_at_index | 104     | NULL | 146030 |   100.00 | Using where; Using index |
+----+-------------+------------------+------------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+---------+------+--------+----------+--------------------------+

closes https://linear.app/tryghost/issue/ENG-791/migration-to-fix-email-recipients-indexes Our indexes over single columns (`delivered_at`, `opened_at`, `failed_at`) were ineffective because the only time we query those is alongside `email_id` meaning we were frequently performing full table scans on very large tables during our email analytics jobs. - added migration to add new indexes covering `email_id` and the respective columns - added migration to drop the old indexes that weren't being used in any query plans

github-actions · 2024-03-25T16:59:14Z

erik-ghost · 2024-03-25T18:11:32Z

@kevinansfield there's a few more things I think we should get clear on:

are we positive we're avoiding hitting caches when we're getting these timings? Can you describe the process you're using to get the timings, and are you using SQL_NO_CACHE? You can see some more details here for what variables to look for to understand if you're hitting caches.
adding more indexes will likely incur a penalty in performance of writing to that table. How can we reason about what impact this change has on the performance of the writes to this table?
although adding the indexes improves the performance, we're still hitting a lot of rows (as shown in the explain), by the Using where: this likely means that although it's using an index, it's still having to do a table scan for that number of rows. Changing this so that the explain output says Using index will be an improvement.
This is likely because we're doing a IS NOT NULL, you can read more about that here and you can spot that "ref" column in the explain output change from const to NULL.
IS NULL optimization can be a bit of a tricky subarea, can you include the full show create table output for this table before and after the index changes?

Thank you!

kevinansfield · 2024-03-25T18:19:07Z

are we positive we're avoiding hitting caches when we're getting these timings? Can you describe the process you're using to get the timings, and are you using SQL_NO_CACHE? You can see some more details here for what variables to look for to understand if you're hitting caches.

There's no query cache in MySQL 8 as far as I can tell. To avoid cases of query data being in memory I was restarting the server between query runs.

kevinansfield · 2024-03-25T18:21:13Z

adding more indexes will likely incur a penalty in performance of writing to that table. How can we reason about what impact this change has on the performance of the writes to this table?

In this case we're replacing an already existing index so we shouldn't have huge differences. I can take a look at timing a batch insert operation.

kevinansfield · 2024-03-25T18:26:54Z

although adding the indexes improves the performance, we're still hitting a lot of rows (as shown in the explain), by the Using where: this likely means that although it's using an index, it's still having to do a table scan for that number of rows. Changing this so that the explain output says Using index will be an improvement.

I did try looking into the Using where; comment earlier but everything I found was inconclusive, it seems prevailing thought was it's not very indicative of anything because it can show up simply because there's a where condition in the query (the docs are also very ambiguous but seem to support that theory, especially in this case where we're passing the restricted row set to the COUNT function?).

The explain output does have Using index; which means that it was possible to purely use the index to retrieve the necessary rows. My reading of the docs is that it will output only Using index; with no Using where; when you're querying against a primary key.

kevinansfield · 2024-03-25T18:55:25Z

This is likely because we're doing a IS NOT NULL, you can read more about that here and you can spot that "ref" column in the explain output change from const to NULL.
IS NULL optimization can be a bit of a tricky subarea, can you include the full show create table output for this table before and after the index changes?

My understanding from the docs is that ref_or_null is explicitly only used in the situation where you have x = y OR x IS NULL.

In the properly-indexed query it's showing a range type which means it's used the index to retrieve the exact rows (range is used for IS NULL queries). The Using index; extra info also means it's only used the index to retrieve the rows rather than scanning anything.

Here's the full table create table output for each

before:

CREATE TABLE `email_recipients` (
  `id` varchar(24) NOT NULL,
  `email_id` varchar(24) NOT NULL,
  `member_id` varchar(24) NOT NULL,
  `batch_id` varchar(24) NOT NULL,
  `processed_at` datetime DEFAULT NULL,
  `delivered_at` datetime DEFAULT NULL,
  `opened_at` datetime DEFAULT NULL,
  `failed_at` datetime DEFAULT NULL,
  `member_uuid` varchar(36) NOT NULL,
  `member_email` varchar(191) NOT NULL,
  `member_name` varchar(191) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `email_recipients_member_id_index` (`member_id`),
  KEY `email_recipients_batch_id_foreign` (`batch_id`),
  KEY `email_recipients_email_id_member_email_index` (`email_id`,`member_email`),
  KEY `email_recipients_delivered_at_index` (`delivered_at`),
  KEY `email_recipients_opened_at_index` (`opened_at`),
  KEY `email_recipients_failed_at_index` (`failed_at`),
  CONSTRAINT `email_recipients_batch_id_foreign` FOREIGN KEY (`batch_id`) REFERENCES `email_batches` (`id`),
  CONSTRAINT `email_recipients_email_id_foreign` FOREIGN KEY (`email_id`) REFERENCES `emails` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

after:

CREATE TABLE `email_recipients` (
  `id` varchar(24) NOT NULL,
  `email_id` varchar(24) NOT NULL,
  `member_id` varchar(24) NOT NULL,
  `batch_id` varchar(24) NOT NULL,
  `processed_at` datetime DEFAULT NULL,
  `delivered_at` datetime DEFAULT NULL,
  `opened_at` datetime DEFAULT NULL,
  `failed_at` datetime DEFAULT NULL,
  `member_uuid` varchar(36) NOT NULL,
  `member_email` varchar(191) NOT NULL,
  `member_name` varchar(191) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `email_recipients_member_id_index` (`member_id`),
  KEY `email_recipients_batch_id_foreign` (`batch_id`),
  KEY `email_recipients_email_id_member_email_index` (`email_id`,`member_email`),
  KEY `email_recipients_email_id_delivered_at_index` (`email_id`,`delivered_at`),
  KEY `email_recipients_email_id_opened_at_index` (`email_id`,`opened_at`),
  KEY `email_recipients_email_id_failed_at_index` (`email_id`,`failed_at`),
  CONSTRAINT `email_recipients_batch_id_foreign` FOREIGN KEY (`batch_id`) REFERENCES `email_batches` (`id`),
  CONSTRAINT `email_recipients_email_id_foreign` FOREIGN KEY (`email_id`) REFERENCES `emails` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

erik-ghost · 2024-03-25T23:36:56Z

Could we use the optimizer cost model to find out what the impact of the cost of the query is to MySQL?

kevinansfield · 2024-03-26T11:19:03Z

Could we use the optimizer cost model to find out what the impact of the cost of the query is to MySQL?

I had used EXPLAIN ANALYZE on the individual selects (it didn't work for the UPDATE query with subqueries) to check and compare when I was experimenting yesterday. It appeared considerably better with the indexes in place

before:

mysql> explain analyze SELECT COUNT(id)FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND delivered_at IS NOT NULL \G;
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(email_recipients.id)  (cost=88993 rows=1) (actual time=506..506 rows=1 loops=1)
    -> Filter: (email_recipients.delivered_at is not null)  (cost=81036 rows=79563) (actual time=3.08..504 rows=75682 loops=1)
        -> Index lookup on email_recipients using email_recipients_email_id_member_email_index (email_id='00000000846f9cc01e7bd3cc')  (cost=81036 rows=159126) (actual time=3.07..501 rows=79275 loops=1)

after:

mysql> explain analyze SELECT COUNT(id)FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND delivered_at IS NOT NULL \G;
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(email_recipients.id)  (cost=47372 rows=1) (actual time=73.7..73.7 rows=1 loops=1)
    -> Filter: ((email_recipients.email_id = '00000000846f9cc01e7bd3cc') and (email_recipients.delivered_at is not null))  (cost=32769 rows=146030) (actual time=0.093..70.8 rows=75682 loops=1)
        -> Covering index range scan on email_recipients using email_recipients_email_id_delivered_at_index over (email_id = '00000000846f9cc01e7bd3cc' AND NULL < delivered_at)  (cost=32769 rows=146030) (actual time=0.0884..45.5 rows=75682 loops=1)

kevinansfield · 2024-03-26T13:10:29Z

With the queries split out as in #19917 there may be some marginal benefit to changing the queries to match typical patterns.

Counting failed records is quick because there's usually very few of them

mysql> SELECT COUNT(*) FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND failed_at IS NOT NULL;
+----------+
| COUNT(*) |
+----------+
|       60 |
+----------+
1 row in set (0.00 sec)

Conversely counting delivered records with IS NOT NULL is slower because that almost always matches all of the records

mysql> SELECT COUNT(*) FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND delivered_at IS NOT NULL;
+----------+
| COUNT(*) |
+----------+
|    75682 |
+----------+
1 row in set (0.07 sec)

If we can flip that to count only the non-delivered records then it's a much quicker query

mysql> SELECT COUNT(*) FROM email_recipients WHERE email_id='00000000846f9cc01e7bd3cc' AND delivered_at IS NULL;
+----------+
| COUNT(*) |
+----------+
|     3593 |
+----------+
1 row in set (0.01 sec)

However, that requires us to already have a total email_recipient count in order to make the calculation in JS, which thankfully we store on the email table during email creation.

mysql> select email_count from emails where id='00000000846f9cc01e7bd3cc';
+-------------+
| email_count |
+-------------+
|       79248 |
+-------------+
1 row in set (0.00 sec)

From there it leaves us with calculating the opened_at count which is trickier because it will have a slower flip across that mid-point, although we hope it will eventually be faster to count the fewer number of unopened emails. I think it's probably a wash either way for using IS NULL vs IS NOT NULL on that table.

github-actions bot added the migration [pull request] Includes migration for review label Mar 25, 2024

kevinansfield requested review from daniellockyer and allouis March 25, 2024 16:59

kevinansfield mentioned this pull request Mar 26, 2024

Removed use of subqueries in email analytics queries #19917

Merged

allouis approved these changes Apr 3, 2024

View reviewed changes

daniellockyer approved these changes Apr 3, 2024

View reviewed changes

kevinansfield merged commit d5a9731 into TryGhost:main Apr 3, 2024
21 checks passed

kevinansfield deleted the email-recipients-index-migrations branch April 3, 2024 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed email_recipients indexes to match query usage #19918

Fixed email_recipients indexes to match query usage #19918

kevinansfield commented Mar 25, 2024 •

edited

github-actions bot commented Mar 25, 2024

erik-ghost commented Mar 25, 2024

kevinansfield commented Mar 25, 2024 •

edited

kevinansfield commented Mar 25, 2024

kevinansfield commented Mar 25, 2024 •

edited

kevinansfield commented Mar 25, 2024

erik-ghost commented Mar 25, 2024

kevinansfield commented Mar 26, 2024 •

edited

kevinansfield commented Mar 26, 2024

Fixed email_recipients indexes to match query usage #19918

Fixed email_recipients indexes to match query usage #19918

Conversation

kevinansfield commented Mar 25, 2024 • edited

github-actions bot commented Mar 25, 2024

General requirements

Schema changes

Data changes

erik-ghost commented Mar 25, 2024

kevinansfield commented Mar 25, 2024 • edited

kevinansfield commented Mar 25, 2024

kevinansfield commented Mar 25, 2024 • edited

kevinansfield commented Mar 25, 2024

erik-ghost commented Mar 25, 2024

kevinansfield commented Mar 26, 2024 • edited

kevinansfield commented Mar 26, 2024

kevinansfield commented Mar 25, 2024 •

edited

kevinansfield commented Mar 25, 2024 •

edited

kevinansfield commented Mar 25, 2024 •

edited

kevinansfield commented Mar 26, 2024 •

edited