# 10 tips to Optimize your Django queries with PostgreSQL
Welcome to this playground. It follows our [10 tips  to Optimize your Django queries with PostgreSQL](https://www.gitguardian.com) blog article and allows you to test all given tips by yourself and to experiment your own optimization ideas.


## Setup your project

### Imports and Django setup
You must run this cell each time your restart the kernel.

In [2]:
import time

# setup django
import django_init
from django.contrib.postgres.aggregates import ArrayAgg
from django.core.management import call_command
from django.db import connection, reset_queries
from django.db.models import Prefetch

from books.models import Person, Book


### Migrate your database
Following cell allows you to migrate your database. You only need to run it if you change your Django models.

In [22]:
# Create missing migrations
call_command("makemigrations", interactive=True)
# Run migrations
call_command("migrate", interactive=True)

No changes detected
Operations to perform:
  Apply all migrations: admin, auth, books, contenttypes, sessions
Running migrations:
  No migrations to apply.


### Populate your database
Following cells populate the database with a lot of fake data. In case `autovacuum` is not set on your database instance, you'll also need to refresh your tables statistics in order to allow PostgreSQL queries planner to make the right decisions.

In [None]:
call_command("generate_data")

In [74]:
with connection.cursor() as cursor:
    cursor.execute("VACUUM ANALYSE books_book")
    cursor.execute("VACUUM ANALYSE books_person")
    cursor.execute("VACUUM ANALYSE books_book_readers")

In [4]:
Person.objects.count()

999996

In [5]:
Book.objects.count()

2000

## A Good Method To Iterate fast
Django natively proposes convenient ways to display SQL queries that are executed and to explain how they are resolved by PostgreSQL query planner.

In [6]:
reset_queries()

query_set = Person.objects.only("id")
person = query_set.first()

print("SQL Query: ", query_set[:10].query)
print("PostgreSQL query: ", connection.queries[0])  # needs DEBUG=True
print("PostgreSQL explain analyze:", query_set[:10].explain(ANALYZE=True))

SQL Query SELECT "books_person"."id" FROM "books_person" LIMIT 10
PostgreSQL query:  {'sql': 'SELECT "books_person"."id" FROM "books_person" ORDER BY "books_person"."id" ASC LIMIT 1', 'time': '0.002'}
pg explain analyze: Limit  (cost=0.00..1.11 rows=10 width=8) (actual time=0.034..0.038 rows=10 loops=1)
  ->  Seq Scan on books_person  (cost=0.00..113225.35 rows=1016335 width=8) (actual time=0.033..0.035 rows=10 loops=1)
Planning Time: 0.064 ms
Execution Time: 0.055 ms


## Select Only What You Need
You can significantly improve performances by reducing the amount of data sent to / by database. 

### Fetching using a large query
The following query will be huge as the query sent to PostgreSQL contains 100,000 email addresses. Even if the execution time is small, the total time (including Django processing and networking) is very long.

In [3]:
all_persons_qs = Person.objects.all()

lots_emails = all_persons_qs.values_list("email", flat=True)[:100_000]
print(lots_emails[:10])

big_qs = Person.objects.filter(email__in=lots_emails)

reset_queries()
start_time = time.perf_counter()

all_persons = big_qs.all()

print("PostgreSQL query: ", str(big_qs.query)[:200])
print("PostgreSQL explain analyze: ", big_qs[:10].explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

<QuerySet ['1000000_christopher74@example.net', '1000000_daniel75@example.org', '1000000_wdean@example.com', '100000_mcooper@example.org', '100001_jeffreyortiz@example.com', '100001_mary83@example.com', '100002_garciadaniel@example.org', '100002_vmoore@example.net', '100003_amandadiaz@example.net', '100005_michelle84@example.net']>
PostgreSQL query:  SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_person" WHERE "books_person"."email" IN (SELECT U0."email" FROM "books_person" U0 LIMIT 1000
PostgreSQL explain analyze: Limit  (cost=8612.44..8643.92 rows=10 width=780) (actual time=413.782..414.509 rows=10 loops=1)
  ->  Hash Semi Join  (cost=8612.44..323361.89 rows=100000 width=780) (actual time=413.781..414.505 rows=10 loops=1)
        Hash Cond: (books_person.email = u0.email)
        ->  Seq Scan on books_person  (cost=0.00..113061.96 rows=999996 width=780) (actual time=0.202..1.438 rows=139 loops=1)
        ->  Hash  (cost=667

### Fetching all the model
In this example, we fetch all fields of the Person model, including `bio` (text).

In [10]:
reset_queries()
start_time = time.perf_counter()

all_persons = all_persons_qs.all()

print("PostgreSQL query: ", all_persons_qs.query)
print("PostgreSQL explain analyze: ", all_persons_qs[:10].explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

PostgreSQL query:  SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_person"
PostgreSQL explain analyze: Limit  (cost=0.00..1.13 rows=10 width=780) (actual time=0.013..0.019 rows=10 loops=1)
  ->  Seq Scan on books_person  (cost=0.00..113061.96 rows=999996 width=780) (actual time=0.012..0.016 rows=10 loops=1)
Planning Time: 0.104 ms
Execution Time: 0.036 ms
Total time: 0.0017139300471171737


### Fetching only the id
Getting only the `id` will improve the execution.

In [12]:
all_persons_qs = all_persons_qs.only("id")

reset_queries()
start_time = time.perf_counter()

all_persons = all_persons_qs.all()

print("PostgreSQL query: ", all_persons_qs.query)
print("PostgreSQL explain analyze: ", all_persons_qs[:10].explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

PostgreSQL query:  SELECT "books_person"."id" FROM "books_person"
PostgreSQL explain analyze: Limit  (cost=0.42..0.80 rows=10 width=8) (actual time=1.493..1.499 rows=10 loops=1)
  ->  Index Only Scan using books_person_pkey on books_person  (cost=0.42..37496.37 rows=999996 width=8) (actual time=1.491..1.495 rows=10 loops=1)
        Heap Fetches: 0
Planning Time: 0.092 ms
Execution Time: 1.522 ms
Total time: 0.0039990650257095695


But if you only need a list of ids, you can save a lot of time not instanciating models by using `values()` or `values_list()`

In [13]:
all_persons_qs = all_persons_qs.only("id").values_list("id")

reset_queries()
start_time = time.perf_counter()

all_persons = all_persons_qs.all()

print("PostgreSQL query: ", all_persons_qs.query)
print("PostgreSQL explain analyze: ", all_persons_qs[:10].explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

PostgreSQL query:  SELECT "books_person"."id" FROM "books_person"
PostgreSQL explain analyze: Limit  (cost=0.42..0.80 rows=10 width=8) (actual time=0.017..0.022 rows=10 loops=1)
  ->  Index Only Scan using books_person_pkey on books_person  (cost=0.42..37496.37 rows=999996 width=8) (actual time=0.016..0.020 rows=10 loops=1)
        Heap Fetches: 0
Planning Time: 0.085 ms
Execution Time: 0.039 ms
Total time: 0.004087422043085098


## Index what's you searching for
Let's search authors by name.

### Search without index
Without any index, the request will scan all the table for the right value.

In [14]:
with connection.cursor() as cursor:
    cursor.execute("DROP INDEX IF EXISTS books_person_name_upper_idx")
    cursor.execute("DROP INDEX IF EXISTS books_person_name_idx")

tolstoy_qs = Person.objects.filter(name__iexact="tolstoy").only("email")

reset_queries()
start_time = time.perf_counter()

tolstoy = tolstoy_qs.all()

print("PostgreSQL query: ", tolstoy_qs.query)
print("PostgreSQL explain analyze: ", tolstoy_qs.explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

PostgreSQL query:  SELECT "books_person"."id", "books_person"."email" FROM "books_person" WHERE UPPER("books_person"."name"::text) = UPPER(tolstoy)
PostgreSQL explain analyze: Gather  (cost=1000.00..110811.98 rows=5000 width=37) (actual time=583.833..586.294 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on books_person  (cost=0.00..109311.98 rows=2083 width=37) (actual time=544.738..565.079 rows=0 loops=3)
        Filter: (upper(name) = 'TOLSTOY'::text)
        Rows Removed by Filter: 333332
Planning Time: 2.016 ms
Execution Time: 586.323 ms
Total time: 0.5973122449358925


### Search with a regular index
The following code will create a regular index just like Django would do if we add `index=True` to the `name` field. But B-Tree indexes are not able to perform case insensitive search and the planner will still full scan the table.

In [15]:
with connection.cursor() as cursor:
    cursor.execute(
        "CREATE INDEX IF NOT EXISTS books_person_name_idx ON books_person (name);"
    )

# wait for the index creation
time.sleep(5)

tolstoy_qs = Person.objects.filter(name__iexact="tolstoy").only("email")

reset_queries()
start_time = time.perf_counter()

tolstoy = tolstoy_qs.all()

print("PostgreSQL query: ", tolstoy_qs.query)
print("PostgreSQL explain analyze: ", tolstoy_qs.explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

PostgreSQL query:  SELECT "books_person"."id", "books_person"."email" FROM "books_person" WHERE UPPER("books_person"."name"::text) = UPPER(tolstoy)
PostgreSQL explain analyze: Gather  (cost=1000.00..110811.98 rows=5000 width=37) (actual time=284.198..314.624 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on books_person  (cost=0.00..109311.98 rows=2083 width=37) (actual time=294.664..303.912 rows=0 loops=3)
        Filter: (upper(name) = 'TOLSTOY'::text)
        Rows Removed by Filter: 333332
Planning Time: 0.698 ms
Execution Time: 314.645 ms
Total time: 0.3327687010169029


### Search with case insensitive index
Previous try was not a success, so we try again, but with a case insensitive index.

In [16]:
with connection.cursor() as cursor:
    cursor.execute(
        "CREATE INDEX IF NOT EXISTS books_person_name_upper_idx ON books_person (UPPER(name));"
    )

# wait for the index creation
time.sleep(5)

tolstoy_qs = Person.objects.filter(name__iexact="tolstoy").only("email")

reset_queries()
start_time = time.perf_counter()

tolstoy = tolstoy_qs.all()

print("PostgreSQL query: ", tolstoy_qs.query)
print("PostgreSQL explain analyze: ", tolstoy_qs.explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

PostgreSQL query:  SELECT "books_person"."id", "books_person"."email" FROM "books_person" WHERE UPPER("books_person"."name"::text) = UPPER(tolstoy)
PostgreSQL explain analyze: Bitmap Heap Scan on books_person  (cost=119.17..16534.54 rows=5000 width=37) (actual time=0.337..0.339 rows=1 loops=1)
  Recheck Cond: (upper(name) = 'TOLSTOY'::text)
  Heap Blocks: exact=1
  ->  Bitmap Index Scan on books_person_name_upper_idx  (cost=0.00..117.92 rows=5000 width=0) (actual time=0.105..0.105 rows=1 loops=1)
        Index Cond: (upper(name) = 'TOLSTOY'::text)
Planning Time: 2.482 ms
Execution Time: 0.739 ms
Total time: 0.0077229730086401105


## Select_related and prefetch_related are not always the best match
We want to get the author of a list of N books.
### Naive approach
With the naive method, we need N+1 queries to achieve that.

In [17]:
N = 10

reset_queries()

for book in Book.objects.all()[:N]:
    author = book.author

print(connection.queries)
print(f"{len(connection.queries)} queries have been executed")

[{'sql': 'SELECT "books_book"."id", "books_book"."title", "books_book"."author_id" FROM "books_book" LIMIT 10', 'time': '0.002'}, {'sql': 'SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_person" WHERE "books_person"."id" = 1885800 LIMIT 21', 'time': '0.004'}, {'sql': 'SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_person" WHERE "books_person"."id" = 2257661 LIMIT 21', 'time': '0.003'}, {'sql': 'SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_person" WHERE "books_person"."id" = 1713720 LIMIT 21', 'time': '0.002'}, {'sql': 'SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_person" WHERE "books_person"."id" = 1621007 LIMIT 21', 'time': '0.001'}, {'sql': 'SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM 

### select_related()
Using `select_related()` only 1 query is needed.

In [18]:
reset_queries()

for book in Book.objects.select_related("author")[:10]:
    author = book.author

print(connection.queries)

[{'sql': 'SELECT "books_book"."id", "books_book"."title", "books_book"."author_id", "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_book" INNER JOIN "books_person" ON ("books_book"."author_id" = "books_person"."id") LIMIT 10', 'time': '0.004'}]


### Using prefetch_related for "* to many" relations 
For OneToMany of ManyToMany relations, `prefetch_related()` is used instead.

In [19]:
reset_queries()

for person in Person.objects.prefetch_related("writings")[:10]:
    writings = person.writings

print(connection.queries)

[{'sql': 'SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio" FROM "books_person" LIMIT 10', 'time': '0.001'}, {'sql': 'SELECT "books_book"."id", "books_book"."title", "books_book"."author_id" FROM "books_book" WHERE "books_book"."author_id" IN (2169230, 2169241, 2169248, 2169255, 2169264, 2169272, 2169277, 2169285, 2169294, 2169304)', 'time': '0.002'}]


But it can generate huge queries which will may be long to execute.

In [20]:
reset_queries()
start_time = time.perf_counter()

result = {}
for person in Person.objects.prefetch_related("writings")[:100_000]:
    result[person.email] = [book.title for book in person.writings.all()]

print("Query duration:", sum(float(query["time"]) for query in connection.queries))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

Query duration 0.9329999999999999
Total time: 17.908183124032803


### Using to_attr to speed up prefetch_related
As stated in Django's [prefetch_related documentation](https://docs.djangoproject.com/en/4.1/ref/models/querysets/#prefetch-related) you can use `to_attr` to store cached results to a list. It doesn't help much on query duration, but the total time is much better.

In [4]:
reset_queries()
start_time = time.perf_counter()

result = {
    person.email: [book.title for book in person.prefetched_writings]
    for person in Person.objects.prefetch_related(
        Prefetch("writings", to_attr="prefetched_writings")
    )[:100_000]
}

print("Query duration", sum(float(query["time"]) for query in connection.queries))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

Query duration 1.059
Total time: 3.41s


### Using aggregation
Another solution to get our book titles is to use aggration. Again, we can see performance gains.

In [5]:
reset_queries()
start_time = time.perf_counter()

result = {
    person.email: person.writings_titles
    for person in Person.objects.annotate(writings_titles=ArrayAgg("writings__title"))[
        :100_000
    ]
}

print(connection.queries[-1]['sql'])
print("Query duration", sum(float(query["time"]) for query in connection.queries))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

SELECT "books_person"."id", "books_person"."email", "books_person"."name", "books_person"."bio", ARRAY_AGG("books_book"."title" ) AS "writings_titles" FROM "books_person" LEFT OUTER JOIN "books_book" ON ("books_person"."id" = "books_book"."author_id") GROUP BY "books_person"."id" LIMIT 100000
Query duration 0.973
Total time: 1.91s


If we don't need to instanciate Models but just some values, we can same more time.

In [23]:
reset_queries()
start_time = time.perf_counter()

result = {
    person.email: person.writings_titles
    for person in Person.objects.annotate(
        writings_titles=ArrayAgg("writings__title")
    ).values_list("email", "writings_titles", named=True)[:100_000]
}

print(connection.queries)
print("Query duration", sum(float(query["time"]) for query in connection.queries))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

[{'sql': 'SELECT "books_person"."email", ARRAY_AGG("books_book"."title" ) AS "writings_titles" FROM "books_person" LEFT OUTER JOIN "books_book" ON ("books_person"."id" = "books_book"."author_id") GROUP BY "books_person"."id" LIMIT 100000', 'time': '0.228'}]
Query duration 0.228
Total time: 0.38546457199845463


## Aggregations VS subqueries

### Get writers stats using aggregations
We want to get the list of books written by an author, and the total count of readers. 
With Django ORM, this is usually achieved using `annotate()` method

In [24]:
from django.db.models import Count

writers_stats_qs = Person.objects.annotate(
    writings_title=ArrayAgg("writings__title"),
    readers_count=Count("writings__readers"),
).values_list("name", "bio", "writings_title", "readers_count")

reset_queries()

writers_stats = writers_stats_qs.all()

print(connection.queries)
print("query duration", sum(float(query["time"]) for query in connection.queries))
print("pg explain analyze:", writers_stats_qs.explain(ANALYZE=True))

[]
query duration 0
pg explain analyze: GroupAggregate  (cost=1.27..76855404.29 rows=999996 width=791) (actual time=1.513..61136.046 rows=999996 loops=1)
  Group Key: books_person.id
  ->  Merge Left Join  (cost=1.27..1423986.02 rows=10055855776 width=822) (actual time=1.494..53501.557 rows=21110086 loops=1)
        Merge Cond: (books_person.id = books_book.author_id)
        ->  Index Scan using books_person_pkey on books_person  (cost=0.42..258578.88 rows=999996 width=751) (actual time=0.013..2512.841 rows=999996 loops=1)
        ->  Materialize  (cost=0.84..911509.75 rows=20111792 width=79) (actual time=1.477..43920.691 rows=20112082 loops=1)
              ->  Nested Loop Left Join  (cost=0.84..861230.27 rows=20111792 width=79) (actual time=1.461..31093.646 rows=20112082 loops=1)
                    ->  Index Scan using books_book_author_id_8b91747b on books_book  (cost=0.28..186.27 rows=2000 width=79) (actual time=0.509..9.167 rows=2000 loops=1)
                    ->  Index Only S

### Get writers stats using subqueries
The following example will use 2 subqueries instead of `annotate()` for the same purpose.

In [25]:
from django.db.models import Count, OuterRef
from django.contrib.postgres.expressions import ArraySubquery

writings_subquery = Book.objects.filter(author_id=OuterRef("id")).values("title")
readers_subquery = (
    Book.objects.filter(author_id=OuterRef("id"))
    .values("author_id")
    .values(count=Count("readers__id"))[:1]
)
writers_stats_qs = Person.objects.annotate(
    writings_title=ArraySubquery(writings_subquery), readers_count=readers_subquery
).values_list("name", "bio", "writings_title", "readers_count")

reset_queries()

writers_stats = writers_stats_qs.all()

print(connection.queries)
print("query duration", sum(float(query["time"]) for query in connection.queries))
print("pg explain analyze:", writers_stats_qs.explain(ANALYZE=True))

[]
query duration 0
pg explain analyze: Seq Scan on books_person  (cost=0.00..527045954.22 rows=999996 width=783) (actual time=1.326..20453.472 rows=999996 loops=1)
  SubPlan 1
    ->  Index Scan using books_book_author_id_8b91747b on books_book u0  (cost=0.28..8.29 rows=1 width=63) (actual time=0.002..0.002 rows=0 loops=999996)
          Index Cond: (author_id = books_person.id)
  SubPlan 2
    ->  Limit  (cost=0.72..518.64 rows=1 width=16) (actual time=0.017..0.017 rows=0 loops=999996)
          ->  GroupAggregate  (cost=0.72..518.64 rows=1 width=16) (actual time=0.017..0.017 rows=0 loops=999996)
                Group Key: u0_1.author_id
                ->  Nested Loop Left Join  (cost=0.72..468.35 rows=10056 width=16) (actual time=0.003..0.013 rows=20 loops=999996)
                      ->  Index Scan using books_book_author_id_8b91747b on books_book u0_1  (cost=0.28..8.29 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=999996)
                            Index Cond: (author

## Save Your RAM
We first need a small tooling to measure the RAM consumed by our code.

In [26]:
from threading import Thread
from time import sleep

import psutil


def measure_ram_consumption(function_to_audit):
    """Output the RAM consumption of the function passed as parameter"""
    initial_available_memory = psutil.virtual_memory().available
    min_available_memory = initial_available_memory
    is_running = True

    class RamUsageThread(Thread):
        def run(self) -> None:
            nonlocal min_available_memory
            while is_running:
                min_available_memory = min(
                    psutil.virtual_memory().available, min_available_memory
                )
                sleep(0.1)
            return min_available_memory

    ram_thread = RamUsageThread()
    ram_thread.start()
    function_to_audit()
    is_running = False

    print(
        "RAM consumption:",
        (initial_available_memory - min_available_memory) / 2**20,
        "MB",
    )

### Iterate using the Queryset

In [27]:
def iter_over_persons():
    for person in Person.objects.all():
        pass


measure_ram_consumption(iter_over_persons)

RAM consumption: 1604.07421875 MB


### Iterate using an iterator

In [28]:
def iter_over_persons_with_iterator():
    for person in Person.objects.iterator():
        pass


measure_ram_consumption(iter_over_persons_with_iterator)

RAM consumption: 75.34765625 MB
