# 10 tips to Optimize your Django queries with PostgreSQL
Welcome to this playground. It follows our [10 tips  to Optimize your Django queries with PostgreSQL](https://www.gitguardian.com) blog article and allows you to test all given tips by yourself and to experiment your own optimization ideas.


## Setup your project

### Imports and Django setup
You must run this cell each time your restart the kernel:

In [2]:
import time

# setup django
import django_init
from django.contrib.postgres.aggregates import ArrayAgg
from django.core.management import call_command
from django.db import connection, reset_queries
from django.db.models import Prefetch

from books.models import Person, Book


### Migrate your database
Following cell allows you to migrate your database. You only need to run it if you change your Django models.

In [None]:
# Create missing migrations
call_command("makemigrations", interactive=True)
# Run migrations
call_command("migrate", interactive=True)

### Populate your database
Following cells populate the database with a lot of fake data. In case `autovacuum` is not set on your database instance, you'll also need to refresh your tables statistics in order to allow PostgreSQL queries planner to make the right decisions.

In [None]:
call_command("generate_data")

We're making sure that statistics are up to date on all tables we'll use in this Notebook

In [None]:
with connection.cursor() as cursor:
    cursor.execute("VACUUM ANALYSE books_book")
    cursor.execute("VACUUM ANALYSE books_person")
    cursor.execute("VACUUM ANALYSE books_book_readers")

In [3]:
Person.objects.count()

999996

In [None]:
Book.objects.count()

## A Good Method To Iterate fast
Django natively proposes a convenient way to display SQL queries that are executed and to explain how they are resolved by the PostgreSQL query planner.

In [None]:
reset_queries()

query_set = Person.objects.only("id")
person = query_set.first()

print("SQL Query: ", query_set[:10].query)
print("PostgreSQL query: ", connection.queries[0])  # needs DEBUG=True
print("PostgreSQL explain analyze:", query_set[:10].explain(ANALYZE=True))

## Select Only What You Need
You can significantly improve performances by reducing the amount of data sent to / from the database. 

### Fetching using a large query
The following query will be huge as it contains 100,000 email addresses. Even if execution time is slow, the total time (including Django processing and networking) is very long.

In [None]:
all_persons_qs = Person.objects.all()

lots_emails = all_persons_qs.values_list("email", flat=True)[:100_000]
print(lots_emails[:10])

big_qs = Person.objects.filter(email__in=lots_emails)

reset_queries()
start_time = time.perf_counter()

all_persons = big_qs.all()

print("PostgreSQL query: ", str(big_qs.query)[:200])
print("PostgreSQL explain analyze: ", big_qs[:10].explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

### Fetching all the model
In this example, we fetch all fields of the Person model, including `bio` (text).

In [None]:
reset_queries()
start_time = time.perf_counter()

all_persons = all_persons_qs.all()

print("PostgreSQL query: ", all_persons_qs.query)
print("PostgreSQL explain analyze: ", all_persons_qs[:10].explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

### Fetching only the id
Getting only the `id` will improve the execution.

In [None]:
all_persons_qs = all_persons_qs.only("id")

reset_queries()
start_time = time.perf_counter()

all_persons = all_persons_qs.all()

print("PostgreSQL query: ", all_persons_qs.query)
print("PostgreSQL explain analyze: ", all_persons_qs[:10].explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

But if you only need a list of ids, you can save a lot of time by using `values()` or `values_list()` and bypass full model instanciation.

In [None]:
all_persons_qs = all_persons_qs.only("id").values_list("id")

reset_queries()
start_time = time.perf_counter()

all_persons = all_persons_qs.all()

print("PostgreSQL query: ", all_persons_qs.query)
print("PostgreSQL explain analyze: ", all_persons_qs[:10].explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

## Index what's you searching for
Let's search authors by name.

### Search without index
Without any index, the request will scan all the table for the right value.

In [None]:
with connection.cursor() as cursor:
    cursor.execute("DROP INDEX IF EXISTS books_person_name_upper_idx")
    cursor.execute("DROP INDEX IF EXISTS books_person_name_idx")

tolstoy_qs = Person.objects.filter(name__iexact="tolstoy").only("email")

reset_queries()
start_time = time.perf_counter()

tolstoy = tolstoy_qs.all()

print("PostgreSQL query: ", tolstoy_qs.query)
print("PostgreSQL explain analyze: ", tolstoy_qs.explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

### Search with a regular index
The following code will create a regular index, just like Django would do if we added `index=True` to the `name` field. But B-Tree indexes are not able to perform case insensitive search, so the planner will still have to perform a full scan of the table.

In [None]:
with connection.cursor() as cursor:
    cursor.execute(
        "CREATE INDEX IF NOT EXISTS books_person_name_idx ON books_person (name);"
    )

# wait for the index creation
time.sleep(5)

tolstoy_qs = Person.objects.filter(name__iexact="tolstoy").only("email")

reset_queries()
start_time = time.perf_counter()

tolstoy = tolstoy_qs.all()

print("PostgreSQL query: ", tolstoy_qs.query)
print("PostgreSQL explain analyze: ", tolstoy_qs.explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

### Search with case insensitive index
Previous try was not a success, so let's try with a case insensitive index:

In [None]:
with connection.cursor() as cursor:
    cursor.execute(
        "CREATE INDEX IF NOT EXISTS books_person_name_upper_idx ON books_person (UPPER(name));"
    )

# wait for the index creation
time.sleep(5)

tolstoy_qs = Person.objects.filter(name__iexact="tolstoy").only("email")

reset_queries()
start_time = time.perf_counter()

tolstoy = tolstoy_qs.all()

print("PostgreSQL query: ", tolstoy_qs.query)
print("PostgreSQL explain analyze: ", tolstoy_qs.explain(ANALYZE=True))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

## Select_related and prefetch_related are not always the best match
We want to get the author of a list of N books.
### Naive approach
With the naive method, we need N+1 queries to achieve that:

In [None]:
N = 10

reset_queries()

for book in Book.objects.all()[:N]:
    author = book.author

print(connection.queries)
print(f"{len(connection.queries)} queries have been executed")

### select_related()
Using `select_related()` only 1 query is needed:

In [None]:
reset_queries()

for book in Book.objects.select_related("author")[:10]:
    author = book.author

print(connection.queries)

### Using prefetch_related for "* to many" relations 
For OneToMany of ManyToMany relations, `prefetch_related()` comes in handy:

In [None]:
reset_queries()

for person in Person.objects.prefetch_related("writings")[:10]:
    writings = person.writings

print(connection.queries)

But it can generate huge queries which will may be long to execute:

In [None]:
reset_queries()
start_time = time.perf_counter()

result = {}
for person in Person.objects.prefetch_related("writings")[:100_000]:
    result[person.email] = [book.title for book in person.writings.all()]

print("Query duration:", sum(float(query["time"]) for query in connection.queries))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

### Using to_attr to speed up prefetch_related
As stated in Django's [prefetch_related documentation](https://docs.djangoproject.com/en/4.1/ref/models/querysets/#prefetch-related) you can use `to_attr` to store cached results to a list. It doesn't help much on query duration, but the total time is much better:

In [None]:
reset_queries()
start_time = time.perf_counter()

result = {
    person.email: [book.title for book in person.prefetched_writings]
    for person in Person.objects.prefetch_related(
        Prefetch("writings", to_attr="prefetched_writings")
    )[:100_000]
}

print("Query duration", sum(float(query["time"]) for query in connection.queries))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

### Using aggregation
Another solution to get our book titles is to use aggration. Again, we can see performance gains:

In [None]:
reset_queries()
start_time = time.perf_counter()

result = {
    person.email: person.writings_titles
    for person in Person.objects.annotate(writings_titles=ArrayAgg("writings__title"))[
        :100_000
    ]
}

print(connection.queries[-1]['sql'])
print("Query duration", sum(float(query["time"]) for query in connection.queries))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

If we don't need to instanciate Models but just some values, we can save even more time:

In [None]:
reset_queries()
start_time = time.perf_counter()

result = {
    person.email: person.writings_titles
    for person in Person.objects.annotate(
        writings_titles=ArrayAgg("writings__title")
    ).values_list("email", "writings_titles", named=True)[:100_000]
}

print(connection.queries)
print("Query duration", sum(float(query["time"]) for query in connection.queries))
print(f"Total time: { time.perf_counter() - start_time:.2f}s")

## Aggregations VS subqueries

### Get writers stats using aggregations
We want to get the list of books written by an author, and the total count of readers. 
With Django ORM, this is usually achieved using `annotate()` method:

In [None]:
from django.db.models import Count

writers_stats_qs = Person.objects.annotate(
    writings_title=ArrayAgg("writings__title"),
    readers_count=Count("writings__readers"),
).values_list("name", "bio", "writings_title", "readers_count")

reset_queries()

writers_stats = writers_stats_qs.all()

print(connection.queries)
print("query duration", sum(float(query["time"]) for query in connection.queries))
print("pg explain analyze:", writers_stats_qs.explain(ANALYZE=True))

### Get writers stats using subqueries
The following example will use 2 subqueries instead of `annotate()` for the same purpose.

In [None]:
from django.db.models import Count, OuterRef
from django.contrib.postgres.expressions import ArraySubquery

writings_subquery = Book.objects.filter(author_id=OuterRef("id")).values("title")
readers_subquery = (
    Book.objects.filter(author_id=OuterRef("id"))
    .values("author_id")
    .values(count=Count("readers__id"))[:1]
)
writers_stats_qs = Person.objects.annotate(
    writings_title=ArraySubquery(writings_subquery), readers_count=readers_subquery
).values_list("name", "bio", "writings_title", "readers_count")

reset_queries()

writers_stats = writers_stats_qs.all()

print(connection.queries)
print("query duration", sum(float(query["time"]) for query in connection.queries))
print("pg explain analyze:", writers_stats_qs.explain(ANALYZE=True))

## Save Your RAM
We first need a small lib `psutil` to measure the RAM consumed by our code:

In [None]:
from threading import Thread
from time import sleep

import psutil


def measure_ram_consumption(function_to_audit):
    """Output the RAM consumption of the function passed as parameter"""
    initial_available_memory = psutil.virtual_memory().available
    min_available_memory = initial_available_memory
    is_running = True

    class RamUsageThread(Thread):
        def run(self) -> None:
            nonlocal min_available_memory
            while is_running:
                min_available_memory = min(
                    psutil.virtual_memory().available, min_available_memory
                )
                sleep(0.1)
            return min_available_memory

    ram_thread = RamUsageThread()
    ram_thread.start()
    function_to_audit()
    is_running = False

    print(
        "RAM consumption:",
        (initial_available_memory - min_available_memory) / 2**20,
        "MB",
    )

### Iterate using the Queryset

In [None]:
def iter_over_persons():
    for person in Person.objects.all():
        pass


measure_ram_consumption(iter_over_persons)

### Iterate using an iterator

In [None]:
def iter_over_persons_with_iterator():
    for person in Person.objects.iterator():
        pass


measure_ram_consumption(iter_over_persons_with_iterator)