Find out the cause for long DRF response delays on the `api/histogram_data_files/` endpoint #30

nothingface0 · 2022-04-11T15:18:26Z

Description

Upon deployment, the ML Playground displays very long delays in rendering responses through the DRF API.

Initially, we were getting lots of Broken pipe messages, which were due to the proxy cutting off the connection due to the server not responding in time. Then, by increasing the proxy timeout to 180s, we were able to proceed with debugging the issue.

Example 1

Deploy the project on PaaS.
Visit this endpoint
Wait for it to load. It might take from 20 seconds up to 3 minutes to render (or even return a 504 error due to timeout).

Example 2

Run the project locally, using the DBoD database (just like the deployed project does).
Visit this endpoint
The page is rendered in (at most) 2 seconds.

Things we've tried

Use different wsgi/asgi servers (no difference, always the same delay):
- daphne with asgi under settings.py
- gunicorn with wsgi under settings.py
- Plain development server (the one that is not recommended for deployment) with manage.py runserver
Running the project locally, using the DBoD database that deployment uses. This works correctly, even when running it with daphne. Responses take up to 2 seconds, maximum, meaning that the database is not the problem.
Pod CPU usage displays some peaks during rendering the response in the range of 0.03 to 0.06, so it's not a matter of CPU speed.
Disabling DEBUG. No change.
Deployed with extra debug logging in settings.py, so that we can see what's going on in Django's mind while rendering that. You can see below that the queries themselves are fast. A long time is spent between the last SELECT query and the start of the HTTP 200 response (~2.5 min).

10.76.10.1:34256 - - [11/Apr/2022:13:27:40] "GET /api/histogram_data_files/?format=json&page=7" 200 17015
DEBUG - 2022-04-11 13:27:40,170 - http_protocol - HTTP response complete for ['10.76.10.1', 34256]
DEBUG - 2022-04-11 13:34:22,979 - http_protocol - HTTP b'GET' request for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:34:22,990 - selector_events - Using selector: EpollSelector
DEBUG - 2022-04-11 13:34:23,118 - utils - (0.009) SELECT "django_session"."session_key", "django_session"."session_data", "django_session"."expire_date" FROM "django_session" WHERE ("django_session"."expire_date" > '2022-04-11T13:34:22.999654+00:00'::timestamptz AND "django_session"."session_key" = 'f3nubvfv7t7k6eltpk5e54ovzaijv89s') LIMIT 21; args=(datetime.datetime(2022, 4, 11, 13, 34, 22, 999654, tzinfo=datetime.timezone.utc), 'f3nubvfv7t7k6eltpk5e54ovzaijv89s'); alias=default
DEBUG - 2022-04-11 13:34:23,118 - utils - (0.009) SELECT "django_session"."session_key", "django_session"."session_data", "django_session"."expire_date" FROM "django_session" WHERE ("django_session"."expire_date" > '2022-04-11T13:34:22.999654+00:00'::timestamptz AND "django_session"."session_key" = 'f3nubvfv7t7k6eltpk5e54ovzaijv89s') LIMIT 21; args=(datetime.datetime(2022, 4, 11, 13, 34, 22, 999654, tzinfo=datetime.timezone.utc), 'f3nubvfv7t7k6eltpk5e54ovzaijv89s'); alias=default
DEBUG - 2022-04-11 13:34:23,127 - utils - (0.005) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1 LIMIT 21; args=(1,); alias=default
DEBUG - 2022-04-11 13:34:23,127 - utils - (0.005) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1 LIMIT 21; args=(1,); alias=default
DEBUG - 2022-04-11 13:34:23,136 - utils - (0.006) SELECT COUNT(*) AS "__count" FROM "histogram_file_manager_histogramdatafile"; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,136 - utils - (0.006) SELECT COUNT(*) AS "__count" FROM "histogram_file_manager_histogramdatafile"; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,138 - utils - (0.001) SELECT "histogram_file_manager_histogramdatafile"."id", "histogram_file_manager_histogramdatafile"."filepath", "histogram_file_manager_histogramdatafile"."filesize", "histogram_file_manager_histogramdatafile"."data_dimensionality", "histogram_file_manager_histogramdatafile"."data_era", "histogram_file_manager_histogramdatafile"."entries_total", "histogram_file_manager_histogramdatafile"."entries_processed", "histogram_file_manager_histogramdatafile"."granularity", "histogram_file_manager_histogramdatafile"."created", "histogram_file_manager_histogramdatafile"."modified" FROM "histogram_file_manager_histogramdatafile" LIMIT 50 OFFSET 300; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,138 - utils - (0.001) SELECT "histogram_file_manager_histogramdatafile"."id", "histogram_file_manager_histogramdatafile"."filepath", "histogram_file_manager_histogramdatafile"."filesize", "histogram_file_manager_histogramdatafile"."data_dimensionality", "histogram_file_manager_histogramdatafile"."data_era", "histogram_file_manager_histogramdatafile"."entries_total", "histogram_file_manager_histogramdatafile"."entries_processed", "histogram_file_manager_histogramdatafile"."granularity", "histogram_file_manager_histogramdatafile"."created", "histogram_file_manager_histogramdatafile"."modified" FROM "histogram_file_manager_histogramdatafile" LIMIT 50 OFFSET 300; args=(); alias=default
DEBUG - 2022-04-11 13:37:01,528 - http_protocol - HTTP 200 response started for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:37:01,530 - http_protocol - HTTP close for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:37:01,531 - http_protocol - HTTP response complete for ['10.76.10.1', 34540]

Other notes

Even after a 504 error, on the immediate next reload of the same page the page loads instantly (due to a 1-min cache of the responses), meaning that the response has already been cached from the first request, even if has not been returned to the client.

Summary

We need to 👽probe deeper👽 to what's going on inside Django between the end of the DB query and the start of the HTTP response.

The text was updated successfully, but these errors were encountered:

nothingface0 · 2022-04-11T15:40:43Z

@XavierAtCERN mentioned django-debug-toolbar.

This provides some insights to the cpu time each page render takes.
Profiling for local run and deployment is shown below.

Locally

On deployment

nothingface0 · 2022-04-11T16:45:43Z

Things to check

IPv4-only configuration -- How??

@xicoa1: Create own image, then use it on Openshift

# Guarantee you have the base image 
podman/docker pull ubi8/python-38
# Run S2I 
s2i build <PATH_TO_SOURCE> ubi8/python-38  imagelocal

Other API endpoints. Does the same happen with 2D Lumisections?
No; which is strange. Could this be related specifically with HistogramDataFile? If yes:
Is it due to FilePathField which points to /eos? -- Unlikely, no /eos access is done on API querying, only Database access
Is it due to percentage_processed in serializers.py which is run on each query? -- Probably not, this issue would also appear when running on the local computer.
This should be narrowing it down to specifically the histogram_data_files API endpoint

Use a debugger (locally) to see what's going on after the DB query is complete

Investigate db lookup, serialization, dispatch, rendering
By checking this procedure, the results for the HistogramDataFileViewSet response (for one page of 50 entries) are:

Operation	Execution time
Database lookup	0.0000s
Serialization	0.0000s
Django request/response	0.0025s
API view	1.2326s
Response rendering	0.0005s

Pagination: Pagination is done within the get_context method of the renderers.py file of DRF. We are currently using a custom pagination method to add extra info.

After testing, the custom pagination.py file does not seem to affect performance at all.

Disabling pagination completely has also no impact. In fact, the response takes surprisingly less time than anticipated to render 6000+ results (same execution time as with 50 results):

Operation	Execution time
Database lookup	0.0000s
Serialization	0.0000s
Django request/response	0.0272s
API view	1.3690s
Response rendering	0.0376s

Authentication: someone reports that authentication seems to be the bottleneck for the dispatch method.

Disabling API authentication completely has some, but not substantial, effect on performance

Allowing only some HTTP methods on the ViewSet

Adding the mixins.ListModelMixin and mixins.RetrieveModelMixin to the HistogramDataFileViewSet does not affect performance

Dig deeper into DRF
Downloaded DRF source from github and saved inside project
Added -e ./django-rest-framework in requirements.txt
pip install -r requirements.txt
Counted time in views.py -> dispatch

nothingface0 · 2022-04-12T14:53:47Z

Solution

A FilePathField was used for the HistogramDataFile model. This field, given a directory, will iterate all files in order to populate possible values. I believe that this is part of its validation, which DRF uses every time it serializes its data.

As this path was pointing to EOS, Django, when trying to populate a list of all available files in the specified directory, took somewhere between 20s and 3 minutes (depending on EOS load, I guess) for every API query. This explains why, for a single results page (50 results, serialized with the ListModelMixin), the first serialization took so long (fetched a list of the files), while the consequent serializations were so fast (just validated that the existing filepath existed in the list of files).

Making this field a CharField automatically fixed the problem.

💀

Moral of the story

Don't use FilePathField for EOS directories with lots of files, if you're going to serialize them with DRF

nothingface0 added the deployment Issues related to deployment of the app label Apr 11, 2022

nothingface0 added the backend Task and features relating to the backend label Apr 12, 2022

nothingface0 closed this as completed in 0a1a953 Apr 12, 2022

nothingface0 mentioned this issue Apr 12, 2022

Fix the discover_dqm_files management command #31

Closed

nothingface0 removed the deployment Issues related to deployment of the app label Apr 12, 2022

nothingface0 changed the title ~~[Deployment] Find out the cause for long DRF response delays~~ Find out the cause for long DRF response delays on the api/histogram_data_files/ endpoint Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find out the cause for long DRF response delays on the `api/histogram_data_files/` endpoint #30

Find out the cause for long DRF response delays on the `api/histogram_data_files/` endpoint #30

nothingface0 commented Apr 11, 2022 •

edited

Loading

nothingface0 commented Apr 11, 2022 •

edited

Loading

nothingface0 commented Apr 11, 2022 •

edited

Loading

nothingface0 commented Apr 12, 2022 •

edited

Loading

Find out the cause for long DRF response delays on the api/histogram_data_files/ endpoint #30

Find out the cause for long DRF response delays on the api/histogram_data_files/ endpoint #30

Comments

nothingface0 commented Apr 11, 2022 • edited Loading

Description

Example 1

Example 2

Things we've tried

Other notes

Summary

nothingface0 commented Apr 11, 2022 • edited Loading

Locally

On deployment

nothingface0 commented Apr 11, 2022 • edited Loading

Things to check

nothingface0 commented Apr 12, 2022 • edited Loading

Solution

Moral of the story

Find out the cause for long DRF response delays on the `api/histogram_data_files/` endpoint #30

Find out the cause for long DRF response delays on the `api/histogram_data_files/` endpoint #30

nothingface0 commented Apr 11, 2022 •

edited

Loading

nothingface0 commented Apr 11, 2022 •

edited

Loading

nothingface0 commented Apr 11, 2022 •

edited

Loading

nothingface0 commented Apr 12, 2022 •

edited

Loading