Skip to content
This repository has been archived by the owner on Feb 22, 2024. It is now read-only.

Find out the cause for long DRF response delays on the api/histogram_data_files/ endpoint #30

Closed
nothingface0 opened this issue Apr 11, 2022 · 3 comments
Labels
backend Task and features relating to the backend

Comments

@nothingface0
Copy link
Contributor

nothingface0 commented Apr 11, 2022

Description

Upon deployment, the ML Playground displays very long delays in rendering responses through the DRF API.

Initially, we were getting lots of Broken pipe messages, which were due to the proxy cutting off the connection due to the server not responding in time. Then, by increasing the proxy timeout to 180s, we were able to proceed with debugging the issue.

Example 1

  1. Deploy the project on PaaS.
  2. Visit this endpoint
  3. Wait for it to load. It might take from 20 seconds up to 3 minutes to render (or even return a 504 error due to timeout).

Example 2

  1. Run the project locally, using the DBoD database (just like the deployed project does).
  2. Visit this endpoint
  3. The page is rendered in (at most) 2 seconds.

Things we've tried

  • Use different wsgi/asgi servers (no difference, always the same delay):
    • daphne with asgi under settings.py
    • gunicorn with wsgi under settings.py
    • Plain development server (the one that is not recommended for deployment) with manage.py runserver
  • Running the project locally, using the DBoD database that deployment uses. This works correctly, even when running it with daphne. Responses take up to 2 seconds, maximum, meaning that the database is not the problem.
  • Pod CPU usage displays some peaks during rendering the response in the range of 0.03 to 0.06, so it's not a matter of CPU speed.
  • Disabling DEBUG. No change.
  • Deployed with extra debug logging in settings.py, so that we can see what's going on in Django's mind while rendering that. You can see below that the queries themselves are fast. A long time is spent between the last SELECT query and the start of the HTTP 200 response (~2.5 min).
10.76.10.1:34256 - - [11/Apr/2022:13:27:40] "GET /api/histogram_data_files/?format=json&page=7" 200 17015
DEBUG - 2022-04-11 13:27:40,170 - http_protocol - HTTP response complete for ['10.76.10.1', 34256]
DEBUG - 2022-04-11 13:34:22,979 - http_protocol - HTTP b'GET' request for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:34:22,990 - selector_events - Using selector: EpollSelector
DEBUG - 2022-04-11 13:34:23,118 - utils - (0.009) SELECT "django_session"."session_key", "django_session"."session_data", "django_session"."expire_date" FROM "django_session" WHERE ("django_session"."expire_date" > '2022-04-11T13:34:22.999654+00:00'::timestamptz AND "django_session"."session_key" = 'f3nubvfv7t7k6eltpk5e54ovzaijv89s') LIMIT 21; args=(datetime.datetime(2022, 4, 11, 13, 34, 22, 999654, tzinfo=datetime.timezone.utc), 'f3nubvfv7t7k6eltpk5e54ovzaijv89s'); alias=default
DEBUG - 2022-04-11 13:34:23,118 - utils - (0.009) SELECT "django_session"."session_key", "django_session"."session_data", "django_session"."expire_date" FROM "django_session" WHERE ("django_session"."expire_date" > '2022-04-11T13:34:22.999654+00:00'::timestamptz AND "django_session"."session_key" = 'f3nubvfv7t7k6eltpk5e54ovzaijv89s') LIMIT 21; args=(datetime.datetime(2022, 4, 11, 13, 34, 22, 999654, tzinfo=datetime.timezone.utc), 'f3nubvfv7t7k6eltpk5e54ovzaijv89s'); alias=default
DEBUG - 2022-04-11 13:34:23,127 - utils - (0.005) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1 LIMIT 21; args=(1,); alias=default
DEBUG - 2022-04-11 13:34:23,127 - utils - (0.005) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" WHERE "auth_user"."id" = 1 LIMIT 21; args=(1,); alias=default
DEBUG - 2022-04-11 13:34:23,136 - utils - (0.006) SELECT COUNT(*) AS "__count" FROM "histogram_file_manager_histogramdatafile"; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,136 - utils - (0.006) SELECT COUNT(*) AS "__count" FROM "histogram_file_manager_histogramdatafile"; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,138 - utils - (0.001) SELECT "histogram_file_manager_histogramdatafile"."id", "histogram_file_manager_histogramdatafile"."filepath", "histogram_file_manager_histogramdatafile"."filesize", "histogram_file_manager_histogramdatafile"."data_dimensionality", "histogram_file_manager_histogramdatafile"."data_era", "histogram_file_manager_histogramdatafile"."entries_total", "histogram_file_manager_histogramdatafile"."entries_processed", "histogram_file_manager_histogramdatafile"."granularity", "histogram_file_manager_histogramdatafile"."created", "histogram_file_manager_histogramdatafile"."modified" FROM "histogram_file_manager_histogramdatafile" LIMIT 50 OFFSET 300; args=(); alias=default
DEBUG - 2022-04-11 13:34:23,138 - utils - (0.001) SELECT "histogram_file_manager_histogramdatafile"."id", "histogram_file_manager_histogramdatafile"."filepath", "histogram_file_manager_histogramdatafile"."filesize", "histogram_file_manager_histogramdatafile"."data_dimensionality", "histogram_file_manager_histogramdatafile"."data_era", "histogram_file_manager_histogramdatafile"."entries_total", "histogram_file_manager_histogramdatafile"."entries_processed", "histogram_file_manager_histogramdatafile"."granularity", "histogram_file_manager_histogramdatafile"."created", "histogram_file_manager_histogramdatafile"."modified" FROM "histogram_file_manager_histogramdatafile" LIMIT 50 OFFSET 300; args=(); alias=default
DEBUG - 2022-04-11 13:37:01,528 - http_protocol - HTTP 200 response started for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:37:01,530 - http_protocol - HTTP close for ['10.76.10.1', 34540]
DEBUG - 2022-04-11 13:37:01,531 - http_protocol - HTTP response complete for ['10.76.10.1', 34540]

Other notes

Even after a 504 error, on the immediate next reload of the same page the page loads instantly (due to a 1-min cache of the responses), meaning that the response has already been cached from the first request, even if has not been returned to the client.

Summary

We need to 👽probe deeper👽 to what's going on inside Django between the end of the DB query and the start of the HTTP response.

@nothingface0 nothingface0 added the deployment Issues related to deployment of the app label Apr 11, 2022
@nothingface0
Copy link
Contributor Author

nothingface0 commented Apr 11, 2022

@XavierAtCERN mentioned django-debug-toolbar.

This provides some insights to the cpu time each page render takes.
Profiling for local run and deployment is shown below.

Locally

image
image
image

On deployment

image
image
image

@nothingface0
Copy link
Contributor Author

nothingface0 commented Apr 11, 2022

Things to check

  • IPv4-only configuration -- How??

  • @xicoa1: Create own image, then use it on Openshift
# Guarantee you have the base image 
podman/docker pull ubi8/python-38
# Run S2I 
s2i build <PATH_TO_SOURCE> ubi8/python-38  imagelocal

  • Other API endpoints. Does the same happen with 2D Lumisections?
    No; which is strange. Could this be related specifically with HistogramDataFile? If yes:
  • Is it due to FilePathField which points to /eos? -- Unlikely, no /eos access is done on API querying, only Database access
  • Is it due to percentage_processed in serializers.py which is run on each query? -- Probably not, this issue would also appear when running on the local computer.
    This should be narrowing it down to specifically the histogram_data_files API endpoint

  • Use a debugger (locally) to see what's going on after the DB query is complete

  • Investigate db lookup, serialization, dispatch, rendering
    By checking this procedure, the results for the HistogramDataFileViewSet response (for one page of 50 entries) are:
Operation Execution time
Database lookup 0.0000s
Serialization 0.0000s
Django request/response 0.0025s
API view 1.2326s
Response rendering 0.0005s

  • Pagination: Pagination is done within the get_context method of the renderers.py file of DRF. We are currently using a custom pagination method to add extra info.

After testing, the custom pagination.py file does not seem to affect performance at all.

Disabling pagination completely has also no impact. In fact, the response takes surprisingly less time than anticipated to render 6000+ results (same execution time as with 50 results):

Operation Execution time
Database lookup 0.0000s
Serialization 0.0000s
Django request/response 0.0272s
API view 1.3690s
Response rendering 0.0376s

  • Authentication: someone reports that authentication seems to be the bottleneck for the dispatch method.

Disabling API authentication completely has some, but not substantial, effect on performance


  • Allowing only some HTTP methods on the ViewSet

Adding the mixins.ListModelMixin and mixins.RetrieveModelMixin to the HistogramDataFileViewSet does not affect performance


  • Dig deeper into DRF
  • Downloaded DRF source from github and saved inside project
  • Added -e ./django-rest-framework in requirements.txt
  • pip install -r requirements.txt
  • Counted time in views.py -> dispatch

@nothingface0 nothingface0 added the backend Task and features relating to the backend label Apr 12, 2022
@nothingface0
Copy link
Contributor Author

nothingface0 commented Apr 12, 2022

Solution

A FilePathField was used for the HistogramDataFile model. This field, given a directory, will iterate all files in order to populate possible values. I believe that this is part of its validation, which DRF uses every time it serializes its data.

As this path was pointing to EOS, Django, when trying to populate a list of all available files in the specified directory, took somewhere between 20s and 3 minutes (depending on EOS load, I guess) for every API query. This explains why, for a single results page (50 results, serialized with the ListModelMixin), the first serialization took so long (fetched a list of the files), while the consequent serializations were so fast (just validated that the existing filepath existed in the list of files).

Making this field a CharField automatically fixed the problem.

💀

Moral of the story

Don't use FilePathField for EOS directories with lots of files, if you're going to serialize them with DRF

@nothingface0 nothingface0 removed the deployment Issues related to deployment of the app label Apr 12, 2022
@nothingface0 nothingface0 changed the title [Deployment] Find out the cause for long DRF response delays Find out the cause for long DRF response delays on the api/histogram_data_files/ endpoint Apr 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backend Task and features relating to the backend
Projects
None yet
Development

No branches or pull requests

1 participant