What happened
While looking into gap_analysis.py, I found an infinite loop inside the preload(target_url) function that can permanently lock up our background processes.
The function relies on a while len(waiting): loop. It expects calculate_a_to_b(sa, sb) to eventually return True, which allows it to remove a standard pairing from the waiting array. However, if a pair continuously fails, it will return False forever. The waiting array never empties, and the function just sleeps for 30 seconds before endlessly re-trying the same failing standards.
How it breaks (The Proof)
This isn't just theoretical; the codebase guarantees an infinite loop under two specific scenarios:
- Environmental or Heroku 404s: If
CRE_NO_CALCULATE_GAP_ANALYSIS is set (or if a standard is missing on Heroku), /rest/v1/map_analysis explicitly aborts with a 404. Over in preload, calculate_a_to_b sees status_code != 200, immediately returns False, and goes back to sleep. It will query the exact same 404 endpoint 30 seconds later natively forever.
- Failed Background Jobs: If a heavy
db.gap_analysis job actually crashes or times out in the RQ queue, its status becomes FAILED. When the preload loop queries it 30 seconds later, /rest/v1/map_analysis sees the failure, creates a brand new job, and returns the new job_id. Our loop sees the new job_id, assumes it's still just "waiting", returns False, and falls right back asleep. It gets stuck eternally creating failing jobs.
Expected Fix
We should implement a MAX_RETRIES = 10 cap inside the loop. If a specific standard pairing fails repeatedly and crosses the 10-retry threshold, we should drop it from the waiting queue so the rest of the application can safely finish.
Also, iterating over the unresolved pairs via Python Tuples instead of repeating the entire nested standards strings array every time will prevent the application from constantly double-checking pairs that have already succeeded.
Let me know if you would like me to open a PR for this, I have a clean fix tested and ready!
What happened
While looking into
gap_analysis.py, I found an infinite loop inside thepreload(target_url)function that can permanently lock up our background processes.The function relies on a
while len(waiting):loop. It expectscalculate_a_to_b(sa, sb)to eventually returnTrue, which allows it to remove a standard pairing from thewaitingarray. However, if a pair continuously fails, it will returnFalseforever. Thewaitingarray never empties, and the function just sleeps for 30 seconds before endlessly re-trying the same failing standards.How it breaks (The Proof)
This isn't just theoretical; the codebase guarantees an infinite loop under two specific scenarios:
CRE_NO_CALCULATE_GAP_ANALYSISis set (or if a standard is missing on Heroku),/rest/v1/map_analysisexplicitly aborts with a404. Over inpreload,calculate_a_to_bseesstatus_code != 200, immediately returnsFalse, and goes back to sleep. It will query the exact same 404 endpoint 30 seconds later natively forever.db.gap_analysisjob actually crashes or times out in the RQ queue, its status becomesFAILED. When thepreloadloop queries it 30 seconds later,/rest/v1/map_analysissees the failure, creates a brand new job, and returns the newjob_id. Our loop sees the newjob_id, assumes it's still just "waiting", returnsFalse, and falls right back asleep. It gets stuck eternally creating failing jobs.Expected Fix
We should implement a
MAX_RETRIES = 10cap inside the loop. If a specific standard pairing fails repeatedly and crosses the 10-retry threshold, we should drop it from thewaitingqueue so the rest of the application can safely finish.Also, iterating over the unresolved pairs via Python Tuples instead of repeating the entire nested standards strings array every time will prevent the application from constantly double-checking pairs that have already succeeded.
Let me know if you would like me to open a PR for this, I have a clean fix tested and ready!