Skip to content

Project Meeting 2024.06.20

Michelle Bina edited this page Jun 20, 2024 · 6 revisions

Agenda

Action Items

  • CS to test SANDAG model, full sample, sharrow on, single process in AWS cloud using Intel vs AMD hardware, keeping the image same (holding all other factors common)
  • Jeff to test with shared memory for skims completely disabled (single process only)
  • RSG to test running MTC model keeping all things constant except the sharrow fix all branch code.
  • WSP to test SFCTA run that is crashing due to insufficient resources with a small sample run, to test the hypothesis that it has something to do with disk space.

Meeting Notes

Project Admin

  • AMPO Contracting
    • Agencies should have received Agreement MOUs from AMPO
    • Typically give agencies 3 months to get everything executed and transmitted
  • Drafting of TOs for Phase 9b
    • Joe to follow up with Jeff, Sijia, and David to discuss details

Phase 9a Updates

  • Latest Run Results
  • Compared to the start of Phase 9, many changes were made to resolve egregious run time and memory usage performance. There have been a lot of successes, but still a few outstanding things that question the stability of the ActivitySim code.
  • One outstanding thing not resolved: while there have been successfully runs of the SANDAG model with sharrow on/off, single process, full sample – in one of those tests, it ran very well (on WSP’s machine) and other attempts to do the example same thing but have very different (negative) performance results. Hypotheses include:
    • Could be hardware
      • Success on a machine with AMD hardware
      • No success on machines with Intel hardware
      • CS to test this hypothesis on aws – using different instance types, varying AMD and Intel hardware
    • Could be the version of numba
      • RSG did a test with a numba version change and it still performed poorly, so that’s not it
    • Could be a different hardware-related thing - not the CPU but the bandwidth between the CPU and RAM, but this is harder to test
    • Could be related to a shared memory process in sharrow. Sharrow utilizes in multiprocess shared memory for xarray, even when running in single process.
      • Jeff is creating code to test running without any shared memory. Jeff doesn’t know why this would be a problem but is trying anything.
  • Other outstanding thing – when we’ve attempted to run multiprocess on SFCTA’s server, it is crashing due to a cryptic insufficient resource report. We can’t figure out what resource is insufficient. There’s 1 TB of RAM and presumably plenty of disk space.
    • WSP to test SFCTA run that is crashing due to insufficient resources with a small sample run, to test the hypothesis that it has something to do with disk space.
    • Longer-term consideration: We may want to find a way to track disk usage/requirements if we get into very large multiprocess runs.
  • RSG ran the SEMCOG model with and without sharrow. SEMCOG model taking longer to run with 1.3 beta
    • With sharrow there’s a reduction of run time from 6.1 hours to 4.2 hours. However, the workplace location choice model takes longer (this was seen with the MWCOG model as well, before Phase 9 work).
    • We did see the same pattern in the SANDAG model (see Issue #6)
    • Rerunning with updated code, there’s an increase in run time with sharrow. Maybe there’s something in the sharrow fix all branch that’s causing this. RSG to re-run MTC model with the sharrow fix all branch to see if it’s showing worse times; if so, then there’s something in that PR that’s slowing things down. We need to do a new baseline for the MTC model.
Clone this wiki locally