Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum computer specifications to conduct a chunk-size training run #543

Open
ray-ngo opened this issue Mar 8, 2022 · 9 comments
Open

Comments

@ray-ngo
Copy link

ray-ngo commented Mar 8, 2022

MWCOG staff encountered an insufficient resources error when running a chunk-size training run for our Gen3, Phase I, Model on a server with 128 GB of RAM and 12 physical cores. Here are our questions:

  1. How do we know if the chuck size training run failed due to RAM versus due to lack of enough processors? The log file does not help answer this question.
  2. What are the minimum requirements, such as for RAM and processors, for a computer to be able to run a chunk-size training?
@jpn--
Copy link
Member

jpn-- commented Mar 8, 2022

Your chunk training attempt failed due to a lack of RAM. There is no minimum number of processors required for ActivitySim (if you are patient). But you do need enough RAM to load the skims plus at least a little extra to work in. How much that is depends on the number of zones, the number of modeled time periods, and the number of different tables represented. Thus, the minimum RAM needed is more a function of your model implementation and not something that is generalizable to ActivitySim at large.

There is an experimental memory-mapped skims interface that might allow for a reduction in the RAM required, and once the sharrow interface is operable more generally (several months from now) the RAM required will be reduced.

@JilanChen
Copy link

SEMCOG ABM run into the same issue in one of our machines (128 GB RAM and 24 cores). Currently, RSG/SEMCOG are still investigating into the cause of the issue. SEMCOG ABM Phase I (one zone system) used to be able to run with this machine but this OSError started to pop up a few months ago. When we monitored the computer performance, the highest memory use is at about 90 GB so not sure if it's RAM issue or something else.

@stefancoe
Copy link
Contributor

What model is causing the crash? What are your chunk settings? Have you been able to run the model to completion with a smaller household sample size? I would try reducing the chunk_size and see if that helps. Also, the number of processors should be a few below the total number available.

@JilanChen
Copy link

Currently, we are testing in the training mode with the Chunk_size: 100_000_000_000 and num_processes: 20. The crash for SEMCOG's ABM mostly happened during the time of running "workplace_location".

@ray-ngo
Copy link
Author

ray-ngo commented Mar 9, 2022

It is good to know that there is no requirement for the number of processors. We can upgrade RAMs on our older servers (which have less cores and little RAM) to run our ActivitySim model. Thanks @jpn-- !

RSG told us that our model needs around 110 GB of RAM to run, while we set the chunk_size variable at 102 GB (80% of the available RAM) in our testing run. I guess we have to set the chunk_size value no less than 110 GB. Please correct me if I am wrong.

@stefancoe: What is the role of the processor variable in the chunk_size training? Feel free to point me to the documentation discussing this. We set the variable at 80% of the total available in our run.

@AndrewTheTM
Copy link
Contributor

One thing that would be nice out of ActivitySim is better RAM logging and reporting. Using a completed run of @ray-ngo 's model above, the logged RAM maximum values are:

rss: 76,714,344,448 (77 GB)
full_rss: 1,247,299,629,056 (1.2 TB)
uss: 76,975,316,992 (77 GB)

So the model won't run on a 128 GB server set with a chunk_size of 102 GB, and the server I used has 244 GB. The skims need 76.6 GB (according to the log file). The answer of 'how much RAM does this model need' seems to be greater than 102 GB (more than rss and uss) and less than 244 GB (way less than max full_rss). Ray's model failed in shared_data_buffers, and on the completed run, the full_rss use was 152 GB (roughly 2* skim_buffer) at that point (per the mem.csv log), but that same field hit 1.2 TB on a machine that doesn't have that much RAM, so... ?

@stefancoe
Copy link
Contributor

stefancoe commented Mar 9, 2022

@ray-ngo Here is the documentation for multiprocessing:
https://activitysim.github.io/activitysim/core.html#multiprocessing

When using mp without chunking, ActivitySim breaks the 'problem' (where the problem is often a very large table of choosers and alternatives) into parts equal to the num_processors argument. So 10 processors = 10 tables. Next, the program works on each table in parallel, one per processor. This greatly decreases ActivitySim's runtime, but does not do anything to manage the amount of RAM that is used.

If the tables and skims and everything else require more RAM than is available, the program will crash. Chunking is used to fit the 'problem' into the available RAM by further reducing the size of each table, which are run sequentially for each process. So, using the example of 10 processors, if a sub-model/step requires two chunks given available RAM, then each of the 10 tables are broken into 2 for each processor to work on sequentially. The training step is used to determine how many chunks are needed for each step given available RAM and number of processors used.

The need to use less processors than are available is to leave some compute power for the OS to manage all this and, I am guessing, to handle anything that might be multithreaded. Some python libraries can take advantage of multithreading, which may not be ideal when we are already using multiprocessing. There is a setting, 'MKL_NUM_THREADS: 1', which should limit libraries like numpy to use only one thread.

I would try reducing both chunk size and num_processors a little bit more and see if that helps. Have you run the model with a smaller number of households?

**Edit
Just read @AndrewTheTM comments. Sounds like 128 gb is just not enough.
Decreases runtime not increases!

@jfdman
Copy link

jfdman commented Mar 9, 2022

@aletzdy
Copy link
Contributor

aletzdy commented Mar 9, 2022

Has anyone tried the system registry changes suggested here: https://stackoverflow.com/questions/53752487/oserror-winerror-1450-insufficient-system-resources-exist-to-complete-the-req ?

I tried the registry change option in addition to the other suggested fixes here on SEMCOG's 128GB server, and it did not help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants