Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test parallelizing scons run #72

Closed
rcalvo12 opened this issue Dec 5, 2022 · 28 comments · Fixed by #73
Closed

Test parallelizing scons run #72

rcalvo12 opened this issue Dec 5, 2022 · 28 comments · Fixed by #73
Assignees

Comments

@rcalvo12
Copy link
Collaborator

rcalvo12 commented Dec 5, 2022

scons represents the build steps as a directed acyclic graph.

This means it can calculate which paths can be built in parallel and which must be built in sequence.

The -j option is designed to allow a user to run multiple simultaneous tasks.

We'd like to explore whether it's possible to use this option successfully on FASRC.

@jmshapir
Copy link
Contributor

jmshapir commented Dec 5, 2022

@MosesStewart welcome to Template! Once you accept the invitation to join this repository, please self-assign and review the issue description.

A first step is probably to make sure you can build the entire Template on FASRC. After that you can start testing parallel builds.

Please keep in mind that this repository is in the public domain.

Thanks!

@MosesStewart
Copy link
Contributor

We'd like to explore whether it's possible to use this option successfully on FASRC.

To give an update, I'm able to use the -j option successfully without any warnings when running on my own system. I'm still trying to sort out the executable path for stata on the FASRC to test it there.

@jmshapir
Copy link
Contributor

To give an update, I'm able to use the -j option successfully without any warnings when running on my own system.

@MosesStewart that's great! Do you get any speedup in runtime, or is the repository too fast to be able to tell?

In the latter case, we could create an issue branch and add some slow operations to the repository (that are not dependent on one another) so we can see if we're getting any improvement from parallelizing.

I'm still trying to sort out the executable path for stata on the FASRC to test it there.

Got it. Per here:

The builders are going to look for Stata under the name StataMP-64. If when you type StataMP-64 from the bash prompt Stata doesn't run, then the builders likely won't be able to run it, either.

In that case, you could look for the correct command to run Stata from the bash prompt. Suppose for the sake of argument that it's StataCommand.

Then I think we just need to tell Linux that when you type StataMP-64 you really mean StataCommand. I think (?) aliases will do this but I haven't tested it directly (at least not for a while).

Let me know if any of that is helpful, thanks.

@MosesStewart
Copy link
Contributor

MosesStewart commented Dec 30, 2022

@jmshapir changing the stata executable in Template/source/lib/JMSLab/builders/executables.yml to stata-mp fixed the problem!

Thank you so much for the detailed explanation! I really appreciate it!

I didn't have permission to create a simlink in the FASRC applications folder, so I think I'll run it like so!

@MosesStewart
Copy link
Contributor

MosesStewart commented Dec 30, 2022

Do you get any speedup in runtime, or is the repository too fast to be able to tell?

The repository is too fast for me to tell 😭

In the latter case, we could create an issue branch and add some slow operations

Would it be possible for me to gain write-access to edit the code?

@jmshapir
Copy link
Contributor

@MosesStewart

@jmshapir changing the stata executable in Template/source/lib/JMSLab/builders/executables.yml to stata-mp fixed the problem!

Thank you so much for the detailed explanation! I really appreciate it!

I didn't have permission to create a simlink in the FASRC applications folder, so I think I'll run it like so!

Sounds good! Let's just be sure not to commit the revised executables.yml to the repository.

(And when @rcalvo12 @ew487 let's ask them what solution they use for this detail about the executables.)

Do you get any speedup in runtime, or is the repository too fast to be able to tell?

The repository is too fast for me to tell 😭

In the latter case, we could create an issue branch and add some slow operations

Would it be possible for me to gain write-access to edit the code?

Done! Use it wisely.

MosesStewart added a commit that referenced this issue Dec 30, 2022
@MosesStewart
Copy link
Contributor

Does anyone know how to force scons to build all target files, even if they're up-to-date?

I've tried scons . and scons /, which both should build all target files according to the link in #72 (comment). I've also tried scons --config=force, but I haven't been able to get scons to re-build the entire repository when it's up to date already.

I'm trying to re-build the entire repository to compare times with and without the -j option.

@jmshapir
Copy link
Contributor

@MosesStewart thanks!

Does anyone know how to force scons to build all target files, even if they're up-to-date?

I'm not sure but here are some ideas:

  • See stackexchange post
  • See github post which mentions a sequence of commands for this. (Maybe the same as the stackexchange post?)
  • Try deleting your local copy of sconsign.dblite. (This is how scons keeps track of the state of the build, so deleting it might force scons to rebuild everything.)
  • Try deleting your local copy of the contents of each /output folder. (I think the --clean option mentioned in the github post above might automate this? If not, deleting manually is a brute-force approach, and it should work as a last resort.)

I've tried scons . and scons /, which both should build all target files according to the link in #72 (comment).

Those commands will only build targets that are not currently up to date, since scons is designed to make sure targets are up to date using the minimum amount of compute.

I'm trying to re-build the entire repository to compare times with and without the -j option.

Great!

@MosesStewart
Copy link
Contributor

I'm not sure but here are some ideas:

Thank you so much! The StackExchange solution worked! I'm getting very interesting results with the -j option!
# Cores | Time: 0:00 | Completion
1 core | Time: 3:27 | Completed
2 core | Time: 2:19 | Failed
4 core | Time: 0:50 | Failed

The 2 and 4 task parallelizations that failed both did so in paper/SConscript, which is almost the last step the scons build and is almost instant. Therefore, I'm confident that the -j option does significantly increase speed at approximately a 1:1 ratio with the number of cores.

I'll try to dig into why the lyx environment is raising an ExecCallError when we use multiple cores.

@MosesStewart
Copy link
Contributor

MosesStewart commented Jan 1, 2023

@jmshapir I don't think the -j option can be used successfully in most cases.

From my testing, I hypothesize that when we ask scons to build 3 targets in parallel, it will always try to build 3 targets at the same time. In other words, even if the target scons needs to build next depends on a target currently being built, scons will try to start building it anyways and the program will fail.

In this repository, both the build #output/paper/TemplateTex.pdf and build #output/paper/TemplateLyx.pdf were failing when building multiple targets only when they included #source/figures/gdp_educ.eps and #output/tables/top_gdp.tex, which is built by scons the step before. The only way I could get the build to complete was by removing the dependencies on earlier steps. Therefore, I don't think that the -j option is optimized so that scons knows to wait for earlier dependencies to finish building.

I do think if scons gets an update, this could easily be fixed and the -j option could become incredibly useful

@jmshapir
Copy link
Contributor

jmshapir commented Jan 1, 2023

Thanks @MosesStewart!

In this repository, both the build #output/paper/TemplateTex.pdf and build #output/paper/TemplateLyx.pdf were failing when building multiple targets only when they included #output/tables/top_gdp.tex, which is built by scons the step before.

Can you point me to which SConscript controls the building of #output/tables/top_gdp.tex? I want to make sure I understand how the DAG is structured here in case there's something we're missing.

Therefore, I don't think that the -j option is optimized so that scons knows to wait for earlier dependencies to finish building.

Since parallelization is a key part of the scons design, I'd be surprised if it has this limitation. If it did, I'd expect it would have been reported. If you have bandwidth, it might be worth your spending a little time (say, not more than 1-2 hours) on the scons github repository to see if there are any open issues that relate to the problem we're having here.

@MosesStewart
Copy link
Contributor

Can you point me to which SConscript controls the building of #output/tables/top_gdp.tex? I want to make sure I understand how the DAG is structured here in case there's something we're missing.

Sorry I attached the wrong file name! It is #output/tables/top_gdp.lyx not .tex that is being built in this step:

target = ['#output/tables/top_gdp.lyx']
source = ['#source/tables/top_gdp.lyx',
'#output/analysis/top_gdp/top_gdp.txt']

I believe that #output/tables/top_gdp.tex isn't being built by anything.

Since parallelization is a key part of the scons design, I'd be surprised if it has this limitation. If it did, I'd expect it would have been reported. If you have bandwidth, it might be worth your spending a little time (say, not more than 1-2 hours) on the scons github repository to see if there are any open issues that relate to the problem we're having here.

Yes! I will start this right away!

@jmshapir
Copy link
Contributor

jmshapir commented Jan 2, 2023

I believe that #output/tables/top_gdp.tex isn't being built by anything.

Thanks @MosesStewart! In that case, do we know why TemplateTex.pdf isn't building successfully when we paralellize? It seems like the dependency on top_gdp.tex can't be the culprit, since as you say that file isn't being built by anything.

@MosesStewart
Copy link
Contributor

MosesStewart commented Jan 2, 2023

Thanks @MosesStewart! In that case, do we know why TemplateTex.pdf isn't building successfully when we paralellize? It seems like the dependency on top_gdp.tex can't be the culprit, since as you say that file isn't being built by anything.

No, I hadn't noticed before that top_gdp.tex wasn't dependent on anything, but I'm sure that editing source/papaer/template.tex to not include output/tables/top_gdp allows scons to not fail on this step of the process. I'm at a loss for the cause ~ I'm trying to change different things and see what causes it to fail here during parallelization

@jmshapir
Copy link
Contributor

jmshapir commented Jan 2, 2023

Thanks @MosesStewart!

I wonder if it might be a good idea to build a minimal working example of what we're trying to do, maybe in a separate repository? That way we could strip out some of the complexity of the Template and focus on the bare essentials.

If that sounds good to you here is a process I propose:

  • Create a new (empty) public repository SConsTest in your own namespace (i.e., under @MosesStewart rather than JMSLab)
  • Add a very minimal scons structure using out of the box builders rather than JMSLab builders. Something like:
    • /source/paper/: contains test.tex which is compiled to /output/paper/test.pdf using the PDF() builder
    • /source/zip: contains SConscript which zips test.pdf to /output/zip/test.zip using the ZIP() builder

If we find that parallelization breaks here then we have a simple example we can use to diagnose potential explanations. We could even post an issue to the scons github repository pointing to the example.

If we find that parallelization works here then that might help us get to the bottom of why it's breaking in Template.

Let me know what you think. If you think it would be helpful to talk this through feel free to shoot me an e-mail and we can schedule a call or meeting.

Thanks!

@MosesStewart
Copy link
Contributor

The test repository is here. I'm currently changing things in test.tex and important_figure.tex to see if I can break scons with the -j function.

@jmshapir
Copy link
Contributor

jmshapir commented Jan 3, 2023

Thanks @MosesStewart! That test repository looks good to me. A couple thoughts:

  • Do we need important_figure.tex? I'd have thought that for a minimal example the pipeline test.tex-->test.pdf-->test.zip might be sufficient, since that would support -j with 2 simultaneous jobs?
  • If you wanted to clean up the /output/paper directory you could add a .gitignore local to that directory that specifies which files to ignore.

If we're finding that -j works in the test repo but not in Template, that could mean that something is off with the way we've specified the DAG in Template. In order to parallelize the build, it's crucial that scons has all the dependencies exactly right, since otherwise it might try to build targets before their inputs are ready.

To visualize the DAG, I think you can use scons --tree=type where type determines exactly what sort of tree is drawn. Maybe it's worth having a look at the DAG in case we're missing something?

@MosesStewart
Copy link
Contributor

To visualize the DAG, I think you can use scons --tree=type where type determines exactly what sort of tree is drawn. Maybe it's worth having a look at the DAG in case we're missing something?

That's very helpful! I'll look into it!

  • Do we need important_figure.tex? I'd have thought that for a minimal example the pipeline test.tex-->test.pdf-->test.zip might be sufficient, since that would support -j with 2 simultaneous jobs?

I wanted to test if anything my be breaking when trying to import a file. Currently the -j is working in the test repo, and me changing file locations and such isn't breaking it.

MosesStewart added a commit that referenced this issue Jan 3, 2023
@MosesStewart
Copy link
Contributor

MosesStewart commented Jan 3, 2023

@jmshapir nothing jumped out at me after looking at the DAG, but I narrowed down exactly what's causing the parallelization to fail.

Changing the repository by:

    1. Deleting the source/figures/gdp_educ.lyx figure from the source/paper/template.lyx document
      \begin_layout Standard
      \begin_inset CommandInset include
      LatexCommand include
      filename "../../source/figures/gdp_educ.lyx"
      \end_inset
      \end_layout
    1. Commenting out the source/figures/gdp_educ figure from source/paper/template.tex document
      \input{source/figures/gdp_educ}

allows the parallelization to work.

I didn't have to change any of the dependencies in source/paper/SConscript for it to compile. So, I think something's happening when including the source figures in either document that's causing scons to fail only when using the -j option, but succeeding otherwise. I included what the repository looks like when it succeeds in a2c53bc

@jmshapir
Copy link
Contributor

jmshapir commented Jan 4, 2023

Thanks @MosesStewart!

Does the DAG reflect the fact that gdp_educ.lyx and gdp_educ.tex depend on gdp_educ.eps?

@MosesStewart
Copy link
Contributor

@jmshapir they depend on gpd_educ.lyx and gpd_educ.tex respectively:

For source/papers/template.lyx:

Template/sconstruct.log

Lines 85 to 96 in 227dcce

│ │ ├─┬output/paper/TemplateLyx.pdf
│ │ │ ├─source/paper/Template.lyx
│ │ │ ├─┬output/tables/top_gdp.lyx
│ │ │ │ ├─source/tables/top_gdp.lyx
│ │ │ │ └─┬output/analysis/top_gdp/top_gdp.txt
│ │ │ │ ├─source/analysis/top_gdp/top_gdp_table.R
│ │ │ │ └─┬output/derived/wb_clean/gdp_education.csv
│ │ │ │ ├─source/derived/wb_clean/build.py
│ │ │ │ ├─datastore/raw/world_bank/orig/API_NY.GDP.PCAP.CD_DS2_en_csv_v2_1740213.csv
│ │ │ │ └─datastore/raw/world_bank/orig/API_SE.XPD.TOTL.GD.ZS_DS2_en_csv_v2_1740282.csv
│ │ │ ├─source/figures/gdp_educ.lyx
│ │ │ └─source/paper/References.bib

│ │ │ ├─source/figures/gdp_educ.lyx

For source/papers/template.tex

Template/sconstruct.log

Lines 97 to 101 in 227dcce

│ │ └─┬output/paper/TemplateTex.pdf
│ │ ├─source/paper/Template.tex
│ │ ├─output/tables/top_gdp.tex
│ │ ├─source/figures/gdp_educ.tex
│ │ └─source/paper/References.bib

│ │ ├─source/figures/gdp_educ.tex

@jmshapir
Copy link
Contributor

jmshapir commented Jan 4, 2023

@MosesStewart that makes sense. But if we look at gdp_educ.lyx, for example, it includes a call to gdp_educ.eps:

filename ../../output/analysis/plots/gdp_educ.eps

Does any SConscript currently declare that gdp_educ.lyx depends on gdp_educ.eps?

If not, I wonder if there's a hole in the DAG, such that we have represented to scons that we have

stuff-->gdp_educ.eps
gdp_educ.lyx-->Template.lyx

but really we have

stuff-->gdp_educ.eps-->gdp_educ.lyx-->Template.lyx

If that were the case, then the DAG would imply (incorrectly) that it is fine to build, say, gdp_educ.eps and Template.lyx at the same time.

Could that be the culprit here?

@MosesStewart
Copy link
Contributor

MosesStewart commented Jan 4, 2023

@jmshapir I didn't notice that ~ that seems very likely ~ I will test right away

MosesStewart added a commit that referenced this issue Jan 4, 2023
This reverts commit a2c53bc.
MosesStewart added a commit that referenced this issue Jan 4, 2023
MosesStewart added a commit that referenced this issue Jan 4, 2023
MosesStewart added a commit that referenced this issue Jan 4, 2023
MosesStewart added a commit that referenced this issue Jan 4, 2023
@MosesStewart
Copy link
Contributor

It works!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
MWAHAHAHAHAHAHAHAHAHAHAHA

@jmshapir
Copy link
Contributor

jmshapir commented Jan 4, 2023

@MosesStewart nice! One thing I think we've learned that if we want to use this option we need to be quite meticulous about declaring dependencies, even those that aren't needed for a serial build.

The other thing I'd ideally like to check before we leave the issue is that we are actually seeing some improvements in run time. Is it possible to look into that? As before, it might be necessary to add a "speedbump" to some of the steps in order to make the run time calculations meaningful.

@rcalvo12 @ew487

@MosesStewart
Copy link
Contributor

MosesStewart commented Jan 4, 2023

@jmshapir currently I'm getting 3:18 without parallelization and 1:56 with parallelization (~42% speed boost). This is only with "speedbumps" in 2 of the files though ~ the more speedbumps I add, the faster the parallelization becomes!

@jmshapir
Copy link
Contributor

jmshapir commented Jan 4, 2023

@MosesStewart that's great! It sounds to me like this is working as intended.

The next steps I envision are the following:

  • Bring the issue branch to a state where it aligns with master except for fixing the gap in the DAG.
  • Rerun one more time to confirm scons -j is functional. (No need for speedbumps.)
  • Open a pull with @rcalvo12 @jmshapir as reviewers.

Of course if you think I've missed something just say!

MosesStewart added a commit that referenced this issue Jan 4, 2023
MosesStewart added a commit that referenced this issue Jan 5, 2023
…om working (#73)

* create a waste of time for #72

* sconstruct.log with dependency tree for #72

* What works for #72

* Revert "What works for #72"

This reverts commit a2c53bc.

* Updated dependencies for #72

* Changed environment method for #72

* updated dependencies for #72

* updated dependency method for #72

* Revert "create a waste of time for #72"

This reverts commit b7ffdef.
@MosesStewart
Copy link
Contributor

Summary:

In this issue we tested the functionality of the -j option for scons in the repository.

After testing, we found that it is pivotal to carefully indicate all dependencies so the DAG is set up to prevent scons from building dependent files in parallel. We updated source/figures/scons to prevent scons from building gdp_educ.eps and template.pdf in parallel when using the -j option, opening up parallelization in the repository.

Final state of the issue branch is in f6297fe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants