Skip to content

Conversation

@jjacobson95
Copy link
Collaborator

@jjacobson95 jjacobson95 commented Aug 5, 2025

Build v2.2 is ready to merge.

Summary of all changes made on this branch:


Drug file generation was overhauled.
This was a major update across all datasets due to the fact that every dataset required some of its own code to adjust the SMI values (which not all did, #430) and to create a merged file of all of the previous drug files (which not all did, #428, #429). Additionally one dataset (#427) did not use the pubchem_retrieval.py script at all because it came before it and was never updated so I removed all the old code and replaced it. This overhaul required a large change to pubchem_retrieval.py which now accepts a new argument prev_drug_filepaths, and then a compatibility update to every other dataset's drug generation script. All datasets now use all previous drug files!
This also solves half of #421, not the drug descriptors portion though.

Large scale Renaming
All references to the build directory are being changed to coderbuild.
All recent datasets were renamed, dropping pdo/pdx, and other small updates across all build files.

SarcoPDO Fixes
There were a couple issues to this dataset as it had not properly passed validation in the last build. This fixes the mutations file (#431) and the experiments file (#436).

LiverPDO Fix
The last second addition of integer casting led to an issue with the experiments file (#432). This has been fixed.

Mapping Scripts Update
The mapping scripts are now updated to include all desired datasets (#435).

Broad_Sanger Update
Broad/Sanger has persistently been the most likely to fail and stop the build process. This is due to the file download method (#438) where a connection break or mis-download stops everything. I'm implementing a more robust method to download files. This is taking a bit of time debugging (but concurrent to everything else), but in the long run, this will save us days/weeks of build time.
Also fixing a polars import issue that is stopping a couple of scripts from working (#442).

Build Process
I've cleaned up a ton of the print statements across dataset build scripts so the debugging process can be faster. Previously I had to filter through 100k+ lines of logs to find the issues. The only issue this relates to is #437 (which produces 56k lines of warnings in the log). Some print statements can't be removed easily, such as those from the GDC_tool but this is still much better.

I'm also implementing a retry function in build_all.py. For example, if the hcmi build_omics.sh script fails due to a memory spike, it will retry it 3 times before the whole build fails. While not a direct fix, it will also function as back-up protection against broken connections during downloads that cause inconsistent failures (#434). This also required a change to the pubchem_retrieval logic with the ignore_chems (#446).

Docker Process
Optimized all Dockerfiles across all datasets in order to better leverage docker caching. This significantly speeds up build time and more notably, debugging time, especially for docker containers with R. Broad_sanger_exp takes 1hr to build, broad_sanger_omics takes 20 minutes to build without optimized cache order. Now files can be modified and using caching; R and everything else that needs compiling will be cached. Best order is essentially largest to smallest, so R and python compiling, add R requirements file, install R packages, add python requirements file, install python packages, add all build files.

GDC Datasets (HCMI and Pancreatic):
These datasets now use chunking and streaming to keep memory as low as possible. This resolves docker issues relating to oom kill (#449), and seg faults (#448).

Extras
Removed a couple of unused files including Dockerfile.crcPDO (#447).

@jjacobson95 jjacobson95 marked this pull request as draft August 5, 2025 17:08
@jjacobson95
Copy link
Collaborator Author

jjacobson95 commented Aug 10, 2025

Just noting, I think all bugs should be fixed now (fingers crossed). Attempting full build, if there are no bumps, this will probably take about 48hrs.

I did remove the original improve_mapping files as we've changed such a huge amount of the data and renamed many of the datasets, so new ones will be generated with this build.

Edit: New issue appeared with the GDC tool, #449

@jjacobson95 jjacobson95 requested a review from sgosline August 18, 2025 21:31
@jjacobson95 jjacobson95 marked this pull request as ready for review August 19, 2025 18:26
@jjacobson95 jjacobson95 merged commit 17392c1 into main Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants