Build 2.2 Bug Fixes - Large PR #441

jjacobson95 · 2025-08-05T17:07:34Z

Build v2.2 is ready to merge.

Summary of all changes made on this branch:

Drug file generation was overhauled.
This was a major update across all datasets due to the fact that every dataset required some of its own code to adjust the SMI values (which not all did, #430) and to create a merged file of all of the previous drug files (which not all did, #428, #429). Additionally one dataset (#427) did not use the pubchem_retrieval.py script at all because it came before it and was never updated so I removed all the old code and replaced it. This overhaul required a large change to pubchem_retrieval.py which now accepts a new argument prev_drug_filepaths, and then a compatibility update to every other dataset's drug generation script. All datasets now use all previous drug files!
This also solves half of #421, not the drug descriptors portion though.

Large scale Renaming
All references to the build directory are being changed to coderbuild.
All recent datasets were renamed, dropping pdo/pdx, and other small updates across all build files.

SarcoPDO Fixes
There were a couple issues to this dataset as it had not properly passed validation in the last build. This fixes the mutations file (#431) and the experiments file (#436).

LiverPDO Fix
The last second addition of integer casting led to an issue with the experiments file (#432). This has been fixed.

Mapping Scripts Update
The mapping scripts are now updated to include all desired datasets (#435).

Broad_Sanger Update
Broad/Sanger has persistently been the most likely to fail and stop the build process. This is due to the file download method (#438) where a connection break or mis-download stops everything. I'm implementing a more robust method to download files. This is taking a bit of time debugging (but concurrent to everything else), but in the long run, this will save us days/weeks of build time.
Also fixing a polars import issue that is stopping a couple of scripts from working (#442).

Build Process
I've cleaned up a ton of the print statements across dataset build scripts so the debugging process can be faster. Previously I had to filter through 100k+ lines of logs to find the issues. The only issue this relates to is #437 (which produces 56k lines of warnings in the log). Some print statements can't be removed easily, such as those from the GDC_tool but this is still much better.

I'm also implementing a retry function in build_all.py. For example, if the hcmi build_omics.sh script fails due to a memory spike, it will retry it 3 times before the whole build fails. While not a direct fix, it will also function as back-up protection against broken connections during downloads that cause inconsistent failures (#434). This also required a change to the pubchem_retrieval logic with the ignore_chems (#446).

Docker Process
Optimized all Dockerfiles across all datasets in order to better leverage docker caching. This significantly speeds up build time and more notably, debugging time, especially for docker containers with R. Broad_sanger_exp takes 1hr to build, broad_sanger_omics takes 20 minutes to build without optimized cache order. Now files can be modified and using caching; R and everything else that needs compiling will be cached. Best order is essentially largest to smallest, so R and python compiling, add R requirements file, install R packages, add python requirements file, install python packages, add all build files.

GDC Datasets (HCMI and Pancreatic):
These datasets now use chunking and streaming to keep memory as low as possible. This resolves docker issues relating to oom kill (#449), and seg faults (#448).

Extras
Removed a couple of unused files including Dockerfile.crcPDO (#447).

…les for caching

…hope

…t. Removed tons of print statements so debugging the full build would be easier

…s in drug descriptor file.

… used in the drug generation as well, so we need to keep it this ver

…to stream hcmi data instead of hold in storage

…y 404s. Build_all retries set at 3 and 10 min

…re quite a few

…atasets. This was hundreds of references so its possible I missed something or capitalization is off somewhere

…ad of polars. Way less RAM

jjacobson95 · 2025-08-10T23:05:37Z

Just noting, I think all bugs should be fixed now (fingers crossed). Attempting full build, if there are no bumps, this will probably take about 48hrs.

I did remove the original improve_mapping files as we've changed such a huge amount of the data and renamed many of the datasets, so new ones will be generated with this build.

Edit: New issue appeared with the GDC tool, #449

…for rare bug where a MAF file is empty and it crashes. Moved HCMI to the end of build

jjacobson95 added 23 commits July 30, 2025 16:46

Fix beataml Drug issue

c5acf71

liverpdo fixes

c4df787

added novartis to build_all.py. update for liverpdo drugs

a96c111

another liverpdo drug update

17377b2

testing pubchem update

1981a3b

working on pubchem

ebc79b5

working on pubchem2

bc5b859

updated pubchem call in build/bladderpdo/02_createBladderPDODrugsFile.py

799c636

Large drug generation overhaul

7470735

reduced drugs in broad_sanger for debugging

7cf7dd1

bug fix

a40eca2

changed to random 10 instead fo first test for debugging

7f57630

Speed up Docker build (and debug process) through optimizing dockerfi…

44ab62c

…les for caching

Make sure helper script is actually added to the dockerfile

2fafd15

bug fix in join

e3670b0

bug fix on join

54a9254

Sorted after joining

aee1a1d

ensure that first drug in first file starts at SMI_1 instead of SMI_2

7f39128

Turning off test steps. Made a change to HCMI that should speed up I …

88083fe

…hope

SarcPDO issues fixed for mutations and experiments

533f66b

fixes liverpdo experiments

797f37c

Updated mapping scripts with all datasets and removed cptac by defaul…

09fb9e5

…t. Removed tons of print statements so debugging the full build would be easier

Added robust methods to download files for broad_sanger omics

4d74714

jjacobson95 marked this pull request as draft August 5, 2025 17:08

jjacobson95 added 6 commits August 5, 2025 14:28

Dockerfile optimization. Attempting to fix broad_sanger. Hide warning…

509b170

…s in drug descriptor file.

tiny changes. 05b_separate_datasets.py working now

c349fc6

pinning polars-lts-cpu to the original version as polars pin. This is…

8018f8b

… used in the drug generation as well, so we need to keep it this ver

Added 3 x retry to build_all.py for each step that fails. Attempting …

c7171f2

…to stream hcmi data instead of hold in storage

Remove incorrectly-cased Dockerfile.crcPDO and add Dockerfile.crcpdo

d628dd3

Merge remote-tracking branch 'origin/main' into build_2.2_bug_fixes

e3f4df3

jjacobson95 added 11 commits August 6, 2025 22:42

HCMI data streaming finally seems like it mightttt be working

b347513

Handle 503 Gateway errors better. Pubchem ignore_chems updated to onl…

4630da8

…y 404s. Build_all retries set at 3 and 10 min

Renamed build to coderbuild. Hopefully I got all references, there we…

9881108

…re quite a few

Renamed All PDO/PDX Datasets. Modified all files that reference the d…

5cf3dac

…atasets. This was hundreds of references so its possible I missed something or capitalization is off somewhere

Adding missed references to build/coderbuild

e98d88d

Adding more missed name changes

3e0aea2

Adding just a couple more references

afdb5f7

Patch fix for a weird bug

f2535f8

apparently pl.scan_csv can't handle gzipped files. fixed hcmi stream

57b9fe0

Removed previous mapping files because vast dataset changes and renaming

70b50ff

Final hcmi fix I hope. I just used bash and subprocess to dedup inste…

759cef3

…ad of polars. Way less RAM

jjacobson95 added 8 commits August 11, 2025 22:49

Chunking added to hcmi and pancreatic datasets. Added an extra check …

74c14a6

…for rare bug where a MAF file is empty and it crashes. Moved HCMI to the end of build

fixed non-issue with novartis call

bc74c6c

added an option to build all to start after exps are done

112ab88

Polars version shift requires some updates, likely more to come

a516384

dataset version update

68145dc

Manually adding mapping files

26d3e69

manually adding mapping files

71ce1d5

updated article link in dataset.yml

67f40bb

jjacobson95 requested a review from sgosline August 18, 2025 21:31

jjacobson95 marked this pull request as ready for review August 19, 2025 18:26

jjacobson95 merged commit 17392c1 into main Aug 19, 2025

This was referenced Aug 27, 2025

HCMI Memory issue when processing Transcriptomics files #444

Closed

PDO dataset names aren't always PDO datasets #443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build 2.2 Bug Fixes - Large PR #441

Build 2.2 Bug Fixes - Large PR #441

Uh oh!

jjacobson95 commented Aug 5, 2025 •

edited

Loading

Uh oh!

jjacobson95 commented Aug 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Build 2.2 Bug Fixes - Large PR #441

Build 2.2 Bug Fixes - Large PR #441

Uh oh!

Conversation

jjacobson95 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjacobson95 commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jjacobson95 commented Aug 5, 2025 •

edited

Loading

jjacobson95 commented Aug 10, 2025 •

edited

Loading