-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
matched_variants.txt.gz empty when pgsc_calc is ran with many scores, resulting in KeyError: 'CHR:POS:A0:A1' and pgscatalog-intersect failure. #361
Comments
I believe there may be a problem with heapq or something similar in intersect_cli.py. Running the same command with ~20 scores worked fine. |
This step with
We will look into that (cc @nebfield) |
I don't know why, but I am able to replicate the error running on many scores, and it goes away with fewer scores. Perhaps some other step is using more resources in the background, but this task fails instead, if this step is truly not dependent on the number of scores? |
If you run the |
The same error/ failure occurs with .command.run. |
If you edit that script to request more memory does it solve the problem? |
Is there a way to do this at the beginning of the run? Such as a configuration file that can be modified? |
pgsc_calc/modules/local/ancestry/intersect_variants.nf Lines 1 to 5 in 96fbb23
Could replace the |
Thank you; I will try that. I see the memory label in conf/base.config. Line 43 in 96fbb23
Then, in modules/local/ancestry/intersect_variants.nf, I will change:
to:
|
Description of the bug
reference_variants.txt.gz is empty, containing only the header. This problem did not occur when running ~30 scores, but it occur when running ~100 scores. This causes the pgscatalog-intersect step to crash.
Command used and terminal output
Code:
https://github.com/PGScatalog/pygscatalog/blob/main/pgscatalog.match/src/pgscatalog/match/cli/intersect_cli.py
Since reference_variants is empty, the error makes sense.
Data Processing: The script successfully processed 84,805,772 reference variants and wrote them to temporary files. The temporary files were correctly written with the expected structure, including the "CHR:POS:A0:A1" column.
Idea:
When attempting to merge these temporary files and write to the final output, the system ran out of available RAM. This memory exhaustion caused the heapq.merge operation to fail silently, right after the file is opened and the header is written. As a result, only the header was written to the reference_variants.txt.gz file before the process was interrupted. Inspecting the Nextflow run did not confirm or reject this idea. Perhaps it needs more than 4 GB?
PGSCATALOG_PGSCCALC:PGSCCALC:MAKE_COMPATIBLE:PLINK2_VCF task:
PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:EXTRACT_DATABASE task:
PGSCATALOG_PGSCCALC:PGSCCALC:INPUT_CHECK:COMBINE_SCOREFILES task:
PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS task:
CHR:POS:A0:A1 ID_REF REF_REF IS_INDEL STRANDAMB IS_MA_REF
The text was updated successfully, but these errors were encountered: