Optimize preprocessing (#65)

* Add new test files * Update test_preprocess.py * Use parquet * Add brians code * Update preprocess.py * sort samples * Remove threads * Update exclude calls logic * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert "Squashed commit of the following:" This reverts commit ebde7c1. * Remove unused import * don't use mkl 2024.1.0 * update micromamba@v1.8.1 * Isolate failing test * test genotype matrix * Revert "test genotype matrix" This reverts commit 6deee9b. * Revert "Isolate failing test" This reverts commit 6a11fe3. * fixup! Format Python code with psf/black pull_request * remove files * Delete variants.tsv.gz * Update test_preprocess.py * Update test_preprocess.py * fixup! Format Python code with psf/black pull_request * Update test_preprocess.py * Update test-runner.yml * one test * Revert "one test" This reverts commit 05e4578. * Revert "Update test-runner.yml" This reverts commit ff78d30. * update call filter test data * Update expected data * Update deeprvat_preprocessing_env.yml Remove joblib * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert change of micromamba * Ruff check * Squashed commit of the following: commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> --------- Co-authored-by: PMBio <PMBio@users.noreply.github.com>
PMBio · Apr 16, 2024 · 24b3af5 · 24b3af5
1 parent ae5c83e
commit 24b3af5
Show file tree

Hide file tree

Showing 26 changed files with 213 additions and 126 deletions.
diff --git a/deeprvat/preprocessing/preprocess.py b/deeprvat/preprocessing/preprocess.py
diff --git a/deeprvat_preprocessing_env.yml b/deeprvat_preprocessing_env.yml
@@ -6,7 +6,6 @@ dependencies:
   - python=3.8
   - click=8.1.3
   - h5py=3.8.0
-  - joblib=1.0.1
   - numpy=1.21.2
   - pandas=1.5.1
   - tqdm=4.59.0

diff --git a/docs/preprocessing.md b/docs/preprocessing.md
@@ -53,9 +53,6 @@ included_chromosomes : [21,22]
 # The format of the name of the "raw" vcf files
 vcf_files_list: vcf_files_list.txt
 
-# Number of threads to use in the preprocessing script, separate from snakemake threads
-preprocess_threads: 16
-
 # If you need to run a cmd to load bcf and samtools specify it here, see example
 bcftools_load_cmd : # module load bcftools/1.10.2 &&
 samtools_load_cmd : # module load samtools/1.9 &&

diff --git a/pipelines/config/deeprvat_preprocess_config.yaml b/pipelines/config/deeprvat_preprocess_config.yaml
@@ -4,8 +4,6 @@ included_chromosomes : [21,22]
 # The format of the name of the "raw" vcf files
 vcf_files_list: vcf_files_list.txt
 
-# Number of threads to use in the preprocessing script, separate from snakemake threads
-preprocess_threads: 16
 
 # If you need to run a cmd to load bcf and samtools specify it here, see example
 bcftools_load_cmd : # module load bcftools/1.10.2 &&

diff --git a/pipelines/preprocess_no_qc.snakefile b/pipelines/preprocess_no_qc.snakefile
@@ -24,8 +24,7 @@ rule preprocess_no_qc:
                 f"--exclude-variants {qc_duplicate_vars_dir}",
                 "--chromosomes ",
                 ",".join(str(chr) for chr in set(chromosomes)),
-                f"--threads {preprocess_threads}",
-                "{input.variants}",
+                "{input.variants_parquet}",
                 "{input.samples}",
                 f"{sparse_dir}",
                 f"{preprocessed_dir / 'genotypes'}",

diff --git a/pipelines/preprocess_with_qc.snakefile b/pipelines/preprocess_with_qc.snakefile
@@ -40,8 +40,7 @@ rule preprocess_with_qc:
                 f"--exclude-samples {qc_filtered_samples_dir}",
                 "--chromosomes ",
                 ",".join(str(chr) for chr in set(chromosomes)),
-                f"--threads {preprocess_threads}",
-                "{input.variants}",
+                "{input.variants_parquet}",
                 "{input.samples}",
                 f"{sparse_dir}",
                 f"{preprocessed_dir / 'genotypes'}",

diff --git a/pipelines/preprocessing/preprocess.snakefile b/pipelines/preprocessing/preprocess.snakefile
@@ -14,8 +14,6 @@ working_dir = Path(config["working_dir"])
 preprocessed_dir = working_dir / config["preprocessed_dir_name"]
 reference_dir = working_dir / config["reference_dir_name"]
 
-preprocess_threads = config["preprocess_threads"]
-
 fasta_file = reference_dir / config["reference_fasta_file"]
 fasta_index_file = reference_dir / f"{config['reference_fasta_file']}.fai"
 

diff --git a/tests/preprocessing/test_data/combine_genotypes/combine_chr1_chr2/input/genotypes_chr1.h5 b/tests/preprocessing/test_data/combine_genotypes/combine_chr1_chr2/input/genotypes_chr1.h5
diff --git a/tests/preprocessing/test_data/combine_genotypes/combine_chr1_chr2/input/genotypes_chr2.h5 b/tests/preprocessing/test_data/combine_genotypes/combine_chr1_chr2/input/genotypes_chr2.h5
diff --git a/..._combine_sparse_gt/filter_calls_variants_samples_minimal_split/expected/expected_data.npz b/..._combine_sparse_gt/filter_calls_variants_samples_minimal_split/expected/expected_data.npz
diff --git a/...ter_calls_variants_samples_minimal_split/input/qc/calls/chr1/excluded_calls_untidy.tsv.gz b/...ter_calls_variants_samples_minimal_split/input/qc/calls/chr1/excluded_calls_untidy.tsv.gz
diff --git a/...rse_gt/filter_calls_variants_samples_minimal_split/input/qc/calls/chr1/input_c1_b1.tsv.gz b/...rse_gt/filter_calls_variants_samples_minimal_split/input/qc/calls/chr1/input_c1_b1.tsv.gz
diff --git a/...rse_gt/filter_calls_variants_samples_minimal_split/input/qc/calls/chr1/input_c1_b2.tsv.gz b/...rse_gt/filter_calls_variants_samples_minimal_split/input/qc/calls/chr1/input_c1_b2.tsv.gz
diff --git a/...c/calls/chr2/excluded_calls_untidy.tsv.gz → ...it/input/qc/calls/chr2/input_c2_b1.tsv.gz b/...c/calls/chr2/excluded_calls_untidy.tsv.gz → ...it/input/qc/calls/chr2/input_c2_b1.tsv.gz
diff --git a/..._and_combine_sparse_gt/filter_calls_variants_samples_minimal_split/input/variants.parquet b/..._and_combine_sparse_gt/filter_calls_variants_samples_minimal_split/input/variants.parquet
diff --git a/...s_and_combine_sparse_gt/filter_calls_variants_samples_minimal_split/input/variants.tsv.gz b/...s_and_combine_sparse_gt/filter_calls_variants_samples_minimal_split/input/variants.tsv.gz
diff --git a/...g/test_data/process_and_combine_sparse_gt/no_filters_minimal_split/input/variants.parquet b/...g/test_data/process_and_combine_sparse_gt/no_filters_minimal_split/input/variants.parquet
diff --git a/...ng/test_data/process_and_combine_sparse_gt/no_filters_minimal_split/input/variants.tsv.gz b/...ng/test_data/process_and_combine_sparse_gt/no_filters_minimal_split/input/variants.tsv.gz
diff --git a/...processing/test_data/process_sparse_gt/filter_calls_minimal/input/qc/chr1/input_c1_b1.tsv b/...processing/test_data/process_sparse_gt/filter_calls_minimal/input/qc/chr1/input_c1_b1.tsv
@@ -0,0 +1,7 @@
+chr1	51928	G	A	100100
+chr1	54708	G	C	100101
+chr1	54490	G	A	100104
+chr1	51898	C	A	100107
+chr1	54708	G	C	100101
+chr1	54753	T	G	100107
+chr1	54490	G	A	100108
diff --git a/tests/preprocessing/test_data/process_sparse_gt/filter_calls_minimal/input/variants.parquet b/tests/preprocessing/test_data/process_sparse_gt/filter_calls_minimal/input/variants.parquet
diff --git a/...rocess_sparse_gt/filter_calls_vars_samples_minimal/input/qc/calls/chr1/input_c1_b1.tsv.gz b/...rocess_sparse_gt/filter_calls_vars_samples_minimal/input/qc/calls/chr1/input_c1_b1.tsv.gz
diff --git a/...sing/test_data/process_sparse_gt/filter_calls_vars_samples_minimal/input/variants.parquet b/...sing/test_data/process_sparse_gt/filter_calls_vars_samples_minimal/input/variants.parquet
diff --git a/...s/preprocessing/test_data/process_sparse_gt/filter_samples_minimal/input/variants.parquet b/...s/preprocessing/test_data/process_sparse_gt/filter_samples_minimal/input/variants.parquet
diff --git a/.../preprocessing/test_data/process_sparse_gt/filter_variants_minimal/input/variants.parquet b/.../preprocessing/test_data/process_sparse_gt/filter_variants_minimal/input/variants.parquet
diff --git a/tests/preprocessing/test_data/process_sparse_gt/no_filters_minimal/input/variants.parquet b/tests/preprocessing/test_data/process_sparse_gt/no_filters_minimal/input/variants.parquet
diff --git a/tests/preprocessing/test_preprocess.py b/tests/preprocessing/test_preprocess.py
@@ -86,7 +86,7 @@ def test_process_sparse_gt_file(
 
     test_data_input_dir = current_test_data_dir / "input"
 
-    variant_file = test_data_input_dir / "variants.tsv.gz"
+    variant_file = test_data_input_dir / "variants.parquet"
     samples_file = test_data_input_dir / "samples_chr.csv"
     sparse_gt_dir = test_data_input_dir / "sparse_gt"
 
@@ -99,13 +99,15 @@ def test_process_sparse_gt_file(
     cli_parameters = [
         "process-sparse-gt",
         *extra_cli_params,
+        "--chunksize",
+        "3",
         variant_file.as_posix(),
         samples_file.as_posix(),
         sparse_gt_dir.as_posix(),
         out_file_base.as_posix(),
     ]
 
-    result = cli_runner.invoke(preprocess_cli, cli_parameters)
+    result = cli_runner.invoke(preprocess_cli, cli_parameters, catch_exceptions=False)
     assert result.exit_code == 0
 
     h5_file = out_file_base.as_posix().replace("genotypes", genotype_file_name)
@@ -149,11 +151,13 @@ def test_combine_genotypes(test_data_name_dir, input_h5, result_h5, tmp_path):
 
     cli_parameters = [
         "combine-genotypes",
+        "--chunksize",
+        3,
         *[(test_data_input_dir / h5f).as_posix() for h5f in input_h5],
         combined_output_h5.as_posix(),
     ]
 
-    result = cli_runner.invoke(preprocess_cli, cli_parameters)
+    result = cli_runner.invoke(preprocess_cli, cli_parameters, catch_exceptions=False)
     assert result.exit_code == 0
 
     written_samples, written_variant_matrix, written_genotype_matrix = load_h5_archive(
@@ -268,7 +272,7 @@ def test_process_and_combine_sparse_gt(
 
     test_data_input_dir = current_test_data_dir / "input"
 
-    variant_file = test_data_input_dir / "variants.tsv.gz"
+    variant_file = test_data_input_dir / "variants.parquet"
     samples_file = test_data_input_dir / "samples_chr.csv"
     sparse_gt_dir = test_data_input_dir / "sparse_gt"
 
@@ -282,22 +286,30 @@ def test_process_and_combine_sparse_gt(
     cli_parameters_process = [
         "process-sparse-gt",
         *extra_cli_params,
+        "--chunksize",
+        "3",
         variant_file.as_posix(),
         samples_file.as_posix(),
         sparse_gt_dir.as_posix(),
         out_file_base.as_posix(),
     ]
 
-    result_process = cli_runner.invoke(preprocess_cli, cli_parameters_process)
+    result_process = cli_runner.invoke(
+        preprocess_cli, cli_parameters_process, catch_exceptions=False
+    )
     assert result_process.exit_code == 0
 
     cli_parameters_combine = [
         "combine-genotypes",
+        "--chunksize",
+        3,
         *[(preprocessed_dir / h5f).as_posix() for h5f in input_h5],
         combined_output_h5.as_posix(),
     ]
 
-    result_combine = cli_runner.invoke(preprocess_cli, cli_parameters_combine)
+    result_combine = cli_runner.invoke(
+        preprocess_cli, cli_parameters_combine, catch_exceptions=False
+    )
     assert result_combine.exit_code == 0
 
     written_samples, written_variant_matrix, written_genotype_matrix = load_h5_archive(