Skip to content

Commit

Permalink
Optimize preprocessing (#65)
Browse files Browse the repository at this point in the history
* Add new test files

* Update test_preprocess.py

* Use parquet

* Add brians code

* Update preprocess.py

* sort samples

* Remove threads

* Update exclude calls logic

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

* Revert "Squashed commit of the following:"

This reverts commit ebde7c1.

* Remove unused import

* don't use mkl 2024.1.0

* update micromamba@v1.8.1

* Isolate failing test

* test genotype matrix

* Revert "test genotype matrix"

This reverts commit 6deee9b.

* Revert "Isolate failing test"

This reverts commit 6a11fe3.

* fixup! Format Python code with psf/black pull_request

* remove files

* Delete variants.tsv.gz

* Update test_preprocess.py

* Update test_preprocess.py

* fixup! Format Python code with psf/black pull_request

* Update test_preprocess.py

* Update test-runner.yml

* one test

* Revert "one test"

This reverts commit 05e4578.

* Revert "Update test-runner.yml"

This reverts commit ff78d30.

* update call filter test data

* Update expected data

* Update deeprvat_preprocessing_env.yml

Remove joblib

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

* Revert change of micromamba

* Ruff check

* Squashed commit of the following:

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>
  • Loading branch information
endast and PMBio committed Apr 16, 2024
1 parent ae5c83e commit 24b3af5
Show file tree
Hide file tree
Showing 26 changed files with 213 additions and 126 deletions.
294 changes: 186 additions & 108 deletions deeprvat/preprocessing/preprocess.py

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion deeprvat_preprocessing_env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ dependencies:
- python=3.8
- click=8.1.3
- h5py=3.8.0
- joblib=1.0.1
- numpy=1.21.2
- pandas=1.5.1
- tqdm=4.59.0
Expand Down
3 changes: 0 additions & 3 deletions docs/preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,6 @@ included_chromosomes : [21,22]
# The format of the name of the "raw" vcf files
vcf_files_list: vcf_files_list.txt

# Number of threads to use in the preprocessing script, separate from snakemake threads
preprocess_threads: 16

# If you need to run a cmd to load bcf and samtools specify it here, see example
bcftools_load_cmd : # module load bcftools/1.10.2 &&
samtools_load_cmd : # module load samtools/1.9 &&
Expand Down
2 changes: 0 additions & 2 deletions pipelines/config/deeprvat_preprocess_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ included_chromosomes : [21,22]
# The format of the name of the "raw" vcf files
vcf_files_list: vcf_files_list.txt

# Number of threads to use in the preprocessing script, separate from snakemake threads
preprocess_threads: 16

# If you need to run a cmd to load bcf and samtools specify it here, see example
bcftools_load_cmd : # module load bcftools/1.10.2 &&
Expand Down
3 changes: 1 addition & 2 deletions pipelines/preprocess_no_qc.snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@ rule preprocess_no_qc:
f"--exclude-variants {qc_duplicate_vars_dir}",
"--chromosomes ",
",".join(str(chr) for chr in set(chromosomes)),
f"--threads {preprocess_threads}",
"{input.variants}",
"{input.variants_parquet}",
"{input.samples}",
f"{sparse_dir}",
f"{preprocessed_dir / 'genotypes'}",
Expand Down
3 changes: 1 addition & 2 deletions pipelines/preprocess_with_qc.snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,7 @@ rule preprocess_with_qc:
f"--exclude-samples {qc_filtered_samples_dir}",
"--chromosomes ",
",".join(str(chr) for chr in set(chromosomes)),
f"--threads {preprocess_threads}",
"{input.variants}",
"{input.variants_parquet}",
"{input.samples}",
f"{sparse_dir}",
f"{preprocessed_dir / 'genotypes'}",
Expand Down
2 changes: 0 additions & 2 deletions pipelines/preprocessing/preprocess.snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ working_dir = Path(config["working_dir"])
preprocessed_dir = working_dir / config["preprocessed_dir_name"]
reference_dir = working_dir / config["reference_dir_name"]

preprocess_threads = config["preprocess_threads"]

fasta_file = reference_dir / config["reference_fasta_file"]
fasta_index_file = reference_dir / f"{config['reference_fasta_file']}.fai"

Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
chr1 51928 G A 100100
chr1 54708 G C 100101
chr1 54490 G A 100104
chr1 51898 C A 100107
chr1 54708 G C 100101
chr1 54753 T G 100107
chr1 54490 G A 100108
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
24 changes: 18 additions & 6 deletions tests/preprocessing/test_preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ def test_process_sparse_gt_file(

test_data_input_dir = current_test_data_dir / "input"

variant_file = test_data_input_dir / "variants.tsv.gz"
variant_file = test_data_input_dir / "variants.parquet"
samples_file = test_data_input_dir / "samples_chr.csv"
sparse_gt_dir = test_data_input_dir / "sparse_gt"

Expand All @@ -99,13 +99,15 @@ def test_process_sparse_gt_file(
cli_parameters = [
"process-sparse-gt",
*extra_cli_params,
"--chunksize",
"3",
variant_file.as_posix(),
samples_file.as_posix(),
sparse_gt_dir.as_posix(),
out_file_base.as_posix(),
]

result = cli_runner.invoke(preprocess_cli, cli_parameters)
result = cli_runner.invoke(preprocess_cli, cli_parameters, catch_exceptions=False)
assert result.exit_code == 0

h5_file = out_file_base.as_posix().replace("genotypes", genotype_file_name)
Expand Down Expand Up @@ -149,11 +151,13 @@ def test_combine_genotypes(test_data_name_dir, input_h5, result_h5, tmp_path):

cli_parameters = [
"combine-genotypes",
"--chunksize",
3,
*[(test_data_input_dir / h5f).as_posix() for h5f in input_h5],
combined_output_h5.as_posix(),
]

result = cli_runner.invoke(preprocess_cli, cli_parameters)
result = cli_runner.invoke(preprocess_cli, cli_parameters, catch_exceptions=False)
assert result.exit_code == 0

written_samples, written_variant_matrix, written_genotype_matrix = load_h5_archive(
Expand Down Expand Up @@ -268,7 +272,7 @@ def test_process_and_combine_sparse_gt(

test_data_input_dir = current_test_data_dir / "input"

variant_file = test_data_input_dir / "variants.tsv.gz"
variant_file = test_data_input_dir / "variants.parquet"
samples_file = test_data_input_dir / "samples_chr.csv"
sparse_gt_dir = test_data_input_dir / "sparse_gt"

Expand All @@ -282,22 +286,30 @@ def test_process_and_combine_sparse_gt(
cli_parameters_process = [
"process-sparse-gt",
*extra_cli_params,
"--chunksize",
"3",
variant_file.as_posix(),
samples_file.as_posix(),
sparse_gt_dir.as_posix(),
out_file_base.as_posix(),
]

result_process = cli_runner.invoke(preprocess_cli, cli_parameters_process)
result_process = cli_runner.invoke(
preprocess_cli, cli_parameters_process, catch_exceptions=False
)
assert result_process.exit_code == 0

cli_parameters_combine = [
"combine-genotypes",
"--chunksize",
3,
*[(preprocessed_dir / h5f).as_posix() for h5f in input_h5],
combined_output_h5.as_posix(),
]

result_combine = cli_runner.invoke(preprocess_cli, cli_parameters_combine)
result_combine = cli_runner.invoke(
preprocess_cli, cli_parameters_combine, catch_exceptions=False
)
assert result_combine.exit_code == 0

written_samples, written_variant_matrix, written_genotype_matrix = load_h5_archive(
Expand Down

0 comments on commit 24b3af5

Please sign in to comment.