Add sample missingness rule to preprocessing pipeline (#50) · PMBio/deeprvat@380e140

Commit

Add sample missingness rule to preprocessing pipeline (#50)

* add qc_indmiss

* Update preprocess_with_qc.snakefile

* Fix csv

* add process_individual_missingness cmd

* add process_individual_missingness

* Use separate variable for sample_path

* Only write sample to indmiss file

* add test_process_individual_missingness tests

* Add sample missingness to workflow

* Update dag images in doc

* Update test_preprocess.py

* add back create_excluded_samples_dir

* Cleanup pipeline

* fixup! Format Python code with psf/black pull_request

* Update preprocess.py

* fixup! Format Python code with psf/black pull_request

* Fix ruff errors

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit 24b3af5
Author: Magnus Wahlberg <endast@gmail.com>
Date:   Tue Apr 16 10:40:45 2024 +0200

    Optimize preprocessing (#65)

    * Add new test files

    * Update test_preprocess.py

    * Use parquet

    * Add brians code

    * Update preprocess.py

    * sort samples

    * Remove threads

    * Update exclude calls logic

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert "Squashed commit of the following:"

    This reverts commit ebde7c1.

    * Remove unused import

    * don't use mkl 2024.1.0

    * update micromamba@v1.8.1

    * Isolate failing test

    * test genotype matrix

    * Revert "test genotype matrix"

    This reverts commit 6deee9b.

    * Revert "Isolate failing test"

    This reverts commit 6a11fe3.

    * fixup! Format Python code with psf/black pull_request

    * remove files

    * Delete variants.tsv.gz

    * Update test_preprocess.py

    * Update test_preprocess.py

    * fixup! Format Python code with psf/black pull_request

    * Update test_preprocess.py

    * Update test-runner.yml

    * one test

    * Revert "one test"

    This reverts commit 05e4578.

    * Revert "Update test-runner.yml"

    This reverts commit ff78d30.

    * update call filter test data

    * Update expected data

    * Update deeprvat_preprocessing_env.yml

    Remove joblib

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert change of micromamba

    * Ruff check

    * Squashed commit of the following:

    commit ae5c83e
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Mon Apr 15 11:01:03 2024 +0200

        fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

        * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    ---------

    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Revert "Squashed commit of the following:"

This reverts commit 4e9b47d.

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>

Loading branch information

endast and PMBio committed Apr 16, 2024

1 parent 24b3af5 commit 380e140

deeprvat/preprocessing/preprocess.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -164,6 +164,57 @@ def get_file_chromosome(file, col_names, chrom_field="chrom"):
  
        return chrom

    def parse_file_path_list(file_path_list_path: Path):

        with open(file_path_list_path) as file:

            vcf_files = [Path(line.rstrip()) for line in file]

            vcf_stems = [vf.stem.replace(".vcf", "") for vf in vcf_files]

            assert len(vcf_stems) == len(vcf_files)

            vcf_look_up = {stem: file for stem, file in zip(vcf_stems, vcf_files)}

            return vcf_stems, vcf_files, vcf_look_up

    @cli.command()

    @click.option("--threshold", type=float, default=0.1)

    @click.argument("file-paths-list", type=click.Path(exists=True))

    @click.argument("imiss-dir", type=click.Path(exists=True))

    @click.argument("out-file", type=click.Path())

    def process_individual_missingness(

        threshold: float, file_paths_list: Path, imiss_dir: str, out_file: str

    ):

        vcf_stems, _, _ = parse_file_path_list(file_paths_list)

        imiss_dir = Path(imiss_dir)

        imiss_blocks = []

        total_variants = 0

        for vcf_stem in tqdm(vcf_stems, desc="VCFs"):

            missing_counts = pd.read_csv(

                imiss_dir / "samples" / f"{vcf_stem}.tsv",

                sep="\t",

                header=None,

                usecols=[1, 11],

            )

            missing_counts.columns = ["sample", "n_missing"]

            imiss_blocks.append(missing_counts)

            total_variants += pd.read_csv(

                imiss_dir / "sites" / f"{vcf_stem}.tsv",

                header=None,

                sep="\t",

            ).iloc[0, 1]

        imiss = pd.concat(imiss_blocks, ignore_index=True)

        sample_groups = imiss.groupby("sample")

        sample_counts = sample_groups.agg(np.sum).reset_index()

        sample_counts["missingness"] = sample_counts["n_missing"] / total_variants

        sample_counts = sample_counts.loc[

            sample_counts["missingness"] >= threshold, ["sample", "missingness"]

        ]

        sample_counts[["sample"]].to_csv(out_file, index=False, header=None)

    @cli.command()

    @click.option("--chunksize", type=int, default=1000)

    @click.option("--exclude-variants", type=click.Path(exists=True), multiple=True)

    @@ -172,7 +223,7 @@ def get_file_chromosome(file, col_names, chrom_field="chrom"):
  
    @click.option("--chromosomes", type=str)

    @click.option("--skip-sanity-checks", is_flag=True)

    @click.argument("variant-file", type=click.Path(exists=True))

    @click.argument("samples", type=click.Path(exists=True))

    @click.argument("samples-path", type=click.Path(exists=True))

    @click.argument("sparse-gt", type=click.Path(exists=True))

    @click.argument("out-file", type=click.Path())

    def process_sparse_gt(

    @@ -183,7 +234,7 @@ def process_sparse_gt(
  
        chromosomes: Optional[str],

        skip_sanity_checks: bool,

        variant_file: str,

        samples: str,

        samples_path: str,

        sparse_gt: str,

        out_file: str,

    ):

    @@ -215,32 +266,36 @@ def process_sparse_gt(
  
            )["id"]

            variants = variants[~variants["id"].isin(variant_ids_to_exclude)]

            if not skip_sanity_checks:

                try:

                    assert total_variants - len(variants) == len(variants_to_exclude)

                except Exception as e:

                    logger.error(e)

                    import ipdb

                    ipdb.set_trace()

                assert total_variants - len(variants) == len(variants_to_exclude)

        logging.info(f"Dropped {total_variants - len(variants)} variants")

        logging.info(f"...done ({time.time() - start_time} s)")

        logging.info("Processing samples")

        samples = set(pd.read_csv(samples, header=None).loc[:, 0])

        samples = set(pd.read_csv(samples_path, header=None).loc[:, 0])

        if exclude_samples is not None:

            total_samples = len(samples)

            if sample_exclusion_files := list(Path(exclude_samples).glob("*.csv")):

                samples_to_exclude = set(

                    pd.concat(

                        [

                            pd.read_csv(s, header=None).loc[:, 0]

                            for s in sample_exclusion_files

                        ],

                        ignore_index=True,

            if sample_exclusion_files := list(Path(exclude_samples).rglob("*.csv")):

                sample_exclusion_files = [

                    s for s in sample_exclusion_files if s.stat().st_size > 0

                ]

                if sample_exclusion_files:

                    logging.info(

                        f"Found {len(sample_exclusion_files)} sample exclusion files"

                    )

                )

                    samples_to_exclude = set(

                        pd.concat(

                            [

                                pd.read_csv(s, header=None).loc[:, 0]

                                for s in sample_exclusion_files

                            ],

                            ignore_index=True,

                        )

                    )

                else:

                    samples_to_exclude = set()

                samples -= samples_to_exclude

                logging.info(f"Dropped {total_samples - len(samples)} samples")

            else:

docs/_static/preprocess_rulegraph_no_qc.svg

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

0 comments on commit `380e140`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `380e140`

Commit

There are no files selected for viewing

0 comments on commit 380e140

0 comments on commit `380e140`