Skip to content

Commit

Permalink
Add sample missingness rule to preprocessing pipeline (#50)
Browse files Browse the repository at this point in the history
* add qc_indmiss

* Update preprocess_with_qc.snakefile

* Fix csv

* add process_individual_missingness cmd

* add process_individual_missingness

* Use separate variable for sample_path

* Only write sample to indmiss file

* add test_process_individual_missingness tests

* Add sample missingness to workflow

* Update dag images in doc

* Update test_preprocess.py

* add back create_excluded_samples_dir

* Cleanup pipeline

* fixup! Format Python code with psf/black pull_request

* Update preprocess.py

* fixup! Format Python code with psf/black pull_request

* Fix ruff errors

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit 24b3af5
Author: Magnus Wahlberg <endast@gmail.com>
Date:   Tue Apr 16 10:40:45 2024 +0200

    Optimize preprocessing (#65)

    * Add new test files

    * Update test_preprocess.py

    * Use parquet

    * Add brians code

    * Update preprocess.py

    * sort samples

    * Remove threads

    * Update exclude calls logic

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert "Squashed commit of the following:"

    This reverts commit ebde7c1.

    * Remove unused import

    * don't use mkl 2024.1.0

    * update micromamba@v1.8.1

    * Isolate failing test

    * test genotype matrix

    * Revert "test genotype matrix"

    This reverts commit 6deee9b.

    * Revert "Isolate failing test"

    This reverts commit 6a11fe3.

    * fixup! Format Python code with psf/black pull_request

    * remove files

    * Delete variants.tsv.gz

    * Update test_preprocess.py

    * Update test_preprocess.py

    * fixup! Format Python code with psf/black pull_request

    * Update test_preprocess.py

    * Update test-runner.yml

    * one test

    * Revert "one test"

    This reverts commit 05e4578.

    * Revert "Update test-runner.yml"

    This reverts commit ff78d30.

    * update call filter test data

    * Update expected data

    * Update deeprvat_preprocessing_env.yml

    Remove joblib

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert change of micromamba

    * Ruff check

    * Squashed commit of the following:

    commit ae5c83e
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Mon Apr 15 11:01:03 2024 +0200

        fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

        * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    ---------

    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Revert "Squashed commit of the following:"

This reverts commit 4e9b47d.

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>
  • Loading branch information
endast and PMBio committed Apr 16, 2024
1 parent 24b3af5 commit 380e140
Show file tree
Hide file tree
Showing 29 changed files with 591 additions and 155 deletions.
93 changes: 74 additions & 19 deletions deeprvat/preprocessing/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,57 @@ def get_file_chromosome(file, col_names, chrom_field="chrom"):
return chrom


def parse_file_path_list(file_path_list_path: Path):
with open(file_path_list_path) as file:
vcf_files = [Path(line.rstrip()) for line in file]
vcf_stems = [vf.stem.replace(".vcf", "") for vf in vcf_files]

assert len(vcf_stems) == len(vcf_files)

vcf_look_up = {stem: file for stem, file in zip(vcf_stems, vcf_files)}

return vcf_stems, vcf_files, vcf_look_up


@cli.command()
@click.option("--threshold", type=float, default=0.1)
@click.argument("file-paths-list", type=click.Path(exists=True))
@click.argument("imiss-dir", type=click.Path(exists=True))
@click.argument("out-file", type=click.Path())
def process_individual_missingness(
threshold: float, file_paths_list: Path, imiss_dir: str, out_file: str
):
vcf_stems, _, _ = parse_file_path_list(file_paths_list)

imiss_dir = Path(imiss_dir)

imiss_blocks = []
total_variants = 0
for vcf_stem in tqdm(vcf_stems, desc="VCFs"):
missing_counts = pd.read_csv(
imiss_dir / "samples" / f"{vcf_stem}.tsv",
sep="\t",
header=None,
usecols=[1, 11],
)
missing_counts.columns = ["sample", "n_missing"]
imiss_blocks.append(missing_counts)
total_variants += pd.read_csv(
imiss_dir / "sites" / f"{vcf_stem}.tsv",
header=None,
sep="\t",
).iloc[0, 1]

imiss = pd.concat(imiss_blocks, ignore_index=True)
sample_groups = imiss.groupby("sample")
sample_counts = sample_groups.agg(np.sum).reset_index()
sample_counts["missingness"] = sample_counts["n_missing"] / total_variants
sample_counts = sample_counts.loc[
sample_counts["missingness"] >= threshold, ["sample", "missingness"]
]
sample_counts[["sample"]].to_csv(out_file, index=False, header=None)


@cli.command()
@click.option("--chunksize", type=int, default=1000)
@click.option("--exclude-variants", type=click.Path(exists=True), multiple=True)
Expand All @@ -172,7 +223,7 @@ def get_file_chromosome(file, col_names, chrom_field="chrom"):
@click.option("--chromosomes", type=str)
@click.option("--skip-sanity-checks", is_flag=True)
@click.argument("variant-file", type=click.Path(exists=True))
@click.argument("samples", type=click.Path(exists=True))
@click.argument("samples-path", type=click.Path(exists=True))
@click.argument("sparse-gt", type=click.Path(exists=True))
@click.argument("out-file", type=click.Path())
def process_sparse_gt(
Expand All @@ -183,7 +234,7 @@ def process_sparse_gt(
chromosomes: Optional[str],
skip_sanity_checks: bool,
variant_file: str,
samples: str,
samples_path: str,
sparse_gt: str,
out_file: str,
):
Expand Down Expand Up @@ -215,32 +266,36 @@ def process_sparse_gt(
)["id"]
variants = variants[~variants["id"].isin(variant_ids_to_exclude)]
if not skip_sanity_checks:
try:
assert total_variants - len(variants) == len(variants_to_exclude)
except Exception as e:
logger.error(e)
import ipdb

ipdb.set_trace()
assert total_variants - len(variants) == len(variants_to_exclude)

logging.info(f"Dropped {total_variants - len(variants)} variants")
logging.info(f"...done ({time.time() - start_time} s)")

logging.info("Processing samples")
samples = set(pd.read_csv(samples, header=None).loc[:, 0])
samples = set(pd.read_csv(samples_path, header=None).loc[:, 0])
if exclude_samples is not None:
total_samples = len(samples)

if sample_exclusion_files := list(Path(exclude_samples).glob("*.csv")):
samples_to_exclude = set(
pd.concat(
[
pd.read_csv(s, header=None).loc[:, 0]
for s in sample_exclusion_files
],
ignore_index=True,
if sample_exclusion_files := list(Path(exclude_samples).rglob("*.csv")):

sample_exclusion_files = [
s for s in sample_exclusion_files if s.stat().st_size > 0
]
if sample_exclusion_files:
logging.info(
f"Found {len(sample_exclusion_files)} sample exclusion files"
)
)
samples_to_exclude = set(
pd.concat(
[
pd.read_csv(s, header=None).loc[:, 0]
for s in sample_exclusion_files
],
ignore_index=True,
)
)
else:
samples_to_exclude = set()
samples -= samples_to_exclude
logging.info(f"Dropped {total_samples - len(samples)} samples")
else:
Expand Down
32 changes: 16 additions & 16 deletions docs/_static/preprocess_rulegraph_no_qc.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 380e140

Please sign in to comment.