Skip to content

Commit

Permalink
Squashed commit of the following:
Browse files Browse the repository at this point in the history
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
  • Loading branch information
endast committed Apr 10, 2024
1 parent fde64fc commit ebde7c1
Show file tree
Hide file tree
Showing 63 changed files with 5,727 additions and 1,022 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -164,4 +164,6 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
/docs/apidocs/

pipelines/deprecated_pipelines

0 comments on commit ebde7c1

Please sign in to comment.