Skip to content

Commit

Permalink
Squashed commit of the following:
Browse files Browse the repository at this point in the history
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
  • Loading branch information
endast committed Apr 10, 2024
1 parent 9a45d90 commit 82f962e
Show file tree
Hide file tree
Showing 24 changed files with 2,516 additions and 553 deletions.
1,430 changes: 1,135 additions & 295 deletions deeprvat/annotations/annotations.py

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions deeprvat_env_no_gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ dependencies:
- dask=2023.5
- fastparquet=0.5
- h5py=3.1
- mkl!=2024.1.0
- numcodecs=0.11
- numpy=1.21
- optuna=2.10
Expand Down
263 changes: 263 additions & 0 deletions docs/_static/annotation_rulegraph.svg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 30 additions & 33 deletions docs/annotations.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,21 @@

This pipeline is based on [snakemake](https://snakemake.readthedocs.io/en/stable/). It uses [bcftools + samstools](https://www.htslib.org/), as well as [perl](https://www.perl.org/), [deepRiPe](https://ohlerlab.mdc-berlin.de/software/DeepRiPe_140/) and [deepSEA](http://deepsea.princeton.edu/) as well as [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html), including plugins for [primateAI](https://github.com/Illumina/PrimateAI) and [spliceAI](https://github.com/Illumina/SpliceAI). DeepRiPe annotations were acquired using [faatpipe repository by HealthML](https://github.com/HealthML/faatpipe)[[1]](#reference-1-target) and DeepSea annotations were calculated using [kipoi-veff2](https://github.com/kipoi/kipoi-veff2)[[2]](#reference-2-target), abSplice scores were computet using [abSplice](https://github.com/gagneurlab/absplice/)[[3]](#reference-3-target)

![dag](_static/annotation_pipeline_dag.png)
*Figure 1: Example DAG of annoation pipeline using only two bcf files as input.*
![dag](_static/annotation_rulegraph.svg)

*Figure 1: Rulegraph of the annoation pipeline.*

## Output
This pipeline outputs a parquet file including all annotations as well as a file containing IDs to all protein coding genes needed to run DeepRVAT.
Besides This the pipeline outputs a PCA transformation matrix for deepSEA as well as means and standard deviations used to standardize deepSEA scores before PCA analysis. This is helpfull to recreate results using a different dataset.
Furthermore, the pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and Absplice for each input vcf-file. The tool then creates concatenates the files, performes PCA on the deepSEA scores and merges the result into a single file.

## Input

The pipeline uses left-normalized bcf files containing variant information, a reference fasta file as well as a text file that maps data blocks to chromosomes as input. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT". Any other columns, including genotype information are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data. The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).
The pipeline uses left-normalized bcf files containing variant information, a reference fasta file as well as a text file that maps data blocks to chromosomes as input. It is expected that the bcf files contain the columns "CHROM" "POS" "ID" "REF" and "ALT". Any other columns, including genotype information are stripped from the data before annotation tools are used on the data. The variants may be split into several vcf files for each chromosome and each "block" of data. The filenames should then contain the corresponding chromosome and block number. The pattern of the file names, as well as file structure may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml). The pipeline also requires input data and repositories descried in [requirements](#requirements).

(requirements-target)=
## Requirements
(requirements)=
## Requirements

BCFtools as well as HTSlib should be installed on the machine,
- [CADD](https://github.com/kircherlab/CADD-scripts/tree/master/src/scripts) as well as
Expand All @@ -20,43 +26,39 @@ BCFtools as well as HTSlib should be installed on the machine,
- [faatpipe](https://github.com/HealthML/faatpipe), and the
- [vep-plugins repository](https://github.com/Ensembl/VEP_plugins/)

will be installed by the pipeline together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).
Download path:
should be installed for runnning the pipeline, together with the [plugins](https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html) for primateAI and spliceAI. Annotation data for CADD, spliceAI and primateAI should be downloaded. The path to the data may be specified in the corresponding [config file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).
Download paths:
- [CADD](https://cadd.bihealth.org/download): "All possible SNVs of GRCh38/hg38" and "gnomad.genomes.r3.0.indel.tsv.gz" incl. their Tabix Indices
- [SpliceAI](https://basespace.illumina.com/s/otSPW8hnhaZR): "genome_scores_v1.3"/"spliceai_scores.raw.snv.hg38.vcf.gz" and "spliceai_scores.raw.indel.hg38.vcf.gz"
- [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz"
- [AlphaMissense](https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz)
Also a reference GTF file containing transcript annotaions should be provided, this can be downloaded from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz)


## Output

The pipeline outputs one annotation file for VEP, CADD, DeepRiPe, DeepSea and Absplice for each input vcf-file. The tool further creates concatenated files for each tool and one merged file containing Scores from AbSplice, VEP incl. CADD, primateAI and spliceAI as well as principal components from DeepSea and DeepRiPe.

## Configure the annotation pipeline
The snakemake annotation pipeline is configured using a yaml file with the format akin to the [example file](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).

The config above would use the following directory structure:
```shell

|-- reference
|--reference
| |-- fasta file


|-- metadata
| |-- pvcf_blocks.txt
| |-- GTF file

|-- preprocessing_workdir
| |--reference
| | |-- fasta file
| |-- norm
| | |-- bcf
| | | |-- bcf_input_files
| | | |-- ...
| | |-- variants
| | | |-- variants.tsv.gz
| |-- preprocessed
| | |-- genotypes.h5


|-- output_dir
| |-- annotations
| | |-- tmp
| | | |-- deepSEA_PCA

|-- repo_dir
| |-- ensembl-vep
Expand All @@ -73,21 +75,20 @@ The config above would use the following directory structure:




```

Bcf files created by the [preprocessing pipeline](preprocessing.md) are used as input data.
The pipeline also uses the variant.tsv file as well as the reference file from the preprocesing pipeline.

Bcf files created by the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html) are used as input data. The input data directory should only contain the files needed.
The pipeline also uses the variant.tsv file, the reference file and the genotypes file from the preprocesing pipeline.
A GTF file as described in [requirements](#requirements) and the FASTA file used for preprocessing is also necessary.
The pipeline beginns by installing the repositories needed for the annotations, it will automatically install all repositories in the `repo_dir` folder that can be specified in the config file relative to the annotation working directory.
The text file mapping blocks to chromosomes is stored in `metadata` folder. The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir/ensempl-vep`.
Data for VEP plugins and the CADD cache are stored in `annotation data`.

## Running the annotation pipeline
### Preconfiguration
- Inside the annotation directory create a directory `repo_dir` and run the [annotation setup script](https://github.com/PMBio/deeprvat/blob/main/deeprvat/annotations/setup_annotation_workflow.sh)
```shell
setup_annotation_workflow.sh repo_dir/ensembl-vep/cache repo_dir/ensembl-vep/Plugins repo_dir
```
or manually clone the repositories mentioned in the [requirements](#requirements-target) into `repo_dir` and install the needed conda environments with
- Clone the repositories mentioned in [requirements](#requirements) into `repo_dir` and install the needed conda environments with
```shell
mamba env create -f repo_dir/absplice/environment.yaml
mamba env create -f repo_dir/kipoi-veff2/environment.minimal.linux.yml
Expand All @@ -96,22 +97,18 @@ Data for VEP plugins and the CADD cache are stored in `annotation data`.
If you already have some of the needed repositories on your machine you can edit the paths in the [config](https://github.com/PMBio/deeprvat/blob/main/pipelines/config/deeprvat_annotation_config.yaml).


- Inside the annotation directory create a directory `annotation_dir` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements-target))
- Inside the annotation directory create a directory `annotation_dir` and download/link the prescored files for CADD, SpliceAI, and PrimateAI (see [requirements](#requirements))


### Running the pipeline
This pipeline should be run after running the [preprocessing pipeline](https://deeprvat.readthedocs.io/en/latest/preprocessing.html), since it relies on some of its outpur files (specifically the bcf files in `norm/bcf/`, the variant files in `norm/variants/` and the genotype file `preprocessed/genotypes.h5`

After configuration and activating the `deeprvat_annotations` environment run the pipeline using snakemake:

```shell
snakemake -j <nr_cores> -s annotations.snakemake --configfile config/deeprvat_annotation.config --use-conda
```
## Running the annotation pipeline without the preprocessing pipeline

It is possible to run the annotation pipeline without having run the preprocessing prior to that.
However, the annotation pipeline requires some files from this pipeline that then have to be created manually.
- Left normalized bcf files from the input. These files do not have to contain any genotype information. "chrom, "pos", "ref" and "alt" columns will suffice.
- a reference fasta file will have to be provided
- A tab separated file containing all input variants "chrom, "pos", "ref" and "alt" entries each with a unique id.


## References
Expand Down
Empty file.
Empty file.

0 comments on commit 82f962e

Please sign in to comment.