Proteogenomics database-generation tool for protein haplotypes and variants. Preprint describing the tool: .
- Databases obtained from the common haplotypes of the 1000 Genomes Project along with metadata set can be found at .
- Databases obtained from the common haplotypes of the Release 1.1 of the Haplotype Reference Consortium (HRC) along with metadata set can be found at .
- Databases obtained from the preliminary release of the Human Pangenome Reference Consortium (HPRC) along with metadata set can be found at .
Note: The databases contain only common haplotypes (maf > 1 %), no individual-level data is available from the databases. For individual-level sequences, please run ProHap on the individual-level data.
Below is a brief overview, for details on input file format and configuration, please refer to the Wiki page.
Required input:
- For ProHap: VCF with phased genotypes, one file per chromosome (such as 1000 Genomes Project - downloaded automatically by Snakemake if URL is provided)
- For ProVar: VCF, single file per dataset. Multiple VCF files can be processed by ProVar in the same run.
- FASTA file of contaminant sequences. These will then be added to the final FASTA, and tagged as contaminants. The default contaminant database is created by the cRAP project, provided in this repository.
- GTF annotation file (Ensembl - downloaded automatically by Snakemake)
- cDNA FASTA file (Ensembl - downloaded automatically by Snakemake)
- (optional) ncRNA FASTA file (Ensembl - downloaded automatically by Snakemake)
Required software: Snakemake & Conda. ProHap was tested with Ubuntu 22.04.3 LTS. Windows users are encouraged to use the Windows Subsystem for Linux.
Using ProHap with the full 1000 Genomes Project data set (as per default) requires about 1TB disk space!
Usage:
- Clone this repository:
git clone https://github.com/ProGenNo/ProHap.git; cd ProHap/;
- Create a configuration file called
config.yaml
using https://progenno.github.io/ProHap/. Please refer to the Wiki page for details. - Test Snakemake with a dry-run:
snakemake --cores <# provided cores> -n -q
- Run the Snakemake pipeline to create your protein database:
snakemake --ccores <# provided cores> -p --use-conda
In the first usage example, we provide a small example dataset taken from the 1000 Genomes Project on GRCh38. We will use ProHap to create a database of protein haplotypes aligned with Ensembl v.111 (January 2024) using only MANE Select transcripts.
Expected runtime using 4 CPU cores: ~1 hour. Expected runtime using 23 CPU cores: ~30 minutes.
Requirements: Install Conda / Mamba and Snakemake using this guide. Minimum hardware requirements: 1 CPU core, ~5 GB disk space, 3 GB RAM.
Use the following commands to run this example:
# Clone this repository:
git clone https://github.com/ProGenNo/ProHap.git ;
cd ProHap;
# Unpack the sample dataset
cd sample_data ;
gunzip sample_1kGP_common_global.tar.gz ;
tar xf sample_1kGP_common_global.tar ;
cd .. ;
# Copy the configuration to config.yaml
cp config_example1.yaml config.yaml ;
# Activate the snakemake conda environment and run the pipeline
conda activate snakemake ;
snakemake --cores 4 -p --use-conda ;
Once you obtain a list of peptide-spectrum matches (PSMs), you can use a pipeline provided in the PeptideAnnotator repository to map the peptides back to the respective protein haplotype / variant sequences, and map the identified variants back to their genetic origin. For the usage and details, please refer to the following wiki page.
The ProHap / ProVar pipeline produces three kinds of output files. Below is a brief description, please refer to the wiki page for further details.
- Concatenated FASTA file: The main result of the pipeline is the concatenated FASTA file, consisting of the ProHap and/or ProVar output, reference sequences from Ensembl, and provided contaminant sequences. The file can be used with any search engine.
- Optionally, headers are extracted and provided in an attached tab-separated file, and a gene name is added to each protein entry.
- Metadata table: Additional information on the variant / haplotype sequences produced by the pipeline, such as genomic coordinates of the variants covered, variant consequence type, etc.
- cDNA translations FASTA: FASTA file contains the original translations of variant / haplotype cDNA sequences prior to any optimization, the removal of UTR sequences, and merging with canonical proteins and contaminants.
We welcome bug reports, suggestions of improvements, and contributions. Please do not hesitate to open an issue or a pull request.
As part of our efforts toward delivering open and inclusive science, we follow the Contributor Convenant Code of Conduct for Open Source Projects.
When using ProHap and databases generated using ProHap, please cite the accompanying scientific publication .