How to prepare a protein database
Clone this wiki locally
Proteomics pipelines and toolkits like Philosopher rely on properly formatted protein sequence databases to correctly identify peptides. Here are some tips on how to prepare a protein database for your experiment.
Run Philosopher from the command line to download one from UniProt by executing the following two commands:
philosopher workspace --init philosopher database --reviewed --contam --id UP000005640
This will generate a human UniProt/SwissProt (i.e. reviewed sequences only) database, with common contaminants and decoys added (with a default decoy prefix rev_). If you would like to use the full (unreviewed) UniProt proteome, remove the
For mouse, for example, use the proteome ID UP000000589. To find the proteome ID for other organisms, search within the UniProt proteomes.
To combine multiple proteomes, provide a comma-separated list, e.g.:
philosopher workspace --init philosopher database --reviewed --contam --id UP000005640,UP000000625,UP000002311
to generate a database with the human, yeast, and E. coli proteomes.
Add decoys and contaminants and format it for FragPipe/philosopher using the following commands:
philosopher workspace --init philosopher database --custom <file_name> --contam
Reformat it for FragPipe using the following commands:
philosopher workspace --init philosopher database --annotate <file_name> --prefix <prefix>
If you need to run the
--custom or the
--annotate command, you may manually inspect the formatted files to ensure it will be compatible with Philosopher, it should follow one of these formats (see example for each):
>sp|P02489|CRYAA_HUMAN Alpha-crystallin A chain OS=Homo sapiens OX=9606 GN=CRYAA PE=1 SV=2
>NP_000385.1 alpha-crystallin A chain isoform 1 [Homo sapiens]
>ENSP00000291554.2 pep chromosome:GRCh38:21:43169008:43172805:1 gene:ENSG00000160202.7 transcript:ENST00000291554.6 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:CRYAA description:crystallin alpha A [Source:HGNC Symbol;Acc:HGNC:2388]
Note: the protein description text (e.g. "crystallin alpha A") should not contain any commas or special characters, as it may result in incorrect parsing of the entry by Philosopher
- or generic:
If you are adding you own decoys, they also need to follow a specific formatting; sequences need to be formatted as a whole protein string in FASTA file with a decoy (e.g. rev_ or DECOY_) added at the beginning.
Examples of compatible decoy formats:
Examples of incompatible decoy formats: