The code in this repository can be used to identify, validate, and quantify amino acid substitutions in LC-MS proteomics data that arise from alternate decoding of RNA.
The pipeline has several steps and multiple sections of the pipeline require input from external analysis, such as database searches of proteomics data. As such, this is not a standalone software package that can be run automatically. Detailed instructions for running the pipeline are provided in the various README.md files in this repository.
Results and output files from running this pipeline on datasets as described by Tsour et al can be accessed here: decode output. These files can be downloaded and used to replicate figures using the code provided in decode_figures.
- Custom protein sequence database generation by in-silico translation of RNA-seq data
- Dependent peptide search with MaxQuant (external software analysis)
- Identifying candidate peptides with amino acid substitutions
- Validation database search (external software analysis)
- Quantifying validated peptides with amino acid substitutions
- Downstream data analysis
Use RNA-seq data matched to LC-MS proteomics data to create sample-specific protein databases.
The code for this step is in custom_protein_database_pipeline/ and the README.md in that directory contains detailed instructions for running the code.
If no matched RNA-seq data is available, this step can be skipped, but caution should be taken in interpreting quantified amino acid substitutions as there is lower confidence that they are not encoded in the genome.
The dependent peptide search algorithm in MaxQuant is used to identify peptides with modifications in LC-MS proteomics data.
The LC-MS proteomics data is ideally searched against the sample-specific database generated in Step 1. If not available, species-specific UniProt fasta can be used.
A sample MaxQuant parameter file is provided in MaxQuant_templates, along with a script to create a new parameter file with user-defined parameters (raw files, fasta, etc.)
The output from this dependent peptide search is required to proceed with the next steps of the pipeline.
Search for modified peptides in dependent peptide search results that may represent amino acid substitutions. Add candidate peptides to custom protein sequence databases for validation search.
The code for this step can be found in decode_pipeline/python_scripts. decode_pipeline/README.md contains detailed instructions for running this code.
Run a standard database search using the protein databases appended with candidate substituted peptides (step 3).
A sample MaxQuant parameter file is provided in MaxQuant_templates. The output from this validation search is required to proceed with the next steps of the pipeline.
The code for this step can be found in decode_pipeline/python_scripts. decode_pipeline/README.md contains detailed instructions for running this code.
The data generated in the pipeline outlined above can be further analyzed in many ways. The code for the analysis described in Preprint article is provided in decode_analysis and decode_figures and gnomAD_analysis.
These directories contain different subsets of the analyses described in the paper along with the code to reproduce the figures. Each contain a README.md describing the analyses contained.
The downstream analyses are dependent on the many data files generated in the pipeline described above. In order to keep this analysis as reproducible as possible, we have deposited all our relevant output files to decode output. If this data is downloaded and the "proj_dir" parameter set to the download location, all the figures in decode_figures/Code_for_figures.ipynb can be generated.
The code is avilable under CC BY-NC 4.0 license