# Determining whether predicted cleavage peptides (DeepPeptide) that don't match peptides in databases have other traits that support their veracity.

Some cleavage peptides (~49%) predicted by the DeepPeptide tool had matches to peptides in the Peptipedia database.
Peptipedia is a metadatabase comprised of peptide sequences from 76 databases.
This notebook investigates whether peptides that didn't have matches contain other traits that support these peptides being real.

**Signal peptides**: Most (but not all) annotated cleavage peptides are cleaved from precursor proteins that contain an N-terminal signal peptide [https://doi.org/10.1038/s41467-022-34031-z].
Signal peptides target a protein to the secretory pathway and allow cleaved peptides to reach their final destination [https://doi.org/10.1096/fasebj.8.9.8005390].
Many cleavage peptides function as hormones or other signaling molecules, making export from the cell a key step in their biogenesis [https://doi.org/10.1096/fasebj.8.9.8005390]. 

**Propeptides**: Some precursor proteins include propeptides, which are segments that may help in the correct folding of the protein, inhibit premature activity before reaching the target site, or aid in the proper localization of the enzyme [https://doi.org/10.1002/prot.26702].
The propeptides are cleaved off to activate the protein or peptide once it's reached its destination.

We could investigate other signals (disulfide bonds, glycosylation sites, sorting signals), but signal peptides and propeptides were the easiest with data/tools we already had access to, so we started with these two.

## Prep notebook

In [1]:
library(tidyverse, warn.conflicts = F)

── [1mAttaching core tidyverse packages[22m ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Read in and process data

Below, we read in two results files produced by running peptigate on the human transcriptome.

In [2]:
human_predictions <- read_tsv("../../peptigate/results/predictions/peptide_predictions.tsv.gz", show_col_types = F) %>%
  mutate(peptide_class = ifelse(is.na(peptide_class), "sORF", peptide_class)) %>%
  # Remove propeptide predictions, as propeptides don't have biological activity once cleaved.
  filter(peptide_class != "Propeptide") %>%
  # This notebook focuses on DeepPeptide cleavage peptides, so filter to these
  filter(prediction_tool == "deeppeptide")

human_annotations <- read_tsv("../../peptigate/results/predictions/peptide_annotations.tsv.gz", show_col_types = F) %>%
  mutate(length = nchar(sequence)) %>%
  mutate(peptipedia_blast_result = ifelse(!is.na(peptipedia_blast_bitscore), "blast hit", "no blast hit")) 

human_results <- left_join(human_predictions, human_annotations, by = "peptide_id")

## Filter to distinct peptide sequences

While there is no overlap in sequences predicted by different tools (code not shown), duplicate sequences arise from a tool (ex. DeepPeptide) predicting the same peptide sequence from different transcripts.
These transcripts are usually isoforms of the same gene that contain the same sequence that gives rise to the peptide.

Peptigate predicted 263 distinct cleavage peptide amino acid sequences.

In [3]:
# This code block filters to distinct sequences while keeping the most metadata possible.
# This requires removing metadata columns that might not be the same even if sequences are the same.
# We remove columns that we expect to vary like "peptide_id", "start", and "end".
human_results_distinct <- human_results %>%
  select(peptide_type, prediction_tool, sequence, length, 
         peptipedia_blast_pident, peptipedia_blast_evalue, 
         peptipedia_blast_bitscore,  peptipedia_blast_result) %>%
  distinct()

In [4]:
# Confirm the number of rows in the dataframe match the number of distinct sequences
length(unique(human_results_distinct$sequence))
nrow(human_results_distinct)

## How many deeppeptide-predicted peptides had a BLAST hit?

In [5]:
human_results_distinct %>%
  group_by(prediction_tool, peptipedia_blast_result) %>%
  tally() 

prediction_tool,peptipedia_blast_result,n
<chr>,<chr>,<int>
deeppeptide,blast hit,130
deeppeptide,no blast hit,133


## Join to signal peptide information

Below we join the signal peptide predictions for parent proteins to the peptigate peptide predictions.
This will allow us to see which peptides that didn't have BLAST matches come from parent proteins with signal peptides.

As documented in the README, we ran DeepSig on the precursor/parent proteins from which the peptides were cleaved.
We started with the peptigate intermediate file that reports the sequences of the parent proteins and predicted signal peptides on these sequences.
See the README in this directory for the code that we used.

In [6]:
deepsig_deeppeptide <- read_tsv("deeppeptide_peptide_parents_deepsig.tsv", show_col_types = F) %>%
  filter(deepsig_feature == "Signal peptide") %>%
  select(parent_id = peptide_id, deepsig_parent = deepsig_feature) %>%
  group_by(parent_id) %>%
  slice_head(n = 1) %>%
  mutate(prediction_tool = "deeppeptide")

In [7]:
human_results_deepsig <- human_results %>%
  # Generate precursor protein sequence id from peptide id.
  mutate(parent_id = gsub("_start.*", "", peptide_id)) %>%
  left_join(deepsig_deeppeptide, by = c("prediction_tool", "parent_id")) %>%
  mutate(deepsig_parent = ifelse(is.na(deepsig_parent), "Chain", deepsig_parent))

In [8]:
# Re-derive distinct sequences so things aren't counted twice.
human_results_deepsig_distinct <-  human_results_deepsig %>%
  select(peptide_type, prediction_tool, sequence, length,
         peptipedia_blast_bitscore,  peptipedia_blast_result,
         deepsig_combined, deepsig_parent) %>%
  mutate(deepsig_peptide = ifelse(grepl(pattern = "Signal", x = deepsig_combined), "Signal peptide", "Chain")) %>%
  distinct()

In [9]:
# Note that there are duplicated rows.
# We deal with this in subsequent cells.
nrow(human_results_deepsig_distinct)

In [10]:
# get an overview of results
human_results_deepsig_distinct %>%
  group_by(prediction_tool, peptipedia_blast_result, deepsig_peptide, deepsig_parent) %>%
  tally()

prediction_tool,peptipedia_blast_result,deepsig_peptide,deepsig_parent,n
<chr>,<chr>,<chr>,<chr>,<int>
deeppeptide,blast hit,Chain,Chain,83
deeppeptide,blast hit,Chain,Signal peptide,67
deeppeptide,no blast hit,Chain,Chain,104
deeppeptide,no blast hit,Chain,Signal peptide,28
deeppeptide,no blast hit,Signal peptide,Chain,4


In [11]:
# pull out specifically the number of distinct cleavage peptides whose precursor protein had a signal peptide
deeppeptide_no_blast_hit_but_signal_peptide <- human_results_deepsig_distinct %>%
  filter(prediction_tool == "deeppeptide") %>%
  filter(peptipedia_blast_result == "no blast hit") %>%
  filter(deepsig_parent == "Signal peptide")

length(unique(deeppeptide_no_blast_hit_but_signal_peptide$sequence))

## Determine if cleavage peptides have propeptides predicted from parent sequences

In [12]:
# In the version of peptigate run in this repo, propeptides are included in the peptide prediction output.
# Below, we read in the peptigate results and only keep the propeptide predictions.
propeptide_predictions <- read_tsv("../../peptigate/results/predictions/peptide_predictions.tsv.gz", show_col_types = F) %>%
  filter(peptide_class == "Propeptide") %>%
  mutate(parent_id = gsub("_start.*", "", peptide_id))

In [13]:
human_results_deepsig %>%
  mutate(parent_id = gsub("_start.*", "", peptide_id)) %>%
  mutate(has_propeptide = ifelse(parent_id %in% propeptide_predictions$parent_id, "propeptide", "no propeptide")) %>%
  filter(prediction_tool == "deeppeptide") %>%
  select(peptide_type, prediction_tool, sequence, length,
         peptipedia_blast_pident, peptipedia_blast_evalue, 
         peptipedia_blast_bitscore,  peptipedia_blast_result,
         deepsig_combined, has_propeptide, deepsig_parent) %>%
  distinct() %>%
  group_by(prediction_tool, peptipedia_blast_result, deepsig_parent, has_propeptide) %>%
  tally()

prediction_tool,peptipedia_blast_result,deepsig_parent,has_propeptide,n
<chr>,<chr>,<chr>,<chr>,<int>
deeppeptide,blast hit,Chain,no propeptide,60
deeppeptide,blast hit,Chain,propeptide,24
deeppeptide,blast hit,Signal peptide,no propeptide,42
deeppeptide,blast hit,Signal peptide,propeptide,27
deeppeptide,no blast hit,Chain,no propeptide,100
deeppeptide,no blast hit,Chain,propeptide,8
deeppeptide,no blast hit,Signal peptide,no propeptide,20
deeppeptide,no blast hit,Signal peptide,propeptide,8


In [14]:
sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS/LAPACK: /Users/taylorreiter/miniconda3/envs/pepeval/lib/libopenblasp-r0.3.26.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] bit_4.0.5        gtable_0.3.4     jsonlite_1.8.8   compiler_4.3.3  
 [5] crayon_1.5.2     tidyselect_1.2.0 IRdisplay_1.1    parallel_4.3.3  
 [9] scales_1.3.0     uuid_1.2-0       fastmap_1.1.1    IRkernel_1.3.2  
[13] R6_2.5.1         generics_0.1.3   munsell_0.5.0    pi