# Selecting peptides for experimental validation

This notebook follows my rationale in selecting peptides to investigate for anti-pruritic activity in the mouse scratch assay.
At the beginning of each filtering section, I explain the rationale for my choices.
I included intermediate filtering steps that helped me make decisions even though reaching the final filtered results could have been streamlined with simpler code.
I think including these details will help other understand why I made certain decisions and feel confident in the peptides we move forward with.

This notebook proposes 12 peptides from 3 orthogroups to be experimentally validated using a mouse scratch assay.
To reach these peptides, we filtered in the following way:
* The peptide belonged to an orthrogoup statistically associated with itch suppression. (All input predictions were "significantly" associated with itch after traitmapping the ticks on a tree data).
* The peptide belonged to an orthogroup where at least 50% of proteins in the orthogroup were predicted to encode a peptide.
* The peptide belonged to an orthogroup where most proteins in the orthogroup had a signal peptide.

After these first three filters, there were 88 peptides from three orthogroups left.

I then selected which peptides from the three orthogroups to move into the scratch assay.
To make this decision, I considered:
* Ease of synthesis
* Solubility (hydrophilicity)
* Whether the peptides had matches in tick salivary gland  (e.g. is expressed in the saliva)
* How well the peptide sequence represents other peptide sequences form the orthogroup (if it clusters with other peptide sequence). 

We also report the annotation the sORF or parent protein received.

The product of this notebook is a TSV file `20240626-predictions.tsv`.
This file contains the 12 peptide names, sequences, and metadata.

*NB* the pools are numbered oddly because I wanted the pool names to stay the same for each orthogroup between this analysis and the one I performed in 04/2024. The pool numbers are also a construct and likely don't reflect how the peptides will actually be pooled for experimental testing.

## Notebook setup

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all 

In [2]:
setwd("..")

## Filter to orthogroups where many of the proteins in the group had peptides

We try various filters below.
First, we look at the strength of support for itch suppression for the orthogroup (`traitmapping_coefficient`) as well as the fraction of proteins in the group that had a predicted peptide.

Using this information, we experiment with different thresholds to see how many orthogroups we keep, and the composition of those groups.
We try the following:
1. Filtering to orthogroups where at least 50% of the proteins had predicted peptides.
2. Filtering to orthogroups where at least 10% of the protiens had predicted peptides.
3. Filtering to orthogroups where at least 10 proteins had predicted peptides.

The last filter seems like the best balance between keeping things that are more likely to be real and keeping variation.

In [3]:
predictions_support_summary <- read_tsv("outputs/notebooks/20240626_predictions_with_metadata.tsv", show_col_types = F) %>%
  select(traitmapping_orthogroup, traitmapping_coefficient, fraction_of_orthogroup_with_predicted_peptide, type_of_itch_suppression_evidence) %>%
  arrange(desc(traitmapping_coefficient)) %>%
  distinct()

predictions_support_summary

traitmapping_orthogroup,traitmapping_coefficient,fraction_of_orthogroup_with_predicted_peptide,type_of_itch_suppression_evidence
<chr>,<dbl>,<dbl>,<chr>
OG0011284,2.24534051,0.11111111,tick support
OG0008888,1.69892817,0.0625,tick support
OG0001774,1.05408448,0.68181818,chelicerate support
OG0008102,0.89907992,0.75,tick support
OG0002194,0.87213413,0.03571429,tick support
OG0000189,0.50428119,0.0875,tick support
OG0000746,0.42872398,0.04901961,tick support
OG0000194,0.35313516,0.09704641,tick support
OG0000880,0.31400766,0.7706422,tick support
OG0001663,0.29542423,0.09375,tick support


In [4]:
predictions <- read_tsv("outputs/notebooks/20240626_predictions_with_metadata.tsv", show_col_types = F) %>%
  filter(fraction_of_orthogroup_with_predicted_peptide >= 0.5) %>%
  arrange(desc(traitmapping_coefficient))

In [5]:
table(predictions$traitmapping_orthogroup, predictions$type_of_itch_suppression_evidence)

           
            chelicerate support tick support
  OG0000880                   0           82
  OG0001774                  45            0
  OG0008102                   0           18

This removes a lot of orthogroups but keeps the one orthogroup with both *P. ovis* (non-tick) and ticks. Might be reasonable to proceed with...TBD.
Check instead what this looks like if we limit to orthogroups that have at least 10 peptides predicted from them.

In [6]:
predictions <- read_tsv("outputs/notebooks/20240626_predictions_with_metadata.tsv", show_col_types = F) %>%
  filter(fraction_of_orthogroup_with_predicted_peptide >= 0.1) %>%
  arrange(desc(traitmapping_coefficient))

table(predictions$traitmapping_orthogroup, predictions$type_of_itch_suppression_evidence)

           
            chelicerate support tick support
  OG0000079                   0           55
  OG0000880                   0           82
  OG0001774                  45            0
  OG0008102                   0           18
  OG0011284                   0            1

This filtering approach keeps more orthogroups while likely maintaining a strong signal.
We could alternately decrease the percent we're filtering with, to something like 10%.

In [7]:
predictions <- read_tsv("outputs/notebooks/20240626_predictions_with_metadata.tsv", show_col_types = F) %>%
  filter(num_predicted_peptides > 10) %>%
  arrange(desc(traitmapping_coefficient))

table(predictions$traitmapping_orthogroup, predictions$type_of_itch_suppression_evidence)

           
            chelicerate support tick support
  OG0000079                   0           55
  OG0000143                   0           10
  OG0000189                   0           21
  OG0000194                   0           23
  OG0000880                   0           82
  OG0001774                  45            0
  OG0008102                   0           18

I think 10 looks like a good balace between keeping things that are more likely to be real and keeping good variability.

In [8]:
predictions_support_summary <- predictions %>%
  select(traitmapping_orthogroup, traitmapping_coefficient, fraction_of_orthogroup_with_predicted_peptide, type_of_itch_suppression_evidence) %>%
  arrange(desc(traitmapping_coefficient)) %>%
  distinct()

predictions_support_summary

traitmapping_orthogroup,traitmapping_coefficient,fraction_of_orthogroup_with_predicted_peptide,type_of_itch_suppression_evidence
<chr>,<dbl>,<dbl>,<chr>
OG0001774,1.05408448,0.68181818,chelicerate support
OG0008102,0.89907992,0.75,tick support
OG0000189,0.50428119,0.0875,tick support
OG0000194,0.35313516,0.09704641,tick support
OG0000880,0.31400766,0.7706422,tick support
OG0000143,0.15338398,0.03900709,tick support
OG0000079,0.08290626,0.13959391,tick support


## Filter to orthogroups where at least one member has a signal peptide

When we do this, the three orthogroups that have the most evidence of having signal peptides (many proteins in the group have signal peptides), we get to the same 3 orthogroups where >50% of the proteins were predicted to be peptides: OG0000880, OG0001774, OG0008102.

Given both of these signals (fraction predicted peptides, signal peptides), move forward with only these three orthogroups.

In [9]:
predictions <- predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide")

table(predictions$traitmapping_orthogroup, predictions$type_of_itch_suppression_evidence)

           
            chelicerate support tick support
  OG0000079                   0            1
  OG0000143                   0            2
  OG0000880                   0           39
  OG0001774                  36            0
  OG0008102                   0           13

In [10]:
predictions <- predictions %>%
  filter(traitmapping_orthogroup %in% c("OG0000880", "OG0001774", "OG0008102"))

## Combine with solubility data

In [11]:
# read in solubility and ease of synthesis data
# This is from a web application by genscript.
# https://www.genscript.com/tools/peptide%2danalyzing%2dtool
synthesis_and_solubility <- read_csv("outputs/notebooks/20240626-genscript-synthesis-and-solubility.csv", show_col_types = F) %>%
  distinct() %>%
  filter(sequence %in% predictions$protein_sequence)
table(synthesis_and_solubility$synthesis_difficulty, synthesis_and_solubility$hydrophilicity)

           
            Good Poor
  Difficult    0   10
  Easy         3    4
  Medium       2   69

In [12]:
predictions <- left_join(predictions, synthesis_and_solubility, by = c("protein_sequence" = "sequence"))

In [13]:
predictions %>% 
  group_by(synthesis_difficulty, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, type_of_itch_suppression_evidence) %>% 
  tally() %>%
  arrange(desc(traitmapping_coefficient))

synthesis_difficulty,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,type_of_itch_suppression_evidence,n
<chr>,<chr>,<chr>,<dbl>,<chr>,<int>
Difficult,Poor,OG0001774,1.0540845,chelicerate support,1
Easy,Good,OG0001774,1.0540845,chelicerate support,3
Easy,Poor,OG0001774,1.0540845,chelicerate support,4
Medium,Good,OG0001774,1.0540845,chelicerate support,2
Medium,Poor,OG0001774,1.0540845,chelicerate support,26
Difficult,Poor,OG0008102,0.8990799,tick support,7
Medium,Poor,OG0008102,0.8990799,tick support,6
Difficult,Poor,OG0000880,0.3140077,tick support,2
Medium,Poor,OG0000880,0.3140077,tick support,37


## Summary of selected peptides

In [19]:
# Note I labeled the 5 sORF predictions that had signal peptides.
# I predicted them with DeepSig and cleaved them by hand.
# The cell below this one contains a data frame of the cleaved peptide sequences.

pool1_names <- c("Amblyomma-americanum_evm.model.contig-245149-1.2", # HAS A SIGNAL PEPTIDE CLEAVED
                 "Amblyomma-sculptum_GEEX01004552.1.p1",             # HAS A SIGNAL PEPTIDE CLEAVED
                 "Rhipicephalus-microplus_XP-037271377.1_start70_end114",
                 "Rhipicephalus-microplus_XP-037271378.1_start78_end115",
                 "Dermacentor-andersoni_XP-054924338.1_start87_end106")

# pools 2-4 from the 04/2024 analysis were eliminated.

pool5_names <- c("Rhipicephalus-microplus_XP-037269427.1_start34_end70",
                 "Dermacentor-andersoni_XP-050051547.1_start39_end77",
                 "Dermacentor-silvarum_XP-037559871.1_start39_end87",
                 "Hyalomma-asiaticum_KAH6923445.1_start29_end58")

pool6_names <- c("Dermacentor-silvarum_XP-049518196.1",      # HAS A SIGNAL PEPTIDE CLEAVED
                 "Ixodes-scapularis_tr|B7P452|B7P452-IXOSC", # HAS A SIGNAL PEPTIDE CLEAVED
                 "Ixodes-scapularis_tr|B7Q4Z2|B7Q4Z2-IXOSC") # HAS A SIGNAL PEPTIDE CLEAVED

all_names <- c(pool1_names, pool5_names, pool6_names)

In [20]:
cleaved_sorfs <- data.frame(peptide_id = c("Amblyomma-americanum_evm.model.contig-245149-1.2", "Amblyomma-sculptum_GEEX01004552.1.p1",
                                           "Dermacentor-silvarum_XP-049518196.1", "Ixodes-scapularis_tr|B7P452|B7P452-IXOSC",
                                           "Ixodes-scapularis_tr|B7Q4Z2|B7Q4Z2-IXOSC"),                      
                            cleaved_protein_sequence = c("AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC",
                                                         "ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA",
                                                         "SEEHGGDSHQAGDDADPEEYPAEERDADTPLVRVRRGFGCPLQSRCNSHCQSIQRRAGYCDGPLKLRCVCTT",
                                                         "ENDEGGEKELVRVRRTSYNCPFQKHKCHRHCKSIGHIAGYCGGFRNRTCICVKK",
                                                         "QVPHVRVRRAFGCPFDQGTCHSHCRSIRRRGERCSGFAKRTCTCYQK"),
                            cleaved_synthesis_difficulty = c("Easy", "Easy", "Easy", "Easy", "Easy"),
                            cleaved_hydrophilicity = c("Good", "Good", "Good", "Good", "Good"))

In [21]:
predictions_final <- predictions %>%
  filter(peptide_id %in% all_names) %>%
  left_join(cleaved_sorfs, by = "peptide_id") %>%
  arrange(desc(traitmapping_orthogroup)) %>%
  select(-start, -end, -nlpprecursor_class_score, -nlpprecursor_cleavage_score, -traitmapping_model, -traitmapping_profile_type) %>%
  mutate(cleaved_length = nchar(cleaved_protein_sequence))

predictions_final  %>%
  select(peptide_id, peptide_type, prediction_tool, peptide_length,
         traitmapping_orthogroup, traitmapping_coefficient, traitmapping_deepsig_feature,
         antiinflammatory, traitmapping_egg_Description, traitmapping_KO_definition,
         synthesis_difficulty, hydrophilicity,
         cleaved_protein_sequence, cleaved_synthesis_difficulty, cleaved_hydrophilicity) 

peptide_id,peptide_type,prediction_tool,peptide_length,traitmapping_orthogroup,traitmapping_coefficient,traitmapping_deepsig_feature,antiinflammatory,traitmapping_egg_Description,traitmapping_KO_definition,synthesis_difficulty,hydrophilicity,cleaved_protein_sequence,cleaved_synthesis_difficulty,cleaved_hydrophilicity
<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Rhipicephalus-microplus_XP-037271377.1_start70_end114,cleavage,deeppeptide,45,OG0008102,0.8990799,Signal peptide,0.0,,"FrmR/RcnR family transcriptional regulator, repressor of rcnA expression;protein S100-A2",Medium,Poor,,,
Amblyomma-sculptum_GEEX01004552.1.p1,sORF,less_than_100aa,99,OG0008102,0.8990799,Signal peptide,0.0,,U3 small nucleolar RNA-associated protein 12;proteoglycan 3;cullin-associated NEDD8-dissociated protein 1;Escherichia phage dCTP pyrophosphatase [EC:3.6.1.12];cell division protein DivIC,Medium,Poor,ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA,Easy,Good
Amblyomma-americanum_evm.model.contig-245149-1.2,sORF,less_than_100aa,100,OG0008102,0.8990799,Signal peptide,0.03333333,,"outer membrane protein, multidrug efflux system;cecropin;2,3-dihydroxyphenylpropionate 1,2-dioxygenase [EC:1.13.11.16];thalianol hydroxylase",Medium,Poor,AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC,Easy,Good
Dermacentor-andersoni_XP-054924338.1_start87_end106,cleavage,nlpprecursor,19,OG0008102,0.8990799,Signal peptide,0.16666667,,osmotically inducible lipoprotein OsmB,Medium,Poor,,,
Rhipicephalus-microplus_XP-037271378.1_start78_end115,cleavage,deeppeptide,38,OG0008102,0.8990799,Signal peptide,0.0,,,Medium,Poor,,,
Dermacentor-silvarum_XP-049518196.1,sORF,less_than_100aa,94,OG0001774,1.0540845,Signal peptide,0.0,,defensin;drosomycin,Easy,Good,SEEHGGDSHQAGDDADPEEYPAEERDADTPLVRVRRGFGCPLQSRCNSHCQSIQRRAGYCDGPLKLRCVCTT,Easy,Good
Ixodes-scapularis_tr|B7P452|B7P452-IXOSC,sORF,less_than_100aa,76,OG0001774,1.0540845,Signal peptide,0.06666667,,drosomycin;defensin,Medium,Good,ENDEGGEKELVRVRRTSYNCPFQKHKCHRHCKSIGHIAGYCGGFRNRTCICVKK,Easy,Good
Ixodes-scapularis_tr|B7Q4Z2|B7Q4Z2-IXOSC,sORF,less_than_100aa,70,OG0001774,1.0540845,Signal peptide,0.03333333,,,Medium,Good,QVPHVRVRRAFGCPFDQGTCHSHCRSIRRRGERCSGFAKRTCTCYQK,Easy,Good
Hyalomma-asiaticum_KAH6923445.1_start29_end58,cleavage,deeppeptide,30,OG0000880,0.3140077,Signal peptide,0.03333333,,,Medium,Poor,,,
Rhipicephalus-microplus_XP-037269427.1_start34_end70,cleavage,deeppeptide,37,OG0000880,0.3140077,Signal peptide,0.03333333,,Lymphocryptovirus nuclear antigen 1,Medium,Poor,,,


In [22]:
output <- predictions_final %>%
  mutate(final_peptide_sequence = ifelse(is.na(cleaved_protein_sequence), protein_sequence, cleaved_protein_sequence),
         final_length = ifelse(is.na(cleaved_length), peptide_length, cleaved_length),
         final_synthesis_difficulty = ifelse(is.na(cleaved_synthesis_difficulty), synthesis_difficulty, cleaved_synthesis_difficulty),
         final_hydrophilicity = ifelse(is.na(cleaved_hydrophilicity), hydrophilicity, cleaved_hydrophilicity)) %>%
  select(peptide_id, peptide_sequence = final_peptide_sequence, length = final_length, hydrophilicity = final_hydrophilicity, 
         difficulty_of_synthesis = final_synthesis_difficulty, peptide_type, peptide_class, prediction_tool, traitmapping_cluster, 
         traitmapping_orthogroup, traitmapping_coefficient, traitmapping_KO, traitmapping_KO_definition, traitmapping_deepsig_feature,
         type_of_itch_suppression_evidence,
         num_proteins_in_orthogroup, num_predicted_peptides, fraction_of_orthogroup_with_predicted_peptide, num_tick_proteins_in_orthogroup,
         fraction_of_orthogroup_tick_proteins, num_predicted_peptides_from_tick, fraction_of_orthogroup_with_predicted_tick_peptides,
         num_itchsuppsp_proteins_in_orthogroup, fraction_of_orthogroup_itchsuppsp_proteins, num_predicted_peptides_from_itchsuppsp,
         fraction_of_orthogroup_with_predicted_itchsuppsp_peptides, num_predicted_peptides_with_sg_blast_hit, 
         locus_tag, traitmapping_signif_level, traitmapping_signif_fdr, traitmapping_species, traitmapping_Length, 
         traitmapping_egg_Description, traitmapping_deepsig_start, traitmapping_deepsig_end,  
         AB, ACE, ACP, AF, AMAP, AMP, AOX, APP, AV, BBP, DPPIV, MRSA, Neuro, QS, TOX, TTCA, antiinflammatory,
         aliphatic_index, boman_index, charge, hydrophobicity, instability_index, isoelectric_point, molecular_weight, pd1_residue_volume,
         pd2_hydrophilicity, z1_lipophilicity, z2_steric_bulk_or_polarizability, z3_polarity_or_charge, z4_electronegativity_etc, z5_electronegativity_etc, 
         peptipedia_blast_sseqid, peptipedia_blast_full_sseq, peptipedia_blast_pident, peptipedia_blast_length, peptipedia_blast_qlen, 
         peptipedia_blast_slen, peptipedia_blast_mismatch, peptipedia_blast_gapopen, peptipedia_blast_qstart, peptipedia_blast_qend, 
         peptipedia_blast_sstart, peptipedia_blast_send, peptipedia_blast_evalue, peptipedia_blast_bitscore, peptipedia_num_hits, 
         peptide_deepsig_prediction = deepsig_combined, mmseqs2_representative_sequence, mmseqs2_num_peptides_in_cluster, 
         sgpeptide_blast_sseqid, sgpeptide_blast_full_sseq, sgpeptide_blast_pident, sgpeptide_blast_length, sgpeptide_blast_qlen, 
         sgpeptide_blast_slen, sgpeptide_blast_qcovhsp, sgpeptide_blast_scovhsp, sgpeptide_blast_mismatch, sgpeptide_blast_gapopen, 
         sgpeptide_blast_qstart, sgpeptide_blast_qend, sgpeptide_blast_sstart, sgpeptide_blast_send, sgpeptide_blast_evalue, sgpeptide_blast_bitscore,
         original_peptide_sequence = protein_sequence, original_peptide_length = peptide_length, 
         original_synthesis_difficulty = synthesis_difficulty, original_hydrophilicity = hydrophilicity, 
         cleaved_protein_sequence, cleaved_synthesis_difficulty, cleaved_hydrophilicity, cleaved_length)

write_tsv(output, "20240626-predictions.tsv")

## POOL 6: OG0001774 (3 peptides)

* OG0001774 has the highest trait mapping coefficient and is the only orthogroup with tick and non-tick (*P. ovis*) proteins that passed all filters.
* 5 peptides are Easy or Medium to synthesize and have good solubility.
* One hit (cleavage) is from *Tyrophagus putrescentiae*, a mold mite that does not bite humans. I removed this protein from consideration.
* The four remaining peptides are all sORFs. All have hits against tick salivary gland transcriptomes and are annotated as defensins. Two cluster together and two do not. Given that two peptides cluster together, I picked one representative from this pair. I included the other two sequences as well.
* **Candidates**:
    * `Dermacentor-silvarum_XP-049518196.1`(SEEHGGDSHQAGDDADPEEYPAEERDADTPLVRVRRGFGCPLQSRCNSHCQSIQRRAGYCDGPLKLRCVCTT) (signal peptide cleaved, easy, good)
    * `Ixodes-scapularis_tr|B7P452|B7P452-IXOSC` (ENDEGGEKELVRVRRTSYNCPFQKHKCHRHCKSIGHIAGYCGGFRNRTCICVKK) (signal peptide cleaved, easy, good)
    * `Ixodes-scapularis_tr|B7Q4Z2|B7Q4Z2-IXOSC` (QVPHVRVRRAFGCPFDQGTCHSHCRSIRRRGERCSGFAKRTCTCYQK) (signal peptide cleaved, easy, good)

In [23]:
OG0001774 <- predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  select(synthesis_difficulty, hydrophilicity, peptide_type, traitmapping_orthogroup,
         type_of_itch_suppression_evidence, peptide_id, mmseqs2_representative_sequence, 
         sgpeptide_blast_sseqid, protein_sequence) %>% 
  filter(traitmapping_orthogroup == "OG0001774") %>%
  filter(synthesis_difficulty %in% c("Easy", "Medium")) %>%
  filter(hydrophilicity == "Good") %>%
  arrange(hydrophilicity, synthesis_difficulty)
OG0001774

synthesis_difficulty,hydrophilicity,peptide_type,traitmapping_orthogroup,type_of_itch_suppression_evidence,peptide_id,mmseqs2_representative_sequence,sgpeptide_blast_sseqid,protein_sequence
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Good,cleavage,OG0001774,chelicerate support,Tyrophagus-putrescentiae_KAH9391114.1_start28_end63,Tyrophagus-putrescentiae_KAH9391114.1_start28_end63,,DYGCPITSKCKQHCLENKFKSGSCEGTLKLTCHCVG
Easy,Good,sORF,OG0001774,chelicerate support,Dermacentor-silvarum_XP-049518196.1,Dermacentor-silvarum_XP-049518196.1,GBBK01002034.1,MKISTVAFALLILSVMLVSGIASEEHGGDSHQAGDDADPEEYPAEERDADTPLVRVRRGFGCPLQSRCNSHCQSIQRRAGYCDGPLKLRCVCTT
Easy,Good,sORF,OG0001774,chelicerate support,Dermacentor-andersoni_XP-050044148.1,Dermacentor-silvarum_XP-049518196.1,GBBK01002034.1,MKMSTVAFALLILSVMLVSGIASEEHGGESQEADEDPHPEEYGLEERSAETPLVRVRRGFGCPLQSRCNTHCQSIQRRAGYCDGPLKLRCVCTT
Medium,Good,sORF,OG0001774,chelicerate support,Ixodes-scapularis_tr|B7P452|B7P452-IXOSC,Ixodes-scapularis_tr|B7P452|B7P452-IXOSC,GADI01002331.1,MKVLAVSLAFLLITGLISTSLAENDEGGEKELVRVRRTSYNCPFQKHKCHRHCKSIGHIAGYCGGFRNRTCICVKK
Medium,Good,sORF,OG0001774,chelicerate support,Ixodes-scapularis_tr|B7Q4Z2|B7Q4Z2-IXOSC,Ixodes-scapularis_tr|B7Q4Z2|B7Q4Z2-IXOSC,GANP01015342.1,MKVVGIALVVRLFSFSCSQGVHSQVPHVRVRRAFGCPFDQGTCHSHCRSIRRRGERCSGFAKRTCTCYQK


In [24]:
# look at annotation information
predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  filter(traitmapping_orthogroup == "OG0001774") %>%
  filter(!peptide_id %in% c("Tyrophagus-putrescentiae_KAH9391114.1_start28_end63")) %>%
  filter(synthesis_difficulty %in% c("Easy", "Medium")) %>%
  filter(hydrophilicity == "Good") %>%
  select(peptide_type, peptide_id, traitmapping_deepsig_feature, traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,traitmapping_deepsig_feature,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>
sORF,Dermacentor-silvarum_XP-049518196.1,Signal peptide,K20668;K20670,defensin;drosomycin
sORF,Dermacentor-andersoni_XP-050044148.1,Signal peptide,K20668;K20670,defensin;drosomycin
sORF,Ixodes-scapularis_tr|B7P452|B7P452-IXOSC,Signal peptide,K20670;K20668,drosomycin;defensin
sORF,Ixodes-scapularis_tr|B7Q4Z2|B7Q4Z2-IXOSC,Signal peptide,,


## POOL 1 -- OG0008102 (5 peptides)

* Orthogroup `OG0008102` has the second highest trait mapping coefficient.

* **sORFs** (2 peptides)
    * There are three sORFs in this group, but only two have signal peptides, so we'll only move forward with those two.
    * The Two sORFs with signal peptides are "Medium" to synthesize with "Poor" hydrophilicity. However, when we cleave their signal peptides, they become "Easy" to synthesize with "Good" hydrophilicity.
        * `Amblyomma-americanum_evm.model.contig-245149-1.2` (AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC) (easy, good; cleaved)
        * `Amblyomma-sculptum_GEEX01004552.1.p1` (ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA) (easy, good; cleaved)
    * Both of these sORFs match to "Transcript_929497.p2_start21_end72" hits to "petxwholefemale_TRINITY_DN4020_c0_g1_i1", which is not a salivary gland transcriptome (though "whole" does contain salivary glands too). The cleavage peptides in the same orthogroup have hits to an actual sg transcriptome, so I think this hit is ok.
* **cleavage** (3 peptides)
    * All of the sORFs have good solubility, so we can pick at least one cleavage peptide to put in this pool and it shouldn't cause aggregation problems.
    * None of the cleavage peptides cluster together (mmseqs) so we can't use that to drive our selection
    * Two cleavage peptides have hits to tick sg transcriptomes so I think these two are best to move forward with (note that `Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100` has a hit to "petxwholefemale_TRINITY_DN4020_c0_g1_i1," which is a whole female transcriptome, not an sg transcriptome).
        * `Rhipicephalus-microplus_XP-037271377.1_start70_end114` (IHPVVATVVVPVVKVLVNGAASGAVGALVGKLLESDRDKSPAPSL)
        * `Rhipicephalus-microplus_XP-037271378.1_start78_end115` (VVVSVSKKIVERVADATIGFVVNKLLGHLLDRPTEPSF)
    * I also added `Dermacentor-andersoni_XP-054924338.1_start87_end106` (NGAISGAVGAAVANLINKG) because it's from a different species.

In [25]:
OG0008102 <- predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  select(synthesis_difficulty, hydrophilicity, peptide_type, traitmapping_orthogroup,
         type_of_itch_suppression_evidence, peptide_id, mmseqs2_representative_sequence, 
         sgpeptide_blast_sseqid, protein_sequence) %>% 
  filter(traitmapping_orthogroup == "OG0008102") %>%
  filter(synthesis_difficulty %in% c("Easy", "Medium"))
OG0008102

synthesis_difficulty,hydrophilicity,peptide_type,traitmapping_orthogroup,type_of_itch_suppression_evidence,peptide_id,mmseqs2_representative_sequence,sgpeptide_blast_sseqid,protein_sequence
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,cleavage,OG0008102,tick support,Rhipicephalus-microplus_XP-037271377.1_start70_end114,Rhipicephalus-microplus_XP-037271377.1_start70_end114,GIKN01002979.1.p1_start91_end134,IHPVVATVVVPVVKVLVNGAASGAVGALVGKLLESDRDKSPAPSL
Medium,Poor,sORF,OG0008102,tick support,Amblyomma-sculptum_GEEX01004552.1.p1,Amblyomma-sculptum_GEEX01004552.1.p1,GINV01009842.1,MKAYLILALVILGHLSQIHAATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA
Medium,Poor,sORF,OG0008102,tick support,Amblyomma-americanum_evm.model.contig-245149-1.2,Amblyomma-americanum_evm.model.contig-245149-1.2,Transcript_929497.p2_start21_end72,MKAYLILVLVILGHLSQIHAAAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC
Medium,Poor,cleavage,OG0008102,tick support,Dermacentor-andersoni_XP-054924338.1_start87_end106,Dermacentor-andersoni_XP-054924338.1_start87_end106,,NGAISGAVGAAVANLINKG
Medium,Poor,cleavage,OG0008102,tick support,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Transcript_929497.p2_start64_end100,VGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC
Medium,Poor,cleavage,OG0008102,tick support,Rhipicephalus-microplus_XP-037271378.1_start78_end115,Rhipicephalus-microplus_XP-037271378.1_start78_end115,GIKN01002127.1.p1_start100_end137,VVVSVSKKIVERVADATIGFVVNKLLGHLLDRPTEPSF


In [26]:
# look at annotation information
predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  filter(traitmapping_orthogroup == "OG0008102") %>%
  filter(synthesis_difficulty %in% c("Easy", "Medium")) %>%
  select(peptide_type, peptide_id, traitmapping_deepsig_feature, traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,traitmapping_deepsig_feature,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Rhipicephalus-microplus_XP-037271377.1_start70_end114,Signal peptide,K23240;K23759,"FrmR/RcnR family transcriptional regulator, repressor of rcnA expression;protein S100-A2"
sORF,Amblyomma-sculptum_GEEX01004552.1.p1,Signal peptide,K14556;K25722;K17263;K21503;K13052,U3 small nucleolar RNA-associated protein 12;proteoglycan 3;cullin-associated NEDD8-dissociated protein 1;Escherichia phage dCTP pyrophosphatase [EC:3.6.1.12];cell division protein DivIC
sORF,Amblyomma-americanum_evm.model.contig-245149-1.2,Signal peptide,K19593;K20696;K05713;K22812,"outer membrane protein, multidrug efflux system;cecropin;2,3-dihydroxyphenylpropionate 1,2-dioxygenase [EC:1.13.11.16];thalianol hydroxylase"
cleavage,Dermacentor-andersoni_XP-054924338.1_start87_end106,Signal peptide,K04062,osmotically inducible lipoprotein OsmB
cleavage,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Signal peptide,K19593;K20696;K05713;K22812,"outer membrane protein, multidrug efflux system;cecropin;2,3-dihydroxyphenylpropionate 1,2-dioxygenase [EC:1.13.11.16];thalianol hydroxylase"
cleavage,Rhipicephalus-microplus_XP-037271378.1_start78_end115,Signal peptide,,


## POOL 5 -- OG0000880 (4 cleavage peptides)

* `OG0000880` has the next highest coefficient.
* All peptides are **cleavage** peptides.
* Most are **Medium** to synthesize and have **Poor** solubility.
* There's a big diversity in parent protein annotations, but all peptides are glycine-rich. After removing peptides that don't have hits to salivary glands, the cleavage peptides group into four mmseqs2 clusters. I suggest synthesizing one sequence from each group.
    * `Rhipicephalus-microplus_XP-037269427.1_start34_end70` (GGVLGGLGGVGYGTGLGTGLGTGFGGSGLSGVGLGGL)
    * `Dermacentor-andersoni_XP-050051547.1_start39_end77` (GGVLGGLGGYGAGVGPGLVGAGLGGPGLVGGGVVGSPAL)
    * `Dermacentor-silvarum_XP-037559871.1_start39_end87` (GGVLGGLGGYGAGVGPGLVGAGIGGPGLVGGGVVGNPALVGAGLGQGVG)
    * `Hyalomma-asiaticum_KAH6923440.1_start24_end69` (GGLLGAGLGGYGGGLGGPGLVGAGLGGVGL)

In [27]:
predictions %>%
  filter(traitmapping_orthogroup == "OG0000880") %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  filter(!is.na(sgpeptide_blast_sseqid)) %>% # filter to those that have salivary gland transcriptome hits
  filter(! sgpeptide_blast_sseqid %in% c("Transcript_125878.p1_start39_end77",
                                         "Transcript_160125.p1_start39_end77",
                                         "Transcript_211978.p1_start35_end80",
                                         "Transcript_125878.p1_start39_end77",
                                         "Transcript_66059.p1_start33_end65")) %>%  # remove amblyomma hits that aren't to sg
  select(peptide_type, synthesis_difficulty, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, peptide_id, 
         traitmapping_deepsig_feature, protein_sequence, traitmapping_KO, traitmapping_KO_definition, 
         type_of_itch_suppression_evidence, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>%
  arrange(mmseqs2_representative_sequence)

peptide_type,synthesis_difficulty,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,peptide_id,traitmapping_deepsig_feature,protein_sequence,traitmapping_KO,traitmapping_KO_definition,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Medium,Poor,OG0000880,0.3140077,Rhipicephalus-microplus_XP-037269420.1_start39_end77,Signal peptide,GGVLGGLGGYGAGVGPGLAGVGLGRPGLIGGGVVGNPGL,K19461;K22989,Lymphocryptovirus nuclear antigen 1;integral membrane protein GPR137,tick support,Dermacentor-andersoni_XP-050051547.1_start39_end77,GINV01008730.1.p1_start39_end77
cleavage,Medium,Poor,OG0000880,0.3140077,Rhipicephalus-microplus_XP-037269421.1_start39_end77,Signal peptide,GGVLGGLGGYGAGVGPGLVGAGLGAPGLVGDGVVGNPAL,K19461;K24317,Lymphocryptovirus nuclear antigen 1;germinal-center associated nuclear protein [EC:2.3.1.48],tick support,Dermacentor-andersoni_XP-050051547.1_start39_end77,GBJS01005204.1.p1_start39_end77
cleavage,Medium,Poor,OG0000880,0.3140077,Dermacentor-andersoni_XP-050051547.1_start39_end77,Signal peptide,GGVLGGLGGYGAGVGPGLVGAGLGGPGLVGGGVVGSPAL,K12051;K13184;K07344;K19461;K12741,ComB7 competence protein;ATP-dependent RNA helicase A [EC:3.6.4.13];type IV secretion system protein TrbL;Lymphocryptovirus nuclear antigen 1;heterogeneous nuclear ribonucleoprotein A1/A3,tick support,Dermacentor-andersoni_XP-050051547.1_start39_end77,GBJS01005204.1.p1_start39_end77
cleavage,Medium,Poor,OG0000880,0.3140077,Rhipicephalus-sanguineus_XP-037517942.2_start39_end77,Signal peptide,GGVLGGLGGYGAGVGPGLVGTGLGGPGLVGGGVVGNPGL,K19461;K13184;K12741;K24317,Lymphocryptovirus nuclear antigen 1;ATP-dependent RNA helicase A [EC:3.6.4.13];heterogeneous nuclear ribonucleoprotein A1/A3;germinal-center associated nuclear protein [EC:2.3.1.48],tick support,Dermacentor-andersoni_XP-050051547.1_start39_end77,GINV01008730.1.p1_start39_end77
cleavage,Medium,Poor,OG0000880,0.3140077,Rhipicephalus-microplus_XP-037272456.1_start28_end68,Signal peptide,GGLLGAGLGGYGAGVGGAGLVGAGVGGPGLVGAGVGGPGLV,K19461,Lymphocryptovirus nuclear antigen 1,tick support,Hyalomma-asiaticum_KAH6923440.1_start24_end69,GBJS01004457.1.p1_start14_end54
cleavage,Medium,Poor,OG0000880,0.3140077,Rhipicephalus-sanguineus_XP-037524803.1_start29_end58,Signal peptide,GGLLGAGLGGYGAGVGGPGLVGAGLGGVGL,K13914;K15047;K19461;K07344,statherin;heterogeneous nuclear ribonucleoprotein U-like protein 1;Lymphocryptovirus nuclear antigen 1;type IV secretion system protein TrbL,tick support,Hyalomma-asiaticum_KAH6923440.1_start24_end69,GEDV01015390.1.p1_start48_end77
cleavage,Medium,Poor,OG0000880,0.3140077,Hyalomma-asiaticum_KAH6923445.1_start29_end58,Signal peptide,GGLLGAGLGGYGGGLGGPGLVGAGLGGVGL,,,tick support,Hyalomma-asiaticum_KAH6923440.1_start24_end69,GFGI01047205.1.p1_start29_end60
cleavage,Medium,Poor,OG0000880,0.3140077,Dermacentor-silvarum_XP-037559871.1_start39_end87,Signal peptide,GGVLGGLGGYGAGVGPGLVGAGIGGPGLVGGGVVGNPALVGAGLGQGVG,K12051;K13090;K19461;K07344;K13184,ComB7 competence protein;interleukin enhancer-binding factor 3;Lymphocryptovirus nuclear antigen 1;type IV secretion system protein TrbL;ATP-dependent RNA helicase A [EC:3.6.4.13],tick support,Hyalomma-asiaticum_KAH6947073.1_start69_end117,GINV01004785.1.p1_start33_end81
cleavage,Medium,Poor,OG0000880,0.3140077,Rhipicephalus-microplus_XP-037269422.1_start39_end87,Signal peptide,GGVLGGLGGYGAGVGPGLVGAGLGGPGLVGGGVVGNPALVGGGLGHGVG,K19461;K24317,Lymphocryptovirus nuclear antigen 1;germinal-center associated nuclear protein [EC:2.3.1.48],tick support,Hyalomma-asiaticum_KAH6947073.1_start69_end117,GBJT01016707.1.p1_start62_end110
cleavage,Medium,Poor,OG0000880,0.3140077,Rhipicephalus-sanguineus_XP-037517950.2_start44_end92,Signal peptide,GGVLGGLGGYGAGVGPGLVGSGLGGPGLVGGGVVGNPALVGAGLGHGVG,K13090;K13184;K07344;K19461;K13982,interleukin enhancer-binding factor 3;ATP-dependent RNA helicase A [EC:3.6.4.13];type IV secretion system protein TrbL;Lymphocryptovirus nuclear antigen 1;probable ATP-dependent RNA helicase DDX4 [EC:3.6.4.13],tick support,Hyalomma-asiaticum_KAH6947073.1_start69_end117,GINV01004785.1.p1_start33_end81


In [28]:
sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS/LAPACK: /Users/taylorreiter/miniconda3/envs/tidyjupyter/lib/libopenblasp-r0.3.26.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.0   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] bit_4.0.5        gtable_0.3.4     jsonlite_1.8.8   compiler_4.3.3  
 [5] crayon_1.5.2     tidyselect_1.2.0 IRdisplay_1.1    parallel_4.3.3  
 [9] scales_1.3.0     uuid_1.2-0       fastmap_1.1.1    IRkernel_1.3.2  
[13] R6_2.5.1         generics_0.1.3   munsell_0.5.1  