# Selecting peptides for experimental validation

This notebook proposes pools of peptides to be experimentally validated using a mouse scratch assay.
It primarily relies on the following metadata about peptides
* Ease of synthesis
* Solubility (hydrophilicity)
* Orthogroup the peptide belonged to and the statistical association of that orthogroup with itch suppression.
* The sequence of the peptide itself
* What other peptides the peptide clustered with (mmseqs2 80% identity)
* Whether the peptides had matches in tick salivary gland transcriptomes


In the case of sORFs, we also investigated the parent protein annotation (if available, `traitmapping_egg_Description`, `traitmapping_KO`, `traitmapping_KO_definition`).
If the annotation strongly suggests that the sORF is a housekeeping gene and that the protein does not have a secretion signal (signal peptide, `traitmapping_deepsig_feature`, `traitmapping_deepsig_start`, `traitmapping_deepsig_end`, `deepsig_combined`), we removed the peptide/orthogroup from consideration.

## Notebook setup

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
setwd("..")

## Try filtering on 50% of proteins in the orthogroup having a peptide predicted from them

In [3]:
predictions <- read_tsv("outputs/notebooks/predictions_with_metadata.tsv", show_col_types = F) %>%
  filter(fraction_of_orthogroup_with_predicted_peptide >= 0.5) %>%
  arrange(desc(traitmapping_coefficient))

This removes both candidates with chelicerate support -- I think that this might not be the "correct" filter to apply because of that.
I think it's possible that ticks/other itch suppressing chelicerates could have evolved to make a peptide when the rest of the group didn't, so I think it might just be better to filter on absolute number of peptides in the group.
I'm going to filter to 10, somewhat arbitrarily, as a cut off.
(we see below that using a cut off of 10 actually gives us a minimum of 15 predicted peptides per orthogroup)

In [4]:
predictions <- read_tsv("outputs/notebooks/predictions_with_metadata.tsv", show_col_types = F) %>%
  filter(num_predicted_peptides > 10) %>%
  arrange(desc(traitmapping_coefficient))

print(paste("num predicted peptides:", nrow(predictions)))
print(paste("num orthogroups:", length(unique(predictions$traitmapping_orthogroup))))
print(paste("smallest number of peptides predicted in an orthogroup:", min(predictions$num_predicted_peptides)))

[1] "num predicted peptides: 246"
[1] "num orthogroups: 10"
[1] "smallest number of peptides predicted in an orthogroup: 15"


## Combine with solubility data

In [5]:
# read in solubility and ease of synthesis data
# This is from a web application by genscript.
# https://www.genscript.com/tools/peptide%2danalyzing%2dtool
synthesis_and_solubility <- read_csv("outputs/notebooks/tmp.csv", show_col_types = F) %>%
  distinct() %>%
  filter(sequence %in% predictions$protein_sequence)
table(synthesis_and_solubility$difficulty_level, synthesis_and_solubility$hydrophilicity)

           
            Good Poor
  Difficult    0   25
  Easy        72   27
  Medium       5  113

In [6]:
predictions <- left_join(predictions, synthesis_and_solubility, by = c("protein_sequence" = "sequence"))

In [7]:
predictions %>% 
  group_by(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, type_of_itch_suppression_evidence) %>% 
  tally() %>%
  arrange(desc(traitmapping_coefficient))

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,type_of_itch_suppression_evidence,n
<chr>,<chr>,<chr>,<dbl>,<chr>,<int>
Difficult,Poor,OG0008102,0.90567434,tick support,11
Easy,Good,OG0008102,0.90567434,tick support,1
Medium,Poor,OG0008102,0.90567434,tick support,6
Difficult,Poor,OG0013943,0.79343515,tick support,9
Easy,Poor,OG0013943,0.79343515,tick support,5
Medium,Poor,OG0013943,0.79343515,tick support,1
Easy,Good,OG0005246,0.3990746,chelicerate support,14
Medium,Good,OG0005246,0.3990746,chelicerate support,3
Easy,Poor,OG0007769,0.3802378,tick support,11
Medium,Poor,OG0007769,0.3802378,tick support,9


In [8]:
predictions %>% 
  group_by(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, type_of_itch_suppression_evidence) %>% 
  tally() %>%
  arrange(desc(traitmapping_coefficient)) %>%
  filter(difficulty_level %in% c("Easy", "Medium"))

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,type_of_itch_suppression_evidence,n
<chr>,<chr>,<chr>,<dbl>,<chr>,<int>
Easy,Good,OG0008102,0.90567434,tick support,1
Medium,Poor,OG0008102,0.90567434,tick support,6
Easy,Poor,OG0013943,0.79343515,tick support,5
Medium,Poor,OG0013943,0.79343515,tick support,1
Easy,Good,OG0005246,0.3990746,chelicerate support,14
Medium,Good,OG0005246,0.3990746,chelicerate support,3
Easy,Poor,OG0007769,0.3802378,tick support,11
Medium,Poor,OG0007769,0.3802378,tick support,9
Easy,Good,OG0000231,0.33988501,chelicerate support,6
Easy,Poor,OG0000231,0.33988501,chelicerate support,5


In [9]:
predictions %>% 
  group_by(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, type_of_itch_suppression_evidence) %>% 
  tally() %>%
  arrange(desc(traitmapping_coefficient)) %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(hydrophilicity == "Good")

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,type_of_itch_suppression_evidence,n
<chr>,<chr>,<chr>,<dbl>,<chr>,<int>
Easy,Good,OG0008102,0.90567434,tick support,1
Easy,Good,OG0005246,0.3990746,chelicerate support,14
Medium,Good,OG0005246,0.3990746,chelicerate support,3
Easy,Good,OG0000231,0.33988501,chelicerate support,6
Easy,Good,OG0000305,0.08832735,tick support,19
Easy,Good,OG0000354,0.08164998,tick support,9
Medium,Good,OG0000354,0.08164998,tick support,1
Easy,Good,OG0000335,0.07863746,tick support,16
Easy,Good,OG0001002,0.07861881,tick support,9
Medium,Good,OG0001002,0.07861881,tick support,1


## Establish pools

## For all pools:
* Don't pursue things that are difficult to synthesize. We have enough mediums and easies that we don't need to go that route.

## Pool Summary

In [53]:
pool1_names <- c("Amblyomma-americanum_evm.model.contig-129979-1.1", 
                 "Amblyomma-americanum_evm.model.contig-245149-1.2", # NEEDS THE SIGNAL PEPTIDE CLEAVED
                 "Amblyomma-sculptum_GEEX01004552.1.p1", # NEEDS THE SIGNAL PEPTIDE CLEAVED
                 "Rhipicephalus-microplus_XP-037271377.1_start70_end114",
                 "Rhipicephalus-microplus_XP-037271378.1_start78_end115",
                 "Dermacentor-andersoni_XP-054924338.1_start87_end106")

pool2_names <- c("Rhipicephalus-microplus_XP-037282321.1_start56_end95",
                 "Dermacentor-andersoni_XP-054918570.1") # NEEDS THE SIGNAL PEPTIDE CLEAVED

pool3_names <- c("Rhipicephalus-sanguineus_XP-037515628.1_start185_end221") 

pool4_names <- c("Hyalomma-asiaticum_KAH6922907.1_start146_end184",
                 "Amblyomma-americanum_evm.model.contig-114601-1.1",
                 "Haemaphysalis-longicornis_KAH9364597.1",
                 "Haemaphysalis-longicornis_KAH9364993.1",
                 "Leptotrombidium-deliense_tr|A0A443SE68|A0A443SE68-9ACAR",
                 "Leptotrombidium-deliense_tr|A0A443RTB5|A0A443RTB5-9ACAR",
                 "Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR") # NEEDS SIGNAL PEPTIDE CLEAVED

pool5_names <- c()

In [55]:
all_names <- c(pool1_names, pool2_names, pool3_names, pool4_names, pool5_names)
length(all_names)

In [61]:
cleaved <- data.frame(peptide_id = c("Amblyomma-americanum_evm.model.contig-245149-1.2", "Amblyomma-sculptum_GEEX01004552.1.p1",
                                     "Dermacentor-andersoni_XP-054918570.1", "Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR"),
                      cleaved_protein_sequence = c("AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC",
                                                   "ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA",
                                                   "GYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR",
                                                   "RPLYNIAHMVNSIRQVNEFLELGANAIETDVVFYSNGTAMKTFHGTPCDCFRDCFHSESIVNYLEYTKNITTP"),
                     cleaved_difficulty_level = c("easy", "easy", "medium", "medium"),
                     cleaved_hydrophilicity = c("good", "good", "poor", "poor"))

In [65]:
predictions %>%
  filter(peptide_id %in% all_names) %>%
  left_join(cleaved, by = "peptide_id") %>%
  arrange(desc(traitmapping_orthogroup)) %>%
  select(-start, -end, -nlpprecursor_class_score, -nlpprecursor_cleavage_score, -traitmapping_model, -traitmapping_profile_type)

peptide_id,peptide_type,peptide_class,prediction_tool,protein_sequence,peptide_length,locus_tag,traitmapping_cluster,traitmapping_orthogroup,traitmapping_signif_level,⋯,sgpeptide_blast_qend,sgpeptide_blast_sstart,sgpeptide_blast_send,sgpeptide_blast_evalue,sgpeptide_blast_bitscore,difficulty_level,hydrophilicity,cleaved_protein_sequence,cleaved_difficulty_level,cleaved_hydrophilicity
<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
Rhipicephalus-microplus_XP-037282321.1_start56_end95,cleavage,Peptide,deeppeptide,HIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW,40,Rhipicephalus-microplus_XP-037282321.1,cluster_46,OG0013943,No,⋯,15.0,1.0,15.0,8.25e-06,37.0,Medium,Poor,,,
Dermacentor-andersoni_XP-054918570.1,sORF,sORF,less_than_100aa,MHLYWVLLAAALATGVAAGGYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR,100,Dermacentor-andersoni_XP-054918570.1,cluster_46,OG0013943,No,⋯,70.0,1.0,70.0,6.37e-16,66.6,Difficult,Poor,GYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR,medium,poor
Dermacentor-andersoni_XP-054924338.1_start87_end106,cleavage,CLASS_II_LANTIPEPTIDE,nlpprecursor,NGAISGAVGAAVANLINKG,19,Dermacentor-andersoni_XP-054924338.1,cluster_33,OG0008102,Yes,⋯,,,,,,Medium,Poor,,,
Rhipicephalus-microplus_XP-037271377.1_start70_end114,cleavage,Peptide,deeppeptide,IHPVVATVVVPVVKVLVNGAASGAVGALVGKLLESDRDKSPAPSL,45,Rhipicephalus-microplus_XP-037271377.1,cluster_33,OG0008102,Yes,⋯,44.0,1.0,43.0,7.23e-16,62.8,Medium,Poor,,,
Rhipicephalus-microplus_XP-037271378.1_start78_end115,cleavage,Peptide,deeppeptide,VVVSVSKKIVERVADATIGFVVNKLLGHLLDRPTEPSF,38,Rhipicephalus-microplus_XP-037271378.1,cluster_33,OG0008102,Yes,⋯,38.0,1.0,38.0,1.24e-20,74.3,Medium,Poor,,,
Amblyomma-americanum_evm.model.contig-129979-1.1,sORF,sORF,less_than_100aa,MNSPKKTLEGGKELQKKIYDAVMNNSEDIIAAVRNMKSSMDNTGDETDEQFIGAVITAVVSTAAAAAVEAGVEAAIKRG,79,Amblyomma-americanum_evm.model.contig-129979-1.1,cluster_33,OG0008102,Yes,⋯,48.0,3.0,50.0,3.05e-11,52.8,Easy,Good,,,
Amblyomma-americanum_evm.model.contig-245149-1.2,sORF,sORF,less_than_100aa,MKAYLILVLVILGHLSQIHAAAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC,100,Amblyomma-americanum_evm.model.contig-245149-1.2,cluster_33,OG0008102,Yes,⋯,72.0,1.0,52.0,6.2899999999999995e-30,100.0,Medium,Poor,AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC,easy,good
Amblyomma-sculptum_GEEX01004552.1.p1,sORF,sORF,less_than_100aa,MKAYLILALVILGHLSQIHAATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA,99,Amblyomma-sculptum_GEEX01004552.1.p1,cluster_33,OG0008102,Yes,⋯,72.0,1.0,52.0,1.06e-11,54.7,Medium,Poor,ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA,easy,good
Rhipicephalus-sanguineus_XP-037515628.1_start185_end221,cleavage,Peptide,deeppeptide,AAGYGAAGLYGGLGGRGVGVYAGGAGGGLLGKHGGWH,37,Rhipicephalus-sanguineus_XP-037515628.1,cluster_46,OG0007769,No,⋯,37.0,1.0,37.0,8.43e-12,52.0,Easy,Poor,,,
Hyalomma-asiaticum_KAH6922907.1_start146_end184,cleavage,LASSO_PEPTIDE,nlpprecursor,TSRRTAANTTRDYVDKVYVWTVDKPCTMRRFMRYVPRW,38,Hyalomma-asiaticum_KAH6922907.1,cluster_33,OG0000231,Yes,⋯,,,,,,Easy,Good,,,


## POOL 1 (6 peptides)

* Orthogroup `OG0008102` has the highest trait mapping coefficient, meaning it has the most promising statistical support for itch suppression when considering the presence of peptides in the group.

* **sORFs**
    * Only 1 peptide is "Easy" to synthesize and with "Good" hydrophilicity in the group. This one is an sORF that doesn't have a signal peptide meaning there is less support that it would get where it needs to go to have anti-itch capabilities. Because it's easy to synthesize though, we can just throw it in any way. This would be the first peptide I would exclude from this pool if we need to make the pool smaller.
        * `Amblyomma-americanum_evm.model.contig-129979-1.1`: MNSPKKTLEGGKELQKKIYDAVMNNSEDIIAAVRNMKSSMDNTGDETDEQFIGAVITAVVSTAAAAAVEAGVEAAIKRG 
    * The two other sORFs are "Medium" to synthesize with "Poor" hydrophilicity. When we cleave their signal peptides, they become easy to synthesize with good hydrophilicity.
        * `Amblyomma-americanum_evm.model.contig-245149-1.2`: AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC (easy, good)
        * `Amblyomma-sculptum_GEEX01004552.1.p1`: ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA (easy, good)
    * All three sORFs match to `Transcript_929497.p2_start21_end72` hits to `petxwholefemale_TRINITY_DN4020_c0_g1_i1`, which is not a salivary gland transcriptome (though "whole" does contain salivary glands too). The cleavage peptides in the same orthogroup have hits to an actual sg transcriptome, so I think this hit is ok.
* **cleavage**
    * All of the sORFs have good solubility, so we can pick at least one cleavage peptide to put in this pool and it shouldn't cause aggregation problems.
    * Since the cleavage peptides are somewhat diverse, I think it could be good to synthesize a couple and check if they have aggregation problems with the pool. 
    * None of the cleavage peptides cluster together (mmseqs) so we can't use that to drive our selection
    * Two peptides have hits to tick sg transcriptomes so I think these two would be the best to move forward with.
        * `Rhipicephalus-microplus_XP-037271377.1_start70_end114` (IHPVVATVVVPVVKVLVNGAASGAVGALVGKLLESDRDKSPAPSL)
        * `Rhipicephalus-microplus_XP-037271378.1_start78_end115` (VVVSVSKKIVERVADATIGFVVNKLLGHLLDRPTEPSF)
    * I think we should throw in `Dermacentor-andersoni_XP-054924338.1_start87_end106` (NGAISGAVGAAVANLINKG) as a bonus peptide because it's from a different species.

In [10]:
OG0008102 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, peptide_id, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0008102") %>%
  filter(difficulty_level %in% c("Easy", "Medium"))

OG0008102

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,peptide_id,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,cleavage,OG0008102,VGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC,tick support,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Transcript_929497.p2_start64_end100
Medium,Poor,cleavage,OG0008102,NGAISGAVGAAVANLINKG,tick support,Dermacentor-andersoni_XP-054924338.1_start87_end106,Dermacentor-andersoni_XP-054924338.1_start87_end106,
Medium,Poor,cleavage,OG0008102,IHPVVATVVVPVVKVLVNGAASGAVGALVGKLLESDRDKSPAPSL,tick support,Rhipicephalus-microplus_XP-037271377.1_start70_end114,Rhipicephalus-microplus_XP-037271377.1_start70_end114,GIKN01002979.1.p1_start91_end134
Medium,Poor,cleavage,OG0008102,VVVSVSKKIVERVADATIGFVVNKLLGHLLDRPTEPSF,tick support,Rhipicephalus-microplus_XP-037271378.1_start78_end115,Rhipicephalus-microplus_XP-037271378.1_start78_end115,GIKN01002127.1.p1_start100_end137
Easy,Good,sORF,OG0008102,MNSPKKTLEGGKELQKKIYDAVMNNSEDIIAAVRNMKSSMDNTGDETDEQFIGAVITAVVSTAAAAAVEAGVEAAIKRG,tick support,Amblyomma-americanum_evm.model.contig-129979-1.1,Amblyomma-americanum_evm.model.contig-129979-1.1,Transcript_929497.p2_start21_end72
Medium,Poor,sORF,OG0008102,MKAYLILVLVILGHLSQIHAAAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC,tick support,Amblyomma-americanum_evm.model.contig-245149-1.2,Amblyomma-americanum_evm.model.contig-245149-1.2,Transcript_929497.p2_start21_end72
Medium,Poor,sORF,OG0008102,MKAYLILALVILGHLSQIHAATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA,tick support,Amblyomma-sculptum_GEEX01004552.1.p1,Amblyomma-sculptum_GEEX01004552.1.p1,Transcript_929497.p2_start21_end72


In [13]:
# look at annotation information
predictions %>% 
  filter(traitmapping_orthogroup == "OG0008102") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  select(peptide_type, peptide_id, starts_with("deepsig"), traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,deepsig_combined,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Chain; 1; 36; .; evidence=ECO:0000256,K19593;K20696;K05713;K22812,"outer membrane protein, multidrug efflux system;cecropin;2,3-dihydroxyphenylpropionate 1,2-dioxygenase [EC:1.13.11.16];thalianol hydroxylase"
cleavage,Dermacentor-andersoni_XP-054924338.1_start87_end106,Chain; 1; 19; .; evidence=ECO:0000256,K04062,osmotically inducible lipoprotein OsmB
cleavage,Rhipicephalus-microplus_XP-037271377.1_start70_end114,Chain; 1; 45; .; evidence=ECO:0000256,K23240;K23759,"FrmR/RcnR family transcriptional regulator, repressor of rcnA expression;protein S100-A2"
cleavage,Rhipicephalus-microplus_XP-037271378.1_start78_end115,Chain; 1; 38; .; evidence=ECO:0000256,,
sORF,Amblyomma-americanum_evm.model.contig-129979-1.1,Chain; 1; 79; .; evidence=ECO:0000256,,
sORF,Amblyomma-americanum_evm.model.contig-245149-1.2,Signal peptide; 1; 20; 0.79; evidence=ECO:0000256 | Chain; 21; 100; .; evidence=ECO:0000256,K19593;K20696;K05713;K22812,"outer membrane protein, multidrug efflux system;cecropin;2,3-dihydroxyphenylpropionate 1,2-dioxygenase [EC:1.13.11.16];thalianol hydroxylase"
sORF,Amblyomma-sculptum_GEEX01004552.1.p1,Signal peptide; 1; 20; 0.95; evidence=ECO:0000256 | Chain; 21; 99; .; evidence=ECO:0000256,K14556;K25722;K17263;K21503;K13052,U3 small nucleolar RNA-associated protein 12;proteoglycan 3;cullin-associated NEDD8-dissociated protein 1;Escherichia phage dCTP pyrophosphatase [EC:3.6.1.12];cell division protein DivIC


## POOL 2 (2 peptides)

* Orthogroup `OG0013943` has the second-highest trait mapping coefficient.
* Similar to the previous orthogroup, all of the predictions have poor solubility.
* **Cleavage**
     * most cleavage sequences cluster together. I think we could pick one or two and try it out.
         * The representative sequence for most of them is `Dermacentor-andersoni_XP-054918570.1_start58_end100` so we should start with one of those in the cluster. `Rhipicephalus-microplus_XP-037282321.1_start56_end95` is part of this cluster and had a hit to peptides predicted from tick salivary glands so this is probably the best one to move forward with.
         * `Rhipicephalus-microplus_XP-037282321.1_start56_end95` (HIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW)
* **sORF**
    * This orthogroup also had three sORF peptides that are difficult to synthesize with poor solubility but that had signal peptides. When the signal peptides are cleaved off, they become "medium" to synthesize. Still with poor solubility.
    * Two cluster with `Dermacentor-andersoni_XP-054918570.1` (mmseqs2 80%). These two also had hits to tick salivary gland-predicted peptides.
    * move forward with `Dermacentor-andersoni_XP-054918570.1` GYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR (medium, poor)

In [12]:
OG0013943 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, peptide_id, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0013943") %>%
  filter(difficulty_level %in% c("Easy", "Medium"))

OG0013943

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,peptide_id,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Poor,cleavage,OG0013943,ISHGYGGGYGGGGGYGGGGGYGGGYGGGGGFGGGYGGWR,tick support,Amblyomma-americanum_evm.model.contig-138531-1.3_start129_end167,Amblyomma-americanum_evm.model.contig-138531-1.3_start129_end167,
Easy,Poor,cleavage,OG0013943,VKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR,tick support,Dermacentor-andersoni_XP-054918570.1_start58_end100,Dermacentor-andersoni_XP-054918570.1_start58_end100,
Easy,Poor,cleavage,OG0013943,VKPVLTVSHGYGGGYGGGYGGGFGGGYGGGYGGGHGYWR,tick support,Dermacentor-silvarum_XP-049511149.1_start58_end96,Dermacentor-andersoni_XP-054918570.1_start58_end100,
Easy,Poor,cleavage,OG0013943,GYGGYGGGYGGGYGGYGGGYGGGYGGYGGGYGGGYGGWH,tick support,Haemaphysalis-longicornis_KAH9362006.1_start68_end106,Haemaphysalis-longicornis_KAH9362006.1_start68_end106,
Medium,Poor,cleavage,OG0013943,HIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW,tick support,Rhipicephalus-microplus_XP-037282321.1_start56_end95,Dermacentor-andersoni_XP-054918570.1_start58_end100,GBJS01028274.1.p1_start72_end105
Easy,Poor,cleavage,OG0013943,VSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGYGGGYGW,tick support,Rhipicephalus-sanguineus_XP-037511163.1_start64_end102,Dermacentor-andersoni_XP-054918570.1_start58_end100,


In [16]:
# look at annotation information
predictions %>% 
  filter(traitmapping_orthogroup == "OG0013943") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  select(peptide_type, peptide_id, starts_with("deepsig"), traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,deepsig_combined,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Amblyomma-americanum_evm.model.contig-138531-1.3_start129_end167,Chain; 1; 39; .; evidence=ECO:0000256,K06872;K13098;K13344;K07605;K07604,"uncharacterized protein;RNA-binding protein FUS;peroxin-13;type II keratin, basic;type I keratin, acidic"
cleavage,Dermacentor-andersoni_XP-054918570.1_start58_end100,Chain; 1; 43; .; evidence=ECO:0000256,K13344;K06339;K06872;K12741;K13098,peroxin-13;spore coat protein T;uncharacterized protein;heterogeneous nuclear ribonucleoprotein A1/A3;RNA-binding protein FUS
cleavage,Dermacentor-silvarum_XP-049511149.1_start58_end96,Chain; 1; 39; .; evidence=ECO:0000256,K06339;K13344;K06872;K13098;K12741,spore coat protein T;peroxin-13;uncharacterized protein;RNA-binding protein FUS;heterogeneous nuclear ribonucleoprotein A1/A3
cleavage,Haemaphysalis-longicornis_KAH9362006.1_start68_end106,Chain; 1; 39; .; evidence=ECO:0000256,K13344;K13098;K06872;K03102;K14651,peroxin-13;RNA-binding protein FUS;uncharacterized protein;squid;transcription initiation factor TFIID subunit 15
cleavage,Rhipicephalus-microplus_XP-037282321.1_start56_end95,Chain; 1; 40; .; evidence=ECO:0000256,,
cleavage,Rhipicephalus-sanguineus_XP-037511163.1_start64_end102,Chain; 1; 39; .; evidence=ECO:0000256,K13344;K12741;K06872;K13098;K14651,peroxin-13;heterogeneous nuclear ribonucleoprotein A1/A3;uncharacterized protein;RNA-binding protein FUS;transcription initiation factor TFIID subunit 15


In [49]:
# sorfs with signal peptides
predictions %>% filter(peptide_type == "sORF") %>%
  filter(grepl(pattern = "Signal", deepsig_combined)) %>%
  select(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, peptide_id, 
         deepsig_combined, protein_sequence, traitmapping_KO, traitmapping_KO_definition, 
         type_of_itch_suppression_evidence, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>%
  filter(traitmapping_orthogroup %in% c("OG0013943"))

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,peptide_id,deepsig_combined,protein_sequence,traitmapping_KO,traitmapping_KO_definition,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Difficult,Poor,OG0013943,0.7934352,Dermacentor-andersoni_XP-054918570.1,Signal peptide; 1; 18; 0.98; evidence=ECO:0000256 | Chain; 19; 100; .; evidence=ECO:0000256,MHLYWVLLAAALATGVAAGGYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR,K13344;K06339;K06872;K12741;K13098,peroxin-13;spore coat protein T;uncharacterized protein;heterogeneous nuclear ribonucleoprotein A1/A3;RNA-binding protein FUS,tick support,Dermacentor-andersoni_XP-054918570.1,GIKN01003556.1
Difficult,Poor,OG0013943,0.7934352,Dermacentor-silvarum_XP-049511149.1,Signal peptide; 1; 18; 0.97; evidence=ECO:0000256 | Chain; 19; 96; .; evidence=ECO:0000256,MHLYWVLLACALATGVAAGGYGHASVSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGFGGGYGGGYGGGHGYWR,K06339;K13344;K06872;K13098;K12741,spore coat protein T;peroxin-13;uncharacterized protein;RNA-binding protein FUS;heterogeneous nuclear ribonucleoprotein A1/A3,tick support,Dermacentor-andersoni_XP-054918570.1,GIKN01003556.1
Difficult,Poor,OG0013943,0.7934352,Rhipicephalus-microplus_XP-037282321.1,Signal peptide; 1; 18; 0.92; evidence=ECO:0000256 | Chain; 19; 95; .; evidence=ECO:0000256,MHLYWVLLACALATGVTAGGYGHASISYVSKPVVRVGYVSKPVVTYVKQPVATVSHIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW,,,tick support,Dermacentor-andersoni_XP-054918570.1,GIKN01003556.1


## Excluded pool

* The orthogroup with the next highest coefficient, `OG0005246`, has easy/medium synthesis and high solubility.
* We exclude this pool because all of these are sORFs that have hits to a housekeeping gene (dynein light chain roadblock-type) and _do not_ have signal peptides.
* All of them had hits to tick salivary gland transcriptomes.

In [13]:
OG0005246 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0005246") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(evidence_of_itch_suppression == "evidence of itch suppression")
OG0005246

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,evidence_of_itch_suppression,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Good,sORF,OG0005246,MTSEVEEIFKKLKDQEGVVGVVVTTSEGAPIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYCLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Dermacentor-andersoni_XP-050049593.1,evidence of itch suppression,GKHV01001871.1
Easy,Good,sORF,OG0005246,MTSEVEEIFKKLKDQEGVVGVVVTTSEGAPIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYCLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Dermacentor-silvarum_XP-037556830.1,evidence of itch suppression,GKHV01001871.1
Easy,Good,sORF,OG0005246,MAEVEATSEPERGAENDCNEHRRNSYQDTLKEIDPPKELTFLQIYFGRNEIMVAPDKYYFLIVIQNPTE,chelicerate support,Leptotrombidium-deliense_tr|A0A443RT29|A0A443RT29-9ACAR,Leptotrombidium-deliense_tr|A0A443RT29|A0A443RT29-9ACAR,evidence of itch suppression,GFZD01010403.1
Easy,Good,sORF,OG0005246,MTSEVEDIFKKLKDQDGVVGVVVTTSEGAAIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYFLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Rhipicephalus-microplus_XP-037283166.1,evidence of itch suppression,GBJT01001074.1
Easy,Good,sORF,OG0005246,MTSEVEEIFKKLKDQDGVVGVVVTTSEGAPIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYFLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Rhipicephalus-sanguineus_XP-037500236.1,evidence of itch suppression,GBJT01001074.1


In [19]:
# look at annotation information
predictions %>% 
  filter(traitmapping_orthogroup == "OG0005246") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  select(peptide_type, peptide_id, starts_with("deepsig"), traitmapping_egg_Description, traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,deepsig_combined,traitmapping_egg_Description,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
sORF,Blomia-tropicalis_KAJ6221418.1,Chain; 1; 97; .; evidence=ECO:0000256,Dynein light chain,,
sORF,Dermacentor-andersoni_XP-050049593.1,Chain; 1; 97; .; evidence=ECO:0000256,Roadblock/LC7 domain,K10419,dynein light chain roadblock-type
sORF,Dermacentor-silvarum_XP-037556830.1,Chain; 1; 97; .; evidence=ECO:0000256,Roadblock/LC7 domain,K10419,dynein light chain roadblock-type
sORF,Euroglyphus-maynei_tr|A0A1Y3BIK5|A0A1Y3BIK5-EURMA,Chain; 1; 87; .; evidence=ECO:0000256,,K10419;K07131;K25221;K08480;K08121,dynein light chain roadblock-type;uncharacterized protein;[methyl-Co(III) glycine betaine-specific corrinoid protein]---tetrahydrofolate methyltransferase [EC:2.1.1.378];circadian clock protein KaiA;fibromodulin
sORF,Galendromus-occidentalis_XP-003740035.1,Chain; 1; 100; .; evidence=ECO:0000256,Ragulator complex protein LAMTOR5,,
sORF,Leptotrombidium-deliense_tr|A0A443RT29|A0A443RT29-9ACAR,Chain; 1; 69; .; evidence=ECO:0000256,Acts as one of several non-catalytic accessory components of the cytoplasmic dynein 1 complex that are thought to be involved in linking dynein to cargos and to adapter proteins that regulate dynein function. Cytoplasmic dynein 1 acts as a motor for the intracellular retrograde motility of vesicles and organelles along microtubules,K10419,dynein light chain roadblock-type
sORF,Limulus-polyphemus_XP-013775992.1,Chain; 1; 97; .; evidence=ECO:0000256,Dynein light chain,K10419;K16344,dynein light chain roadblock-type;ragulator complex protein LAMTOR5
sORF,Oppiella-nova_tr|A0A7R9MRX4|A0A7R9MRX4-9ACAR,Chain; 1; 64; .; evidence=ECO:0000256,,K10419;K00052;K21360;K07131,dynein light chain roadblock-type;3-isopropylmalate dehydrogenase [EC:1.1.1.85];3-isopropylmalate/methylthioalkylmalate dehydrogenase [EC:1.1.1.85 1.1.1.-];uncharacterized protein
sORF,Oppiella-nova_tr|A0A7R9M7S0|A0A7R9M7S0-9ACAR,Chain; 1; 99; .; evidence=ECO:0000256,dynein intermediate chain binding,K10419;K07131;K10647;K04370,dynein light chain roadblock-type;uncharacterized protein;midline 2 [EC:2.3.2.27];ragulator complex protein LAMTOR3
sORF,Phalangium-opilio_jg27177t1,Chain; 1; 98; .; evidence=ECO:0000256,Dynein light chain,K10419;K16344,dynein light chain roadblock-type;ragulator complex protein LAMTOR5


## POOL 3 (1 peptide)

* `OG0007769` are all cleavage peptides.
* They are easy or medium to synthesize but all have low solubility.
* Only some have hits to tick salivary gland transcriptomes, but those that do all clustered with `Rhipicephalus-sanguineus_XP-037515628.1_start185_end221`. I say we just move forward with that sequence.
* `Rhipicephalus-sanguineus_XP-037515628.1_start185_end221` (GGIGARGIGGVGLGGLGGAGLGVGLGAGTYGGGYRGGYGGYGGGYG)
* If there are no solubility/aggregation issues, pool 3 can be combined with pool2

In [47]:
OG0007769 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0007769") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(evidence_of_itch_suppression == "evidence of itch suppression") %>%
  select(-evidence_of_itch_suppression)
OG0007769

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,cleavage,OG0007769,RNLGGGYVGGGYGGLAAGLGGVALGGGLKGGLGVHHGGGFKGGYGGLYG,tick support,Amblyomma-americanum_evm.model.contig-126756-1.3_start14_end62,Amblyomma-americanum_evm.model.contig-126756-1.3_start14_end62,
Medium,Poor,cleavage,OG0007769,AYGVGGLGGYGLGGYGGGLGGLGGGVGVYRGAGGYGKYGAGGAGWW,tick support,Amblyomma-americanum_evm.model.contig-110937-1.5_start282_end327,Amblyomma-americanum_evm.model.contig-126756-1.3_start208_end253,
Easy,Poor,cleavage,OG0007769,GNLGGGFVGGGFGGLGAGLGGGVYGGGLGGGLGVHHGGG,tick support,Amblyomma-americanum_evm.model.contig-110937-1.5_start86_end124,Amblyomma-americanum_evm.model.contig-110937-1.5_start86_end124,
Medium,Poor,cleavage,OG0007769,AYGVGGLGGYGLGGYGGGVGGLGGGVGVYRGAGGYGKHGAGGAGWW,tick support,Amblyomma-americanum_evm.model.contig-110937-1.5_start282_end327,Amblyomma-americanum_evm.model.contig-110937-1.5_start282_end327,
Easy,Poor,cleavage,OG0007769,GYGGYGGGYGGLGAVGGVGGGYGVGGGLYGGAGVFRGVGGHGKHGHGWQ,tick support,Dermacentor-andersoni_XP-050026447.1_start226_end274,Dermacentor-andersoni_XP-050026447.1_start226_end274,
Medium,Poor,cleavage,OG0007769,GAGGLYGAGVARYGGAGLYGGLGGGGVGVYAGGAGVGVLGKHGGGVGWH,tick support,Dermacentor-andersoni_XP-054919892.1_start175_end223,Dermacentor-andersoni_XP-054919792.1_start424_end472,
Medium,Poor,cleavage,OG0007769,GAGGLYGAGVARYGGAGLYGGLGGRGVGVYGGGAGVGVLGKHGGGVGWH,tick support,Dermacentor-andersoni_XP-054919892.1_start175_end223,Dermacentor-andersoni_XP-054919892.1_start175_end223,
Medium,Poor,cleavage,OG0007769,GAAGYGSAGLYGGLGGRGVGVYAGGVGVHGKHGVGWH,tick support,Rhipicephalus-sanguineus_XP-037515628.1_start185_end221,Dermacentor-silvarum_XP-037564805.1_start275_end311,GFGI01009308.1.p2_start118_end154
Medium,Poor,cleavage,OG0007769,AAGYGGAGLYGGIGRGVGVYAGGRGVGVLGKHGGGWH,tick support,Rhipicephalus-sanguineus_XP-037515628.1_start185_end221,Dermacentor-silvarum_XP-049518524.1_start230_end266,
Easy,Poor,cleavage,OG0007769,GTLRGGFGGVGLGGLGGAGLGAGLGVGLGAGTYGGGYAGGYGGHGG,tick support,Rhipicephalus-sanguineus_XP-037519543.1_start18_end63,Dermacentor-silvarum_XP-049528796.1_start18_end63,


In [20]:
# look at annotation information
predictions %>% 
  filter(traitmapping_orthogroup == "OG0007769") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  select(peptide_type, peptide_id, starts_with("deepsig"), traitmapping_egg_Description, traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,deepsig_combined,traitmapping_egg_Description,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Amblyomma-americanum_evm.model.contig-126756-1.3_start14_end62,Chain; 1; 49; .; evidence=ECO:0000256,,,
cleavage,Amblyomma-americanum_evm.model.contig-126756-1.3_start208_end253,Chain; 1; 46; .; evidence=ECO:0000256,,,
cleavage,Amblyomma-americanum_evm.model.contig-110937-1.5_start86_end124,Chain; 1; 39; .; evidence=ECO:0000256,Cuticle protein,,
cleavage,Amblyomma-americanum_evm.model.contig-110937-1.5_start282_end327,Chain; 1; 46; .; evidence=ECO:0000256,Cuticle protein,,
cleavage,Dermacentor-andersoni_XP-050026447.1_start226_end274,Chain; 1; 49; .; evidence=ECO:0000256,,,
cleavage,Dermacentor-andersoni_XP-054919792.1_start424_end472,Chain; 1; 49; .; evidence=ECO:0000256,Insect cuticle protein,,
cleavage,Dermacentor-andersoni_XP-054919892.1_start175_end223,Chain; 1; 49; .; evidence=ECO:0000256,Insect cuticle protein,K13087,bcl2 associated transcription factor 1
cleavage,Dermacentor-silvarum_XP-037564805.1_start275_end311,Chain; 1; 37; .; evidence=ECO:0000256,Insect cuticle protein,K16599;K06675;K02110;K25448;K09043,tubulin polyglutamylase TTLL1 [EC:6.3.2.61];structural maintenance of chromosome 4;F-type H+-transporting ATPase subunit c;olfactomedin-like protein 1/3;AP-1-like transcription factor
cleavage,Dermacentor-silvarum_XP-049518524.1_start230_end266,Chain; 1; 37; .; evidence=ECO:0000256,Insect cuticle protein,K02110;K13344,F-type H+-transporting ATPase subunit c;peroxin-13
cleavage,Dermacentor-silvarum_XP-049528796.1_start18_end63,Chain; 1; 46; .; evidence=ECO:0000256,Insect cuticle protein,K19461;K07344;K12841;K12837;K13171,Lymphocryptovirus nuclear antigen 1;type IV secretion system protein TrbL;calcium homeostasis endoplasmic reticulum protein;splicing factor U2AF 65 kDa subunit;serine/arginine repetitive matrix protein 1


## POOL 4 (7 peptides)

* `OG0000231` don't really cluster well with each other but have a lot that are easy to synthesize with good solubility.
* Because of the good solubility, we can test them all in a single pool.

* **Cleavage**
   * There is only one cleavage peptide that is "easy" to synthesize and has "good" solubility.
       * Move forward with `Hyalomma-asiaticum_KAH6922907.1_start146_end184` (TSRRTAANTTRDYVDKVYVWTVDKPCTMRRFMRYVPRW) 
* **sORF**
   * Five sORF peptides are "easy" to synthesize and have "good" solubility. None of them have signal peptides. I also sent them through DeepPeptide and none of them were cleaved.
   * `Amblyomma-americanum_evm.model.contig-114601-1.1` has a hit to a tick salivary gland transcriptome predicted peptide (`Transcript_329793`, which is `peok72sgfemale_NODE_101988_length_152_cov_1.456311_g101129_i0`, a female salivary gland sample)
   * I think its relatively low stakes to throw the five peptides in below even if they are sORFs with out signal peptides
       * MSNIQQRSTVVLLLREHLLPNNVLRGVDGIITNDPRRMARIMKEREFRNKLRPATIQDNPGICVPGRPAPRSVASQQLLEIDFLGDFNNSANGILHV	Amblyomma-americanum_evm.model.contig-114601-1.1
       * MGHMANTLHELKDLLAQGANSIEADVVFAPNGTAVKLNHEDGCDCDRNCNQETEIRRYLYFLKNAVSKGEKSKSSSVTLEFY Haemaphysalis-longicornis_KAH9364597.1
       * MDSLAKVGRAFADLKIYNHRWVGSGNTNCLPYLSGKYDRLKDIVACRDGLKSGCDFIDKGYAWTLDYESSIAREIK Haemaphysalis-longicornis_KAH9364993.1
       * MVNNITEINQFLDLGCNAVEADVKFIDAYPKNAFHGQPCDCDRYCDSSEDLAKYLNYVRKITTPEIAASGEVGHRK Leptotrombidium-deliense_tr|A0A443SE68|A0A443SE68-9ACAR
       * GITNCIGFLYPLIRLQALVQKRDECNIDKDPFCPRKVYQWTTDNQSRFRSILRMQVDGFITNYPNRLNEVLREPEFATKFRLATNRDNPWQIYK Leptotrombidium-deliense_tr|A0A443RTB5|A0A443RTB5-9ACAR
   * `Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR` is an sORF that _does_ have a signal peptide. include this one as the only poor solubility peptide in the pool.
       * `Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR` RPLYNIAHMVNSIRQVNEFLELGANAIETDVVFYSNGTAMKTFHGTPCDCFRDCFHSESIVNYLEYTKNITTP (SIGNAL PEPTIDE CLEAVED OFF, medium, poor)

In [15]:
OG0000231 <- predictions %>%
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0000231") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(hydrophilicity == "Good") %>%
  filter(evidence_of_itch_suppression == "evidence of itch suppression") %>%
  select(-evidence_of_itch_suppression)
OG0000231

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Good,cleavage,OG0000231,TSRRTAANTTRDYVDKVYVWTVDKPCTMRRFMRYVPRW,chelicerate support,Hyalomma-asiaticum_KAH6922907.1_start146_end184,Hyalomma-asiaticum_KAH6922907.1_start146_end184,
Easy,Good,sORF,OG0000231,MSNIQQRSTVVLLLREHLLPNNVLRGVDGIITNDPRRMARIMKEREFRNKLRPATIQDNPGICVPGRPAPRSVASQQLLEIDFLGDFNNSANGILHV,chelicerate support,Amblyomma-americanum_evm.model.contig-114601-1.1,Amblyomma-americanum_evm.model.contig-114601-1.1,Transcript_329793
Easy,Good,sORF,OG0000231,MGHMANTLHELKDLLAQGANSIEADVVFAPNGTAVKLNHEDGCDCDRNCNQETEIRRYLYFLKNAVSKGEKSKSSSVTLEFY,chelicerate support,Haemaphysalis-longicornis_KAH9364597.1,Haemaphysalis-longicornis_KAH9364597.1,
Easy,Good,sORF,OG0000231,MDSLAKVGRAFADLKIYNHRWVGSGNTNCLPYLSGKYDRLKDIVACRDGLKSGCDFIDKGYAWTLDYESSIAREIK,chelicerate support,Haemaphysalis-longicornis_KAH9364993.1,Haemaphysalis-longicornis_KAH9364993.1,
Easy,Good,sORF,OG0000231,MVNNITEINQFLDLGCNAVEADVKFIDAYPKNAFHGQPCDCDRYCDSSEDLAKYLNYVRKITTPEIAASGEVGHRK,chelicerate support,Leptotrombidium-deliense_tr|A0A443SE68|A0A443SE68-9ACAR,Leptotrombidium-deliense_tr|A0A443SE68|A0A443SE68-9ACAR,
Easy,Good,sORF,OG0000231,GITNCIGFLYPLIRLQALVQKRDECNIDKDPFCPRKVYQWTTDNQSRFRSILRMQVDGFITNYPNRLNEVLREPEFATKFRLATNRDNPWQIYK,chelicerate support,Leptotrombidium-deliense_tr|A0A443RTB5|A0A443RTB5-9ACAR,Leptotrombidium-deliense_tr|A0A443RTB5|A0A443RTB5-9ACAR,


In [51]:
# look at annotation information
predictions %>% 
  filter(traitmapping_orthogroup == "OG0000231") %>%
  filter(difficulty_level %in% c("Easy")) %>%
  filter(hydrophilicity == "Good") %>%
  select(peptide_type, peptide_id, starts_with("deepsig"), traitmapping_KO, traitmapping_KO_definition)

peptide_type,peptide_id,deepsig_combined,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Hyalomma-asiaticum_KAH6922907.1_start146_end184,Chain; 1; 38; .; evidence=ECO:0000256,,
sORF,Amblyomma-americanum_evm.model.contig-114601-1.1,Chain; 1; 97; .; evidence=ECO:0000256,K01126;K18694,glycerophosphoryl diester phosphodiesterase [EC:3.1.4.46];phosphatidylglycerol phospholipase C [EC:3.1.4.-]
sORF,Haemaphysalis-longicornis_KAH9364597.1,Chain; 1; 82; .; evidence=ECO:0000256,,
sORF,Haemaphysalis-longicornis_KAH9364993.1,Chain; 1; 76; .; evidence=ECO:0000256,,
sORF,Leptotrombidium-deliense_tr|A0A443SE68|A0A443SE68-9ACAR,Chain; 1; 76; .; evidence=ECO:0000256,,
sORF,Leptotrombidium-deliense_tr|A0A443RTB5|A0A443RTB5-9ACAR,Chain; 1; 94; .; evidence=ECO:0000256,K01126;K18694;K22387,glycerophosphoryl diester phosphodiesterase [EC:3.1.4.46];phosphatidylglycerol phospholipase C [EC:3.1.4.-];lysophospholipase D [EC:3.1.4.39]


In [52]:
predictions %>% filter(peptide_type == "sORF") %>%
  filter(grepl(pattern = "Signal", deepsig_combined)) %>%
  select(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, peptide_id, 
         deepsig_combined, protein_sequence, traitmapping_KO, traitmapping_KO_definition, 
         type_of_itch_suppression_evidence, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>%
  filter(traitmapping_orthogroup %in% c("OG0000231"))

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,peptide_id,deepsig_combined,protein_sequence,traitmapping_KO,traitmapping_KO_definition,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,OG0000231,0.339885,Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR,Signal peptide; 1; 20; 0.99; evidence=ECO:0000256 | Chain; 21; 93; .; evidence=ECO:0000256,MKYLLVFLLFYFQCKKTVQQRPLYNIAHMVNSIRQVNEFLELGANAIETDVVFYSNGTAMKTFHGTPCDCFRDCFHSESIVNYLEYTKNITTP,K01123,sphingomyelin phosphodiesterase D [EC:3.1.4.41],chelicerate support,Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR,


## POOL 5 (53 peptides)

Anything easy to synthesize, with high solubility, and with an orthogroup that wasn't included above.
These things had lower coefficients for association with itch suppression but would allow us to cast a wide net with our peptide predictions.
If synthesis costs are limiting, we could limit the number of peptides in the bonus pool by:
* Selecting representatives from each orthogroup (4 peptides)
* Limiting to peptides with matches in tick salivary glands (5 peptides)
* Selecting based on representative sequences (49 peptides)

In [28]:
predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, traitmapping_coefficient, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>%
  arrange(desc(traitmapping_coefficient)) %>%
  filter(difficulty_level %in% c("Easy")) %>%
  filter(hydrophilicity == "Good") %>%
  filter(!traitmapping_orthogroup %in% c("OG0000231", "OG0007769", "OG0005246", "OG0013943", "OG0008102")) 

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,traitmapping_coefficient,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,evidence_of_itch_suppression,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Good,sORF,OG0000305,0.08832735,MVRTELLELSQEINTPRIVYRIDTLPAPTYGREEVVLVPPYYCKFKPTERVWSQLKGHFARRNRVTMNLKRFCQKHSRL,tick support,Haemaphysalis-longicornis_KAH9364540.1,Haemaphysalis-longicornis_KAH9364540.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKGDVARLNTDFRIGSMRKLLITAAENVSPDNWTKAVEQIIGIERRRLEVRGFSDHVEQTIISLGEEDDDRRKRSLLH,tick support,Haemaphysalis-longicornis_KAH9364717.1,Haemaphysalis-longicornis_KAH9364717.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MDKASYHSRRNEAVPTTNSLKGTITEWLDSKSIQYGSCADREAAAGDNCSSEATFHQLPSRHGCTDGRVYRGKAAVLPLRVQSY,tick support,Haemaphysalis-longicornis_KAH9367493.1,Haemaphysalis-longicornis_KAH9367493.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MSDVFRGKKTWAYHEEMDGPHFESWFDGVLQKLPSGRVIMWTTPPTAPSGKRQGQRTNSLKGTITEWLDSKDIQHGARLTKKQLLKIVA,tick support,Haemaphysalis-longicornis_KAH9371263.1,Haemaphysalis-longicornis_KAH9371263.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKLGVAARNSTFKLPYVEVLLREEVAKVTSQHWAKTVQHVISIETKFRGNGGASAYVQPIMIHLGEDMDSDSNLTAIESFRDV,tick support,Haemaphysalis-longicornis_KAH9380381.1,Haemaphysalis-longicornis_KAH9374347.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKRGVAPRNVTFKLSDVEVLLREEAAKVTAQHWVNAVQHVINIETKFMGDGGASVHVQPIIIHLDEDDMDSDSDLSGIESFEDL,tick support,Haemaphysalis-longicornis_KAH9374666.1,Haemaphysalis-longicornis_KAH9374666.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MRGLTTGLKKPSGKGQRLIVTHIGSEDGFVSGCLDIFRGTKTRDYHEMDGTRFERWFGAVLPQEHCTQHANNEYD,tick support,Haemaphysalis-longicornis_KAH9375042.1,Haemaphysalis-longicornis_KAH9375042.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKLGVAARNATFKLADVEVLLREEAAKVTAEHWANAVQHVINIETKFRGDGGASAHVQPIIIHLAEDDIDSDSDLSGIESFEDV,tick support,Haemaphysalis-longicornis_KAH9374666.1,Haemaphysalis-longicornis_KAH9376245.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKRGVASRSATFKLPHVEVLLREEVAKITAQHWANTVQHVISIETKFRGDGGASAHVQPIIIHLDEDDMDSDTNHTGEI,tick support,Haemaphysalis-longicornis_KAH9380381.1,Haemaphysalis-longicornis_KAH9380381.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MDNASYHSRRLETITTMSSRKPDILLADIKGTDDGSPNNWAKAVEHVIGIEDKRREARGFSDHVEPIIISLGEEDGDSCGSADDDLSGIEPME,tick support,Haemaphysalis-longicornis_KAH9382422.1,Haemaphysalis-longicornis_KAH9382422.1,evidence of itch suppression,


In [46]:
# look at annotation information
predictions %>%
  filter(!traitmapping_orthogroup %in% c("OG0000231", "OG0007769", "OG0005246", "OG0013943", "OG0008102")) %>%
  filter(difficulty_level %in% c("Easy")) %>%
  filter(hydrophilicity == "Good") %>%
  select(traitmapping_orthogroup, peptide_type, peptide_id, starts_with("deepsig"), traitmapping_egg_Description, traitmapping_KO, traitmapping_KO_definition)

traitmapping_orthogroup,peptide_type,peptide_id,deepsig_combined,traitmapping_egg_Description,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
OG0000305,sORF,Haemaphysalis-longicornis_KAH9364540.1,Chain; 1; 79; .; evidence=ECO:0000256,,K01152,isftu1 transposase
OG0000305,sORF,Haemaphysalis-longicornis_KAH9364717.1,Chain; 1; 78; .; evidence=ECO:0000256,,,
OG0000305,sORF,Haemaphysalis-longicornis_KAH9367493.1,Chain; 1; 84; .; evidence=ECO:0000256,,,
OG0000305,sORF,Haemaphysalis-longicornis_KAH9371263.1,Chain; 1; 89; .; evidence=ECO:0000256,,,
OG0000305,sORF,Haemaphysalis-longicornis_KAH9374347.1,Chain; 1; 83; .; evidence=ECO:0000256,,,
OG0000305,sORF,Haemaphysalis-longicornis_KAH9374666.1,Chain; 1; 84; .; evidence=ECO:0000256,,K26447,Alphabaculovirus probable serine/threonine-protein kinase 2 [EC:2.7.11.1]
OG0000305,sORF,Haemaphysalis-longicornis_KAH9375042.1,Chain; 1; 75; .; evidence=ECO:0000256,-,,
OG0000305,sORF,Haemaphysalis-longicornis_KAH9376245.1,Chain; 1; 84; .; evidence=ECO:0000256,,,
OG0000305,sORF,Haemaphysalis-longicornis_KAH9380381.1,Chain; 1; 79; .; evidence=ECO:0000256,,,
OG0000305,sORF,Haemaphysalis-longicornis_KAH9382422.1,Chain; 1; 93; .; evidence=ECO:0000256,,,


## Try filter sORFs by signal peptides

## Actual POOL 5

* Any cleavage peptide that has good solubility
* Any sORF with a signal peptide not synthesized above

In [44]:
predictions %>% filter(peptide_type == "sORF") %>%
  filter(grepl(pattern = "Signal", deepsig_combined)) %>%
  select(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, peptide_id, 
         deepsig_combined, protein_sequence, traitmapping_KO, traitmapping_KO_definition, 
         type_of_itch_suppression_evidence, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) #%>%
  #filter(!traitmapping_orthogroup %in% c("OG0000231", "OG0007769", "OG0005246", "OG0013943", "OG0008102"))

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,peptide_id,deepsig_combined,protein_sequence,traitmapping_KO,traitmapping_KO_definition,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,OG0008102,0.9056743,Amblyomma-americanum_evm.model.contig-245149-1.2,Signal peptide; 1; 20; 0.79; evidence=ECO:0000256 | Chain; 21; 100; .; evidence=ECO:0000256,MKAYLILVLVILGHLSQIHAAAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC,K19593;K20696;K05713;K22812,"outer membrane protein, multidrug efflux system;cecropin;2,3-dihydroxyphenylpropionate 1,2-dioxygenase [EC:1.13.11.16];thalianol hydroxylase",tick support,Amblyomma-americanum_evm.model.contig-245149-1.2,Transcript_929497.p2_start21_end72
Medium,Poor,OG0008102,0.9056743,Amblyomma-sculptum_GEEX01004552.1.p1,Signal peptide; 1; 20; 0.95; evidence=ECO:0000256 | Chain; 21; 99; .; evidence=ECO:0000256,MKAYLILALVILGHLSQIHAATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA,K14556;K25722;K17263;K21503;K13052,U3 small nucleolar RNA-associated protein 12;proteoglycan 3;cullin-associated NEDD8-dissociated protein 1;Escherichia phage dCTP pyrophosphatase [EC:3.6.1.12];cell division protein DivIC,tick support,Amblyomma-sculptum_GEEX01004552.1.p1,Transcript_929497.p2_start21_end72
Difficult,Poor,OG0013943,0.7934352,Dermacentor-andersoni_XP-054918570.1,Signal peptide; 1; 18; 0.98; evidence=ECO:0000256 | Chain; 19; 100; .; evidence=ECO:0000256,MHLYWVLLAAALATGVAAGGYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR,K13344;K06339;K06872;K12741;K13098,peroxin-13;spore coat protein T;uncharacterized protein;heterogeneous nuclear ribonucleoprotein A1/A3;RNA-binding protein FUS,tick support,Dermacentor-andersoni_XP-054918570.1,GIKN01003556.1
Difficult,Poor,OG0013943,0.7934352,Dermacentor-silvarum_XP-049511149.1,Signal peptide; 1; 18; 0.97; evidence=ECO:0000256 | Chain; 19; 96; .; evidence=ECO:0000256,MHLYWVLLACALATGVAAGGYGHASVSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGFGGGYGGGYGGGHGYWR,K06339;K13344;K06872;K13098;K12741,spore coat protein T;peroxin-13;uncharacterized protein;RNA-binding protein FUS;heterogeneous nuclear ribonucleoprotein A1/A3,tick support,Dermacentor-andersoni_XP-054918570.1,GIKN01003556.1
Difficult,Poor,OG0013943,0.7934352,Rhipicephalus-microplus_XP-037282321.1,Signal peptide; 1; 18; 0.92; evidence=ECO:0000256 | Chain; 19; 95; .; evidence=ECO:0000256,MHLYWVLLACALATGVTAGGYGHASISYVSKPVVRVGYVSKPVVTYVKQPVATVSHIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW,,,tick support,Dermacentor-andersoni_XP-054918570.1,GIKN01003556.1
Medium,Poor,OG0000231,0.339885,Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR,Signal peptide; 1; 20; 0.99; evidence=ECO:0000256 | Chain; 21; 93; .; evidence=ECO:0000256,MKYLLVFLLFYFQCKKTVQQRPLYNIAHMVNSIRQVNEFLELGANAIETDVVFYSNGTAMKTFHGTPCDCFRDCFHSESIVNYLEYTKNITTP,K01123,sphingomyelin phosphodiesterase D [EC:3.1.4.41],chelicerate support,Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR,


Cleaving the signal peptide off makes synthesis easier and increases solubility:
sORFs with signal peptides, but signal peptides cleaved off:

~AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC (easy, good)~
~ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA (easy, good)~
~GYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR (medium, poor)
GGYGHASVSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGFGGGYGGGYGGGHGYWR (medium, poor)
GGYGHASISYVSKPVVRVGYVSKPVVTYVKQPVATVSHIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW (medium, poor)~
RPLYNIAHMVNSIRQVNEFLELGANAIETDVVFYSNGTAMKTFHGTPCDCFRDCFHSESIVNYLEYTKNITTP (medium, poor)
