# Selecting peptides for experimental validation

This notebook proposes pools of peptides to be experimentally validated using a mouse scratch assay.
It primarily relies on the following metadata about peptides
* Ease of synthesis
* Solubility (hydrophilicity)
* Orthogroup the peptide belonged to and the statistical association of that orthogroup with itch suppression.
* The sequence of the peptide itself
* What other peptides the peptide clustered with (mmseqs2 80% identity)
* Whether the peptides had matches in tick salivary gland transcriptomes

## Notebook setup

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to b

In [2]:
setwd("..")

## Try filtering on 50% of proteins in the orthogroup having a peptide predicted from them

In [4]:
predictions <- read_tsv("outputs/notebooks/predictions_with_metadata.tsv", show_col_types = F) %>%
  filter(fraction_of_orthogroup_with_predicted_peptide >= 0.5) %>%
  arrange(desc(traitmapping_coefficient))

This removes both candidates with chelicerate support -- I think that this might not be the "correct" filter to apply because of that.
I think it's possible that ticks/other itch suppressing chelicerates could have evolved to make a peptide when the rest of the group didn't, so I think it might just be better to filter on absolute number of peptides in the group.
I'm going to filter to 10, somewhat arbitrarily, as a cut off.
(we see below that using a cut off of 10 actually gives us a minimum of 15 predicted peptides per orthogroup)

In [5]:
predictions <- read_tsv("outputs/notebooks/predictions_with_metadata.tsv", show_col_types = F) %>%
  filter(num_predicted_peptides > 10) %>%
  arrange(desc(traitmapping_coefficient))

print(paste("num predicted peptides:", nrow(predictions)))
print(paste("num orthogroups:", length(unique(predictions$traitmapping_orthogroup))))
print(paste("smallest number of peptides predicted in an orthogroup:", min(predictions$num_predicted_peptides)))

[1] "num predicted peptides: 246"
[1] "num orthogroups: 10"
[1] "smallest number of peptides predicted in an orthogroup: 15"


## Combine with solubility data

In [6]:
# read in solubility and ease of synthesis data
# This is from a web application by genscript.
# https://www.genscript.com/tools/peptide%2danalyzing%2dtool
synthesis_and_solubility <- read_csv("outputs/notebooks/tmp.csv", show_col_types = F) %>%
  distinct() %>%
  filter(sequence %in% predictions$protein_sequence)
table(synthesis_and_solubility$difficulty_level, synthesis_and_solubility$hydrophilicity)

           
            Good Poor
  Difficult    0   25
  Easy        72   27
  Medium       5  113

In [7]:
predictions <- left_join(predictions, synthesis_and_solubility, by = c("protein_sequence" = "sequence"))

In [8]:
predictions %>% 
  group_by(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, type_of_itch_suppression_evidence) %>% 
  tally() %>%
  arrange(desc(traitmapping_coefficient))

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,type_of_itch_suppression_evidence,n
<chr>,<chr>,<chr>,<dbl>,<chr>,<int>
Difficult,Poor,OG0008102,0.90567434,tick support,11
Easy,Good,OG0008102,0.90567434,tick support,1
Medium,Poor,OG0008102,0.90567434,tick support,6
Difficult,Poor,OG0013943,0.79343515,tick support,9
Easy,Poor,OG0013943,0.79343515,tick support,5
Medium,Poor,OG0013943,0.79343515,tick support,1
Easy,Good,OG0005246,0.3990746,chelicerate support,14
Medium,Good,OG0005246,0.3990746,chelicerate support,3
Easy,Poor,OG0007769,0.3802378,tick support,11
Medium,Poor,OG0007769,0.3802378,tick support,9


In [9]:
predictions %>% 
  group_by(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, type_of_itch_suppression_evidence) %>% 
  tally() %>%
  arrange(desc(traitmapping_coefficient)) %>%
  filter(difficulty_level %in% c("Easy", "Medium"))

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,type_of_itch_suppression_evidence,n
<chr>,<chr>,<chr>,<dbl>,<chr>,<int>
Easy,Good,OG0008102,0.90567434,tick support,1
Medium,Poor,OG0008102,0.90567434,tick support,6
Easy,Poor,OG0013943,0.79343515,tick support,5
Medium,Poor,OG0013943,0.79343515,tick support,1
Easy,Good,OG0005246,0.3990746,chelicerate support,14
Medium,Good,OG0005246,0.3990746,chelicerate support,3
Easy,Poor,OG0007769,0.3802378,tick support,11
Medium,Poor,OG0007769,0.3802378,tick support,9
Easy,Good,OG0000231,0.33988501,chelicerate support,6
Easy,Poor,OG0000231,0.33988501,chelicerate support,5


In [10]:
predictions %>% 
  group_by(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, type_of_itch_suppression_evidence) %>% 
  tally() %>%
  arrange(desc(traitmapping_coefficient)) %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(hydrophilicity == "Good")

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,type_of_itch_suppression_evidence,n
<chr>,<chr>,<chr>,<dbl>,<chr>,<int>
Easy,Good,OG0008102,0.90567434,tick support,1
Easy,Good,OG0005246,0.3990746,chelicerate support,14
Medium,Good,OG0005246,0.3990746,chelicerate support,3
Easy,Good,OG0000231,0.33988501,chelicerate support,6
Easy,Good,OG0000305,0.08832735,tick support,19
Easy,Good,OG0000354,0.08164998,tick support,9
Medium,Good,OG0000354,0.08164998,tick support,1
Easy,Good,OG0000335,0.07863746,tick support,16
Easy,Good,OG0001002,0.07861881,tick support,9
Medium,Good,OG0001002,0.07861881,tick support,1


## Establish pools

## For all pools:
* Don't pursue things that are difficult to synthesize. We have enough mediums and easies that we don't need to go that route.

## POOL 1 (x peptides?)

! THIS POOL NEEDS TO BE FILTERED DOWN BECAUSE OF LIKELY SOLUBILITY CONFLICTS
* Orthogroup `OG0008102` has the highest trait mapping coefficient, meaning it has the most promising statistical support for itch suppression when considering the presence of peptides in the group. Only 1 peptide is "Easy" to synthesize and with "Good" hydrophilicity. Because it has such a high coefficient though, we should take the hit on solubility. Since the sequences are a mix of sORF and cleavage, I think we should send the sORFs through cleavage prediction and then just pick a couple to explore. Thoughts here on how to pick within the group are welcome.
* Note also that the Rhipicephalus microplus had hits to salivary gland predicted peptides. Others did as well, but they were to Amblyomma americancum so we have to double check that these are sg hits, not whole body general hits. The fact that RM also had hits is promising that these ones would be ok.

In [11]:
OG0008102 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, peptide_id, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0008102") %>%
  filter(difficulty_level %in% c("Easy", "Medium"))

OG0008102

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,peptide_id,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,cleavage,OG0008102,VGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC,tick support,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Transcript_929497.p2_start64_end100
Medium,Poor,cleavage,OG0008102,NGAISGAVGAAVANLINKG,tick support,Dermacentor-andersoni_XP-054924338.1_start87_end106,Dermacentor-andersoni_XP-054924338.1_start87_end106,
Medium,Poor,cleavage,OG0008102,IHPVVATVVVPVVKVLVNGAASGAVGALVGKLLESDRDKSPAPSL,tick support,Rhipicephalus-microplus_XP-037271377.1_start70_end114,Rhipicephalus-microplus_XP-037271377.1_start70_end114,GIKN01002979.1.p1_start91_end134
Medium,Poor,cleavage,OG0008102,VVVSVSKKIVERVADATIGFVVNKLLGHLLDRPTEPSF,tick support,Rhipicephalus-microplus_XP-037271378.1_start78_end115,Rhipicephalus-microplus_XP-037271378.1_start78_end115,GIKN01002127.1.p1_start100_end137
Easy,Good,sORF,OG0008102,MNSPKKTLEGGKELQKKIYDAVMNNSEDIIAAVRNMKSSMDNTGDETDEQFIGAVITAVVSTAAAAAVEAGVEAAIKRG,tick support,Amblyomma-americanum_evm.model.contig-129979-1.1,Amblyomma-americanum_evm.model.contig-129979-1.1,Transcript_929497.p2_start21_end72
Medium,Poor,sORF,OG0008102,MKAYLILVLVILGHLSQIHAAAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC,tick support,Amblyomma-americanum_evm.model.contig-245149-1.2,Amblyomma-americanum_evm.model.contig-245149-1.2,Transcript_929497.p2_start21_end72
Medium,Poor,sORF,OG0008102,MKAYLILALVILGHLSQIHAATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA,tick support,Amblyomma-sculptum_GEEX01004552.1.p1,Amblyomma-sculptum_GEEX01004552.1.p1,Transcript_929497.p2_start21_end72


## POOL 2 (1 peptide?)

! THIS POOL NEEDS TO BE FILTERED DOWN BECAUSE OF LIKELY SOLUBILITY CONFLICTS
* Orthogroup `OG0013943` has the second highest trait mapping coefficient. Similar to the previous orthogroup, all of the predictions have poor solubility. However, unlike the last group, most of these sequences cluster together. I think we could pick one or two and try it out. The representative sequence for most of them is `Dermacentor-andersoni_XP-054918570.1_start58_end100` so we should start with one of those in the cluster. The peptide `Rhipicephalus-microplus_XP-037282321.1_start56_end95` had a hit to peptdes predicted from tick salivary glands also, so we might just select that one sequence and move forward with just that.

In [12]:
OG0013943 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, peptide_id, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0013943") %>%
  filter(difficulty_level %in% c("Easy", "Medium"))

OG0013943

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,peptide_id,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Poor,cleavage,OG0013943,ISHGYGGGYGGGGGYGGGGGYGGGYGGGGGFGGGYGGWR,tick support,Amblyomma-americanum_evm.model.contig-138531-1.3_start129_end167,Amblyomma-americanum_evm.model.contig-138531-1.3_start129_end167,
Easy,Poor,cleavage,OG0013943,VKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR,tick support,Dermacentor-andersoni_XP-054918570.1_start58_end100,Dermacentor-andersoni_XP-054918570.1_start58_end100,
Easy,Poor,cleavage,OG0013943,VKPVLTVSHGYGGGYGGGYGGGFGGGYGGGYGGGHGYWR,tick support,Dermacentor-silvarum_XP-049511149.1_start58_end96,Dermacentor-andersoni_XP-054918570.1_start58_end100,
Easy,Poor,cleavage,OG0013943,GYGGYGGGYGGGYGGYGGGYGGGYGGYGGGYGGGYGGWH,tick support,Haemaphysalis-longicornis_KAH9362006.1_start68_end106,Haemaphysalis-longicornis_KAH9362006.1_start68_end106,
Medium,Poor,cleavage,OG0013943,HIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW,tick support,Rhipicephalus-microplus_XP-037282321.1_start56_end95,Dermacentor-andersoni_XP-054918570.1_start58_end100,GBJS01028274.1.p1_start72_end105
Easy,Poor,cleavage,OG0013943,VSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGYGGGYGW,tick support,Rhipicephalus-sanguineus_XP-037511163.1_start64_end102,Dermacentor-andersoni_XP-054918570.1_start58_end100,


## POOL 3 (5 peptides)

* The orthogroup with the next highest coefficient, `OG0005246`, has easy/medium synthesis and high solubility. I say we test all of the peptides from the organisms that suppress itch:
    * Leptotrombidium-deliense_tr|A0A443RT29|A0A443RT29-9ACAR
    * Rhipicephalus-sanguineus_XP-037500236.1
    * Rhipicephalus-microplus_XP-037283166.1
    * Dermacentor-andersoni_XP-050049593.1
    * Dermacentor-silvarum_XP-037556830.1
* Note all of them had hits to tick salivary gland transcriptomes.

In [13]:
OG0005246 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0005246") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(evidence_of_itch_suppression == "evidence of itch suppression")
OG0005246

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,evidence_of_itch_suppression,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Good,sORF,OG0005246,MTSEVEEIFKKLKDQEGVVGVVVTTSEGAPIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYCLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Dermacentor-andersoni_XP-050049593.1,evidence of itch suppression,GKHV01001871.1
Easy,Good,sORF,OG0005246,MTSEVEEIFKKLKDQEGVVGVVVTTSEGAPIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYCLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Dermacentor-silvarum_XP-037556830.1,evidence of itch suppression,GKHV01001871.1
Easy,Good,sORF,OG0005246,MAEVEATSEPERGAENDCNEHRRNSYQDTLKEIDPPKELTFLQIYFGRNEIMVAPDKYYFLIVIQNPTE,chelicerate support,Leptotrombidium-deliense_tr|A0A443RT29|A0A443RT29-9ACAR,Leptotrombidium-deliense_tr|A0A443RT29|A0A443RT29-9ACAR,evidence of itch suppression,GFZD01010403.1
Easy,Good,sORF,OG0005246,MTSEVEDIFKKLKDQDGVVGVVVTTSEGAAIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYFLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Rhipicephalus-microplus_XP-037283166.1,evidence of itch suppression,GBJT01001074.1
Easy,Good,sORF,OG0005246,MTSEVEEIFKKLKDQDGVVGVVVTTSEGAPIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYFLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Rhipicephalus-sanguineus_XP-037500236.1,evidence of itch suppression,GBJT01001074.1


## POOL 4 (1 peptide)

* `OG0007769` are easy or medium to synthesize but all have low solubility. Only some have hits to tick salivary gland transcriptomes, but those that do all clustered with `Rhipicephalus-sanguineus_XP-037515628.1_start185_end221`. I saw we just move forward with that sequence. 

In [14]:
OG0007769 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0007769") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(evidence_of_itch_suppression == "evidence of itch suppression") %>%
  select(-evidence_of_itch_suppression)
OG0007769

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,cleavage,OG0007769,RNLGGGYVGGGYGGLAAGLGGVALGGGLKGGLGVHHGGGFKGGYGGLYG,tick support,Amblyomma-americanum_evm.model.contig-126756-1.3_start14_end62,Amblyomma-americanum_evm.model.contig-126756-1.3_start14_end62,
Medium,Poor,cleavage,OG0007769,AYGVGGLGGYGLGGYGGGLGGLGGGVGVYRGAGGYGKYGAGGAGWW,tick support,Amblyomma-americanum_evm.model.contig-110937-1.5_start282_end327,Amblyomma-americanum_evm.model.contig-126756-1.3_start208_end253,
Easy,Poor,cleavage,OG0007769,GNLGGGFVGGGFGGLGAGLGGGVYGGGLGGGLGVHHGGG,tick support,Amblyomma-americanum_evm.model.contig-110937-1.5_start86_end124,Amblyomma-americanum_evm.model.contig-110937-1.5_start86_end124,
Medium,Poor,cleavage,OG0007769,AYGVGGLGGYGLGGYGGGVGGLGGGVGVYRGAGGYGKHGAGGAGWW,tick support,Amblyomma-americanum_evm.model.contig-110937-1.5_start282_end327,Amblyomma-americanum_evm.model.contig-110937-1.5_start282_end327,
Easy,Poor,cleavage,OG0007769,GYGGYGGGYGGLGAVGGVGGGYGVGGGLYGGAGVFRGVGGHGKHGHGWQ,tick support,Dermacentor-andersoni_XP-050026447.1_start226_end274,Dermacentor-andersoni_XP-050026447.1_start226_end274,
Medium,Poor,cleavage,OG0007769,GAGGLYGAGVARYGGAGLYGGLGGGGVGVYAGGAGVGVLGKHGGGVGWH,tick support,Dermacentor-andersoni_XP-054919892.1_start175_end223,Dermacentor-andersoni_XP-054919792.1_start424_end472,
Medium,Poor,cleavage,OG0007769,GAGGLYGAGVARYGGAGLYGGLGGRGVGVYGGGAGVGVLGKHGGGVGWH,tick support,Dermacentor-andersoni_XP-054919892.1_start175_end223,Dermacentor-andersoni_XP-054919892.1_start175_end223,
Medium,Poor,cleavage,OG0007769,GAAGYGSAGLYGGLGGRGVGVYAGGVGVHGKHGVGWH,tick support,Rhipicephalus-sanguineus_XP-037515628.1_start185_end221,Dermacentor-silvarum_XP-037564805.1_start275_end311,GFGI01009308.1.p2_start118_end154
Medium,Poor,cleavage,OG0007769,AAGYGGAGLYGGIGRGVGVYAGGRGVGVLGKHGGGWH,tick support,Rhipicephalus-sanguineus_XP-037515628.1_start185_end221,Dermacentor-silvarum_XP-049518524.1_start230_end266,
Easy,Poor,cleavage,OG0007769,GTLRGGFGGVGLGGLGGAGLGAGLGVGLGAGTYGGGYAGGYGGHGG,tick support,Rhipicephalus-sanguineus_XP-037519543.1_start18_end63,Dermacentor-silvarum_XP-049528796.1_start18_end63,


## POOL 5 (6 peptides)

* `OG0000231` don't really cluster well with each other but have a lot that are easy to synthesize with good solubility. Just pick all of them that meet that criteria.



In [15]:
OG0000231 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0000231") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(hydrophilicity == "Good") %>%
  filter(evidence_of_itch_suppression == "evidence of itch suppression") %>%
  select(-evidence_of_itch_suppression)
OG0000231

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Good,cleavage,OG0000231,TSRRTAANTTRDYVDKVYVWTVDKPCTMRRFMRYVPRW,chelicerate support,Hyalomma-asiaticum_KAH6922907.1_start146_end184,Hyalomma-asiaticum_KAH6922907.1_start146_end184,
Easy,Good,sORF,OG0000231,MSNIQQRSTVVLLLREHLLPNNVLRGVDGIITNDPRRMARIMKEREFRNKLRPATIQDNPGICVPGRPAPRSVASQQLLEIDFLGDFNNSANGILHV,chelicerate support,Amblyomma-americanum_evm.model.contig-114601-1.1,Amblyomma-americanum_evm.model.contig-114601-1.1,Transcript_329793
Easy,Good,sORF,OG0000231,MGHMANTLHELKDLLAQGANSIEADVVFAPNGTAVKLNHEDGCDCDRNCNQETEIRRYLYFLKNAVSKGEKSKSSSVTLEFY,chelicerate support,Haemaphysalis-longicornis_KAH9364597.1,Haemaphysalis-longicornis_KAH9364597.1,
Easy,Good,sORF,OG0000231,MDSLAKVGRAFADLKIYNHRWVGSGNTNCLPYLSGKYDRLKDIVACRDGLKSGCDFIDKGYAWTLDYESSIAREIK,chelicerate support,Haemaphysalis-longicornis_KAH9364993.1,Haemaphysalis-longicornis_KAH9364993.1,
Easy,Good,sORF,OG0000231,MVNNITEINQFLDLGCNAVEADVKFIDAYPKNAFHGQPCDCDRYCDSSEDLAKYLNYVRKITTPEIAASGEVGHRK,chelicerate support,Leptotrombidium-deliense_tr|A0A443SE68|A0A443SE68-9ACAR,Leptotrombidium-deliense_tr|A0A443SE68|A0A443SE68-9ACAR,
Easy,Good,sORF,OG0000231,GITNCIGFLYPLIRLQALVQKRDECNIDKDPFCPRKVYQWTTDNQSRFRSILRMQVDGFITNYPNRLNEVLREPEFATKFRLATNRDNPWQIYK,chelicerate support,Leptotrombidium-deliense_tr|A0A443RTB5|A0A443RTB5-9ACAR,Leptotrombidium-deliense_tr|A0A443RTB5|A0A443RTB5-9ACAR,


## BONUS POOL (53 peptides)

Anything easy to synthesize, with high solubility, and with an orthogroup that wasn't included above.
These things had lower coefficients for association with itch suppression but would allow us to cast a wide net with our peptide predictions.
If synthesis costs are limiting, we could limit the number of peptides in the bonus pool by:
* selecting representatives from each orthogroup (4 peptides)
* limiting to peptides with matches in tick salivary glands (5 peptides)
* select based on representative sequences (49 peptides)

Pool 5 is also made of all soluble peptides, so pool 5 and bonus pool could be combined.

In [16]:
predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, traitmapping_coefficient, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>%
  arrange(desc(traitmapping_coefficient)) %>%
  filter(difficulty_level %in% c("Easy")) %>%
  filter(hydrophilicity == "Good") %>%
  filter(!traitmapping_orthogroup %in% c("OG0000231", "OG0007769", "OG0005246", "OG0013943", "OG0008102")) 

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,traitmapping_coefficient,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,evidence_of_itch_suppression,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Good,sORF,OG0000305,0.08832735,MVRTELLELSQEINTPRIVYRIDTLPAPTYGREEVVLVPPYYCKFKPTERVWSQLKGHFARRNRVTMNLKRFCQKHSRL,tick support,Haemaphysalis-longicornis_KAH9364540.1,Haemaphysalis-longicornis_KAH9364540.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKGDVARLNTDFRIGSMRKLLITAAENVSPDNWTKAVEQIIGIERRRLEVRGFSDHVEQTIISLGEEDDDRRKRSLLH,tick support,Haemaphysalis-longicornis_KAH9364717.1,Haemaphysalis-longicornis_KAH9364717.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MDKASYHSRRNEAVPTTNSLKGTITEWLDSKSIQYGSCADREAAAGDNCSSEATFHQLPSRHGCTDGRVYRGKAAVLPLRVQSY,tick support,Haemaphysalis-longicornis_KAH9367493.1,Haemaphysalis-longicornis_KAH9367493.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MSDVFRGKKTWAYHEEMDGPHFESWFDGVLQKLPSGRVIMWTTPPTAPSGKRQGQRTNSLKGTITEWLDSKDIQHGARLTKKQLLKIVA,tick support,Haemaphysalis-longicornis_KAH9371263.1,Haemaphysalis-longicornis_KAH9371263.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKLGVAARNSTFKLPYVEVLLREEVAKVTSQHWAKTVQHVISIETKFRGNGGASAYVQPIMIHLGEDMDSDSNLTAIESFRDV,tick support,Haemaphysalis-longicornis_KAH9380381.1,Haemaphysalis-longicornis_KAH9374347.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKRGVAPRNVTFKLSDVEVLLREEAAKVTAQHWVNAVQHVINIETKFMGDGGASVHVQPIIIHLDEDDMDSDSDLSGIESFEDL,tick support,Haemaphysalis-longicornis_KAH9374666.1,Haemaphysalis-longicornis_KAH9374666.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MRGLTTGLKKPSGKGQRLIVTHIGSEDGFVSGCLDIFRGTKTRDYHEMDGTRFERWFGAVLPQEHCTQHANNEYD,tick support,Haemaphysalis-longicornis_KAH9375042.1,Haemaphysalis-longicornis_KAH9375042.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKLGVAARNATFKLADVEVLLREEAAKVTAEHWANAVQHVINIETKFRGDGGASAHVQPIIIHLAEDDIDSDSDLSGIESFEDV,tick support,Haemaphysalis-longicornis_KAH9374666.1,Haemaphysalis-longicornis_KAH9376245.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MKRGVASRSATFKLPHVEVLLREEVAKITAQHWANTVQHVISIETKFRGDGGASAHVQPIIIHLDEDDMDSDTNHTGEI,tick support,Haemaphysalis-longicornis_KAH9380381.1,Haemaphysalis-longicornis_KAH9380381.1,evidence of itch suppression,
Easy,Good,sORF,OG0000305,0.08832735,MDNASYHSRRLETITTMSSRKPDILLADIKGTDDGSPNNWAKAVEHVIGIEDKRREARGFSDHVEPIIISLGEEDGDSCGSADDDLSGIEPME,tick support,Haemaphysalis-longicornis_KAH9382422.1,Haemaphysalis-longicornis_KAH9382422.1,evidence of itch suppression,
