# Selecting peptides for experimental validation

**Note that this notebook was run on April 18, 2024 using the initial results of the ticks on a tree analysis and before a bug fix in peptigate's sORF peptide prediction. We have included this notebook to document how we selected the initial peptides we ordered from GenScript. Two of these peptides weren't subsequently selected in the 20240626-peptides-into-pools.ipynb notebook but we had already ordered them. This notebook primarily records how we selected those two peptides.**

This notebook proposes pools of peptides to be experimentally validated using a mouse scratch assay.
It primarily relies on the following metadata about peptides
* Ease of synthesis
* Solubility (hydrophilicity)
* Orthogroup the peptide belonged to and the statistical association of that orthogroup with itch suppression.
* The sequence of the peptide itself
* What other peptides the peptide clustered with (mmseqs2 80% identity)
* Whether the peptides had matches in tick salivary gland transcriptomes


In the case of sORFs, we also investigated the parent protein annotation (if available, `traitmapping_egg_Description`, `traitmapping_KO`, `traitmapping_KO_definition`).
If the annotation strongly suggests that the sORF is a housekeeping gene and that the protein does not have a secretion signal (signal peptide, `traitmapping_deepsig_feature`, `traitmapping_deepsig_start`, `traitmapping_deepsig_end`, `deepsig_combined`), we removed the peptide/orthogroup from consideration.

## Notebook setup

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
setwd("..")

## Try filtering on 50% of proteins in the orthogroup having a peptide predicted from them

In [3]:
predictions <- read_tsv("outputs/notebooks/predictions_with_metadata.tsv", show_col_types = F) %>%
  filter(fraction_of_orthogroup_with_predicted_peptide >= 0.5) %>%
  arrange(desc(traitmapping_coefficient))

This removes both candidates with chelicerate support -- I think that this might not be the "correct" filter to apply because of that.
I think it's possible that ticks/other itch suppressing chelicerates could have evolved to make a peptide when the rest of the group didn't, so I think it might just be better to filter on absolute number of peptides in the group.
I'm going to filter to 10, somewhat arbitrarily, as a cut off.
(we see below that using a cut off of 10 actually gives us a minimum of 15 predicted peptides per orthogroup)

In [4]:
predictions <- read_tsv("outputs/notebooks/predictions_with_metadata.tsv", show_col_types = F) %>%
  filter(num_predicted_peptides > 10) %>%
  arrange(desc(traitmapping_coefficient))

print(paste("num predicted peptides:", nrow(predictions)))
print(paste("num orthogroups:", length(unique(predictions$traitmapping_orthogroup))))
print(paste("smallest number of peptides predicted in an orthogroup:", min(predictions$num_predicted_peptides)))

[1] "num predicted peptides: 246"
[1] "num orthogroups: 10"
[1] "smallest number of peptides predicted in an orthogroup: 15"


## Combine with solubility data

In [5]:
# read in solubility and ease of synthesis data
# This is from a web application by genscript.
# https://www.genscript.com/tools/peptide%2danalyzing%2dtool
synthesis_and_solubility <- read_csv("outputs/notebooks/tmp.csv", show_col_types = F) %>%
  distinct() %>%
  filter(sequence %in% predictions$protein_sequence)
table(synthesis_and_solubility$difficulty_level, synthesis_and_solubility$hydrophilicity)

           
            Good Poor
  Difficult    0   25
  Easy        72   27
  Medium       5  113

In [6]:
predictions <- left_join(predictions, synthesis_and_solubility, by = c("protein_sequence" = "sequence"))

In [7]:
predictions %>% 
  group_by(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, type_of_itch_suppression_evidence) %>% 
  tally() %>%
  arrange(desc(traitmapping_coefficient))

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,type_of_itch_suppression_evidence,n
<chr>,<chr>,<chr>,<dbl>,<chr>,<int>
Difficult,Poor,OG0008102,0.90567434,tick support,11
Easy,Good,OG0008102,0.90567434,tick support,1
Medium,Poor,OG0008102,0.90567434,tick support,6
Difficult,Poor,OG0013943,0.79343515,tick support,9
Easy,Poor,OG0013943,0.79343515,tick support,5
Medium,Poor,OG0013943,0.79343515,tick support,1
Easy,Good,OG0005246,0.3990746,chelicerate support,14
Medium,Good,OG0005246,0.3990746,chelicerate support,3
Easy,Poor,OG0007769,0.3802378,tick support,11
Medium,Poor,OG0007769,0.3802378,tick support,9


## Establish pools

## For all pools:
* Don't pursue things that are difficult to synthesize. We have enough mediums and easies that we don't need to go that route.
* Make sure that the peptide has a signal peptide. If it's an sORF, it should have a signal peptide on the sORF itself. If it's a cleavage peptide, it should have one on the parent protein.

## Pool Summary

In [72]:
# POOL 1 CHECKED
pool1_names <- c("Amblyomma-americanum_evm.model.contig-245149-1.2", # NEEDS THE SIGNAL PEPTIDE CLEAVED
                 "Amblyomma-sculptum_GEEX01004552.1.p1", # NEEDS THE SIGNAL PEPTIDE CLEAVED
                 "Rhipicephalus-microplus_XP-037271377.1_start70_end114",
                 "Rhipicephalus-microplus_XP-037271378.1_start78_end115",
                 "Dermacentor-andersoni_XP-054924338.1_start87_end106")
# POOL 2 CHECKED
pool2_names <- c("Rhipicephalus-microplus_XP-037282321.1_start56_end95",
                 "Dermacentor-andersoni_XP-054918570.1") # NEEDS THE SIGNAL PEPTIDE CLEAVED

# POOL 3 CHECKED
pool3_names <- c("Rhipicephalus-sanguineus_XP-037515628.1_start185_end221") 

# POOL 4 CHECKED
pool4_names <- c("Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR") # NEEDS SIGNAL PEPTIDE CLEAVED

# POOL 5 CHECKED
pool5_names <- c("Rhipicephalus-microplus_XP-037269427.1_start34_end70",
                 "Dermacentor-andersoni_XP-050051547.1_start39_end77",
                 "Dermacentor-silvarum_XP-037559871.1_start39_end87",
                 "Hyalomma-asiaticum_KAH6923445.1_start29_end58")

In [73]:
all_names <- c(pool1_names, pool2_names, pool3_names, pool4_names, pool5_names)
length(all_names)

In [74]:
cleaved <- data.frame(peptide_id = c("Amblyomma-americanum_evm.model.contig-245149-1.2", "Amblyomma-sculptum_GEEX01004552.1.p1",
                                     "Dermacentor-andersoni_XP-054918570.1", "Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR"),
                      cleaved_protein_sequence = c("AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC",
                                                   "ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA",
                                                   "GYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR",
                                                   "RPLYNIAHMVNSIRQVNEFLELGANAIETDVVFYSNGTAMKTFHGTPCDCFRDCFHSESIVNYLEYTKNITTP"),
                     cleaved_difficulty_level = c("Easy", "Easy", "Medium", "Medium"),
                     cleaved_hydrophilicity = c("Good", "Good", "Poor", "Poor"))

In [76]:
tmp <- predictions %>%
  filter(peptide_id %in% all_names) %>%
  left_join(cleaved, by = "peptide_id") %>%
  arrange(desc(traitmapping_orthogroup)) %>%
  select(-start, -end, -nlpprecursor_class_score, -nlpprecursor_cleavage_score, -traitmapping_model, -traitmapping_profile_type) %>%
  mutate(cleaved_length = nchar(cleaved_protein_sequence))

tmp %>%
  select(peptide_id, peptide_type, prediction_tool, peptide_length,
         traitmapping_orthogroup, traitmapping_coefficient, traitmapping_deepsig_feature,
         antiinflammatory, traitmapping_egg_Description, traitmapping_KO_definition,
         difficulty_level, hydrophilicity,
         cleaved_protein_sequence, cleaved_difficulty_level, cleaved_hydrophilicity) 

write_tsv(tmp, "tmp_predictions.tsv")

peptide_id,peptide_type,prediction_tool,peptide_length,traitmapping_orthogroup,traitmapping_coefficient,traitmapping_deepsig_feature,antiinflammatory,traitmapping_egg_Description,traitmapping_KO_definition,difficulty_level,hydrophilicity,cleaved_protein_sequence,cleaved_difficulty_level,cleaved_hydrophilicity
<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Rhipicephalus-microplus_XP-037282321.1_start56_end95,cleavage,deeppeptide,40,OG0013943,0.7934352,Signal peptide,0.0,,,Medium,Poor,,,
Dermacentor-andersoni_XP-054918570.1,sORF,less_than_100aa,100,OG0013943,0.7934352,Signal peptide,0.0,,peroxin-13;spore coat protein T;uncharacterized protein;heterogeneous nuclear ribonucleoprotein A1/A3;RNA-binding protein FUS,Difficult,Poor,GYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR,Medium,Poor
Dermacentor-andersoni_XP-054924338.1_start87_end106,cleavage,nlpprecursor,19,OG0008102,0.9056743,Signal peptide,0.16666667,,osmotically inducible lipoprotein OsmB,Medium,Poor,,,
Rhipicephalus-microplus_XP-037271377.1_start70_end114,cleavage,deeppeptide,45,OG0008102,0.9056743,Signal peptide,0.0,,"FrmR/RcnR family transcriptional regulator, repressor of rcnA expression;protein S100-A2",Medium,Poor,,,
Rhipicephalus-microplus_XP-037271378.1_start78_end115,cleavage,deeppeptide,38,OG0008102,0.9056743,Signal peptide,0.0,,,Medium,Poor,,,
Amblyomma-americanum_evm.model.contig-245149-1.2,sORF,less_than_100aa,100,OG0008102,0.9056743,Signal peptide,0.03333333,,"outer membrane protein, multidrug efflux system;cecropin;2,3-dihydroxyphenylpropionate 1,2-dioxygenase [EC:1.13.11.16];thalianol hydroxylase",Medium,Poor,AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC,Easy,Good
Amblyomma-sculptum_GEEX01004552.1.p1,sORF,less_than_100aa,99,OG0008102,0.9056743,Signal peptide,0.0,,U3 small nucleolar RNA-associated protein 12;proteoglycan 3;cullin-associated NEDD8-dissociated protein 1;Escherichia phage dCTP pyrophosphatase [EC:3.6.1.12];cell division protein DivIC,Medium,Poor,ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA,Easy,Good
Rhipicephalus-sanguineus_XP-037515628.1_start185_end221,cleavage,deeppeptide,37,OG0007769,0.3802378,Signal peptide,0.03333333,Insect cuticle protein,tetraspanin-9;bcl2 associated transcription factor 1,Easy,Poor,,,
Dermacentor-andersoni_XP-050051547.1_start39_end77,cleavage,deeppeptide,39,OG0000880,0.3180449,Signal peptide,0.06666667,,ComB7 competence protein;ATP-dependent RNA helicase A [EC:3.6.4.13];type IV secretion system protein TrbL;Lymphocryptovirus nuclear antigen 1;heterogeneous nuclear ribonucleoprotein A1/A3,Medium,Poor,,,
Dermacentor-silvarum_XP-037559871.1_start39_end87,cleavage,deeppeptide,49,OG0000880,0.3180449,Signal peptide,0.03333333,,ComB7 competence protein;interleukin enhancer-binding factor 3;Lymphocryptovirus nuclear antigen 1;type IV secretion system protein TrbL;ATP-dependent RNA helicase A [EC:3.6.4.13],Medium,Poor,,,


In [79]:
cat(colnames(tmp), sep = ', ')

peptide_id, peptide_type, peptide_class, prediction_tool, protein_sequence, peptide_length, locus_tag, traitmapping_cluster, traitmapping_orthogroup, traitmapping_signif_level, traitmapping_signif_fdr, traitmapping_coefficient, traitmapping_species, traitmapping_Length, traitmapping_egg_Description, traitmapping_KO, traitmapping_KO_definition, traitmapping_deepsig_feature, traitmapping_deepsig_start, traitmapping_deepsig_end, evidence_of_itch_suppression, num_proteins_in_orthogroup, num_predicted_peptides, fraction_of_orthogroup_with_predicted_peptide, num_tick_proteins_in_orthogroup, fraction_of_orthogroup_tick_proteins, num_predicted_peptides_from_tick, fraction_of_orthogroup_with_predicted_tick_peptides, num_itchsuppsp_proteins_in_orthogroup, fraction_of_orthogroup_itchsuppsp_proteins, num_predicted_peptides_from_itchsuppsp, fraction_of_orthogroup_with_predicted_itchsuppsp_peptides, type_of_itch_suppression_evidence, num_predicted_peptides_with_sg_blast_hit, sequence, AB, ACE, ACP, 

In [82]:
output <- tmp %>%
  mutate(final_peptide_sequence = ifelse(is.na(cleaved_protein_sequence), protein_sequence, cleaved_protein_sequence),
         final_length = ifelse(is.na(cleaved_length), peptide_length, cleaved_length),
         final_difficulty_level = ifelse(is.na(cleaved_difficulty_level), difficulty_level, cleaved_difficulty_level),
         final_hydrophilicity = ifelse(is.na(cleaved_hydrophilicity), hydrophilicity, cleaved_hydrophilicity)) %>%
  select(peptide_id, peptide_sequence = final_peptide_sequence, length = final_length, hydrophilicity = final_hydrophilicity, 
         difficulty_of_synthesis = final_difficulty_level, peptide_type, peptide_class, prediction_tool, traitmapping_cluster, 
         traitmapping_orthogroup, traitmapping_coefficient, traitmapping_KO, traitmapping_KO_definition, traitmapping_deepsig_feature,
         type_of_itch_suppression_evidence,
         num_proteins_in_orthogroup, num_predicted_peptides, fraction_of_orthogroup_with_predicted_peptide, num_tick_proteins_in_orthogroup,
         fraction_of_orthogroup_tick_proteins, num_predicted_peptides_from_tick, fraction_of_orthogroup_with_predicted_tick_peptides,
         num_itchsuppsp_proteins_in_orthogroup, fraction_of_orthogroup_itchsuppsp_proteins, num_predicted_peptides_from_itchsuppsp,
         fraction_of_orthogroup_with_predicted_itchsuppsp_peptides, num_predicted_peptides_with_sg_blast_hit, 
         locus_tag, traitmapping_signif_level, traitmapping_signif_fdr, traitmapping_species, traitmapping_Length, 
         traitmapping_egg_Description, traitmapping_deepsig_start, traitmapping_deepsig_end,  
         AB, ACE, ACP, AF, AMAP, AMP, AOX, APP, AV, BBP, DPPIV, MRSA, Neuro, QS, TOX, TTCA, antiinflammatory,
         aliphatic_index, boman_index, charge, hydrophobicity, instability_index, isoelectric_point, molecular_weight, pd1_residue_volume,
         pd2_hydrophilicity, z1_lipophilicity, z2_steric_bulk_or_polarizability, z3_polarity_or_charge, z4_electronegativity_etc, z5_electronegativity_etc, 
         peptipedia_blast_sseqid, peptipedia_blast_full_sseq, peptipedia_blast_pident, peptipedia_blast_length, peptipedia_blast_qlen, 
         peptipedia_blast_slen, peptipedia_blast_mismatch, peptipedia_blast_gapopen, peptipedia_blast_qstart, peptipedia_blast_qend, 
         peptipedia_blast_sstart, peptipedia_blast_send, peptipedia_blast_evalue, peptipedia_blast_bitscore, peptipedia_num_hits, 
         peptide_deepsig_prediction = deepsig_combined, mmseqs2_representative_sequence, mmseqs2_num_peptides_in_cluster, 
         sgpeptide_blast_sseqid, sgpeptide_blast_full_sseq, sgpeptide_blast_pident, sgpeptide_blast_length, sgpeptide_blast_qlen, 
         sgpeptide_blast_slen, sgpeptide_blast_qcovhsp, sgpeptide_blast_scovhsp, sgpeptide_blast_mismatch, sgpeptide_blast_gapopen, 
         sgpeptide_blast_qstart, sgpeptide_blast_qend, sgpeptide_blast_sstart, sgpeptide_blast_send, sgpeptide_blast_evalue, sgpeptide_blast_bitscore,
         original_peptide_sequence = protein_sequence, original_peptide_length = peptide_length, 
         original_difficulty_level = difficulty_level, original_hydrophilicity = hydrophilicity, 
         cleaved_protein_sequence, cleaved_difficulty_level, cleaved_hydrophilicity, cleaved_length)

write_tsv(output, "20240430-predictions.tsv")


## POOL 1 (5 peptides)

* Orthogroup `OG0008102` has the highest trait mapping coefficient, meaning it has the most promising statistical support for itch suppression when considering the presence of peptides in the group.

* **sORFs** (2 peptides)
    * There are three sORFs in this group, but only two have signal peptides, so we'll only move forward with those two.
    * Two sORFs have signal peptides that are "Medium" to synthesize with "Poor" hydrophilicity. However, when we cleave their signal peptides, they become "Easy" to synthesize with "Good" hydrophilicity.
        * `Amblyomma-americanum_evm.model.contig-245149-1.2`: AAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC (easy, good; cleaved)
        * `Amblyomma-sculptum_GEEX01004552.1.p1`: ATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA (easy, good; cleaved)
    * Both of these sORFs match to `Transcript_929497.p2_start21_end72` hits to `petxwholefemale_TRINITY_DN4020_c0_g1_i1`, which is not a salivary gland transcriptome (though "whole" does contain salivary glands too). The cleavage peptides in the same orthogroup have hits to an actual sg transcriptome, so I think this hit is ok.
* **cleavage** (3 peptides)
    * All of the sORFs have good solubility, so we can pick at least one cleavage peptide to put in this pool and it shouldn't cause aggregation problems.
    * Since the cleavage peptides are diverse, so I think it would be good to synthesize all four. We can then check if they have aggregation problems within the pool. 
    * None of the cleavage peptides cluster together (mmseqs) so we can't use that to drive our selection
    * Two peptides have hits to tick sg transcriptomes so I think these two would be the best to move forward with (note that `Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100` has a hit to `petxwholefemale_TRINITY_DN4020_c0_g1_i1`, which is a whole femail transcriptome, not an sg transcriptome).
        * `Rhipicephalus-microplus_XP-037271377.1_start70_end114` (IHPVVATVVVPVVKVLVNGAASGAVGALVGKLLESDRDKSPAPSL)
        * `Rhipicephalus-microplus_XP-037271378.1_start78_end115` (VVVSVSKKIVERVADATIGFVVNKLLGHLLDRPTEPSF)
    * I think we should throw in `Dermacentor-andersoni_XP-054924338.1_start87_end106` (NGAISGAVGAAVANLINKG) as a bonus peptide because it's from a different species.
    * All three of these peptides have signal peptides on their parent proteins

In [36]:
OG0008102 <- predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup,
         type_of_itch_suppression_evidence, peptide_id, mmseqs2_representative_sequence, 
         sgpeptide_blast_sseqid, protein_sequence) %>% 
  filter(traitmapping_orthogroup == "OG0008102") %>%
  filter(difficulty_level %in% c("Easy", "Medium"))
OG0008102

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,type_of_itch_suppression_evidence,peptide_id,mmseqs2_representative_sequence,sgpeptide_blast_sseqid,protein_sequence
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,cleavage,OG0008102,tick support,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Transcript_929497.p2_start64_end100,VGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC
Medium,Poor,cleavage,OG0008102,tick support,Dermacentor-andersoni_XP-054924338.1_start87_end106,Dermacentor-andersoni_XP-054924338.1_start87_end106,,NGAISGAVGAAVANLINKG
Medium,Poor,cleavage,OG0008102,tick support,Rhipicephalus-microplus_XP-037271377.1_start70_end114,Rhipicephalus-microplus_XP-037271377.1_start70_end114,GIKN01002979.1.p1_start91_end134,IHPVVATVVVPVVKVLVNGAASGAVGALVGKLLESDRDKSPAPSL
Medium,Poor,cleavage,OG0008102,tick support,Rhipicephalus-microplus_XP-037271378.1_start78_end115,Rhipicephalus-microplus_XP-037271378.1_start78_end115,GIKN01002127.1.p1_start100_end137,VVVSVSKKIVERVADATIGFVVNKLLGHLLDRPTEPSF
Medium,Poor,sORF,OG0008102,tick support,Amblyomma-americanum_evm.model.contig-245149-1.2,Amblyomma-americanum_evm.model.contig-245149-1.2,Transcript_929497.p2_start21_end72,MKAYLILVLVILGHLSQIHAAAMNSPKKVVKGMNELEQSIYEALKDRREDIIAAGKALKTSMDKVGDETDEQFVQALIIGIIAAVAGTATSAAVSAAIKC
Medium,Poor,sORF,OG0008102,tick support,Amblyomma-sculptum_GEEX01004552.1.p1,Amblyomma-sculptum_GEEX01004552.1.p1,Transcript_929497.p2_start21_end72,MKAYLILALVILGHLSQIHAATISSPKKTLKDVKELQQDIIEALKENREEIMAAARALKSSMGNMEDGTEEQYIPPLVTAVIAAVAGGAVGGATGAGIA


In [37]:
# look at annotation information
predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  filter(traitmapping_orthogroup == "OG0008102") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  select(peptide_type, peptide_id, traitmapping_deepsig_feature, traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,traitmapping_deepsig_feature,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Amblyomma-americanum_evm.model.contig-245149-1.2_start64_end100,Signal peptide,K19593;K20696;K05713;K22812,"outer membrane protein, multidrug efflux system;cecropin;2,3-dihydroxyphenylpropionate 1,2-dioxygenase [EC:1.13.11.16];thalianol hydroxylase"
cleavage,Dermacentor-andersoni_XP-054924338.1_start87_end106,Signal peptide,K04062,osmotically inducible lipoprotein OsmB
cleavage,Rhipicephalus-microplus_XP-037271377.1_start70_end114,Signal peptide,K23240;K23759,"FrmR/RcnR family transcriptional regulator, repressor of rcnA expression;protein S100-A2"
cleavage,Rhipicephalus-microplus_XP-037271378.1_start78_end115,Signal peptide,,
sORF,Amblyomma-americanum_evm.model.contig-245149-1.2,Signal peptide,K19593;K20696;K05713;K22812,"outer membrane protein, multidrug efflux system;cecropin;2,3-dihydroxyphenylpropionate 1,2-dioxygenase [EC:1.13.11.16];thalianol hydroxylase"
sORF,Amblyomma-sculptum_GEEX01004552.1.p1,Signal peptide,K14556;K25722;K17263;K21503;K13052,U3 small nucleolar RNA-associated protein 12;proteoglycan 3;cullin-associated NEDD8-dissociated protein 1;Escherichia phage dCTP pyrophosphatase [EC:3.6.1.12];cell division protein DivIC


## POOL 2 (2 peptides)

* Orthogroup `OG0013943` has the second-highest trait mapping coefficient.
* Similar to the previous orthogroup, all of the predictions have poor solubility.
* **Cleavage** (1 peptide)
     * most cleavage sequences cluster together. Five come from parent proteins that had signal peptides. I think we could pick one or two and try it out.
         * The representative sequence for most of them is `Dermacentor-andersoni_XP-054918570.1_start58_end100` so we should start with one of those in the cluster. `Rhipicephalus-microplus_XP-037282321.1_start56_end95` is part of this cluster and had a hit to peptides predicted from tick salivary glands so this is probably the best one to move forward with.
         * `Rhipicephalus-microplus_XP-037282321.1_start56_end95` (HIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW)
* **sORF** (1 peptide)
    * This orthogroup also had three sORF peptides that are "difficult" to synthesize with "poor" solubility but that had signal peptides. When the signal peptides are cleaved off, they become "medium" to synthesize, still with "poor" solubility.
    * Two cluster with `Dermacentor-andersoni_XP-054918570.1` (mmseqs2 80%). These two also had hits to tick salivary gland-predicted peptides.
    * move forward with `Dermacentor-andersoni_XP-054918570.1` GYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR (medium, poor; cleaved)

In [40]:
# cleavage
OG0013943 <- predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, peptide_id, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0013943") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) 

OG0013943

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,peptide_id,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Poor,cleavage,OG0013943,VKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR,tick support,Dermacentor-andersoni_XP-054918570.1_start58_end100,Dermacentor-andersoni_XP-054918570.1_start58_end100,
Easy,Poor,cleavage,OG0013943,VKPVLTVSHGYGGGYGGGYGGGFGGGYGGGYGGGHGYWR,tick support,Dermacentor-silvarum_XP-049511149.1_start58_end96,Dermacentor-andersoni_XP-054918570.1_start58_end100,
Easy,Poor,cleavage,OG0013943,GYGGYGGGYGGGYGGYGGGYGGGYGGYGGGYGGGYGGWH,tick support,Haemaphysalis-longicornis_KAH9362006.1_start68_end106,Haemaphysalis-longicornis_KAH9362006.1_start68_end106,
Medium,Poor,cleavage,OG0013943,HIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW,tick support,Rhipicephalus-microplus_XP-037282321.1_start56_end95,Dermacentor-andersoni_XP-054918570.1_start58_end100,GBJS01028274.1.p1_start72_end105
Easy,Poor,cleavage,OG0013943,VSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGYGGGYGW,tick support,Rhipicephalus-sanguineus_XP-037511163.1_start64_end102,Dermacentor-andersoni_XP-054918570.1_start58_end100,


In [41]:
# look at annotation information
predictions %>% 
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  filter(traitmapping_orthogroup == "OG0013943") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  select(peptide_type, peptide_id, traitmapping_deepsig_feature, traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,traitmapping_deepsig_feature,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Dermacentor-andersoni_XP-054918570.1_start58_end100,Signal peptide,K13344;K06339;K06872;K12741;K13098,peroxin-13;spore coat protein T;uncharacterized protein;heterogeneous nuclear ribonucleoprotein A1/A3;RNA-binding protein FUS
cleavage,Dermacentor-silvarum_XP-049511149.1_start58_end96,Signal peptide,K06339;K13344;K06872;K13098;K12741,spore coat protein T;peroxin-13;uncharacterized protein;RNA-binding protein FUS;heterogeneous nuclear ribonucleoprotein A1/A3
cleavage,Haemaphysalis-longicornis_KAH9362006.1_start68_end106,Signal peptide,K13344;K13098;K06872;K03102;K14651,peroxin-13;RNA-binding protein FUS;uncharacterized protein;squid;transcription initiation factor TFIID subunit 15
cleavage,Rhipicephalus-microplus_XP-037282321.1_start56_end95,Signal peptide,,
cleavage,Rhipicephalus-sanguineus_XP-037511163.1_start64_end102,Signal peptide,K13344;K12741;K06872;K13098;K14651,peroxin-13;heterogeneous nuclear ribonucleoprotein A1/A3;uncharacterized protein;RNA-binding protein FUS;transcription initiation factor TFIID subunit 15


In [42]:
# sorfs with signal peptides
predictions %>% 
  filter(peptide_type == "sORF") %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  select(difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, peptide_id, 
         traitmapping_deepsig_feature, protein_sequence, traitmapping_KO, traitmapping_KO_definition, 
         type_of_itch_suppression_evidence, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>%
  filter(traitmapping_orthogroup %in% c("OG0013943"))

difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,peptide_id,traitmapping_deepsig_feature,protein_sequence,traitmapping_KO,traitmapping_KO_definition,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Difficult,Poor,OG0013943,0.7934352,Dermacentor-andersoni_XP-054918570.1,Signal peptide,MHLYWVLLAAALATGVAAGGYGHASLSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGGGHGYWR,K13344;K06339;K06872;K12741;K13098,peroxin-13;spore coat protein T;uncharacterized protein;heterogeneous nuclear ribonucleoprotein A1/A3;RNA-binding protein FUS,tick support,Dermacentor-andersoni_XP-054918570.1,GIKN01003556.1
Difficult,Poor,OG0013943,0.7934352,Dermacentor-silvarum_XP-049511149.1,Signal peptide,MHLYWVLLACALATGVAAGGYGHASVSYISKPVVSVGYVSKPIVTYVKQPVATVSHVVKPVLTVSHGYGGGYGGGYGGGFGGGYGGGYGGGHGYWR,K06339;K13344;K06872;K13098;K12741,spore coat protein T;peroxin-13;uncharacterized protein;RNA-binding protein FUS;heterogeneous nuclear ribonucleoprotein A1/A3,tick support,Dermacentor-andersoni_XP-054918570.1,GIKN01003556.1
Difficult,Poor,OG0013943,0.7934352,Rhipicephalus-microplus_XP-037282321.1,Signal peptide,MHLYWVLLACALATGVTAGGYGHASISYVSKPVVRVGYVSKPVVTYVKQPVATVSHIVKPVLTVSHGYGGGYGGGYGGGYGGGYGGGYGGGYGYW,,,tick support,Dermacentor-andersoni_XP-054918570.1,GIKN01003556.1


## Excluded pool

* The orthogroup with the next highest coefficient, `OG0005246`, has easy/medium synthesis and high solubility.
* We exclude this pool because all of these are sORFs that have hits to a housekeeping gene (dynein light chain roadblock-type) and _do not_ have signal peptides.
* All of them had hits to tick salivary gland transcriptomes.

In [13]:
OG0005246 <- predictions %>% 
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0005246") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(evidence_of_itch_suppression == "evidence of itch suppression")
OG0005246

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,evidence_of_itch_suppression,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Easy,Good,sORF,OG0005246,MTSEVEEIFKKLKDQEGVVGVVVTTSEGAPIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYCLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Dermacentor-andersoni_XP-050049593.1,evidence of itch suppression,GKHV01001871.1
Easy,Good,sORF,OG0005246,MTSEVEEIFKKLKDQEGVVGVVVTTSEGAPIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYCLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Dermacentor-silvarum_XP-037556830.1,evidence of itch suppression,GKHV01001871.1
Easy,Good,sORF,OG0005246,MAEVEATSEPERGAENDCNEHRRNSYQDTLKEIDPPKELTFLQIYFGRNEIMVAPDKYYFLIVIQNPTE,chelicerate support,Leptotrombidium-deliense_tr|A0A443RT29|A0A443RT29-9ACAR,Leptotrombidium-deliense_tr|A0A443RT29|A0A443RT29-9ACAR,evidence of itch suppression,GFZD01010403.1
Easy,Good,sORF,OG0005246,MTSEVEDIFKKLKDQDGVVGVVVTTSEGAAIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYFLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Rhipicephalus-microplus_XP-037283166.1,evidence of itch suppression,GBJT01001074.1
Easy,Good,sORF,OG0005246,MTSEVEEIFKKLKDQDGVVGVVVTTSEGAPIKTSFDNVTTMQYATLVTRLCEQARSTLRDLEPGNDLTFLRMRTKKHEIMISPDKNYFLVVVQNPSG,chelicerate support,Dermacentor-andersoni_XP-050049593.1,Rhipicephalus-sanguineus_XP-037500236.1,evidence of itch suppression,GBJT01001074.1


In [19]:
# look at annotation information
predictions %>% 
  filter(traitmapping_orthogroup == "OG0005246") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  select(peptide_type, peptide_id, starts_with("deepsig"), traitmapping_egg_Description, traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,deepsig_combined,traitmapping_egg_Description,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
sORF,Blomia-tropicalis_KAJ6221418.1,Chain; 1; 97; .; evidence=ECO:0000256,Dynein light chain,,
sORF,Dermacentor-andersoni_XP-050049593.1,Chain; 1; 97; .; evidence=ECO:0000256,Roadblock/LC7 domain,K10419,dynein light chain roadblock-type
sORF,Dermacentor-silvarum_XP-037556830.1,Chain; 1; 97; .; evidence=ECO:0000256,Roadblock/LC7 domain,K10419,dynein light chain roadblock-type
sORF,Euroglyphus-maynei_tr|A0A1Y3BIK5|A0A1Y3BIK5-EURMA,Chain; 1; 87; .; evidence=ECO:0000256,,K10419;K07131;K25221;K08480;K08121,dynein light chain roadblock-type;uncharacterized protein;[methyl-Co(III) glycine betaine-specific corrinoid protein]---tetrahydrofolate methyltransferase [EC:2.1.1.378];circadian clock protein KaiA;fibromodulin
sORF,Galendromus-occidentalis_XP-003740035.1,Chain; 1; 100; .; evidence=ECO:0000256,Ragulator complex protein LAMTOR5,,
sORF,Leptotrombidium-deliense_tr|A0A443RT29|A0A443RT29-9ACAR,Chain; 1; 69; .; evidence=ECO:0000256,Acts as one of several non-catalytic accessory components of the cytoplasmic dynein 1 complex that are thought to be involved in linking dynein to cargos and to adapter proteins that regulate dynein function. Cytoplasmic dynein 1 acts as a motor for the intracellular retrograde motility of vesicles and organelles along microtubules,K10419,dynein light chain roadblock-type
sORF,Limulus-polyphemus_XP-013775992.1,Chain; 1; 97; .; evidence=ECO:0000256,Dynein light chain,K10419;K16344,dynein light chain roadblock-type;ragulator complex protein LAMTOR5
sORF,Oppiella-nova_tr|A0A7R9MRX4|A0A7R9MRX4-9ACAR,Chain; 1; 64; .; evidence=ECO:0000256,,K10419;K00052;K21360;K07131,dynein light chain roadblock-type;3-isopropylmalate dehydrogenase [EC:1.1.1.85];3-isopropylmalate/methylthioalkylmalate dehydrogenase [EC:1.1.1.85 1.1.1.-];uncharacterized protein
sORF,Oppiella-nova_tr|A0A7R9M7S0|A0A7R9M7S0-9ACAR,Chain; 1; 99; .; evidence=ECO:0000256,dynein intermediate chain binding,K10419;K07131;K10647;K04370,dynein light chain roadblock-type;uncharacterized protein;midline 2 [EC:2.3.2.27];ragulator complex protein LAMTOR3
sORF,Phalangium-opilio_jg27177t1,Chain; 1; 98; .; evidence=ECO:0000256,Dynein light chain,K10419;K16344,dynein light chain roadblock-type;ragulator complex protein LAMTOR5


## POOL 3 (1 peptide)

* `OG0007769` are all cleavage peptides.
* They are easy or medium to synthesize but all have low solubility.
* Only three peptides have parent proteins that had signal peptides
* Of these three, only one had a hit to a tick salivary gland transcriptome, `Rhipicephalus-sanguineus_XP-037515628.1_start185_end221`. I say we just move forward with that sequence.
* `Rhipicephalus-sanguineus_XP-037515628.1_start185_end221` (AAGYGAAGLYGGLGGRGVGVYAGGAGGGLLGKHGGWH)
* If there are no solubility/aggregation issues, pool 3 can be combined with pool2

In [44]:
OG0007769 <- predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup, protein_sequence,
           type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, evidence_of_itch_suppression, sgpeptide_blast_sseqid) %>% 
  filter(traitmapping_orthogroup == "OG0007769") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  filter(evidence_of_itch_suppression == "evidence of itch suppression") %>%
  select(-evidence_of_itch_suppression)
OG0007769

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,protein_sequence,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,cleavage,OG0007769,GAGGLYGAGVARYGGAGLYGGLGGGGVGVYAGGAGVGVLGKHGGGVGWH,tick support,Dermacentor-andersoni_XP-054919892.1_start175_end223,Dermacentor-andersoni_XP-054919792.1_start424_end472,
Medium,Poor,cleavage,OG0007769,GAGGLYGAGVARYGGAGLYGGLGGRGVGVYGGGAGVGVLGKHGGGVGWH,tick support,Dermacentor-andersoni_XP-054919892.1_start175_end223,Dermacentor-andersoni_XP-054919892.1_start175_end223,
Easy,Poor,cleavage,OG0007769,AAGYGAAGLYGGLGGRGVGVYAGGAGGGLLGKHGGWH,tick support,Rhipicephalus-sanguineus_XP-037515628.1_start185_end221,Rhipicephalus-sanguineus_XP-037515628.1_start185_end221,GFGI01009308.1.p2_start118_end154


In [45]:
# look at annotation information
predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  filter(traitmapping_orthogroup == "OG0007769") %>%
  filter(difficulty_level %in% c("Easy", "Medium")) %>%
  select(peptide_type, peptide_id, traitmapping_deepsig_feature, traitmapping_egg_Description, traitmapping_KO, traitmapping_KO_definition) 

peptide_type,peptide_id,traitmapping_deepsig_feature,traitmapping_egg_Description,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Dermacentor-andersoni_XP-054919792.1_start424_end472,Signal peptide,Insect cuticle protein,,
cleavage,Dermacentor-andersoni_XP-054919892.1_start175_end223,Signal peptide,Insect cuticle protein,K13087,bcl2 associated transcription factor 1
cleavage,Rhipicephalus-sanguineus_XP-037515628.1_start185_end221,Signal peptide,Insect cuticle protein,K17350;K13087,tetraspanin-9;bcl2 associated transcription factor 1


## POOL 4 (7 peptides)

* `OG0000231` is the orthogroup with the next highest solubility.
* While the group has many members, most don't have signal peptides; only 1 sORF has an annotated signal peptide.
* The signal peptide doesn't have a hit to a tick salivary gland transcriptome, but we wouldn't necessarily expect it to because it is from *Leptotrombidium deliense* (non-tick).
* `Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR` RPLYNIAHMVNSIRQVNEFLELGANAIETDVVFYSNGTAMKTFHGTPCDCFRDCFHSESIVNYLEYTKNITTP (SIGNAL PEPTIDE CLEAVED OFF, medium, poor)   

In [52]:
OG0000231 <- predictions %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  select(difficulty_level, hydrophilicity, peptide_type, traitmapping_orthogroup,
         type_of_itch_suppression_evidence, mmseqs2_representative_sequence, peptide_id, 
         evidence_of_itch_suppression, sgpeptide_blast_sseqid, deepsig_combined, protein_sequence) %>% 
  filter(traitmapping_orthogroup == "OG0000231") %>%
  filter(evidence_of_itch_suppression == "evidence of itch suppression") %>%
  select(-evidence_of_itch_suppression)
OG0000231

difficulty_level,hydrophilicity,peptide_type,traitmapping_orthogroup,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,peptide_id,sgpeptide_blast_sseqid,deepsig_combined,protein_sequence
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Medium,Poor,sORF,OG0000231,chelicerate support,Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR,Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR,,Signal peptide; 1; 20; 0.99; evidence=ECO:0000256 | Chain; 21; 93; .; evidence=ECO:0000256,MKYLLVFLLFYFQCKKTVQQRPLYNIAHMVNSIRQVNEFLELGANAIETDVVFYSNGTAMKTFHGTPCDCFRDCFHSESIVNYLEYTKNITTP


In [50]:
# look at annotation information
predictions %>% 
  filter(traitmapping_orthogroup == "OG0000231") %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  select(peptide_type, peptide_id, starts_with("deepsig"), traitmapping_KO, traitmapping_KO_definition)

peptide_type,peptide_id,deepsig_combined,traitmapping_KO,traitmapping_KO_definition
<chr>,<chr>,<chr>,<chr>,<chr>
sORF,Leptotrombidium-deliense_tr|A0A443QRH9|A0A443QRH9-9ACAR,Signal peptide; 1; 20; 0.99; evidence=ECO:0000256 | Chain; 21; 93; .; evidence=ECO:0000256,K01123,sphingomyelin phosphodiesterase D [EC:3.1.4.41]


## POOL 5

* `OG0000880` has the next highest coefficient.
* (Other than the orthogroups listed above, it is also the only orthogroup with peptides that had signal peptides left in our data set.)
* All of the peptides with signal peptides are **cleavage** peptides.
* Most are "Medium" to synthesize and have "Poor" solubility.
* There's a big diversity in parent protein annotations, but all of the peptides are glycine-rich. After extensive filtering, the cleavage peptides group into four mmseqs2 clusters. I don't think it would be unreasonable to synthesize one sequence from each group. We could test them as a pool if they don't aggregate together in solution. If we had to pick 1, I would select the shortest because I imagine it is easiest to synthesize (that's just a guess though).
    * Rhipicephalus-microplus_XP-037269427.1_start34_end70 (GGVLGGLGGVGYGTGLGTGLGTGFGGSGLSGVGLGGL)
    * Dermacentor-andersoni_XP-050051547.1_start39_end77 (GGVLGGLGGYGAGVGPGLVGAGLGGPGLVGGGVVGSPAL)
    * Dermacentor-silvarum_XP-037559871.1_start39_end87 (GGVLGGLGGYGAGVGPGLVGAGIGGPGLVGGGVVGNPALVGAGLGQGVG)
    * Hyalomma-asiaticum_KAH6923440.1_start24_end69 (GGLLGAGLGGYGGGLGGPGLVGAGLGGVGL)
* If there aren't aggregation problems, this pool can be combined with pool 2 and/or pool 3.



In [62]:
predictions %>%
  filter(traitmapping_orthogroup == "OG0000880") %>%
  filter(traitmapping_deepsig_feature == "Signal peptide") %>%
  filter(!is.na(sgpeptide_blast_sseqid)) %>% # filter to those that have salivary gland transcriptome hits
  filter(! sgpeptide_blast_sseqid %in% c("Transcript_125878.p1_start39_end77",
                                         "Transcript_160125.p1_start39_end77",
                                         "Transcript_211978.p1_start35_end80",
                                         "Transcript_125878.p1_start39_end77",
                                         "Transcript_66059.p1_start33_end65")) %>%  # remove amblyomma hits that aren't to sg
  select(peptide_type, difficulty_level, hydrophilicity, traitmapping_orthogroup, traitmapping_coefficient, peptide_id, 
         traitmapping_deepsig_feature, protein_sequence, traitmapping_KO, traitmapping_KO_definition, 
         type_of_itch_suppression_evidence, mmseqs2_representative_sequence, sgpeptide_blast_sseqid) %>%
  arrange(mmseqs2_representative_sequence)

peptide_type,difficulty_level,hydrophilicity,traitmapping_orthogroup,traitmapping_coefficient,peptide_id,traitmapping_deepsig_feature,protein_sequence,traitmapping_KO,traitmapping_KO_definition,type_of_itch_suppression_evidence,mmseqs2_representative_sequence,sgpeptide_blast_sseqid
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
cleavage,Medium,Poor,OG0000880,0.3180449,Rhipicephalus-microplus_XP-037269427.1_start34_end70,Signal peptide,GGVLGGLGGVGYGTGLGTGLGTGFGGSGLSGVGLGGL,K19461,Lymphocryptovirus nuclear antigen 1,tick support,Amblyomma-americanum_evm.model.contig-85352-1.5_start35_end80,GBJS01000632.1.p1_start34_end70
cleavage,Medium,Poor,OG0000880,0.3180449,Dermacentor-andersoni_XP-050051547.1_start39_end77,Signal peptide,GGVLGGLGGYGAGVGPGLVGAGLGGPGLVGGGVVGSPAL,K12051;K13184;K07344;K19461;K12741,ComB7 competence protein;ATP-dependent RNA helicase A [EC:3.6.4.13];type IV secretion system protein TrbL;Lymphocryptovirus nuclear antigen 1;heterogeneous nuclear ribonucleoprotein A1/A3,tick support,Dermacentor-andersoni_XP-050051547.1_start39_end77,GBJS01005204.1.p1_start39_end77
cleavage,Medium,Poor,OG0000880,0.3180449,Rhipicephalus-microplus_XP-037269420.1_start39_end77,Signal peptide,GGVLGGLGGYGAGVGPGLAGVGLGRPGLIGGGVVGNPGL,K19461;K22989,Lymphocryptovirus nuclear antigen 1;integral membrane protein GPR137,tick support,Dermacentor-andersoni_XP-050051547.1_start39_end77,GBJS01005204.1.p1_start39_end77
cleavage,Medium,Poor,OG0000880,0.3180449,Rhipicephalus-microplus_XP-037269421.1_start39_end77,Signal peptide,GGVLGGLGGYGAGVGPGLVGAGLGAPGLVGDGVVGNPAL,K19461;K24317,Lymphocryptovirus nuclear antigen 1;germinal-center associated nuclear protein [EC:2.3.1.48],tick support,Dermacentor-andersoni_XP-050051547.1_start39_end77,GBJS01005204.1.p1_start39_end77
cleavage,Medium,Poor,OG0000880,0.3180449,Rhipicephalus-sanguineus_XP-037517942.2_start39_end77,Signal peptide,GGVLGGLGGYGAGVGPGLVGTGLGGPGLVGGGVVGNPGL,K19461;K13184;K12741;K24317,Lymphocryptovirus nuclear antigen 1;ATP-dependent RNA helicase A [EC:3.6.4.13];heterogeneous nuclear ribonucleoprotein A1/A3;germinal-center associated nuclear protein [EC:2.3.1.48],tick support,Dermacentor-andersoni_XP-050051547.1_start39_end77,GFGI01023663.1.p1_start39_end78
cleavage,Medium,Poor,OG0000880,0.3180449,Dermacentor-silvarum_XP-037559871.1_start39_end87,Signal peptide,GGVLGGLGGYGAGVGPGLVGAGIGGPGLVGGGVVGNPALVGAGLGQGVG,K12051;K13090;K19461;K07344;K13184,ComB7 competence protein;interleukin enhancer-binding factor 3;Lymphocryptovirus nuclear antigen 1;type IV secretion system protein TrbL;ATP-dependent RNA helicase A [EC:3.6.4.13],tick support,Dermacentor-silvarum_XP-037559871.1_start39_end87,GBJT01016707.1.p1_start62_end110
cleavage,Medium,Poor,OG0000880,0.3180449,Rhipicephalus-microplus_XP-037269422.1_start39_end87,Signal peptide,GGVLGGLGGYGAGVGPGLVGAGLGGPGLVGGGVVGNPALVGGGLGHGVG,K19461;K24317,Lymphocryptovirus nuclear antigen 1;germinal-center associated nuclear protein [EC:2.3.1.48],tick support,Dermacentor-silvarum_XP-037559871.1_start39_end87,GBJT01016707.1.p1_start62_end110
cleavage,Medium,Poor,OG0000880,0.3180449,Rhipicephalus-sanguineus_XP-037517950.2_start44_end92,Signal peptide,GGVLGGLGGYGAGVGPGLVGSGLGGPGLVGGGVVGNPALVGAGLGHGVG,K13090;K13184;K07344;K19461;K13982,interleukin enhancer-binding factor 3;ATP-dependent RNA helicase A [EC:3.6.4.13];type IV secretion system protein TrbL;Lymphocryptovirus nuclear antigen 1;probable ATP-dependent RNA helicase DDX4 [EC:3.6.4.13],tick support,Dermacentor-silvarum_XP-037559871.1_start39_end87,GBJT01016707.1.p1_start62_end110
cleavage,Medium,Poor,OG0000880,0.3180449,Hyalomma-asiaticum_KAH6923445.1_start29_end58,Signal peptide,GGLLGAGLGGYGGGLGGPGLVGAGLGGVGL,,,tick support,Hyalomma-asiaticum_KAH6923440.1_start24_end69,GFGI01047205.1.p1_start29_end60
cleavage,Medium,Poor,OG0000880,0.3180449,Rhipicephalus-microplus_XP-037272456.1_start28_end68,Signal peptide,GGLLGAGLGGYGAGVGGAGLVGAGVGGPGLVGAGVGGPGLV,K19461,Lymphocryptovirus nuclear antigen 1,tick support,Hyalomma-asiaticum_KAH6923440.1_start24_end69,GBJS01004457.1.p1_start14_end54
