### This notebook selects tick protease inhibitors and secreted proteins of unknown function for folding with ESMfold.  

This uses sequence based protein annotations of the 40 chelicerate proteomes used in our previous work [here](https://research.arcadiascience.com/pub/result-chelicerate-detection-suppression/release/2?readingCollection=3a9d6cf5) processed with our [annotation pipeline](https://github.com/Arcadia-Science/protein-data-curation/tree/v1.2). I am pulling protease inhibitors and secreted proteins of unknown function for further analysis with [ProteinCartography](https://github.com/Arcadia-Science/ProteinCartography). 

#### Inputs: 
Chelicerate protein annotations: ../../25aacutoff_fullset_100223_annotated/all_chelicerate_proteins_annotated.csv 

#### Outputs: 
Tick predicted protease inhibitors <1200 aa, for folding: ../datasheets/tick_PIs_1200.csv \
Tick secreted proteins of unknown function (PUFs) <1200aa, for folding: ../datasheets/tick_PUFs_1200.csv 


In [1]:
import pandas as pd
import numpy as np
import glob
from collections import defaultdict
pd.set_option('display.max_columns', None)

### Reading concatenated file of chelicerate proteome annotations


In [10]:
all_df = pd.read_csv("../../25aacutoff_fullset_100223_annotated/all_chelicerate_proteins_annotated.csv")


### Labelling the tick species 

In [3]:
tick_list = ["Amblyomma-sculptum", "Rhipicephalus-microplus", "Ornithodoros-erraticus", "Dermacentor-variabilis", "Ixodes-scapularis", 
         "Ixodes-ricinus", "Dermacentor-silvarum", "Dermacentor-andersoni", "Haemaphysalis-longicornis", "Ornithodoros-moubata", 
         "Ixodes-persulcatus","Hyalomma-asiaticum", "Ornithodoros-turicata", "Rhipicephalus-sanguineus", "Amblyomma-americanum" ]

In [4]:
def label_ticks(species_list):
    answers = []
    for species in species_list:
        if species in tick_list:
            answers.append("yes")
        else:
            answers.append("no")
    return(answers)

In [5]:
all_df["is_tick"] = label_ticks(all_df["species_name"].to_list()) 


In [6]:
#how many tick proteins are there for each tick species 
all_df.loc[all_df["is_tick"] == "yes"].value_counts("species_name") 

species_name
Ornithodoros-turicata        29460
Amblyomma-americanum         28319
Hyalomma-asiaticum           27478
Ixodes-persulcatus           26020
Ornithodoros-moubata         24072
Haemaphysalis-longicornis    23853
Dermacentor-andersoni        22845
Dermacentor-silvarum         22392
Rhipicephalus-sanguineus     20838
Ixodes-scapularis            20386
Ixodes-ricinus               19280
Dermacentor-variabilis       18937
Ornithodoros-erraticus       18386
Rhipicephalus-microplus      17235
Amblyomma-sculptum           11655
Name: count, dtype: int64

### Reformatting the KO annotations to retrieve the top KO hit in a seperate column called "KO_high_score"


#### Some notes on annotation confidence:
-  Eggnog reports evalues <0.001. This resulted in a lot of annotations (one annotation/protein, but high coverage of the total proteome) and the annotations I spot checked looked real. 
-  KO annotations use a per-HMM scoring metric though, where each HMM has a score that the hit has to beat in order to be legit. This resulted in a very tiny pool of annotations from KO,which was basically just a subset of the eggnog annotations. This made them not useful. 
-  So instead, we decided to report the top 5 KO annotations for each protein that had evalues <0.001. This made the KO pool a lot bigger than the Eggnog pool, so KO annotations here should be considered to be less stringent than the Eggnog annotations. 
-  Spot checking suggests that KO annotations down to around a score of 50 are pretty good  
- KOs that actually pass the threshold are marked with an * in KO_pass 

In [7]:
def get_high(score_list):
    answer_list = []
    for score in score_list:
        if score == "None":
            answer = 0
        else:
            answer = float(score.split(";")[0])
        answer_list.append(answer)
    return(answer_list)

In [8]:
all_df["KO_high_score"] = get_high(all_df["KO_score"].to_list())

#write full combined annotations to csv - will be on zenodo 
all_df.to_csv("../../25aacutoff_fullset_100223_annotated/all_chelicerate_proteins_annotated.csv", index = False) 

### Pulling all the proteins that have an EGGNOG annotation related to protease inhibitor function


In [49]:
egg_protease_inhibitors = all_df.loc[all_df["egg_Description"].str.contains("protease inhibitor", case = False)]
egg_propeptide_inhibitor = all_df.loc[all_df["egg_Description"].str.contains("propeptide inhibitor", case = False)]
egg_peptidase_inhibitor = all_df.loc[all_df["egg_Description"].str.contains("peptidase inhibitor", case = False)]
egg_proteinase_inhibitor = all_df.loc[all_df["egg_Description"].str.contains("proteinase inhibitor", case = False)]


egg_serpin = all_df.loc[all_df["egg_Description"].str.contains("serpin", case = False)]
egg_cystatin = all_df.loc[all_df["egg_Description"].str.contains("cystatin", case = False)]
egg_kazal = all_df.loc[all_df["egg_Description"].str.contains("kazal", case = False)]
egg_a2m = all_df.loc[all_df["egg_Description"].str.contains("alpha-2-macroglobulin", case = False)]
egg_TIMP = all_df.loc[all_df["egg_Description"].str.contains("tissue factor pathway inhibitor", case = False)]
egg_Pacifastin = all_df.loc[all_df["egg_Description"].str.contains("Pacifastin", case = False)]
egg_plai = all_df.loc[all_df["egg_Description"].str.contains("Plasminogen activator inhibitor", case = False)]
egg_kallistatin = all_df.loc[all_df["egg_Description"].str.contains("kallistatin", case = False)]


egg_a2m = all_df.loc[all_df["egg_Description"].str.contains("alpha-2-macroglobulin", case = False)]
egg_a2m = egg_a2m.loc[~egg_a2m["egg_Description"].str.contains("macroglobulin receptor")]

egg_PI_df = pd.concat([egg_kallistatin, egg_plai, egg_Pacifastin,egg_TIMP,egg_a2m,egg_kazal,egg_cystatin,egg_serpin,egg_protease_inhibitors, egg_propeptide_inhibitor, egg_peptidase_inhibitor, egg_proteinase_inhibitor]).drop_duplicates()



### Pulling all the proteins that have an KO annotation related to protease inhibitor function


In [50]:
KO_protease_inhibitors = all_df.loc[all_df["KO_definition"].str.contains("protease inhibitor", case = False)]
KO_propeptide_inhibitor = all_df.loc[all_df["KO_definition"].str.contains("propeptide inhibitor", case = False)]
KO_peptidase_inhibitor = all_df.loc[all_df["KO_definition"].str.contains("peptidase inhibitor", case = False)]
KO_proteinase_inhibitor = all_df.loc[all_df["KO_definition"].str.contains("proteinase inhibitor", case = False)]

KO_serpin = all_df.loc[all_df["KO_definition"].str.contains("serpin", case = False)]
KO_cystatin = all_df.loc[all_df["KO_definition"].str.contains("cystatin", case = False)]
KO_kazal = all_df.loc[all_df["KO_definition"].str.contains("kazal", case = False)]
KO_TIMP = all_df.loc[all_df["KO_definition"].str.contains("tissue factor pathway inhibitor", case = False)]
KO_Pacifastin = all_df.loc[all_df["KO_definition"].str.contains("Pacifastin", case = False)]
KO_plai = all_df.loc[all_df["KO_definition"].str.contains("Plasminogen activator inhibitor", case = False)]
KO_kallistatin = all_df.loc[all_df["KO_definition"].str.contains("Kallistatin", case = False)]


KO_a2m = all_df.loc[all_df["KO_definition"].str.contains("alpha-2-macroglobulin", case = False)]
KO_a2m = KO_a2m.loc[~KO_a2m["KO_definition"].str.contains("macroglobulin receptor")]


KO_PI_df = pd.concat([KO_kallistatin,KO_plai, KO_Pacifastin,KO_TIMP,KO_a2m,KO_kazal,KO_cystatin,KO_serpin,KO_protease_inhibitors, KO_propeptide_inhibitor, KO_peptidase_inhibitor, KO_proteinase_inhibitor]).drop_duplicates()


### Combining KO and EGGNOG protease inhibitors into one df and getting tick protease inhibitors

In [51]:
# all protease inhibitors 
all_PI_df = pd.concat([KO_PI_df, egg_PI_df])
all_PI_df = all_PI_df.drop_duplicates()


In [52]:
# all tick protease inhibitors 
tick_PI_df = all_PI_df.loc[all_PI_df["is_tick"] == "yes"]

In [53]:
len(tick_PI_df)

3580

### Pulling together all tick secreted PUFs

In [54]:
# all ticks 
ticks_df = all_df.loc[all_df["is_tick"] == "yes"]

In [55]:
#secreted proteins of unknown function 
tick_pufs_secreted = ticks_df.loc[(ticks_df["egg_seed_ortholog"] == "None")  & (~ticks_df["KO_pass"].str.contains("\*")) & (ticks_df["deepsig_feature"] == "Signal peptide")]


In [56]:
# removing the proteins that are putative KO protease inhibitors from the PUFs. 
# This is bc I used a stringent cutoff to define PUFs but a low cutoff to define PIs 
 
tick_pufs_secreted_no_PI = (pd.merge(tick_pufs_secreted,all_PI_df, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))

In [57]:
len(tick_pufs_secreted_no_PI)

12137

### Moving forward with tick protease inhibitors and tick secreted proteins of unknown function 
I will only move forward with the proteins <1200 aa, which are easier to fold 

In [58]:
tick_PI_df.loc[tick_PI_df["Length"] <1200].to_csv("../datasheets/tick_PIs_1200.csv", index = False) 

In [59]:
tick_pufs_secreted_no_PI.loc[tick_pufs_secreted_no_PI["Length"] <1200].to_csv("../datasheets/tick_PUFs_1200.csv", index = False) 