### This notebook takes target proteins defined in notebook 01 and the tsv output from the ESMfold run and creates a metadata file compatible with ProteinCartography

ESMfold input and output is in s3://arcadia-pi-disco/EMSfold_folding/

#### Inputs:
Target PIs: ../datasheets/tick_PIs_1200.csv \
Target PUFs: ../datasheets/tick_PUFs_1200.csv \
Datasheet from first folding job (copied over from EMSfold_folding): ../datasheets/0912_pi_discovery_succesful_download.tsv \
Datasheet from second folding job (copied over from EMSfold_folding): ../datasheets/1003_pi_discovery_new_succesful_download.tsv

#### Outputs:
Metadata file for all target tick proteins: ../datasheets/tick_PUFs_PIs_1200_metadata.tsv

### Then, it prepares a metadata file for the structural toxin database (toxinDB) so it is ready for ProteinCartography
This toxin structural database is  7,008 proteins from the [UniProt animal toxin annotation project](https://www.uniprot.org/help/Toxins). This notebook checks the tsv metadata file that came with this download to ensure it has the right IDs and fields for Cartography, and that it matches the structural database that was downloaded. It then creates a merged tick + toxinDB metadata file for the merged tick + toxinDB Cartography run. 

#### Inputs:
ToxinDB structures: s3://arcadia-amblyomma-proteomics/VenomAnnotatedProteins/tsa_toxin_protein_predictions	\
ToxinDB annotations: s3://arcadia-amblyomma-proteomics/VenomAnnotatedProteins/toxin_protein_annotations.tsv 


#### Outputs:
Combined tick and toxinDB metadata file: ../datasheets/tick_PUFs_PIs_1200_plus_toxinDB_metadata.tsv

In [22]:
import pandas as pd
from collections import defaultdict 
import glob
pd.set_option('display.max_columns', None)

In [23]:
#labelling the proteins of uknown function as PUFs
PUFs = pd.read_csv("../datasheets/tick_PUFs_1200.csv").fillna("None")
PUFs["PI_or_PUF"] = "PUF"

In [24]:
#labelling the proteins with putative protease inhibitor activity as PIs 
PIs = pd.read_csv("../datasheets/tick_PIs_1200.csv").fillna("None")
PIs["PI_or_PUF"] = "PI"

In [25]:
#combining into one dataframe of target proteins 
target_proteins = pd.concat([PUFs, PIs])
len(target_proteins)

15494

#### Need to give create a taxonomy list using NCBI taxonomy. I copy-pasted each ticks NCBI taxonomy here

In [26]:
tick_to_tax = defaultdict(list) 

common_tax = ['cellular organisms', 'Eukaryota', 'Opisthokonta', 'Metazoa', 'Eumetazoa', 'Bilateria', 'Protostomia', 'Ecdysozoa', 'Panarthropoda', 
              'Arthropoda', 'Chelicerata', 'Arachnida', 'Acari', 'Parasitiformes','Ixodida', 'Ixodoidea']

In [27]:
tick_to_tax["Dermacentor-andersoni"].append(common_tax  + ['Ixodidae', 'Rhipicephalinae', 'Dermacentor'])
tick_to_tax["Dermacentor-silvarum"].append(common_tax + ['Ixodidae', 'Rhipicephalinae', 'Dermacentor'])
tick_to_tax["Rhipicephalus-sanguineus"].append(common_tax + ['Ixodidae', 'Rhipicephalinae', 'Rhipicephalus', 'Rhipicephalus', 'Rhipicephalus sanguineus group'])
tick_to_tax["Rhipicephalus-microplus"].append(common_tax + ['Ixodidae', 'Rhipicephalinae', 'Rhipicephalus', 'Boophilus'])
tick_to_tax["Hyalomma-asiaticum"].append(common_tax + ['Ixodidae', 'Hyalomminae', 'Hyalomma'])
tick_to_tax["Ixodes-scapularis"].append(common_tax + [ 'Ixodidae', 'Ixodinae', 'Ixodes'])
tick_to_tax["Ornithodoros-erraticus"].append(common_tax + ['Argasidae', 'Ornithodorinae', 'Ornithodoros'])
tick_to_tax["Ixodes-ricinus"].append(common_tax + ['Ixodidae', 'Ixodinae', 'Ixodes'])
tick_to_tax["Ornithodoros-moubata"].append(common_tax + ['Argasidae', 'Ornithodorinae', 'Ornithodoros'])
tick_to_tax["Ornithodoros-turicata"].append(common_tax + ['Argasidae', 'Ornithodorinae', 'Ornithodoros'])
tick_to_tax["Dermacentor-variabilis"].append(common_tax + [ 'Ixodidae', 'Rhipicephalinae', 'Dermacentor'])
tick_to_tax["Ixodes-persulcatus"].append(common_tax + ['Ixodidae', 'Ixodinae', 'Ixodes'])
tick_to_tax["Haemaphysalis-longicornis"].append(common_tax + [ 'Ixodidae', 'Haemaphysalinae', 'Haemaphysalis'])
tick_to_tax["Amblyomma-sculptum"].append(common_tax + ['Ixodidae', 'Amblyomminae', 'Amblyomma'])
tick_to_tax["Amblyomma-americanum"].append(common_tax + [ 'Ixodidae', 'Amblyomminae', 'Amblyomma'])

In [28]:
#use tick_to_tax dictionary to assign lineage based on species name 
target_proteins["Lineage"] = target_proteins["species_name"].apply(lambda x: tick_to_tax[x])
target_proteins["Lineage"] = target_proteins["Lineage"].apply(lambda x : x[0])

In [29]:
#tweaking formatting of species name to make it consistent with Organism field 
target_proteins["Organism"] = target_proteins["species_name"].apply(lambda x: x.replace("-", " "))

In [30]:
#adding pulling out the highest scoring KO to make it the KO data more legible 
target_proteins["KO_high_KO"] = target_proteins["KO"].apply(lambda x: x.split(";")[0])
target_proteins["KO_high_definition"] = target_proteins["KO_definition"].apply(lambda x: x.split(";")[0])
target_proteins["KO_high_E-value"] = target_proteins["KO_E-value"].apply(lambda x: x.split(";")[0])

### I want a single field to show the annotation intsead of being split across eggnog and ko. 
Because the eggnog annotations are more stringent than the KO ones, I made a rule that the eggnog annotation should be considered the more legit annotation, and the KO one should only be used if there is no eggnog annotation and that KO definition is above threshold (ie not a PUF) 

In [31]:
#this function takes 3 annotation fields for a single gene and picks the most appropriate one for the top annotation

def get_best_def(egg_Description_list, KO_high_definition_list, PI_or_PUF_list):
    i = 0 
    answer_list = []
    
    for anno in PI_or_PUF_list:
        #if the protein is a PUF, its top annotatation should be PUF
        if anno == "PUF":
            answer = "PUF"
        
        #if the protein has a eggnog annotation, use that one bc it is more stringent than kofamscan
        elif egg_Description_list[i] != "None":
            answer = egg_Description_list[i]
        
        #only use kofamscan annotation if no eggnog annotation, and if not a PUF
        else:
            answer = KO_high_definition_list[i]
        i = i +1 
        answer_list.append(answer)
    
    return(answer_list)

In [32]:
target_proteins["top_annotation"] = get_best_def(target_proteins["egg_Description"].to_list(), target_proteins["KO_high_definition"].to_list(), target_proteins["PI_or_PUF"].to_list())


In [33]:
#Gene Names (primary) is a field that the protein cartography dashboard uses , using the egg_Preferred_name for it 
target_proteins["Gene Names (primary)"] = target_proteins["egg_Preferred_name"]

In [34]:
#Protein names is a field that the protein cartography dashboard uses ,using 'top_annotation' for it 
target_proteins['Protein names'] = target_proteins['top_annotation']

### Bringing in the output tsv from the esmfold run. This output tsv records a couple things to care about:
1. The pdb file name and the protein fasta header, which is called gene_name in the target_proteins dataframe. In the case that the fasta header is too long/has special characters, it may be necessary for the esmfold script to create an output pdb file name be different than this fasta header. 

2. A record of X to A counts: or how many X's there were in the input fasta sequence that was replaced with As

3. The sanitized protein sequence, which is the sequence that was actually folded. This would include the replacemnet of X with A.

In [35]:
#had two batches of folding jobs, need out for both of them 
esmfold_out_1 = pd.read_csv("../datasheets/0912_pi_discovery_succesful_download.tsv", sep ="\t")
esmfold_out_2 = pd.read_csv("../datasheets/1003_pi_discovery_new_succesful_download.tsv", sep ="\t")


esmfold_out = pd.concat([esmfold_out_1, esmfold_out_2])


In [36]:
#fasta_header is the col that we need to merge with target_proteins using gene_name as key 
#renaming esmfold cols as not to be overly confusing 

esmfold_out = esmfold_out.rename(columns={'gene_name': 'esmfold_generated_gene_name'})
esmfold_out["gene_name"] = esmfold_out["fasta_header"]

In [37]:
#using .pdb file name to generate mandatory protid. This field is used to link the pdb file to metadata 
esmfold_out["protid"] = esmfold_out["pdb_name"].apply(lambda x: x.strip(".pdb"))

In [38]:
#storing alanine substituted sanitized sequence as the Sequence field to be moved over to metadata file
esmfold_out["Sequence"] = esmfold_out["sanitized_sequence"]


In [39]:
#merge target proteins with esmfold output 
target_proteins = target_proteins.merge(esmfold_out, on = "gene_name", how = "left")

In [40]:
#check for any cases where the esmfold info wasn't merged in  
target_proteins.loc[target_proteins["protid"].isna()]

Unnamed: 0,gene_name,egg_seed_ortholog,egg_evalue,egg_score,eggNOG_OGs,egg_max_annot_lvl,egg_COG_category,egg_Description,egg_Preferred_name,egg_GOs,egg_EC,egg_KEGG_ko,egg_KEGG_Pathway,egg_KEGG_Module,egg_KEGG_Reaction,egg_KEGG_rclass,egg_BRITE,egg_KEGG_TC,egg_CAZy,egg_BiGG_Reaction,egg_PFAMs,KO_pass,KO,KO_thrshld,KO_score,KO_E-value,KO_definition,deepsig_feature,deepsig_start,deepsig_end,deepsig_sp_score,deepsig_sp_evidence,Length,species_name,is_tick,KO_high_score,PI_or_PUF,Lineage,Organism,KO_high_KO,KO_high_definition,KO_high_E-value,top_annotation,Gene Names (primary),Protein names,esmfold_generated_gene_name,fasta_header,pdb_name,is_original_sequence_valid,xtoa_replacement_count,sequence,sanitized_sequence,protid,Sequence


In [41]:
#storing original gene name from the fasta file/original annotation files in case .pdb file name is different 
target_proteins["gene_name_in_fasta"] = target_proteins["gene_name"]

In [42]:
#reducing down to just the columns i care about 
proteins_metadata = target_proteins[['protid','Organism','Lineage', 'Length','Sequence', 'Gene Names (primary)', 'Protein names','PI_or_PUF', 'deepsig_feature', 'egg_Description','egg_evalue', 
                                     'egg_score', 'egg_PFAMs','KO_high_KO','KO_high_definition','KO_high_E-value','KO_high_score', 'KO_score', 'KO_E-value', 'KO_definition', 'gene_name_in_fasta', 'xtoa_replacement_count']]


In [43]:
#converting "None" to empty cell 
proteins_metadata = proteins_metadata.replace({'None': ''}, regex=True)

In [44]:
#writing out the metadata file for use in Protein Cartography 
proteins_metadata.to_csv("../datasheets/tick_PUFs_PIs_1200_metadata.tsv", index = False, sep = "\t" )

### Now checking the toxinDB structures and metadata file 

### Read in pdb structure file names to get ids 

In [45]:
#reading in all the file names 
pdb_dict = defaultdict(list)
for file in glob.glob("../../toxinDB/structures/*.pdb"):
    pdb_id = species = file.split("/")[-1].split(".pdb")[0]

    #file names are formatted like AF-C6JUN9-F1-model_v4, C6JUN9 is the relevant here
    elems = pdb_id.split("-")

    if elems[0] != "AF":
        print(pdb_id)
    else:
        prot_id = elems[1]

    pdb_dict["protid"].append(pdb_id)
    pdb_dict["Entry"].append(prot_id)


pdb_df = pd.DataFrame(pdb_dict)
pdb_df

Unnamed: 0,protid,Entry
0,AF-P0DPT9-F1-model_v4,P0DPT9
1,AF-P0DSI3-F1-model_v4,P0DSI3
2,AF-Q75WH2-F1-model_v4,Q75WH2
3,AF-B1PZN6-F1-model_v4,B1PZN6
4,AF-O77256-F1-model_v4,O77256
...,...,...
7003,AF-B6DCM8-F1-model_v4,B6DCM8
7004,AF-P0DN51-F1-model_v4,P0DN51
7005,AF-X5I9Z2-F1-model_v4,X5I9Z2
7006,AF-P0CAS9-F1-model_v4,P0CAS9


In [46]:
pdb_df.to_csv("../datasheets/toxinDB_pdb_list.csv", index = False)

### Read in metadata file that was provided and merge in pdb file ids

In [47]:
toxin_metadata = pd.read_csv("../../toxinDB/toxin_protein_annotations.tsv", sep = "\t")
toxin_metadata = toxin_metadata.merge(pdb_df, on = "Entry", how = "left")
toxin_metadata

Unnamed: 0,Entry,Entry Name,Protein names,Gene Names,Organism,Length,AlphaFoldDB,Sequence,Annotation,Interacts with,Pfam,protid
0,C0L2T8,VM2C1_CRODO,Zinc metalloproteinase/disintegrin (Metallopro...,MPII,Crotalus durissus collilineatus (Brazilian rat...,478,C0L2T8;,MIQVLLVTICLAAFPYQGSSIILESGNVNDYEVIYPRKVTALPKGA...,5.0,,PF00200;PF01562;PF01421;,AF-C0L2T8-F1-model_v4
1,Q2PE51,BNP_CRODO,Bradykinin-potentiating and C-type natriuretic...,,Crotalus durissus collilineatus (Brazilian rat...,181,Q2PE51;,MFVSRLAASGLLLLALLAVSLDGKPLQQWSQRWPHLEIPPLVVQNW...,5.0,,PF00212;,AF-Q2PE51-F1-model_v4
2,Q90Y12,BNP_CRODU,Bradykinin potentiating and C-type natriuretic...,,Crotalus durissus terrificus (South American r...,181,Q90Y12;,MFVSRLAASGLLLLALLAVSLDGKPLQQWSQRWPHLEIPPLVVQNW...,5.0,,PF00212;,AF-Q90Y12-F1-model_v4
3,Q9Y0X6,YA_MESMA,BmK-YA precursor [Cleaved into: BmK-YA 1 (Enke...,,Mesobuthus martensii (Manchurian scorpion) (Bu...,200,Q9Y0X6;,MIFHQFYSILILCLIFPNQVVQSDKERQDWIPSDYGGYMNPAGRSD...,5.0,,,AF-Q9Y0X6-F1-model_v4
4,R4NNL0,VMH3_VIPAA,Zinc metalloproteinase-disintegrin-like protei...,,Vipera ammodytes ammodytes (Western sand viper),616,R4NNL0;,MIQVLLVIICLAVFPYQGSSIILESGNVNDYEVVYLQKVTAMNKGA...,5.0,,PF08516;PF00200;PF01562;PF01421;,AF-R4NNL0-F1-model_v4
...,...,...,...,...,...,...,...,...,...,...,...,...
7731,P86342,VP6_BROAA,Venom peptide 6 (BaP-6),,Brotheas amazonicus (Scorpion),12,,GFIGDIWSGIQG,1.0,,,
7732,P86343,VP5_BROAA,Venom peptide 5 (BaP-5),,Brotheas amazonicus (Scorpion),11,,FIGDIWSGIQG,1.0,,,
7733,Q7M463,SCK6_MESMA,Neurotoxin BmK A3-6,,Mesobuthus martensii (Manchurian scorpion) (Bu...,29,Q7M463;,LPYPVNCKTECECVMCGLGIICKQCYYQQ,1.0,,,AF-Q7M463-F1-model_v4
7734,A0A6B9KZ02,F16P1_PLARH,Venom protein family 16 protein 1 (f16p1),,Platymeris rhadamanthus (Red spot assassin bug),127,A0A6B9KZ02;,MWIWYSLLFFGVCHLAHSTSTVDDALTCHGEKLAEEVLNELKEHFG...,1.0,,,AF-A0A6B9KZ02-F1-model_v4


### I don't have structures for every protein in metadata file. Dropping rows that I dont have structures for

In [48]:
toxin_metadata = toxin_metadata.loc[toxin_metadata["protid"].notna()]

In [49]:
toxin_metadata.to_csv("../datasheets/toxinDB_metadata.tsv", index = False, sep = "\t" )

### Making a merged metadata file for tick PIs and PUFs + toxinDB to use in cartography run

In [50]:
tick_metadata = pd.read_csv("../datasheets/tick_PUFs_PIs_1200_metadata.tsv", sep = "\t")
merged = pd.concat([toxin_metadata, tick_metadata])


In [51]:
merged.to_csv("../datasheets/tick_PUFs_PIs_1200_plus_toxinDB_metadata.tsv", sep = "\t")

In [53]:
len(merged)

22502