<h1> <div style="text-align:center"> This Jupyter notebook will contain all the steps that are used to generate the data and plot the graphs included in the manuscript </div> </h1>


\****Please note that you may need to do some steps and generate some data outside this jupyter notebook***

In the same folder as the jupyter notebook there should be 
1. scripts folder containing all the scripts
2. elm_instances.tsv downloaded from the ELM database. (http://elm.eu.org/index.html)
3. elm_instances.fasta downloaded from the ELM database.

## 1. The first step is to extract the taxonomy of the species

The taxonomic IDs of the organisms were extrated using the 
https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi


In [1]:
import pandas as pd 

# Load the elm_instances.tsv file
ELM_instances = pd.read_csv('elm_instances.tsv', delimiter='\t', quotechar = '"', skiprows= 5, engine='python')
ELM_instances.head()


Unnamed: 0,Accession,ELMType,ELMIdentifier,ProteinName,Primary_Acc,Accessions,Start,End,References,Methods,InstanceLogic,PDB,Organism
0,ELMI003774,CLV,CLV_C14_Caspase3-7,A0A0H3NIK3_SALTS,A0A0H3NIK3,A0A0H3NIK3,483,487,20947770,enzymatic reaction; mutation analysis; proteas...,true positive,,Salmonella enterica subsp. enterica serovar Ty...
1,ELMI002256,CLV,CLV_C14_Caspase3-7,ATN1_HUMAN,P54259,P54259 Q99495 Q99621 Q9UEK7,103,107,10085113 9535906,cleavage reaction; mutation analysis; western ...,true positive,,Homo sapiens
2,ELMI001933,CLV,CLV_C14_Caspase3-7,ATN1_HUMAN,P54259,P54259 Q99495 Q99621 Q9UEK7,106,110,10085113 9535906,cleavage reaction; mutation analysis; western ...,true positive,,Homo sapiens
3,ELMI001914,CLV,CLV_C14_Caspase3-7,BCAR1_RAT,Q63767,Q63767 Q63766,413,417,10712510,classical fluorescence spectroscopy; cleavage ...,true positive,,Rattus norvegicus
4,ELMI001915,CLV,CLV_C14_Caspase3-7,BCAR1_RAT,Q63767,Q63767 Q63766,745,749,10712510,classical fluorescence spectroscopy; cleavage ...,true positive,,Rattus norvegicus


In [5]:
#extract the name of the species from the organism column and drop duplicates

species_names=ELM_instances['Organism'].drop_duplicates() 
#print(species_names.values.tolist()) # will print all the species names as a list
#print(len(species_names)) #check the number of species

#add elm_species text file to the temporary files folder
species_names.to_csv('temp_files/elm_species.txt', encoding='utf-8', index=False, header= False) 


mkdir: cannot create directory ‘temp_files’: File exists


## 2. Upload the generated text file to  https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi 

###  *Select the full taxid lineage and then click save in file. Move the downloaded tax_report.txt file to the temp_files folder*

## 3. Extract only the name of the species and its full taxonomy and save it to a csv file


In [10]:
import pandas as pd 

taxa_report= pd.read_csv('temp_files/tax_report.txt', sep='[|\t]', quotechar = '"', engine='python') 
taxa_report=taxa_report.filter(['name', 'lineage'], axis=1)
#print(tax_report)
taxa_report.to_csv("temp_files/org_full_taxa.csv",encoding='utf-8', index=False, header= True ) 

## 4. Append the taxonomic classification (Eukaryotes, Bacteria, Viruses, etc.) to the elm_instances.tsv downloaded from the ELM database.

We need to use the append_taxa_to_elm.py in the scripts folder. This script takes two files org_full_tax.csv in the temp_files and the elm_instances.tsv file found in the main folder.
The output file will be elm_instances_with_taxa.csv 

In [11]:
%%bash

python3 scripts/append_taxa_to_elm.py elm_instances.tsv temp_files/org_full_taxa.csv 

## 5. Filter the true positive instances and generate 3 csv files (one for each taxonomic group: Eukaryotes, Bacteria, and Viruses)

In [12]:
import pandas as pd
all_elms=pd.read_csv('elm_instances_with_taxa.csv')

#Keep the true positive instances in each taxonomic group and split the data into three files
eukaryotes=all_elms[(all_elms['Taxonomic_group'].str.contains('Eukaryote')) & (all_elms['InstanceLogic'] =='true positive')]
bacteria=all_elms[(all_elms['Taxonomic_group'].str.contains('Bacteria')) & (all_elms['InstanceLogic'] =='true positive')]
viruses=all_elms[(all_elms['Taxonomic_group'].str.contains('Viruses')) & (all_elms['InstanceLogic'] =='true positive')]

eukaryotes.head()
bacteria.head()
viruses.head()

eukaryotes.to_csv("eukaryotes/eukaryotes_TP.csv",encoding='utf-8', index=False, header= True )
bacteria.to_csv("bacteria/bacteria_TP.csv",encoding='utf-8', index=False, header= True )
viruses.to_csv("viruses/viruses_TP.csv",encoding='utf-8', index=False, header= True )
