# Searching for phagotrophic markers in the proteomes of ~100 cultivated Dinoflagellates

This code is used for the "functional annotation" part of the internship. It runs on .tsv files which list a number of identified proteins in a cultivated Dinoflagellate. Its aim is to try to find markers of phagotrophy in these organisms.

Markers searched for: phosphatidylinositol kinase, WASH complex (also named WASP and Scar-homologue), and WASP complex, Snare (vesicular transport), small G proteins (Rab, Rac, Rap, Rho, and Cdc42).

What we need for the following step is to extract the proteins' 'official identifier' from these files. The proteins' identifiers will be used to extract the proteins' associated nucleotidic sequences from FASTA files

In [7]:
# Modules

import os
import numpy as np
import pandas as pd
import re # regular expression (to find aliases of proteins' names)

# Directory of the files stored
# Useful to loop on all the .tsv files on the cluster
directory='/home/alexandra/Documents/M2Alexandra/Data_test/Annot_func/'
output_dir='/home/alexandra/Documents/M2Alexandra/Data_test/Output_annot/'

We will use Python's RE to find 'aliases' (mainly, upper/lowercase, spaces ...) of the markers.

Documentation on Python's RE (regular expression) module can be found at: https://docs.python.org/3/library/re.html

### Define tags for the identified markers of phagotrophy, find their aliases

In [8]:
# Tags I am looking for in the proteins' names
# Ensemble of necessary (not sufficient) proteins required for phagotrophy

# Small GTPases
rab=re.compile('RAB',re.IGNORECASE) #IGNORECASE ==> I don't care if it's lower or uppercase characters
rac=re.compile('RAC.GTP', re.IGNORECASE)
rap=re.compile('RAP.GTP',re.IGNORECASE)
cdc=re.compile('CDC42',re.IGNORECASE)
rho=re.compile('Rho.GTP',re.IGNORECASE)

#Phosphatidylinositol 3 phosphate 4 kinase (required for invagination of large particles)
pho=re.compile('phosphatidylinositol.3.phosphate',re.IGNORECASE) #dot = matches any character (like * in linux)
#kin=re.compile('kinase',re.IGNORECASE)

# WASH complex : Nucleation-promoting factor, interacting with Arp2/3 in actin polymerization process
wash=re.compile('WASH',re.IGNORECASE)
wasp=re.compile('WASP',re.IGNORECASE)

# SNARE: vesicular transport
snare=re.compile('SNARE',re.IGNORECASE)

# Cathepsins : proteases involved in the resolution (digestion) step
cat=re.compile('Cathepsin',re.IGNORECASE)

# Is my organism a phototroph ? --> Expected to be true for all organisms
chl=re.compile('cytochrome',re.IGNORECASE)

### Read the files

This code is designed to provide an example, running on a single test file.

In [9]:
# Run on a directory containing the files
#rep=os.listdir('/home/alexandra/Documents/M2Alexandra/Data_test/Annot_func/') # Makes a list of all files (filenames) present in this directory
# Run on a single test file
rep=['23_EP00420_Gonyaulax_spinifera.tsv'] # test file

In [36]:
for filename in rep: # Loop on the files stored in the directory
    print(filename)
    species=filename[:-4]
    file=pd.read_table(directory+str(filename))
    #file.head(10)
    # We are interested in the first ('protein_accession') and the 5th ('signature_description') columns.

23_EP00420_Gonyaulax_spinifera.tsv


In [22]:
# Define names for the extracted columns

proteins=file['signature_description'] # in this column, I search for the markers' names

ids=file['protein_accession'] # then, if there's a match, I search for their 'official' IDs
go=file['go_annotation'] # if any (often NaN)
length=file['sequence_length'] # length of the sequence
start_stop=file[['start_location','stop_location']]

The general idea is to:

- read each line;
- look whether, in this line, I can find an alias of the markers defined hereabove;
- if I can, then store the protein's ID (1st column);
- store these IDs in a list;
- then, search for unique IDs in this list, so I won't store the same IDs multiple times;
- store unique IDs in a .txt file, named after the species

### Search for the annotations linked to markers' names

In [24]:
mixo={rab,rac,rap,cdc,rho,pho,wash,wasp,snare,cat} #dict storing the phagotrophic markers' aliases (type re.Pattern)
list_ids=[] # list storing the ids of the markers found in the species
list_markers=[] # storing the attributed marker
list_proteins=[] # storing the (exact) functional annotation


Photo=False # up to now, we don't know whether the organism is capable of phototrophy

for index in range(len(proteins)): # Loop on the lines of the file and read the corresponding line
    prot=str(proteins[index]) #read the protein corresponding to this line
    opni=True # prot is an unknown yet
    for item in mixo:
        if opni:
            find=item.match(str(prot)) # Does the protein correspond to the marker ?
        
            if find:
                opni=False
                # Uncomment the lines below to check if the code works properly
                #print(item)
                #print(proteins[index])
                id_prot=ids[index]
                #print(id_prot)
                list_ids.append(id_prot)
                list_markers.append(item.pattern)
                list_proteins.append(proteins[index])


    
    if not Photo: # in case you haven't determined yet whether it is a phototroph or not
        pht=chl.match(str(proteins[index])) # is there a protein called 'cytochrom' somewhere ?
        if pht: # Yes there is
            Photo=True # ... so it is likely to be a phototroph

if Photo:
    print('I have cytochroms')


I have cytochroms


### Now storing those IDs

In [35]:
#Storing the IDs in a dataframe
ids_series=pd.Series(list_ids)
markers_series=pd.Series(list_markers)
proteins_series=pd.Series(list_proteins)
df=pd.DataFrame({'ID':ids_series, 'Markers':markers_series, "Proteins":proteins_series})

df.to_csv(directory+filename[:-4]+'_markers'+'.tsv',sep='\t')