# Extract the nucleotide sequences referring to phagotrophy markers in transcriptomes and genomes of cultivated Dinophyceae

#### Aim of this code:
Match each of the IDs of phagotrophic markers (per cultivated species) with a nucleotidic sequence.

This code is applied specifically to MMETSP METDB files because referencing to proteins depends on data source.

Required:
- A .tsv file containing, for the species in question, a list of protein markers of phagotrophy, with their reference IDs in METdb;
- A fasta file matching the protein IDs with nucleotide sequences (genome/transcriptome).

In [1]:
# Modules
import numpy as np
import pandas as pd

This script provides an example with test files.

In [2]:
# Read the files

# Directory of the files storing markers' IDs
rep_id='/home/alexandra/Documents/M2Alexandra/Data_test/Annot_func/'
file_id='101_MMETSP-METDB_00486-karenia-brevis-ccmp2229-paired_markers.tsv'
species='101_MMETSP-METDB_00486-karenia-brevis-ccmp2229-paired'

# Directory of the fasta files storing the nucleotide sequences
rep_seq='/home/alexandra/Documents/M2Alexandra/Data_test/Annot_func/'
file_seq='MMETSP-METDB_00486-karenia-brevis-ccmp2229-paired.fasta'

#### Read the .tsv ID file :

In [8]:
IDprot=pd.read_csv(rep_id+file_id,sep='\t')
myID=IDprot['ID']

#### Read the fasta file:

In [14]:
# Depending on fasta files, number of columns might be changing
# Option to deal with this: assign same number of columns and fill with NaN 
with open(rep_seq+file_seq, 'r') as temp_f:
    # get number of columns in each line
    col_count = [ len(l.split(" ")) for l in temp_f.readlines() ]
# Generate column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]

In [15]:
# Read fasta file
df = pd.read_csv(rep_seq+file_seq, header=None, delimiter=" ", names=column_names)
# Keep the variables I need, stored in the 1st column
data=df[0]
IDS=data[::2] #there is an ID every 2 lines
seq=data[1::2] # and a nt sequence every other line

#### Loop over the phagotrophy ID file and find the associated sequences, store them in a list

In [18]:
list_seqname=[]
for item in myID:
    newname='>'+item[:-3] # get rid of '.p1' in the id name
    idx=np.where(IDS==newname)[0][0]
    seqname=seq[idx*2+1] #the corresponding line in the fasta file storing the sequence
    list_seqname.append(seqname) 
#print('Nucleotidic sequences : ', list_seqname)

In [24]:
dim=len(list_seqname)
#dim==len(myID) # Check for mistakes


#### Create a text file that will store the reference sequences (for remapping steps)

In [25]:
with open(rep_id+'{}_seq_phagomarkers.txt'.format(species),'w') as ce_fichier:
    for i in range(len(list_seqname)):
        ce_fichier.write(myID[i]) # write the ID as well so we know what the sequence corresponds to
        ce_fichier.write('\n')
        ce_fichier.write(list_seqname[i])
        ce_fichier.write('\n')