# Get protein list from EDD
EDD databases for DBTL0 and DBTL1 have proteins listed in UNIPROT format. I want to use this tool to convert to PP_XXXX/4 letter codes: https://www.uniprot.org/id-mapping 

In [1]:
import edd_utils as eddu
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re 
import random
random.seed(1)

## Import data from EDD

Define EDD import data

In [2]:
study_slug_1 = 'crispri-p-putida-dbtl-0-final'
study_slug_2 = 'crispri-automation-for-enhanced-isoprenol-pro-fca3'
edd_server   = 'edd.jbei.org'
username     = 'pckinnunen'

Open EDD session

In [3]:
try:
    session = eddu.login(edd_server=edd_server, user=username)
except:
    print('ERROR! Connection to EDD failed. We will try to load data from disk...')
else:
    print('OK! Connection to EDD successful. We will try to load data from EDD...')

Password for pckinnunen:  ········


OK! Connection to EDD successful. We will try to load data from EDD...


Import data

In [4]:
try:
    df1 = eddu.export_study(session, study_slug_1, edd_server=edd_server)
except (NameError, AttributeError, KeyError):
    print(f'ERROR! Not able to export study 1.')
    
try:
    df2 = eddu.export_study(session, study_slug_2, edd_server=edd_server)
except (NameError, AttributeError, KeyError):
    print(f'ERROR! Not able to export study 2.')

  0%|          | 0/1033125 [00:00<?, ?it/s]

  0%|          | 0/475488 [00:00<?, ?it/s]

## Get unique proteins

In [15]:
df1_proteins = df1.loc[df1['Protocol'] == 'Global Proteomics', 'Formal Type'].unique()
df2_proteins = df2.loc[df2['Protocol'] == 'Global Proteomics', 'Formal Type'].unique()

unique_proteins = np.unique(np.concatenate([df1_proteins, df2_proteins]))
unique_proteins[:5]

array(['P0AE22', 'sp|A9GAJ9|A9GAJ9_SORC5 Mcm',
       'sp|K4JH65|K4JH65_9ACTN Gdnd', 'sp|O77727', 'sp|O82803'],
      dtype=object)

## Split each protein entry by delimiter `|`

Protein entries look like: `sp|A9GAJ9|A9GAJ9_SORC5 Mcm` but have variable number of entries. 

Goal is to isolate the 6-character code. First, create a list of lists where each sublist is a single protein entry split along the delimiter.

In [27]:
proteins_split = [p.split('|') for p in unique_proteins]
proteins_split

For N = 2738 total proteins, N = 2738 have an entry with length six


Get the length of each string in each sublist using a nested list comprehension

In [None]:
proteins_split_length = [np.array([len(single_entry) for single_entry in single_protein]) 
                         for single_protein in proteins_split]

Check if any of the split strings have length 6:

In [None]:
any_split_length_6 = [sum(protein_string_lengths==6)==1 for protein_string_lengths in proteins_split_length]
print(f"For N = {len(any_split_length_6)} total proteins, N = {sum(any_split_length_6)} have exactly one entry with length six")

Iterate through split proteins, identify which index in sublist to save, and append that string to the list of protein ids

In [45]:
protein_ids_for_uniprot = []
for idx, protein_ids in enumerate(proteins_split):
    split_to_save = np.where(proteins_split_length[idx] == 6)[0]
    assert len(split_to_save) == 1, "Wrong number of protein IDs have the correct length"
    protein_ids_for_uniprot.append(protein_ids[split_to_save[0]])

convert list to dataframe and export to CSV

In [46]:
protein_ids_for_uniprot_df = pd.DataFrame(protein_ids_for_uniprot)
protein_ids_for_uniprot_df.to_csv('protein_ids_for_uniprot_tool.csv', index = False, header = False)

Save dataframe consisting of original protein string (from EDD) and extracted protein ID

In [58]:
protein_id_conversion_df = pd.DataFrame(data = np.transpose(np.vstack([unique_proteins, np.array(protein_ids_for_uniprot)])), columns = ['original', 'extracted'])
protein_id_conversion_df

Unnamed: 0,original,extracted
0,P0AE22,P0AE22
1,sp|A9GAJ9|A9GAJ9_SORC5 Mcm,A9GAJ9
2,sp|K4JH65|K4JH65_9ACTN Gdnd,K4JH65
3,sp|O77727,O77727
4,sp|O82803,O82803
...,...,...
2733,tr|Q88QV1|Q88QV1_PSEPK,Q88QV1
2734,tr|Q88QV2|Q88QV2_PSEPK,Q88QV2
2735,tr|Q88RH1|Q88RH1_PSEPK,Q88RH1
2736,tr|Q88RH2|Q88RH2_PSEPK,Q88RH2


In [60]:
protein_id_conversion_df.to_csv('./data/protein_id_conversion_df_init.csv', index = False, header = True)