**Testing the uniprot API**
To do CRISPRi experiments we have to associate a gene target (via accession number) with a proteomics measurement (via uniprot ID). The goal is to create a dictionary for conversion between them.

To do so, we will use the uniprot API to download all data for a particular organism.

This workbook tests out some functionality.

Starting from: https://www.uniprot.org/help/api_queries

Generated ppURL by searching for PP KT2240 on uniprot advanced search, then filtering for reviewed entries, then selecting download.

In [1]:
import requests
import re
import io
import pandas as pd
import os
import numpy as np

In [2]:
#Get all PP (KT2240 strain) proteins that have been reviewed. 
ppUrl = 'https://rest.uniprot.org/uniprotkb/stream?fields=accession%2Cid%2Cgene_names&format=tsv&query=%28%28organism_id%3A160488%29%29'
r = requests.get(ppUrl).text
df = pd.read_csv(io.StringIO(r),sep='\t')
df

Unnamed: 0,Entry,Entry Name,Gene Names
0,Q88E10,MCPS_PSEPK,mcpS PP_4658
1,Q88E47,HGD_PSEPK,hmgA PP_4621
2,Q88FF8,CHRR_PSEPK,chrR PP_4138
3,Q88FY2,6HN3M_PSEPK,nicC PP_3944
4,Q88GJ9,BSR_PSEPK,alr PP_3722
...,...,...,...
5524,Q88RW2,Q88RW2_PSEPK,PP_0017
5525,Q88RW3,Q88RW3_PSEPK,PP_0016
5526,Q88RW4,Q88RW4_PSEPK,PP_0015
5527,Q88RW5,Q88RW5_PSEPK,PP_0014


Now that we have data, see how to filter the "gene name" column

In [3]:
df['PP'] = None
df
findPP = lambda x: 1 if x.find('PP') >=0 else 0
df['PP']=df['Gene Names'].apply(findPP)
print(f'Do all gene names contain PP? {df.PP.all()}')
#filter by whether they contain PP
df = df[df.PP==True]

print(f"These are the lengths of the gene names from uniprot: {df['Gene Names'].apply(len).unique()}")


Do all gene names contain PP? False
These are the lengths of the gene names from uniprot: [12 11 18 16 17 20  7 13 19 21 25 22 15 14 23 10 31 55 47]


Print out some examples of gene names with different lengths to determine how to filter them properly:


In [4]:
unique_name_lengths = np.sort(df['Gene Names'].apply(len).unique())
print(unique_name_lengths)
for n in unique_name_lengths:
    print(f'length is {n}')
    print(df[df['Gene Names'].apply(len) == n].head())

[ 7 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 31 47 55]
length is 7
      Entry    Entry Name Gene Names  PP
16   Q88CT0   IXTPA_PSEPK    PP_5100   1
22   Q88EI9      AK_PSEPK    PP_4473   1
229  Q88H30  Q88H30_PSEPK    PP_3535   1
283  Q88L12    YEGS_PSEPK    PP_2125   1
288  Q88L51   RRAAH_PSEPK    PP_2084   1
length is 10
       Entry    Entry Name  Gene Names  PP
918   Q88HU8  Q88HU8_PSEPK  ku PP_3255   1
1455  Q88RK3  Q88RK3_PSEPK  cc PP_0126   1
length is 11
         Entry        Entry Name   Gene Names  PP
4       Q88GJ9         BSR_PSEPK  alr PP_3722   1
32      Q88H32         OCD_PSEPK  ocd PP_3533   1
37      Q88LE4      Q88LE4_PSEPK  asd PP_1989   1
42      Q88NN6       URODH_PSEPK  udh PP_1171   1
55  A0A140FWM5  A0A140FWM5_PSEPK  ddl PP_5673   1
length is 12
    Entry   Entry Name    Gene Names  PP
0  Q88E10   MCPS_PSEPK  mcpS PP_4658   1
1  Q88E47    HGD_PSEPK  hmgA PP_4621   1
2  Q88FF8   CHRR_PSEPK  chrR PP_4138   1
3  Q88FY2  6HN3M_PSEPK  nicC PP_3944   1
5  Q88JK6 

filter out the extra text around the gene name:
gene names have varying lengths. 

'PP_' is not always the last part of gene names

we can find it using str.find('PP_') and then get the following 7 characters. Could also use str.split(' ') and then find the substring with 'PP_'.

In [5]:
filterPP = lambda x: x[x.find('PP'):(x.find('PP') + 7)]
print(f"Sample name: {df.iloc[62,2]} applying filterPP: {filterPP(df.iloc[62,2 ])}")
df['trimmed_name'] = None
df['trimmed_name'] = df['Gene Names'].apply(filterPP)
df_final = df.drop(columns = ['Entry Name', 'Gene Names', 'PP'])
print(df_final.head())
pkl_name = os.path.join('data','uniprot_PPutida_reviewed.pkl')

pkl_name
#df_final.to_pickle(pkl_name)

Sample name: atpE PP_5418 PP5418 applying filterPP: PP_5418
    Entry trimmed_name
0  Q88E10      PP_4658
1  Q88E47      PP_4621
2  Q88FF8      PP_4138
3  Q88FY2      PP_3944
4  Q88GJ9      PP_3722
