**Testing the uniprot API**
To do CRISPRi experiments we have to associate a gene target (via accession number) with a proteomics measurement (via uniprot ID). The goal is to create a dictionary for conversion between them.

To do so, we will use the uniprot API to download all data for a particular organism.

This workbook tests out some functionality.

Starting from: https://www.uniprot.org/help/api_queries

Generated ppURL by searching for PP KT2240 on uniprot advanced search, then filtering for reviewed entries, then selecting download.

In [40]:
import requests
import re
import io
import pandas as pd
import os

In [11]:
#Get all PP (KT2240 strain) proteins that have been reviewed. 
ppUrl = 'https://rest.uniprot.org/uniprotkb/stream?fields=accession%2Cid%2Cgene_names&format=tsv&query=%28%28organism_id%3A160488%29%29%20AND%20%28reviewed%3Atrue%29'
r = requests.get(ppUrl).text
df = pd.read_csv(io.StringIO(r),sep='\t')
df

Unnamed: 0,Entry,Entry Name,Gene Names
0,Q88E10,MCPS_PSEPK,mcpS PP_4658
1,Q88E47,HGD_PSEPK,hmgA PP_4621
2,Q88FF8,CHRR_PSEPK,chrR PP_4138
3,Q88FY2,6HN3M_PSEPK,nicC PP_3944
4,Q88GJ9,BSR_PSEPK,alr PP_3722
...,...,...,...
724,Q88Q16,Y682_PSEPK,PP_0682
725,Q88QJ9,FDHE_PSEPK,fdhE PP_0492
726,Q88QT7,APAG_PSEPK,apaG PP_0400
727,Q88R49,FETP_PSEPK,PP_0285


Now that we have data, see how to filter the "gene name" column

In [20]:
df['PP'] = None
df
findPP = lambda x: 1 if x.find('PP') >=0 else 0
df['PP']=df['Gene Names'].apply(findPP)
print(f'Do all gene names contain PP? {df.PP.all()}')
#filter by whether they contain PP
df = df[df.PP==True]

Do all gene names contain PP? True


filter out the extra text around the gene name

In [43]:
filterPP = lambda x: x[x.find('PP'):]
df['trimmed_name'] = None
df['trimmed_name'] = df['Gene Names'].apply(filterPP)
df_final = df.drop(columns = ['Entry Name', 'Gene Names', 'PP'])
print(df_final.head())
pkl_name = os.path.join('data','uniprot_PPutida_reviewed.pkl')
#pkl_name = os.filesep() + 'data' +os.filesep() + 'uniprot_PPutida_reviewed.pkl'

pkl_name
df_final.to_pickle(pkl_name)

    Entry trimmed_name
0  Q88E10      PP_4658
1  Q88E47      PP_4621
2  Q88FF8      PP_4138
3  Q88FY2      PP_3944
4  Q88GJ9      PP_3722
