**Test uniprot data compared with given excel file**
Steps:
1) Create dictionary from excel file
2) Create dictionary from uniprot api
3) Put each key from excel file into api dictionary, check output matching



In [13]:
import pandas as pd
import os

In [14]:
#Import and display dataframes
api_path = os.path.join('data', 'uniprot_PPutida_reviewed.pkl')
xl_path = os.path.join('data', 'Sample name, CRISPRi target gene, Uniprot ID.xlsx')
df_api = pd.read_pickle(api_path)
print('Uniprot API head')
print(df_api.head())
df_xl = pd.read_excel(xl_path)
print('excel head')
print(df_xl.head())



Uniprot API head
    Entry trimmed_name
0  Q88E10      PP_4658
1  Q88E47      PP_4621
2  Q88FF8      PP_4138
3  Q88FY2      PP_3944
4  Q88GJ9      PP_3722
excel head
  CRISPRi target gene UNIPROT ID
0         PP_1607_NT2     Q88MG4
1         PP_1607_NT3     Q88MG4
2         PP_1607_NT4     Q88MG4
3             PP_4549     Q88EB7
4             PP_4550     Q88EB6


Check the entries in the uniprot dataframe to see how they're formatted


In [16]:
print('For API dataframe:')
print(f'These are the lengths of strings in the "entry" column: {df_api.Entry.apply(len).unique()}')
print(f'These are the lengths of strings in the "trimmed_name" column: {df_api.trimmed_name.apply(len).unique()}')


For API dataframe:
These are the lengths of strings in the "entry" column: [6]
These are the lengths of strings in the "trimmed_name" column: [7]


Edit the excel dataframe to match the API

In [17]:
df_xl['trimmed_name'] = None
print(df_xl.head())
trim_name = lambda x: x[x.find('PP'):(x.find('PP')+7)]
df_xl['trimmed_name'] = df_xl['CRISPRi target gene'].apply(trim_name)
df_xl.head()


  CRISPRi target gene UNIPROT ID trimmed_name
0         PP_1607_NT2     Q88MG4         None
1         PP_1607_NT3     Q88MG4         None
2         PP_1607_NT4     Q88MG4         None
3             PP_4549     Q88EB7         None
4             PP_4550     Q88EB6         None


Unnamed: 0,CRISPRi target gene,UNIPROT ID,trimmed_name
0,PP_1607_NT2,Q88MG4,PP_1607
1,PP_1607_NT3,Q88MG4,PP_1607
2,PP_1607_NT4,Q88MG4,PP_1607
3,PP_4549,Q88EB7,PP_4549
4,PP_4550,Q88EB6,PP_4550


I now have dataframes comprising a large set of genes and UPIDs from uniprot API and a smaller set of genes and UPIDs from the provided excel notebook. In an ideal world:
* Every gene-protein connection would be the same in each dataframe
* Every gene and protein that are in the provided excel notebook would show up in the uniprot API query. 

In this world that probably won't happen, but lets check. Note: I think what I'm trying to do is basically some super basic SQL queries (unions of sets, etc), so probably easier ways to do it.

Make a dictionary of each

In [19]:
dict_xl = dict(zip(df_xl['trimmed_name'], df_xl['UNIPROT ID']))
print(dict_xl['PP_4549'])
dict_api = dict(zip(df_api['trimmed_name'], df_api['Entry']))
print(dict_api['PP_4658'])


Q88EB7
Q88E10


Check membership for each set of dict keys in the other.


In [22]:
api_keys = list(dict_api.keys())
xl_keys = list(dict_xl.keys()) 

print(f'There are {len(api_keys)} total api keys and {len(xl_keys)} total xl keys')

in_api_not_xl = []
for k in api_keys:
    if k not in xl_keys:
        #print(k)
        in_api_not_xl.append(k)
print(f'there are {len(in_api_not_xl)} keys in API but not excel')        
 
in_xl_not_api = []
for k in xl_keys:
    if k not in api_keys:
        #print(k)
        in_xl_not_api.append(k)
print(f'there are {len(in_xl_not_api)} keys in excel but not api')        
        
        


There are 729 total api keys and 123 total xl keys
there are 689 keys in API but not excel
there are 83 keys in excel but not api
