# Import phospho-data from Sugiyama et al.

This notebook reads the Supplementary Table 2 of Sugiyama et al. (2021) and formats it for analysis with StructureMap.

https://www.nature.com/articles/s41598-019-46385-4#ref-CR74 

In [1]:
import pandas as pd
import numpy as np

## Phospho data

In [2]:
p_sugiyama = pd.read_csv('../data/unformatted_ptm_data/41598_2019_46385_MOESM3_ESM.tsv', sep='\t')

In [3]:
'Total of ' + str(p_sugiyama.shape[0]) + ' kinase substrate pairs'

'Total of 198536 kinase substrate pairs'

In [4]:
p_sugiyama['AA'] = [x[0] for x in p_sugiyama['Position']]
p_sugiyama['position'] = [int(x[1:]) for x in p_sugiyama['Position']]
p_sugiyama['p_sugiyama'] = 1
p_sugiyama = p_sugiyama[['Uniprot ID','AA','position','p_sugiyama']]
p_sugiyama = p_sugiyama.drop_duplicates().reset_index(drop=True)

In [5]:
'Total of ' + str(p_sugiyama.shape[0]) + ' reported phosphosites'

'Total of 21449 reported phosphosites'

## Annotate UniProt accession

In [6]:
uniprot_annotation = pd.read_csv("../data/human_fasta/uniprot-filtered-organism__Homo+sapiens+(Human)+[9606]_+AND+review--.tab", sep='\t')
uniprot_annotation = uniprot_annotation[['Entry','Entry name']]
uniprot_annotation = uniprot_annotation.rename(columns={"Entry": "protein_id","Entry name": "Uniprot ID"})

In [7]:
p_sugiyama = p_sugiyama.merge(uniprot_annotation, how='left', on=['Uniprot ID'])
p_sugiyama = p_sugiyama.fillna(0)

In [8]:
'Total of ' + str(len(p_sugiyama[p_sugiyama.protein_id==0]['Uniprot ID'].unique())) + ' proteins with no UniProt mapping.'

'Total of 354 proteins with no UniProt mapping.'

In [9]:
'This corresponds to ' + str(p_sugiyama[p_sugiyama.protein_id==0].shape[0]) + ' phosphosites with no UniProt mapping.'

'This corresponds to 1401 phosphosites with no UniProt mapping.'

In [10]:
p_sugiyama = p_sugiyama[p_sugiyama.protein_id != 0].reset_index(drop=True)

In [11]:
'After filerting, ' + str(p_sugiyama.shape[0]) + ' phosphosites remain.'

'After filerting, 20048 phosphosites remain.'

In [12]:
p_sugiyama = p_sugiyama[['protein_id','AA','position','p_sugiyama']]

In [13]:
p_sugiyama

Unnamed: 0,protein_id,AA,position,p_sugiyama
0,P31946,S,212,1
1,P31946,Y,151,1
2,P31946,Y,21,1
3,P31946,Y,50,1
4,P62258,Y,152,1
...,...,...,...,...
20043,Q9NRA0,T,389,1
20044,Q9NRA0,T,402,1
20045,Q9NRA0,T,404,1
20046,Q9NRA0,T,503,1


In [14]:
p_sugiyama.to_csv('../data/ptm_data/p_sugiyama.csv', index=False)

It's important to keep in mind that the sites from Sugiyama et al. were not remapped to the most recent protein fasta. Some phosphosites might therefore not match correctly. However, this is expected to only affect a minority of proteins and phosphosites. 