# Non-standard amino acids in KLIFS molecules

Aims of this notebook:
1. For all KLIFS molecules, save all non-standard amino acids.
2. Check amount of non-standard amino acids. Some `kinsim_structure` features are only defined for standard amino acids, thus we need to check how much information we loose in our dataset.

## Imports and functions

In [174]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [175]:
%autoreload 2

In [251]:
import itertools
from pathlib import Path
import pickle
import sys

import pandas as pd

from kinsim_structure.analysis import NonStandardKlifsAminoAcids
from kinsim_structure.auxiliary import split_klifs_code, get_klifs_regions

## Globals

In [252]:
KLIFS_REGIONS = get_klifs_regions()

In [253]:
KLIFS_REGIONS.head()

Unnamed: 0_level_0,region_name,klifs_id
klifs_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,I,1
2,I,2
3,I,3
4,g.I,4
5,g.I,5


## IO paths

In [177]:
path_to_data = Path('/') / 'home' / 'dominique' / 'Documents' / 'data' / 'kinsim' / '20190724_full'
path_to_kinsim = Path('/') / 'home' / 'dominique' / 'Documents' / 'projects' / 'kinsim_structure'

metadata_path = path_to_data / 'preprocessed' / 'klifs_metadata_preprocessed.csv'
output_path =  path_to_kinsim / 'results' / '20190724_full' / 'non_standard_aminoacids.p'    

## Load KLIFS metadata

In [178]:
klifs_metadata = pd.read_csv(metadata_path)

In [179]:
klifs_metadata.shape

(3918, 24)

## Data generation

In [180]:
# Get non-standard amino acids in KLIFS dataset
non_standard_aminoacids = NonStandardKlifsAminoAcids()

In [181]:
non_standard_aminoacids.get_non_standard_amino_acids_in_klifs(klifs_metadata)

In [182]:
with open(output_path, 'wb') as f:
    pickle.dump(non_standard_aminoacids, f)

## Data analysis

In [183]:
with open(output_path, 'rb') as f:
    non_standard_aminoacids = pickle.load(f)

In [208]:
print(
    f'Number of non-standard amino acids in KLIFS dataset:' 
    f'{len(non_standard_aminoacids.data)}'
)

Number of non-standard amino acids in KLIFS dataset:26


In [248]:
# Get set of non-standard amino acids
set(non_standard_aminoacids.data.res_name)

{'CAF', 'CME', 'CSS', 'KCX', 'MSE', 'OCY', 'PHD', 'PTR'}

In [207]:
print(
    f'Number of structures with non-standard amino acids:',
    f'{len(non_standard_aminoacids.data.groupby(by=["code"]).groups)}'
)

Number of structures with non-standard amino acids: 16


In [202]:
non_standard_aminoacids.data.sort_values(by='klifs_res_name')

Unnamed: 0,res_name,res_id,klifs_id,code,klifs_res_name
0,MSE,357,16,HUMAN/ADCK3_5i35_chainA,M
1,MSE,485,67,HUMAN/ADCK3_5i35_chainA,M
23,MSE,330,76,HUMAN/RIOK1_4otp_chainA,M
22,MSE,314,60,HUMAN/RIOK1_4otp_chainA,M
21,MSE,277,45,HUMAN/RIOK1_4otp_chainA,M
20,MSE,251,25,HUMAN/RIOK1_4otp_chainA,M
6,MSE,3729,1,HUMAN/DNAPK_5luq_chainB,M
7,MSE,3820,57,HUMAN/DNAPK_5luq_chainB,M
8,MSE,3929,77,HUMAN/DNAPK_5luq_chainB,M
9,MSE,192,3,HUMAN/IRAK4_2o8y_chainB,M


In [245]:
non_std = non_standard_aminoacids.data.copy()

In [246]:
non_std['region_name'] = non_std.apply(
    lambda x: KLIFS_REGIONS.loc[x.klifs_id], 
    axis=1
)['region_name']

In [249]:
non_std.sort_values(by='klifs_res_name')

Unnamed: 0,res_name,res_id,klifs_id,code,klifs_res_name,region_name
0,MSE,357,16,HUMAN/ADCK3_5i35_chainA,M,III
1,MSE,485,67,HUMAN/ADCK3_5i35_chainA,M,VI
23,MSE,330,76,HUMAN/RIOK1_4otp_chainA,M,VII
22,MSE,314,60,HUMAN/RIOK1_4otp_chainA,M,aE
21,MSE,277,45,HUMAN/RIOK1_4otp_chainA,M,GK
20,MSE,251,25,HUMAN/RIOK1_4otp_chainA,M,aC
6,MSE,3729,1,HUMAN/DNAPK_5luq_chainB,M,I
7,MSE,3820,57,HUMAN/DNAPK_5luq_chainB,M,aD
8,MSE,3929,77,HUMAN/DNAPK_5luq_chainB,M,VII
9,MSE,192,3,HUMAN/IRAK4_2o8y_chainB,M,I


These and other non-standard residues derive from:
* CAF, CME, CSO, CSS, OCY: L-peptide linking (parent CYS)
* KCX: L-peptide linking (parent LYS)
* MSE: L-peptide linking (parent MET) - SELENOMETHIONINE
* PHD: L-peptide linking (parent ASP) - ASPARTYL PHOSPHATE
* PTR: L-peptide linking (parent TYR) - O-PHOSPHOTYROSINE
* SEP: L-peptide linking (parent SER) - PHOSPHOSERINE

In [254]:
klifs_metadata[klifs_metadata.pdb_id == '5fm3']

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,index,kinase,family,groups,pdb_id,chain,alternate_model,species,...,dfg,ac_helix,rmsd1,rmsd2,qualityscore,pocket,resolution,missing_residues,missing_atoms,full_ifp
3202,3203,3214,7302,RET,Ret,TK,5fm3,A,-,Human,...,in,in,0.788,2.104,8.2,KTLGEGEFGKVVKVAVKMLDLLSEFNVLKQVNPHVIKLYGALLIVE...,2.95,0,18,0000000000000010000001000000000000000000000000...


Decisions for preprocessing and encoding:

* MSE (selenomethionine): KLIFS treats it as M (methionine) and we will do the same. Selenomethionine is often used in X-ray for crystallization (introduce heavy atom to help solve the phase problem)
* Residues denoted in KLIFS with an X (mutants, modified residues): Remove if they fall into important regions of the binding site
    * PHD (aspartyl phosphate) at DFG loop treated in KLIFS as X - removed from dataset.
* Phosphorylated residues
    * PTR (O-phosphotyrosine): KLIFS treats it as Y (tyrosine) and we will do the same (otherwise we will not match unphosphorylated similar binding sites)
   
