# Assessing the secondary structure for PDB entry 6aam

The PDB structure 6AAM is the structure of a Non-receptor tyrosine-protein kinase TYK2 solved by X-ray crystallography with a data resolution of 1.98Å. Further details are available from wwPDB web pages for the structure:

* https://www.ebi.ac.uk/pdbe/entry/pdb/6aam/
* https://www.rcsb.org/structure/6AAM
* https://pdbj.org/mine/summary/6aam


The data resolution and other validation metrics mean that this structure can be regarded as reliable, certainly for in its secondary structure.

At the PDBe site on can visualise the 3D structure of 6AAM and produce a cartoon:

<img src='6aam.png'>

Here helices are show as magenta (alpha helix) or purple (3-10 helix) springs. In contrast beta sheet is show as yellow arrows and coil as  in white. Dotted lines show parts of the structure where atoms are not observed experimentally - likely because of disorder.

The sequence for 6aam can be obtained from https://www.ebi.ac.uk/pdbe/entry/pdb/6aam/protein/1 and is 298 residues long.

In [1]:
# 6aam sequence from https://www.ebi.ac.uk/pdbe/entry/pdb/6aam/protein/1
sequence_6aam = ('GPGDPTVFHKRYLKKIRDLGEGHFGKVSLYCYDPTNDGTGEMVAVKALKADAGP'
                 'QHRSGWKQEIDILRTLYHEHIIKYKGCCEDAGAASLQLVMEYVPLGSLRDYLPR'
                 'HSIGLAQLLLFAQQICEGMAYLHAQHYIHRNLAARNVLLDNDRLVKIGDFGLAK'
                 'AVPEGHEYYRVREDGDSPVFWYAPECLKEYKFYYASDVWSFGVTLYELLTHCDS'
                 'SQSPPTKFLELIGLAQGQMTVLRLTELLERGERLPRPDKCPAEVYHLMKNCWET'
                 'EASFRPTFENLIPILKTVHEKYQGQAPS')
print(len(sequence_6aam))

298


It can be noted that secondary structure is much more complicated than indicated by the simple classification of this data - full details are available from the analysis of the H-bonding arrangements. The program DSSP (https://swift.cmbi.umcn.nl/gv/dssp/DSSP_3.html) is a well-tested approach to this problem. This produces a description of the secondary structure in a known protein structure. For historical reasons DSSP uses E for Strands. A related DSSR program gives RNA secondary structure.

The secondary structure for all PDB entries as assessed by the DSSP program is available from https://cdn.rcsb.org/etl/kabschSander/ss_dis.txt.gz

Downloading this file and uncompressing and finding the entry for 6AAM it we get:

In [2]:
# from https://cdn.rcsb.org/etl/kabschSander/ss_dis.txt.gz
dssp_result_for_6aam  = """>6AAM:A:secstr
      B  GGGEEEEEE       EEEEEEE TT     EEEEEEE      TTHHHHHHHHHHHHHH   TTB
  EEEEEEEGGGTEEEEEEE  TT BHHHHGGGS   HHHHHHHHHHHHHHHHHHHHTTEE S  SGGGEEEEET
TEEEE   TT EE                 GGG  HHHHHH    HHHHHHHHHHHHHHHHTTT GGGSHHHHHH
HHH S  TT HHHHHHHHHHTT      TT  HHHHHHHHHHT SSGGGS  HHHHHHHHHHHHHHHH     
"""
dssp_result_for_6aam = dssp_result_for_6aam.splitlines()
dssp_result_for_6aam.pop(0)
dssp_result_for_6aam = ''.join(dssp_result_for_6aam)
assert len(dssp_result_for_6aam) == 298 # there should be 298 letters

DSSP classifies secondary structure using many categories 

* G = 3-turn helix (310 helix). Min length 3 residues.
* H = 4-turn helix (α helix). Minimum length 4 residues.
* I = 5-turn helix (π helix). Minimum length 5 residues.
* T = hydrogen bonded turn (3, 4 or 5 turn)
* E = extended strand in parallel and/or anti-parallel β-sheet conformation. Min length 2 residues.
* B = residue in isolated β-bridge (single pair β-sheet hydrogen bond formation)
* S = bend (the only non-hydrogen-bond based assignment).
* C = coil (residues which are not in any of the above conformations).

from https://en.wikipedia.org/wiki/Protein_secondary_structure 

For secondary structure prediction it is normal to use only 3 categories `H`, `S` and `C` (helix, sheet and coil). So we need to map the DSSP classification to these:

In [3]:
# need to convert DSSP code to the 3-category helix, strand, coil.
# Use mapping 
# helices H, G, I go to H
# strands E & bridges B go to S
# everything else got to C
translation = str.maketrans('HGIEB GT', 'HHHSSCCC')
dssp_result_for_6aam = dssp_result_for_6aam.translate(translation)
assert len(dssp_result_for_6aam) == 298 # must be as long as the sequence
assert set(dssp_result_for_6aam) == set('HSC')  # only HSC allowed
print(dssp_result_for_6aam)

CCCCCCSCCCCCSSSSSSCCCCCCCSSSSSSSCCCCCCCCSSSSSSSCCCCCCCCHHHHHHHHHHHHHHCCCCCSCCSSSSSSSCCCCSSSSSSSCCCCCSHHHHCCCSCCCHHHHHHHHHHHHHHHHHHHHCCSSCSCCSCCCSSSSSCCSSSSCCCCCCSSCCCCCCCCCCCCCCCCCCCCCCHHHHHHCCCCHHHHHHHHHHHHHHHHCCCCCCCSHHHHHHHHHCSCCCCCHHHHHHHHHHCCCCCCCCCCCCHHHHHHHHHHCCSSCCCSCCHHHHHHHHHHHHHHHHCCCCC


# test data from 6aam

we want to get a set of data mapping 5'mer sequences to the secondary structure of the central residue from the 5'mer. Taking a sample every 10 residues produces a reasonable sample size

In [4]:
from pprint import pprint
test_data_from_6aam = []
for ires, dssp in enumerate(dssp_result_for_6aam):
    if (ires+1)%10 == 0:
        fivemer = sequence_6aam[ires-2:ires+3]
        test_data_from_6aam.append((fivemer, dssp))
pprint(test_data_from_6aam)  
print('length of test_data_from_6aam', len(test_data_from_6aam))

[('FHKRY', 'C'),
 ('DLGEG', 'C'),
 ('SLYCY', 'S'),
 ('GTGEM', 'C'),
 ('LKADA', 'C'),
 ('SGWKQ', 'H'),
 ('RTLYH', 'C'),
 ('YKGCC', 'S'),
 ('ASLQL', 'S'),
 ('PLGSL', 'C'),
 ('RHSIG', 'C'),
 ('LFAQQ', 'H'),
 ('AYLHA', 'H'),
 ('RNLAA', 'C'),
 ('DNDRL', 'C'),
 ('FGLAK', 'C'),
 ('HEYYR', 'C'),
 ('DSPVF', 'C'),
 ('CLKEY', 'H'),
 ('SDVWS', 'H'),
 ('YELLT', 'H'),
 ('QSPPT', 'H'),
 ('IGLAQ', 'S'),
 ('LRLTE', 'H'),
 ('ERLPR', 'C'),
 ('AEVYH', 'H'),
 ('WETEA', 'S'),
 ('FENLI', 'H'),
 ('VHEKY', 'H')]
length of test_data_from_6aam 29


In [5]:
# copy data from cell above
test_data_from_6aam = [('FHKRY', 'C'), ('DLGEG', 'C'), ('SLYCY', 'S'), ('GTGEM', 'C'),
                       ('LKADA', 'C'), ('SGWKQ', 'H'), ('RTLYH', 'C'), ('YKGCC', 'S'),
                       ('ASLQL', 'S'), ('PLGSL', 'C'), ('RHSIG', 'C'), ('LFAQQ', 'H'),
                       ('AYLHA', 'H'), ('RNLAA', 'C'), ('DNDRL', 'C'), ('FGLAK', 'C'),
                       ('HEYYR', 'C'), ('DSPVF', 'C'), ('CLKEY', 'H'), ('SDVWS', 'H'),
                       ('YELLT', 'H'), ('QSPPT', 'H'), ('IGLAQ', 'S'), ('LRLTE', 'H'),
                       ('ERLPR', 'C'), ('AEVYH', 'H'), ('WETEA', 'S'), ('FENLI', 'H'), 
                       ('VHEKY', 'H')]