In [1]:
from crimm import fetch_rcsb
from crimm.IO import MMCIFParser, PDBParser
from Bio.PDB import PDBParser as PPDBParser



## From Local File Path
For this tutorial, we will first download a MMCIF file and a PDB file on the current working directory using wget

In [4]:
!wget https://files.rcsb.org/download/2AKA.cif
!wget https://files.rcsb.org/download/2AKA.pdb

--2023-05-10 11:20:30--  https://files.rcsb.org/download/2AKA.cif
Resolving files.rcsb.org (files.rcsb.org)... 128.6.158.70
Connecting to files.rcsb.org (files.rcsb.org)|128.6.158.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘2AKA.cif.1’

2AKA.cif.1              [ <=>                ] 940.29K  --.-KB/s    in 0.1s    

2023-05-10 11:20:31 (6.13 MB/s) - ‘2AKA.cif.1’ saved [962859]

--2023-05-10 11:20:31--  https://files.rcsb.org/download/2AKA.pdb
Resolving files.rcsb.org (files.rcsb.org)... 128.6.158.70
Connecting to files.rcsb.org (files.rcsb.org)|128.6.158.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘2AKA.pdb’

2AKA.pdb                [ <=>                ] 786.74K  --.-KB/s    in 0.1s    

2023-05-10 11:20:31 (6.05 MB/s) - ‘2AKA.pdb’ saved [805626]



## MMCIF Parser
For `crimm`, the preferred structure file format is **MMCIF**, since it has more rigorous file standard defined by wwPDB and the information contained is much more reliable and consistent.

The `MMCIFParser` is a supercharged version of Biopython's `MMCIFParser`. It adds more information to the constructed structure for modeling purposes (such as identity and type of each chain, canonical sequence, disulfide bonds, etc) while still maintaining the Biopython's API compatibility.

Cell below shows the default arguments to the parameters accepted in the `MMCIFParser`

In [2]:
path = './2AKA.cif'
parser = MMCIFParser(
    first_assembly_only=True,
    first_model_only=True,
    include_hydrogens=False,
    include_solvent=True,
    strict_parser = True,
    QUIET = False
)
structure = parser.get_structure(path)
structure

NGLWidget()

<Structure id=2AKA Models=1>
│
├───<Model id=1 Chains=5>
	│
	├───<Polypeptide(L) id=A Residues=764>
	├──────Description: myosin II heavy chain
	│
	├───<Polypeptide(L) id=B Residues=6>
	├──────Description: LINKER
	│
	├───<Polypeptide(L) id=C Residues=299>
	├──────Description: Dynamin-1
	│
	├───<Solvent id=D Residues=584>
	├──────Description: water
	│
	├───<Solvent id=E Residues=170>
	├──────Description: water


## PDBParser
The `PDBParser` class provides limited information about a structure. The primary use of PDB parser is to accommodate structure sources from other software package/pipeline/webservices who does not produce MMCIF file. The PDBParser is built upon biopython's PDBParser with additonal functions to distinguish chain types and rename chain id to conform with MMCIF chain naming style.

For the example below, we show the same structure parsed using pdb file format, notice the missing description and missing water chains (even though we specified `include_solvent=True`, the missing water is actually due to Biopython's PDBParser). 

Moreover, the information guaranteed in MMCIF structures such as identity and type of each chain, canonical sequence, disulfide bonds, etc will not be present here, since they are not required and likely missing in many pdb files.

In [3]:
pdb_path = './2AKA.pdb'
pdb_parser = PDBParser(
    first_model_only=True,
    include_solvent =True,
    strict_parser=True, 
    get_header=False, 
    QUIET=False
)
pdb_structure = pdb_parser.get_structure(pdb_path)
pdb_structure



NGLWidget()

<Structure id=2AKA Models=1>
│
├───<Model id=0 Chains=3>
	│
	├───<Polypeptide(L) id=A Residues=764>
	│
	├───<Polypeptide(L) id=B Residues=6>
	│
	├───<Polypeptide(L) id=C Residues=299>


## Fetch directly from RCSB

Since we recommend so strongly of using MMCIF file format, we provide the utility to download structures direcly from RCSB by PDB IDs. There are many more fetching utilities for other databases introduced in the `FetchStructures.ipynb` tutorial. All entity types should be parsed without problem from MMCIF (RNA, DNA, Macrolite, Oligosaccharide, small molecules)

Note: unlike wget, this fetcher will not save any file to your local drive. You will need to save the structure explicitly if you want a local copy. Unfortunately, MMCIF wirter has not been implemented yet, and pickling on structure object also has a problem (pickle problem will be addressed very soon); the only support so far is, ironically, the pdb file format.

In [4]:
pdbid = '2or1' 
rcsb_fetched = fetch_rcsb(
    pdbid,
    first_assembly_only=True,
    first_model_only=True,
    include_hydrogens=False,
    include_solvent=False
)
rcsb_fetched

NGLWidget()

<Structure id=2OR1 Models=1>
│
├───<Model id=1 Chains=4>
	│
	├───<Polydeoxyribonucleotide id=A Residues=20>
	├──────Description: DNA (5'-D(*AP*AP*GP*TP*AP*CP*AP*AP*AP*CP*TP*TP*TP*CP*TP*TP*G P*TP*AP*T)-3')
	│
	├───<Polydeoxyribonucleotide id=B Residues=20>
	├──────Description: DNA (5'-D(*TP*AP*TP*AP*CP*AP*AP*GP*AP*AP*AP*GP*TP*TP*TP*GP*T P*AP*CP*T)-3')
	│
	├───<Polypeptide(L) id=C Residues=63>
	├──────Description: 434 REPRESSOR
	│
	├───<Polypeptide(L) id=D Residues=63>
	├──────Description: 434 REPRESSOR


## Object Getter and Handles for Entities
To get a handle of chain in the structure, two types of getter methods can be used:

1. Since we derive our structure classes from the respective [Biopython entities](https://github.com/biopython/biopython/blob/master/Bio/PDB/Entity.py), all Biopython getter method will work. That is, if you know their ids, you can use

```python
chain = structure[model_id][chain_id]
# for example, model 1, chain A
chainA = structure[1]['A']
```

2. Or we can use the list index of `child_list` from each entity level. Moreover, the `child_list` attributes are aliased as *models*, *chains*, *residues*, and *atoms* from their respective parent entities. e.g. the first chain of the first model would be

```python
chain = structure.models[0].chains[0]
```

3. Mismatch of the two method above is allowed. e.g. CA atom on the last residue of chain A from the first model

```python
atom = structure.models[0]['A'].residues[-1]['CA']
```

In [5]:
chainA = structure[1]['A'] # the equivalent would be struct.models[0].chains[0]

In [6]:
chainA # notice only one polymer chain shows up in the window

NGLWidget()

<Polypeptide(L) id=A Residues=764>
  Description: myosin II heavy chain


## Get Residues and Atoms

Residues are store as lists in chains, and atoms in residues. For a single residue, the `resseq` (residue sequence id, aka residue number) can be used to get the residue object in a chain.

In [7]:
res = chainA[22]
res

NGLWidget()

<Residue TYR het=  resseq=22 icode= >


In [8]:
# Get the CA carbon
atom = res['CA']
atom

<Atom CA>

## Iterating over Entities
Iterations follows Biopython's convention, that is, at each level, for-loop will iterate on the `child_list` directly

In [31]:
for model in structure:
    for chain in model:
        print(chain)

<Polypeptide(L) id=A Residues=764>
<Polypeptide(L) id=B Residues=6>
<Polypeptide(L) id=C Residues=299>
<Solvent id=D Residues=584>
<Solvent id=E Residues=170>


Additionally, each level also has a method called `get_atoms()` to get all the atoms belong to this entity. The `include_alt` flag will determine if all alternative locations of disordered residues or atoms will be expanded. The default is `include_alt=False`, which only return the first alternative location if any. 

In [32]:
# get_atom returns an iterator, explicit conversion is needed to get a list
struct_atoms = list(structure.get_atoms(include_alt=False))
print(f'{structure} has {len(struct_atoms)} atoms')

first_chain = structure[1]['A']
chain_atoms = list(first_chain.get_atoms())
print(f'{first_chain} has {len(chain_atoms)} atoms')

<Structure id=2AKA Models=1> has 9273 atoms
<Polypeptide(L) id=A Residues=764> has 6139 atoms


## Parent Entity is Stored as `parent` Attribute

In [33]:
chain_atoms[0].parent

NGLWidget()

<Residue ASN het=  resseq=13 icode= >


In [34]:
chain_atoms[0].parent.parent

NGLWidget()

<Polypeptide(L) id=A Residues=764>
  Description: myosin II heavy chain


## Additional Structural Information

Below is the demonstrations of additional structure information that is available from RCSB MMCIF files. 

### Structure Level Properties

In [9]:
structure.header

{'name': '2AKA',
 'keywords': {'entry_id': '2AKA',
  'pdbx_keywords': 'CONTRACTILE PROTEIN',
  'text': 'fusion protein, GTPase domain, dynamin, myosin, CONTRACTILE PROTEIN'},
 'citation': {'id': ['primary', 1],
  'title': ['Crystal structure of the GTPase domain of rat dynamin 1.',
   'A structural model for actin-induced nucleotide release in myosin'],
  'journal_abbrev': ['Proc.Natl.Acad.Sci.Usa', 'Nat.Struct.Mol.Biol.'],
  'journal_volume': [102, 10],
  'page_first': [13093, 826],
  'page_last': [13098, 830],
  'year': [2005, 2003],
  'journal_id_ASTM': ['PNASA6', None],
  'country': ['US', 'US'],
  'journal_id_ISSN': ['0027-8424', '1545-9993'],
  'journal_id_CSD': [40, None],
  'book_publisher': [None, None],
  'pdbx_database_id_PubMed': [16141317, None],
  'pdbx_database_id_DOI': ['10.1073/pnas.0506491102', None]},
 'idcode': {'entry_id': '2AKA',
  'title': 'Structure of the nucleotide-free myosin II motor domain from Dictyostelium discoideum fused to the GTPase domain of dynamin 

In [10]:
structure.method

'X-RAY DIFFRACTION'

In [11]:
structure.resolution # in angstrom

1.9

### Model Level Properties
Connection record such as *covalent bond between two chains*, *hydrogen bondings*, *disulfide bonds* are stored in the model level in the attribute `model.connect_dict` for text info, and `model.connect_atoms` for direct atom object handles

In [12]:
model = structure[1]
print(model.connect_dict) # covale is for covalent bonds
print(model.connect_atoms)

{'covale': [({'chain': 'A', 'resname': 'ILE', 'resseq': 776, 'atom_id': 'C', 'altloc': None}, {'chain': 'B', 'resname': 'THR', 'resseq': 1, 'atom_id': 'N', 'altloc': None})]}
{'covale': [(<Atom C>, <Atom N>)]}


In [13]:
model = rcsb_fetched[1] # 2OR1
model.connect_atoms # hydrog is for hydrogen bonds

{'hydrog': [(<Atom N6>, <Atom O4>),
  (<Atom N1>, <Atom N3>),
  (<Atom N6>, <Atom O4>),
  (<Atom N1>, <Atom N3>),
  (<Atom N2>, <Atom O2>),
  (<Atom O6>, <Atom N4>),
  (<Atom N3>, <Atom N1>),
  (<Atom O4>, <Atom N6>),
  (<Atom N1>, <Atom N3>),
  (<Atom N6>, <Atom O4>),
  (<Atom N3>, <Atom N1>),
  (<Atom N4>, <Atom O6>),
  (<Atom O2>, <Atom N2>),
  (<Atom N1>, <Atom N3>),
  (<Atom N6>, <Atom O4>),
  (<Atom N1>, <Atom N3>),
  (<Atom N6>, <Atom O4>),
  (<Atom N1>, <Atom N3>),
  (<Atom N6>, <Atom O4>),
  (<Atom N3>, <Atom N1>),
  (<Atom N4>, <Atom O6>),
  (<Atom O2>, <Atom N2>),
  (<Atom N3>, <Atom N1>),
  (<Atom O4>, <Atom N6>),
  (<Atom N3>, <Atom N1>),
  (<Atom O4>, <Atom N6>),
  (<Atom N3>, <Atom N1>),
  (<Atom O4>, <Atom N6>),
  (<Atom N3>, <Atom N1>),
  (<Atom N4>, <Atom O6>),
  (<Atom N3>, <Atom N1>),
  (<Atom O4>, <Atom N6>),
  (<Atom N3>, <Atom N1>),
  (<Atom O4>, <Atom N6>),
  (<Atom N1>, <Atom N3>),
  (<Atom N2>, <Atom O2>),
  (<Atom O6>, <Atom N4>),
  (<Atom N3>, <Atom N1>),
  

In [14]:
bpti = fetch_rcsb('4pti') # bovine pancreatic trypsin inhibitor
model = bpti.models[0]
model.connect_dict # disulf for disulfide bonds

{'disulf': [({'chain': 'A',
    'resname': 'CYS',
    'resseq': 5,
    'atom_id': 'SG',
    'altloc': None},
   {'chain': 'A',
    'resname': 'CYS',
    'resseq': 55,
    'atom_id': 'SG',
    'altloc': None}),
  ({'chain': 'A',
    'resname': 'CYS',
    'resseq': 14,
    'atom_id': 'SG',
    'altloc': None},
   {'chain': 'A',
    'resname': 'CYS',
    'resseq': 38,
    'atom_id': 'SG',
    'altloc': None}),
  ({'chain': 'A',
    'resname': 'CYS',
    'resseq': 30,
    'atom_id': 'SG',
    'altloc': None},
   {'chain': 'A',
    'resname': 'CYS',
    'resseq': 51,
    'atom_id': 'SG',
    'altloc': None})]}

## Chain Properties

Chain types are read from MMCIF and assign to the `PolymerChain` object, and much of the functions rely on this attribute to correctly process different types of polymers

In [15]:
chainA.chain_type

'Polypeptide(L)'

The attribute `can_seq` stands for **canonical sequence**, which is the sequence reported with the canonical 20 residues only. Any modified residue will be reported as the canonical amino acid it was modified from. Any *unknown* or *unclassifiable* type of residue will be reported as **X** in the sequence. This is also read from MMCIF files.

In [16]:
chainA.can_seq

Seq('MHHHHHHHDGTENPIHDRTSDYHKYLKVKQGDSDLFKLTVSDKRYIWYNPDPKE...SEI')

In [19]:
# To show the entire sequence, print function should be used
print(chainA.can_seq)

MHHHHHHHDGTENPIHDRTSDYHKYLKVKQGDSDLFKLTVSDKRYIWYNPDPKERDSYECGEIVSETSDSFTFKTVDGQDRQVKKDDANQRNPIKFDGVEDMSELSYLNEPAVFHNLRVRYNQDLIYTYSGLFLVAVNPFKRIPIYTQEMVDIFKGRRRNEVAPHIFAISDVAYRSMLDDRQNQSLLITGESGAGKTENTKKVIQYLASVAGRNQANGSGVLEQQILQANPILEAFGNAKTTRNNNSSRFGKFIEIQFNSAGFISGASIQSYLLEKSRVVFQSETERNYHIFYQLLAGATAEEKKALHLAGPESFNYLNQSGCVDIKGVSDSEEFKITRQAMDIVGFSQEEQMSIFKIIAGILHLGNIKFEKGAGEGAVLKDKTALNAASTVFGVNPSVLEKALMEPRILAGRDLVAQHLNVEKSSSSRDALVKALYGRLFLWLVKKINNVLCQERKAYFIGVLDISGFEIFKVNSFEQLCINYTNEKLQQFFNHHMFKLEQEEYLKEKINWTFIDFGLDSQATIDLIDGRQPPGILALLDEQSVFPNATDNTLITKLHSHFSKKNAKYEEPRFSKTEFGVTHYAGQVMYEIQDWLEKNKDPLQQDLELCFKDSSDNVVTKLFNDPNIASRAKKGANFITVAAQYKEQLASLMATLETTNPHFVRCIIPNNKQLPAKLEDKVVLDQLRCNGVLEGIRITRKGFPNRIIYADFVKRYYLLAPNVPRDAEDSQKATDAVLKHLNIDPEQYRFGITKIFFRAGQLARIEEAREQRISEI


`known_seq`, on the other hand, is the sequence that report any modified residue and the modified residue will be reported as 3-letter code enclosed in parentheses. The example below does not have any modified residue, but try it with **1A8I** and you can see the modified residue 680 is reported as **(LLP)**, which is modified from LYS. 

In [23]:
chainA.known_seq

Seq('MHHHHHHHDGTENPIHDRTSDYHKYLKVKQGDSDLFKLTVSDKRYIWYNPDPKE...SEI')

Since we known what is present in the sequence, we can actually have the sequence masked with the missing residues in `'-'`. In the example below, we can see we are missing residues on N-terminal for the chain.

In [24]:
chainA.masked_seq

MaskedSeq('------------NPIHDRTSDYHKYLKVKQGDSDLFKLTVSDKRYIWYNPDPKE...SEI')

There is a useful method for `masked_seq` to reveal the identity of the missing residues and highlight them in red.

In [18]:
chainA.masked_seq.show()

[91mM[0m[91mH[0m[91mH[0m[91mH[0m[91mH[0m[91mH[0m[91mH[0m[91mH[0m[91mD[0m[91mG[0m[91mT[0m[91mE[0mNPIHDRTSDYHKYLKVKQGDSDLFKLTVSDKRYIWYNPDPKERDSYECGEIVSETSDSFTFKTVDGQDRQVKKDDANQRNPIKFDGVEDMSELSYLNEPAVFHNLRVRYNQDLIYTYSGLFLVAVNPFKRIPIYTQEMVDIFKGRRRNEVAPHIFAISDVAYRSMLDDRQNQSLLITGESGAGKTENTKKVIQYLASVAGRNQANGSGVLEQQILQANPILEAFGNAKTTRNNNSSRFGKFIEIQFNSAGFISGASIQSYLLEKSRVVFQSETERNYHIFYQLLAGATAEEKKALHLAGPESFNYLNQSGCVDIKGVSDSEEFKITRQAMDIVGFSQEEQMSIFKIIAGILHLGNIKFEKGAGEGAVLKDKTALNAASTVFGVNPSVLEKALMEPRILAGRDLVAQHLNVEKSSSSRDALVKALYGRLFLWLVKKINNVLCQERKAYFIGVLDISGFEIFKVNSFEQLCINYTNEKLQQFFNHHMFKLEQEEYLKEKINWTFIDFGLDSQATIDLIDGRQPPGILALLDEQSVFPNATDNTLITKLHSHFSKKNAKYEEPRFSKTEFGVTHYAGQVMYEIQDWLEKNKDPLQQDLELCFKDSSDNVVTKLFNDPNIASRAKKGANFITVAAQYKEQLASLMATLETTNPHFVRCIIPNNKQLPAKLEDKVVLDQLRCNGVLEGIRITRKGFPNRIIYADFVKRYYLLAPNVPRDAEDSQKATDAVLKHLNIDPEQYRFGITKIFFRAGQLARIEEAREQRISEI


`missing_res` is an attribute that return the residue sequence number and the resname of the **current** missing residues. This list is dynamically constructed based on the canonical sequence and the current residues. 

In [50]:
chainA.missing_res

[(1, 'MET'),
 (2, 'HIS'),
 (3, 'HIS'),
 (4, 'HIS'),
 (5, 'HIS'),
 (6, 'HIS'),
 (7, 'HIS'),
 (8, 'HIS'),
 (9, 'ASP'),
 (10, 'GLY'),
 (11, 'THR'),
 (12, 'GLU')]

In [51]:
chainA.gaps

[{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}]

Finally, there is a built-in method to determine if there is any gap in the chain. This `is_continuous` will return **True** if there is no gap in the middle of the chain; any missing termini will be ignored.

In [52]:
chainA.is_continuous()

True