In [1]:
from crimm.Fetchers import (
    fetch_rcsb, fetch_alphafold, fetch_swiss_model,
    uniprot_id_query, fetch_alphafold_from_chain,
    fetch_swiss_model_from_chain
 )
from crimm.Superimpose import ChainSuperimposer
from crimm.Visualization import show_nglview_multiple, View



## Get Structure from RCSB with PDB ID

The `fetch_rcsb` function will download the structure MMCIF file for the most accurate
structure annotations. The function call below shows all the default options.

In [2]:
pdbid = '5IEV'
struct = fetch_rcsb(
    pdbid,
    first_assembly_only=True,
    first_model_only=True,
    include_hydrogens=False,
    include_solvent=True
)

In [3]:
struct

NGLWidget()

<Structure id=5IEV Models=1>
│
├───<Model id=1 Chains=3>
	│
	├───<Polypeptide(L) id=A Residues=284>
	├──────Description: Cyclin-dependent kinase 2
	│
	├───<Heterogens id=B Molecules=1>
	├──────Description: Roniciclib
	│
	├───<Solvent id=C Residues=154>
	├──────Description: water


In [None]:
chainA = struct[1]['A'] # the equivalent would be struct.models[0].chains[0]

In [None]:
chainA

NGLWidget()

<Polypeptide(L) id=A Residues=284>
  Description: Cyclin-dependent kinase 2


## Query Uniprot IDs

To find the Uniprot ID for a specific polypeptide chain from [RCSB](https://www.rcsb.org/), you will need to provide the **PDB ID** and the **Entity ID** for the chain. However, `entity_id` is stored as attribute to the **polymer** chains if you parse the structure from MMCIF file or fetch from RCSB directly.

In [49]:
uniprot_id = uniprot_id_query(pdbid, chainA.entity_id)

In [50]:
pdbid, chainA.entity_id, uniprot_id

('5IEV', 1, 'P24941')

## Fetch AlphaFold Structure from AlphaFold DB

Once we have the Uniprot ID, we can fetch the corresponding structure from [AlphaFold DB](https://alphafold.ebi.ac.uk/). The downloaded structure will contain only the polypeptide chain whose canonical sequence is folded by AlphaFold 2.

In [7]:
# We are fetching 5IEV chain A in this case
af_struct = fetch_alphafold(uniprot_id)

In [8]:
af_struct

NGLWidget()

<Structure id=AF-P24941-F1 Models=1>
│
├───<Model id=1 Chains=1>
	│
	├───<Polypeptide(L) id=A Residues=298>
	├──────Description: Cyclin-dependent kinase 2


Since we only have one model and one chain in the AlphaFold DB structures, the handle to the requested chain
is always 
```python 
structure[1]['A']
``` 
or equivalently, 
```python
structure.models[0].chains[0]
```

In [9]:
af_chainA = af_struct[1]['A']

In [9]:
af_chainA

NGLWidget()

<Polypeptide(L) id=A Residues=298>
  Description: Cyclin-dependent kinase 2


## Fetch Homology Models from SWISS-MODEL
Similarly, homology models for a give PDB protein model can be obtained from [SWISS-MODEL](https://swissmodel.expasy.org/). However, the model downloaded will have other chains present (polymer and/or ligands). 

Unfortuantely, since SWISS-MODEL only provide PDB file format, the parsed structure has limited annotation and may need a closer inspection to identify the chain to be used.

Moreover, since the template used could be from any of the homology models, the ligands that come with the SWISS-MODEL structure are likely not the same as the ones in the original chain

In [54]:
sm_struct = fetch_swiss_model(uniprot_id)
sm_struct

NGLWidget()

<Structure id=P24941-SwissModel Models=1>
│
├───<Model id=0 Chains=2>
	│
	├───<Polypeptide(L) id=A Residues=295>
	├──────Description: CELL DIVISION PROTEIN KINASE 2
	│
	├───<Heterogens id=B Molecules=1>


Notice how we are using **model id 0** here, because of the unregulated PDB file format. Specifically, the *MODEL* keyword does not exist in single model structure (X-ray crystallography structures) in PDB files

In [53]:
sm_chainA = sm_struct[0]['A']
sm_chainA

NGLWidget()

<Polypeptide(L) id=A Residues=295>
  Description: CELL DIVISION PROTEIN KINASE 2


## Superimposing Two Polymer Chains
The `ChainSuperimposer` class is derived from Biopython's `Superimposer` class. It allows overall structure superimpositions for any chains (the original one only accept chains with identical residues). Sequence alignment will be performed based on the `can_seq` attribute (canonical sequence) of the chains. The aligned residues will then be used for superimposition.

In [10]:
imposer = ChainSuperimposer()
# in this case, the canonical sequence should be identical
imposer.set_chains(chainA, af_chainA)

imposer.apply_transform(af_struct)
print(f'rmsd = {imposer.rms:.3f}')
imposer.show()

rmsd = 1.375


NGLWidget()

## Simplified Workflow for Fetching AlphaFold Structures
Additionally, the polypeptide chain object can be used directly to query and fetch AlphaFold or SWISS-MODEL strutures with functions `fetch_alphafold_from_chain` and `fetch_swiss_model_from_chain`. The downloaded structure will be automatically align to the given chain.

In [25]:
af_struct = fetch_alphafold_from_chain(chainA)

af_chainA = af_struct[1]['A']
show_nglview_multiple([chainA, af_chainA])

NGLWidget()

## Special Cases on Uniprot ID and AlphaFold Structures
Since uniprot ID refers to the polymer entity, and some strutures only contains partial or heavily modified structure, the structure downloaded from AlphaFold DB could be a suprise.

In [11]:
struct2 = fetch_rcsb('1CDL', include_solvent=False)

In [12]:
struct2

NGLWidget()

<Structure id=1CDL Models=1>
│
├───<Model id=1 Chains=6>
	│
	├───<Polypeptide(L) id=A Residues=142>
	├──────Description: CALMODULIN
	│
	├───<Polypeptide(L) id=B Residues=19>
	├──────Description: CALCIUM/CALMODULIN-DEPENDENT PROTEIN KINASE TYPE II ALPHA CHAIN
	│
	├───<Heterogens id=I Molecules=1>
	├──────Description: CALCIUM ION
	│
	├───<Heterogens id=J Molecules=1>
	├──────Description: CALCIUM ION
	│
	├───<Heterogens id=K Molecules=1>
	├──────Description: CALCIUM ION
	│
	├───<Heterogens id=L Molecules=1>
	├──────Description: CALCIUM ION


Here, we are taking the *CALMODULIN-DEPENDENT PROTEIN KINASE TYPE II ALPHA CHAIN*,
which only has 19 residues in this structure instance

In [36]:
chainB = struct2[1]['B']
af_struct2 = fetch_alphafold_from_chain(chainB)

Since the Uniprot ID deposited in PDB for `chainB` refers to the entire structure, the entire folded structure 
(with some crazy loops) will be downloaded

In [38]:
af_chain2 = af_struct2.models[0].chains[0]
af_chain2

NGLWidget()

<Polypeptide(L) id=A Residues=1906>
  Description: Myosin light chain kinase, smooth muscle


In [18]:
# Visualization of where the segment locates on the entire chain
view = View()

view.load_entity(chainB)
view.load_entity(af_chain2)
view.highlight_chains([chainB])
view

View()

## Sequence Alignment Procedure inside of ChainSuperimposer
Continue from the above example, we want to find where `chainB` aligns to the AlphaFold counterpart. Also, I would like to illustrate what is happening in the `ChainSuperimposer` class when two chains with unidentical sequences are loaded, the Biopython's `PairwiseAligner` is demonstrated below is equivalent to what has been implemented in our superimposer.

In [31]:
from Bio.Align import PairwiseAligner

In [39]:
aligner = PairwiseAligner()

In our `ChainSuperimposer`, we applied heavy penalty internal gap opening to avoid highly-fragmented sequence alignments. Gap opening is only allowed in the middle of the sequence when 20% or more of the residues in the sequence is aligned as a result of that.

In [40]:
# floor division to get ~0.2 of seq length as penalty
penalty_scores = -(len(chainB.can_seq))//5 
aligner.target_internal_open_gap_score = penalty_scores
aligner.query_internal_open_gap_score = penalty_scores

In [41]:
alignments = aligner.align(chainB.can_seq, af_chain2.can_seq)

We take the top alignment `alignments[0]` for superimposition of structures. From the example below, we can see the short `chainB` is aligned to the AlphaFold structure from residue sequence number 1730 to 1750

In [47]:
alignments[0].aligned+1 # residue sequence id (resseq) is 1-indexed

array([[[   1,   21]],

       [[1730, 1750]]])

In [56]:
# To show all the aligned/mismatched residues
print(alignments[0])

target            0 ------------------------------------------------------------
                  0 ------------------------------------------------------------
query             0 MGDVKLVTSTRVSKTSLTLSPSVPAEAPAFTLPPRNIRVQLGATARFEGKVRGYPEPQIT

target            0 ------------------------------------------------------------
                 60 ------------------------------------------------------------
query            60 WYRNGHPLPEGDHYVVDHSIRGIFSLVIKGVQEGDSGKYTCEAANDGGVRQVTVELTVEG

target            0 ------------------------------------------------------------
                120 ------------------------------------------------------------
query           120 NSLKKYSLPSSAKTPGGRLSVPPVEHRPSIWGESPPKFATKPNRVVVREGQTGRFSCKIT

target            0 ------------------------------------------------------------
                180 ------------------------------------------------------------
query           180 GRPQPQVTWTKGDIHLQQNERFNMFEKTGIQYLEIQNVQLADAGIYTCTVVNSAGKASVS

target            0 ----