# Algorithms in Structural Biology
## Assignment 2 
### Part 1
##### Andrinopoulou Christina (ds2200013)

First of all, we import some packages that we are going to use. If you followed the instructions of the README and you activated the enviroment struc_bio the next cell will be executed without any problem, otherwise you should install the packages that are not installed in your PC. 

In [1]:
import pandas as pd
import numpy as np
import math
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser(PERMISSIVE=1)
import warnings
from Bio.PDB.PDBExceptions import PDBConstructionWarning
warnings.simplefilter('ignore', PDBConstructionWarning)

### Structure Information

The function *get_characteristics_of_structure* takes as input a PDB ID that corresponds to the name of the PDB file that contains the structure and prints some properties. Actually, the function prints the answers for the first question of the part 1 of this assignment.

This function utilizes the parser of the Bio.PDB and reads a PDB file. It finds the number of the chains, the number of residues per chain (including or not the ligands and the water molecules), the number of the water molecules and the ligands in the structure.

In [2]:
def get_characteristics_of_structure(name):
    name = name.lower()
    structure = parser.get_structure(name, name+'.pdb')

    water_counter = 0 
    ligands = set()
    chains_counter = 0
    residues_dict = {}
    for model in structure:
        for chain in model:
            residues_counter = 0
            residues_counter_without = 0
            chains_counter += 1
            for residue in chain:
                flag = False
                hetero_flag = residue.get_id()[0]
                if hetero_flag == 'W':
                    flag = True
                    water_counter += 1
                if hetero_flag != ' ' and hetero_flag != 'W': # heteroatom (ligand)
                    flag = True
                    ligands.add(residue.id[0])
                residues_counter += 1
                if not flag:
                    residues_counter_without += 1
            residues_dict[chain.get_id()] = (residues_counter, residues_counter_without)

    print(f'--------- {name.upper()} ---------')
    print(f'The total number of chains is {chains_counter}')        
    print('The number of residues per chain is:')
    for chain_id, res in residues_dict.items():
        print(f'Chain {chain_id} contains {res[0]} residues (with ligands and water molecules) and {res[1]} residues (without).')
    print(f'Number of water molecules is {water_counter}')
    print(f'The ligands that present in the structure are {ligands}')
    
    return residues_dict, water_counter, ligands

In [3]:
struc_7neh = get_characteristics_of_structure(name='7neh')

--------- 7NEH ---------
The total number of chains is 4
The number of residues per chain is:
Chain H contains 468 residues (with ligands and water molecules) and 219 residues (without).
Chain L contains 387 residues (with ligands and water molecules) and 215 residues (without).
Chain E contains 296 residues (with ligands and water molecules) and 196 residues (without).
Chain A contains 3 residues (with ligands and water molecules) and 0 residues (without).
Number of water molecules is 496
The ligands that present in the structure are {'H_ CL', 'H_NO3', 'H_NAG', 'H_EDO', 'H_PEG', 'H_FUC', 'H_SO4'}


In [4]:
struc_7neg = get_characteristics_of_structure(name='7neg')

--------- 7NEG ---------
The total number of chains is 4
The number of residues per chain is:
Chain H contains 285 residues (with ligands and water molecules) and 217 residues (without).
Chain L contains 258 residues (with ligands and water molecules) and 214 residues (without).
Chain E contains 213 residues (with ligands and water molecules) and 183 residues (without).
Chain A contains 3 residues (with ligands and water molecules) and 0 residues (without).
Number of water molecules is 134
The ligands that present in the structure are {'H_NAG', 'H_GOL', 'H_FUC', 'H_SO4'}


### Determine the R.M.S.D. between receptor binding domain of SARS-COV-2 Spike glycoprotein complex and its mutant
At this step, we want to calculate the cRMSD between the receptor binding domain of SARS-COV-2 Spike glycoprotein complex and its mutant. For this purpose, we use only the E chain for both of the structures. 

The function below takes as input a PDB ID that corresponds to the name of the PDB file of the structure that we examine. Also, it takes the borders of the structure. The borders are the residue names and the residue numbers for the start and the end of the part of the chain that we examine. There are more details in the report for this selection of the border.

The function uses the parser of Bio.PDB and returns the residues (in the borders) and the corresponding atoms with their 3D coordinates.

In [5]:
def get_RBD_atoms_of_structure(name, start_res_name, start_res_code, stop_res_name, stop_res_code):
    name = name.lower()
    structure = parser.get_structure(name, name+'.pdb')
    rbd_flag = False
    receptor_binding_domain = []
    residues_list = []

    for model in structure:
        for chain in model:
            counter = 0
            if chain.id == 'E': # Spike Glycoprotein
                for residue in chain:
                    name = residue.get_resname() 
                    code = int(str(residue).split('=')[2].split(' ')[0])
                        
                    if name == start_res_name and code == start_res_code: # start of the receptor binding domain
                        rbd_flag = True                         
                    if name == stop_res_name and code == stop_res_code: # end of the receptor binding domain
                        residues_list.append((name, code))
                        for atom in residue:
                            receptor_binding_domain.append((name, code, atom.get_name(), atom.get_coord()))
                        rbd_flag = False
                        
                    if rbd_flag:
                        residues_list.append((name, code))
                        for atom in residue:
                            receptor_binding_domain.append((name, code, atom.get_name(), atom.get_coord()))
    return receptor_binding_domain, residues_list

In [6]:
rbd_7neh, residues_7neh = get_RBD_atoms_of_structure(name='7neh', start_res_name='ASN', start_res_code=334, stop_res_name='GLU', stop_res_code=516)
print(residues_7neh)

[('ASN', 334), ('LEU', 335), ('CYS', 336), ('PRO', 337), ('PHE', 338), ('GLY', 339), ('GLU', 340), ('VAL', 341), ('PHE', 342), ('ASN', 343), ('ALA', 344), ('THR', 345), ('ARG', 346), ('PHE', 347), ('ALA', 348), ('SER', 349), ('VAL', 350), ('TYR', 351), ('ALA', 352), ('TRP', 353), ('ASN', 354), ('ARG', 355), ('LYS', 356), ('ARG', 357), ('ILE', 358), ('SER', 359), ('ASN', 360), ('CYS', 361), ('VAL', 362), ('ALA', 363), ('ASP', 364), ('TYR', 365), ('SER', 366), ('VAL', 367), ('LEU', 368), ('TYR', 369), ('ASN', 370), ('SER', 371), ('ALA', 372), ('SER', 373), ('PHE', 374), ('SER', 375), ('THR', 376), ('PHE', 377), ('LYS', 378), ('CYS', 379), ('TYR', 380), ('GLY', 381), ('VAL', 382), ('SER', 383), ('PRO', 384), ('THR', 385), ('LYS', 386), ('LEU', 387), ('ASN', 388), ('ASP', 389), ('LEU', 390), ('CYS', 391), ('PHE', 392), ('THR', 393), ('ASN', 394), ('VAL', 395), ('TYR', 396), ('ALA', 397), ('ASP', 398), ('SER', 399), ('PHE', 400), ('VAL', 401), ('ILE', 402), ('ARG', 403), ('GLY', 404), ('ASP

In [7]:
rbd_7neg, residues_7neg = get_RBD_atoms_of_structure(name='7neg', start_res_name='ASN', start_res_code=334, stop_res_name='GLU', stop_res_code=516)
print(residues_7neg)

[('ASN', 334), ('LEU', 335), ('CYS', 336), ('PRO', 337), ('PHE', 338), ('GLY', 339), ('GLU', 340), ('VAL', 341), ('PHE', 342), ('ASN', 343), ('ALA', 344), ('THR', 345), ('ARG', 346), ('PHE', 347), ('ALA', 348), ('SER', 349), ('VAL', 350), ('TYR', 351), ('ALA', 352), ('TRP', 353), ('ASN', 354), ('ARG', 355), ('LYS', 356), ('ARG', 357), ('ILE', 358), ('SER', 359), ('ASN', 360), ('CYS', 361), ('VAL', 362), ('ALA', 363), ('ASP', 364), ('TYR', 365), ('SER', 366), ('VAL', 367), ('LEU', 368), ('TYR', 369), ('ASN', 370), ('SER', 371), ('ALA', 372), ('SER', 373), ('PHE', 374), ('SER', 375), ('THR', 376), ('PHE', 377), ('LYS', 378), ('CYS', 379), ('TYR', 380), ('GLY', 381), ('VAL', 382), ('SER', 383), ('PRO', 384), ('THR', 385), ('LYS', 386), ('LEU', 387), ('ASN', 388), ('ASP', 389), ('LEU', 390), ('CYS', 391), ('PHE', 392), ('THR', 393), ('ASN', 394), ('VAL', 395), ('TYR', 396), ('ALA', 397), ('ASP', 398), ('SER', 399), ('PHE', 400), ('VAL', 401), ('ILE', 402), ('ARG', 403), ('GLY', 404), ('ASP

The function below checks all the residues for two structures and prints the mutations. In other words, it checks if one residue of the first structure is mutated in the second structure and prints the suitable message.

In this case, the residue at position 501 was an Asparagine in 7NEH and became a Tyrosine in 7NEG.

In [8]:
def find_mutations(residues1, residues2, name1, name2):    
    for (res1, res2) in zip(residues1, residues2):
        if res1 != res2: # mutations
            print(f'The residue {res1} of the {name1} has become {res2} in {name2}')

In [9]:
find_mutations(residues_7neh, residues_7neg, '7NEH', '7NEG')

The residue ('ASN', 501) of the 7NEH has become ('TYR', 501) in 7NEG


The function below writes the coordinates of the Ca atoms in a txt file.

In [10]:
def write_CA_at_txt(rbd_list, filename):
    ca = []
    for item in rbd_list:
        residue_name = item[0]
        residue_code = item[1]
        atom = item[2]
        coordinates = item[3]
        if atom == 'CA':
            ca.append((coordinates[0], coordinates[1], coordinates[2]))
            
    with open(filename, 'w') as fp:
        fp.write('\n'.join('%s %s %s' % x for x in ca))

In [11]:
write_CA_at_txt(rbd_7neh, 'CA_7NEH.txt')
write_CA_at_txt(rbd_7neg, 'CA_7NEG.txt')

The function *write_atoms_at_txt* writes all the atoms of the residues in a txt file. In case of the Asparagine and the Tyrosine the function writes only the atoms that are common into two structures.

In [12]:
def write_atoms_at_txt(rbd_list, filename):
    atoms = []
    excluded_7neh = ['OD1', 'ND2']
    excluded_7neg = ['CD1', 'CD2', 'CE1', 'CE2', 'CZ', 'OH']
    for item in rbd_list:
        residue_name = item[0]
        residue_code = item[1]
        atom = item[2]
        coordinates = item[3]
        if residue_code == 501:
            if residue_name == 'ASN':
                if atom in excluded_7neh:
                    continue
            if residue_name == 'TYR':
                if atom in excluded_7neg:
                    continue
        atoms.append((coordinates[0], coordinates[1], coordinates[2]))
            
    with open(filename, 'w') as fp:
        fp.write('\n'.join('%s %s %s' % x for x in atoms))

In [13]:
write_atoms_at_txt(rbd_7neh, 'ATOMS_7NEH.txt')
write_atoms_at_txt(rbd_7neg, 'ATOMS_7NEG.txt')

## cRMSD

The class *cRMSD* contains all the appropriate functions for the calculation of the cRMSD distance between two structures. This class was implemented for the first assignment of the course and we use it here with some modifications in the way that read the inputs.

Steps of algorithm:
- Find the centroid of each conformation 
- Move the conformations to the origin of the space: Subtract the centroids from each coordinate
- Singular Value Decomposition (SVD): best transformation Q for the conformation 
- Apply transformation to a conformation 
- Calculate the corresponding cRMSD distance

In [14]:
class cRMSD:
    def __init__(self, filename1, filename2):
        self.filename1 = filename1
        self.filename2 = filename2
        self.conformations, self.number_of_atoms = self.read_conformations()
        self.centroid_1 = []
        self.centroid_2 = []
        self.U = np.empty((3,3))
        self.Sigma = np.empty((0,3))
        self.VT = np.empty((3,3))
        self.Q = np.empty((3,3))
        
    
    # read the conformations from the txt file
    def read_conformations(self):     
        conformations_dict = dict()
        df1 = pd.read_csv(self.filename1, delimiter = " ", header=None)
        df2 = pd.read_csv(self.filename2, delimiter = " ", header=None)
        conformations_dict[0] = df1
        conformations_dict[1] = df2
        return conformations_dict, df1.shape[0]
        
        
    # calculate the centroid
    def find_centroid(self, conformation1, conformation2):
        sum_result1 = conformation1.sum(axis = 0)
        sum_result2 = conformation2.sum(axis = 0)
        self.centroid_1 = [sum_result1[i]/self.number_of_atoms for i in range(len(sum_result1))]
        self.centroid_2 = [sum_result2[i]/self.number_of_atoms for i in range(len(sum_result2))]
        
        
    # move the conformations to the origin
    def move_to_origin(self, conformation1, conformation2):
        number_of_cols = conformation1.shape[1]
        for i in range(number_of_cols):
            conformation1[i] -= self.centroid_1[i]
            conformation2[i] -= self.centroid_2[i]
        return conformation1, conformation2
        
       
    # find best tranformation of one conformation, using SVD
    def SVD_process(self, conformation1, conformation2):
        XT_Y = np.matmul(conformation1.T.to_numpy(), conformation2.to_numpy())
        self.U, self.Sigma, self.VT = np.linalg.svd(XT_Y)
        self.Q = np.matmul(self.U, self.VT)
        detQ = np.linalg.det(self.Q)
        if detQ < 0:
            self.U[2] = -self.U[2]
            self.Q = np.matmul(self.U, self.VT)        
        
    
    # calculate cRMSD
    def cRMSD_distance(self, conformation1, conformation2):
        temp = np.matmul(conformation1.to_numpy(), self.Q) - conformation2.to_numpy()
    
        sum_norms = 0
        for i in range(temp.shape[0]):
            sum_norms += pow(np.linalg.norm(temp[i]),2)
        return math.sqrt(sum_norms/self.number_of_atoms)
     
       
    # compare two conformations
    def compare(self, conformation1, conformation2, print_flag=False):
        self.find_centroid(conformation1, conformation2)
        conformation1, conformation2 = self.move_to_origin(conformation1, conformation2)
        self.SVD_process(conformation1, conformation2)
        c_rmsd = self.cRMSD_distance(conformation1, conformation2)
        if print_flag:
            print(f'cRMSD = {c_rmsd}')
        return c_rmsd
    
    
    def pipeline(self):
        self.read_conformations()
        self.compare(conformation1=self.conformations[0], conformation2=self.conformations[1], print_flag=True)

In [15]:
print('The distance between the atoms of the 7NEH and the 7NEG is:')
crmsd = cRMSD(filename1='ATOMS_7NEH.txt', filename2='ATOMS_7NEG.txt')
crmsd.pipeline()

The distance between the atoms of the 7NEH and the 7NEG is:
cRMSD = 0.6531414085029266


In [16]:
print('The distance between the Ca atoms of the 7NEH and the 7NEG is:')
crmsd = cRMSD(filename1='CA_7NEH.txt', filename2='CA_7NEG.txt')
crmsd.pipeline()

The distance between the Ca atoms of the 7NEH and the 7NEG is:
cRMSD = 0.2923968635660831
