### Define a function to extract the antibody chain sequences and CDR coordinates
* **Input:**   
  * line: line from pdb files, the format of each line is given by the **Format_v33_Letter**
  * chain id: it is the id for a chain, it is gien by the **Summary** file. For example, the light chain maybe gien the id **B**
  * CDR_index, a string, takes values from {"l1", "l2", "l3", "h1", "h2", "h3"}   
  * CDR_range, a list, gives the position where the **CDR**s are located. For example, the CDRL1 is located at [23, 35]
  * normal_tracker, an integer,  tracks the change of the orders of normal sequence, For example, **ALA 120, ARG 121**, normal_tracker = 121 -120 = 1. And for**ALA 120, ARG 122**, normal_tracker = 122 - 120 =2, which means there is a jump.  
  * insersion_tracker, tracks the change of the orders of inserted amino acid, its a string. For example, **ALA 120A**, the insersion_tracker is **"A"**  
  * break_indexer, integer, gives the distance of the jump between two amino acids. For example **ALA 120, ARG 120A**, normal_tracker = 120 -120 = 0, the insersion_tracker changes for **" "** to **"A"**, break_indexer = normal_tracker + _the change of _**insersion_tracker** = 1  
  * antibody_chain, a dictionary, gives the amino acid sequence of the antibody chain, with key _chain_id_  
  * CDR_coordinates, a dictionary, gives the coordinates of the amino acids in CDRs. For example, CDR_coordinates = {'l1A':[[1.04, 1.11, 0.25, 24], .....]}, means, for the light chain with chain id **A**, there is an atom with coordinates (1.04, 1.11, 0.25), and this atom is from amino acid 24, and this amino acids is given by *antibody_chain['A'][24]*  

* **Output:**  
All the returns has the same meaning as in the input, and those values will be fed into ***Get_CDR_chain_and_coordinates*** to iterate over the lines of a *PDB* file.

 

In [5]:
def Get_CDR_chain_and_coordinates(line, chain_id, CDR_index, CDR_range, counter = 0, normal_tracker = 0,insersion_tracker = ' ', 
    antibody_chain = {}, CDR_coordinates = {}):
    #line begins with"ATOM", chain_id is for antibody
    break_indexer = 0 # track the distance between two adjacent  ATOM data in terms of amino acids
#     track the normal sequence, to see whether it jumps    
    if  normal_tracker != 0:
        break_indexer += int(line[22:26]) - normal_tracker
        normal_tracker = int(line[22:26])
    elif normal_tracker == 0:
        normal_tracker = int(line[22:26])
        antibody_chain[chain_id] = [line[17:20]]
# track whether there is any insersion        
    if line[26] != insersion_tracker and line[26] != ' ':
        break_indexer += 1
        insersion_tracker = line[26]
    elif line[26] != insersion_tracker and line[26] == ' ':
        insersion_tracker = line[26]

    if break_indexer == 0: 
        if (CDR_index+chain_id in CDR_coordinates) and (counter in CDR_range):
            CDR_coordinates[CDR_index+chain_id].append([float(line[30:38]), float(line[38:46]), float(line[46:54]), counter])
        elif (CDR_index+chain_id not in CDR_coordinates) and (counter in CDR_range):
            CDR_coordinates[CDR_index+chain_id] = [[float(line[30:38]), float(line[38:46]), float(line[46:54]), counter]]
    elif break_indexer >= 1:# here we assume there is no jump in the sequence of the antibody chains
        counter += 1
        antibody_chain[chain_id].append(line[17:20])
        if (CDR_index+chain_id in CDR_coordinates) and (counter in CDR_range):
            CDR_coordinates[CDR_index+chain_id].append([float(line[30:38]), float(line[38:46]), float(line[46:54]), counter])
        elif (CDR_index+chain_id not in CDR_coordinates) and (counter in CDR_range):
            CDR_coordinates[CDR_index+chain_id] = [[float(line[30:38]), float(line[38:46]), float(line[46:54]), counter]]   
    
    return counter, normal_tracker, antibody_chain, CDR_coordinates, insersion_tracker

### Get_antigen_chain_and_coordinates
** All the input have the same meaning as the *Get_CDR_chain_and_coordinates*.**

In [6]:
def Get_antigen_chain_and_coordinates(line, chain_id, counter = 0, normal_tracker = 0, 
    antigen_chain = {}, antigen_coordinates = {},insersion_tracker = ' '):
    #line begins with"ATOM", chain_id is for antigen
    break_indexer = 0 # track the distance between two adjacent  ATOM data in terms of amino acids
#     track the normal sequence, to see whether it jumps    
    if  normal_tracker != 0:
        break_indexer += int(line[22:26]) - normal_tracker
        normal_tracker = int(line[22:26])
    elif normal_tracker == 0:
        normal_tracker = int(line[22:26])
        antigen_chain[chain_id] = [line[17:20]]
# track whether there is any insersion        
    if line[26] != insersion_tracker and line[26] != ' ':
        break_indexer += 1
        insersion_tracker = line[26]
    elif line[26] != insersion_tracker and line[26] == ' ':
        insersion_tracker = line[26]
#     extract coordinates and sequences, if there is a break, insert 'BRK' in the sequence
    if break_indexer == 0:
        if chain_id in antigen_coordinates:
            antigen_coordinates[chain_id].append([float(line[30:38]), float(line[38:46]), float(line[46:54]), counter])
        else:
            antigen_coordinates[chain_id] = [[float(line[30:38]), float(line[38:46]), float(line[46:54]), counter]]
            
    elif break_indexer == 1:
        counter += 1
        antigen_chain[chain_id].append(line[17:20])
        if chain_id in antigen_coordinates:
            antigen_coordinates[chain_id].append([float(line[30:38]), float(line[38:46]), float(line[46:54]), counter])
        else:
            antigen_coordinates[chain_id] = [[float(line[30:38]), float(line[38:46]), float(line[46:54]), counter]]
            
    elif break_indexer >= 2:
        # when this conditions is met, it means there is a jump in the sequence, we add a 'BRK' to represent the jump. 
        counter += 2
        antigen_chain[chain_id].extend(['BRK', line[17:20]])
        if chain_id in antigen_coordinates:
            antigen_coordinates[chain_id].append([float(line[30:38]), float(line[38:46]), float(line[46:54]), counter])
        else:
            antigen_coordinates[chain_id] = [[float(line[30:38]), float(line[38:46]), float(line[46:54]), counter]]
    
    return counter, normal_tracker, antigen_chain, antigen_coordinates, insersion_tracker

### Find_Chain_Coordinates
* **Inputs:**  
 * pdb, a pdb file  
 * combined_chain_id, a list gives in the form of [heavy chain, light chain, antigen chain]
 
* **Returns:**
 * PDBseq, a dictionary, gives the sequences of chains. For example, PDBseq = {'A':[ALA, SER, THR, TYR, ....], 'B':[GLU, ARG, ....]}  
  * Coordinates, a dictinary gives the coordinates of the whole antigen chain, and CDR, it is in the form of {'A':[[0.05, -1.01. 0.20, 7], ....], 'h1H': [[0.03, -2.01. 0.30, 30], ....]}
 


In [7]:
def Find_Chain_Coordinates(pdb, combined_chain_id): #combined_chain_id = [heavy,light, antigen]
    PDBseq = {}
    Coordinates = {}
    counter = 0
    normal_tracker = 0
    insersion_tracker = ' '
    CDRLindex = [list(range(23, 36)), list(range(45, 56)), list(range(88, 97))]
    CDRHindex = [list(range(25, 36)), list(range(46, 65)), list(range(90, 110))]
    for line in pdb:
 #              Find the coordinates of CDRHs 
        if (line[:4] == 'ATOM' ) and (line[21] in combined_chain_id[0]):
            (counter, normal_tracker, PDBseq, Coordinates, insersion_tracker) = Get_CDR_chain_and_coordinates(line, line[21], 'h1', CDRHindex[0],
            counter, normal_tracker, insersion_tracker, PDBseq, Coordinates)
            (counter, normal_tracker, PDBseq, Coordinates, insersion_tracker) = Get_CDR_chain_and_coordinates(line, line[21], 'h2', CDRHindex[1],
            counter, normal_tracker, insersion_tracker, PDBseq, Coordinates)
            (counter, normal_tracker, PDBseq, Coordinates, insersion_tracker) = Get_CDR_chain_and_coordinates(line, line[21], 'h3', CDRHindex[2],
            counter, normal_tracker, insersion_tracker, PDBseq, Coordinates)
#              Find the coordinates of CDRLs 
        elif (line[:4] == 'ATOM' ) and line[21] in combined_chain_id[1]:
            (counter, normal_tracker, PDBseq, Coordinates, insersion_tracker) = Get_CDR_chain_and_coordinates(line, line[21], 'l1', CDRLindex[0],
            counter, normal_tracker, insersion_tracker, PDBseq, Coordinates)
            (counter, normal_tracker, PDBseq, Coordinates, insersion_tracker) = Get_CDR_chain_and_coordinates(line, line[21], 'l2', CDRLindex[1],
            counter, normal_tracker, insersion_tracker, PDBseq, Coordinates)
            (counter, normal_tracker, PDBseq, Coordinates, insersion_tracker) = Get_CDR_chain_and_coordinates(line, line[21], 'l3', CDRLindex[2],
            counter, normal_tracker, insersion_tracker, PDBseq, Coordinates)
#              Find the coordinates of Antigen
        elif (line[:4] == 'ATOM' ) and line[21] in combined_chain_id[2]:
#            (DBREF, Coordinates) = get_cdr_coordinates(line, DBREF, Coordinates, '', list(range(10000)))                
            (counter, normal_tracker, PDBseq, Coordinates, insersion_tracker) = Get_antigen_chain_and_coordinates(line, line[21], counter, normal_tracker, 
            PDBseq, Coordinates, insersion_tracker) 
        elif line[:3] == 'TER':
            counter = 0
            normal_tracker = 0
            insersion_tracker = ' '
    
    return  PDBseq, Coordinates 

### Distance
* Define a distance function, to calculate the Euclidean distance

In [8]:
def Distance(coordinate1, coordinate2):
    distance_square = 0
    for i in range(0,3):
        distance_square += (coordinate1[i]-coordinate2[i])**2
    distance = distance_square**0.5
    return distance

### findContact_sub_function
* **Input:** 
 *  CDRs, list of strings, composed of the keys in *CDR_coordinates*, such as ['h1B', 'h2B', ..]  
 * achain, chain id for the antigen  
 * coordinates_for_one_pdb, returned *Coordinates* from function *Find_Chain_Coordinates*  
 * cutoff, float, gives the cutoff distance  
 
* **Output**  
 * contact_sub_count, a list with elements in the form of [l1LA, 24, 30, 7], which means number 24 amino acid in L chain CDR1, contacts with number 30 amino acid in antigen chain A, and the conact number between those two amino acids is 7.

 
 

In [9]:
def findContact_sub_function(CDRs, achain, coordinates_for_one_pdb, cutoff):
    contact_sub_all = []
    contact_sub_count =[]
    temp_dict = {}
    for j in CDRs:
        for k in coordinates_for_one_pdb[j]:
            for l in coordinates_for_one_pdb[achain]:
                if Distance(k[:3],l[:3]) <= cutoff:
                    contact_sub_all.append(j+achain+'_'+str(k[3])+'_'+str(l[3]))
    for m in contact_sub_all:
        if m in temp_dict:
            temp_dict[m] += 1
        else:
            temp_dict[m] = 1
    for n in temp_dict:
        temp_list = n.split('_')
        contact_sub_count.append([temp_list[0], int(temp_list[1]), int(temp_list[2]), temp_dict[n]])
    return contact_sub_count  

### findContact
* **Input:**  
 * coordinates_for_one_pdb, as described above  
 * cutoff, as described above  
 * id_dict_for_one_pdb, a dictionary, in the form as {'1dee', [[ H, L, A], [ H, L, B], ....}, '1dee' is the pdbid of the complex, [ H, L, A] give the ids of  heavy chain, light chain, and antigen chain. There may be more than one set of those combinations.  
 
* **Output:**  
 * contact_count, a list, contains all the four_coordinates contact information for one pdb file.
 

In [10]:
def findContact(coordinates_for_one_pdb, id_dict_for_one_pdb, cutoff):
    ct = cutoff
    contact_count = []
    for i in id_dict_for_one_pdb:#id_dict_for_one_pdb
#find contact between between CDRHs and the Antigen
        if i[2] != '' and i[0] != '':
            CDRHs = ['h1'+i[0], 'h2'+i[0], 'h3'+i[0]]
            contact_count.extend(findContact_sub_function(CDRHs, i[2], coordinates_for_one_pdb, ct))
#find contact between between CDRLs and the Antigen                   
        if i[2] != '' and i[1] != '':           
            CDRLs = ['l1'+i[1], 'l2'+i[1], 'l3'+i[1]]
            contact_count.extend(findContact_sub_function(CDRLs, i[2], coordinates_for_one_pdb, ct))           
    return contact_count    

### Id_dict  
* **Input:**  
 * file, it is the *summary* file wich gives the basic informations about antibody_antigen complexes  
* **Output:**
 * id_dict, a dictionary, in the form of {'1dee', [[ H, L, A], [ H, L, B]}


In [11]:
def Id_dict (file):
    id_dict = {}
    for l in file:
#        Make sure the line is long enough
        if len(l) >= 16:
            a = l.split('\t')
#        Deal with the | in a[4]            
            for i in a[4].split('|') :
                temp = [a[1].strip(), a[2].strip(), i.strip()]
                for j in range(0,3):
                    if temp[j] == 'NA':
                        temp[j] =''                       
                if a[0].strip() in id_dict:
                    id_dict[a[0].strip()].append (temp)
                else:
                    id_dict[a[0].strip()] = [temp]                    
    return id_dict  
    

### Combined_chain_id_dict  
* **Input**  
 * id_dict, it is the out put of Id_dict  
* **Return:**  
 * combined_chain_id_dict, in the form of {'1dee',  [ H, L, AB]}
 

In [12]:
def Combined_chain_id_dict (id_dict):
    combined_chain_id_dict = {}
    for i in id_dict:
        temp = ['' ,'' ,'' ]        
        for j in id_dict[i]:
            temp = [temp[0]+j[0], temp[1]+j[1], temp[2]+j[2]]
        combined_chain_id_dict[i] = temp
    return combined_chain_id_dict   

### Here_iddict_combineddict  
* **Input:**  
 * id_dict, combined_chain_id_dict are the returns of the above two functions  
* **Output:** 
 * here_id_dict, here_combined_dict, are in the same form as the above two functions. However, those dictionaries are only about the summaries of all the pdbfiles in current working directory. Thus this function can be used for small scale testing.

In [13]:
def Here_iddict_combineddict(id_dict, combined_chain_id_dict):
    here_id_dict = {}
    here_combined_dict = {}
    names = os.listdir()
    for f in names:
        if len(f) == 8 and f[5:8] == 'pdb':
            if f[:4] in id_dict:
                here_id_dict[f[:4]] = id_dict[f[:4] ]
            if f[:4] in combined_chain_id_dict:
                here_combined_dict[f[:4]] = combined_chain_id_dict[f[:4] ] 
    return here_id_dict, here_combined_dict

### main  
* **Input:**  
 * here_iddict_combineddict, given above  
* **Return:**  
 * sequence_and_coordinates, contact, they are the returns of *Find_Chain_Coordinates* and *findContact*.


In [14]:
def main(here_iddict_combineddict):
    sequence_and_coordinates = {}
    n = 0
    for i in here_iddict_combineddict[1]:
        n += 1
        print('extracting sequence and coordinates of '+ i + '.pdb...'+ str(n))
        with open(i+'.pdb', 'r') as f:
            sequence_and_coordinates[i] = Find_Chain_Coordinates(f, here_iddict_combineddict[1][i])
    n = 0
    contact = {}
    for i in sequence_and_coordinates:
        n += 1
        print('Counting contact of '+ i + '.pdb...'+ str(n))
        try:
            contact[i] = findContact(sequence_and_coordinates[i][1], here_iddict_combineddict[0][i], cutoff = 5)
        except:
            print('Check '+ i+' again')
    return sequence_and_coordinates, contact

### Deals with the working directory
* **The summary file and the pdb files should be in the current working directory**

In [15]:
import os
os.getcwd()
os.listdir()

['.git',
 '.gitignore',
 '.ipynb_checkpoints',
 '1a14.pdb',
 '1a2y.pdb',
 '1adq.pdb',
 '1bog.pdb',
 '1bvk.pdb',
 '1dee.pdb',
 '1g9m.pdb',
 '1g9n.pdb',
 '1gc1.pdb',
 '1h0d.pdb',
 '1hez.pdb',
 '1i9r.pdb',
 '1ikf.pdb',
 '1jrh.pdb',
 '2hrp.pdb',
 '2j88.pdb',
 '5kel.pdb',
 'AAC-1.ipynb',
 'AAC.py',
 'AACS.py',
 'AAC_1.py',
 'AntigenAntibodyComplex.py',
 'ChainDictRelation.py',
 'Contact_with_afinity.json',
 'Debugging.py',
 'Format_v33_Letter.pdf',
 'IDhere.py',
 'Import.py',
 'output.csv',
 'README.md',
 'Reduce_Contact.py',
 'Simple_Analysis.py',
 'summary.tsv',
 '__pycache__']

### Some preparations for the ids

In [16]:
with open('summary.TSV', 'r') as summary:
    file = summary.readlines()
    id_dict = Id_dict(file)
    combined_chain_id_dict = Combined_chain_id_dict(id_dict)
    here_iddict_combineddict = Here_iddict_combineddict(id_dict, combined_chain_id_dict)

In [17]:
here_iddict_combineddict

({'1adq': [['H', 'L', 'A']],
  '1bvk': [['B', 'A', 'C'], ['E', 'D', 'F']],
  '1dee': [['B', 'A', ''], ['D', 'C', 'G'], ['F', 'E', 'H']],
  '1g9m': [['H', 'L', 'G']],
  '1g9n': [['H', 'L', 'G']],
  '1gc1': [['H', 'L', 'G']],
  '1h0d': [['B', 'A', 'C']],
  '1hez': [['D', 'C', 'E'], ['B', 'A', 'E']],
  '1i9r': [['K', 'M', 'B'], ['X', 'Y', 'C'], ['H', 'L', 'A']],
  '1ikf': [['H', 'L', 'C']],
  '5kel': [['Q', 'U', 'I'],
   ['C', 'D', 'A'],
   ['P', 'T', 'G'],
   ['H', 'L', 'B'],
   ['M', 'O', 'F'],
   ['J', 'N', 'E']]},
 {'1adq': ['H', 'L', 'A'],
  '1bvk': ['BE', 'AD', 'CF'],
  '1dee': ['BDF', 'ACE', 'GH'],
  '1g9m': ['H', 'L', 'G'],
  '1g9n': ['H', 'L', 'G'],
  '1gc1': ['H', 'L', 'G'],
  '1h0d': ['B', 'A', 'C'],
  '1hez': ['DB', 'CA', 'EE'],
  '1i9r': ['KXH', 'MYL', 'BCA'],
  '1ikf': ['H', 'L', 'C'],
  '5kel': ['QCPHMJ', 'UDTLON', 'IAGBFE']})

### Run

In [18]:
sequence_and_coordinates, contact = main(here_iddict_combineddict)

extracting sequence and coordinates of 1adq.pdb...1
extracting sequence and coordinates of 1bvk.pdb...2
extracting sequence and coordinates of 1dee.pdb...3
extracting sequence and coordinates of 1g9m.pdb...4
extracting sequence and coordinates of 1g9n.pdb...5
extracting sequence and coordinates of 1gc1.pdb...6
extracting sequence and coordinates of 1h0d.pdb...7
extracting sequence and coordinates of 1hez.pdb...8
extracting sequence and coordinates of 1i9r.pdb...9
extracting sequence and coordinates of 1ikf.pdb...10
extracting sequence and coordinates of 5kel.pdb...11
Counting contact of 1adq.pdb...1
Counting contact of 1bvk.pdb...2
Counting contact of 1dee.pdb...3
Counting contact of 1g9m.pdb...4
Counting contact of 1g9n.pdb...5
Counting contact of 1gc1.pdb...6
Counting contact of 1h0d.pdb...7
Counting contact of 1hez.pdb...8
Counting contact of 1i9r.pdb...9
Counting contact of 1ikf.pdb...10
Counting contact of 5kel.pdb...11


### Save the results as json

In [19]:
import json

with open('seq_and_coordinates_current', 'w') as f:
    json.dump(sequence_and_coordinates, f)
    
with open('contact_current', 'w') as f:
    json.dump(contact, f)    

In [None]:
# with open('contactdict1.json', 'w') as f:
#     data = json.load(f)
# data.keys()
# data['1i9r']