## This file is to extract contact information with some constrant
* **Constraints:**  
 * length, gives the fram length  
 * ref_chain gives the reference chain, it takes values of either 'Ab' or 'Ag'
For example, *ref_chain = 'Ag', length = 2*, means we try to find consecutive amino acids with length 2 in the antigen, with the largest contact number among all consecutive amino acids with the same length in the antigen, lets call those consecutive amino acids, aa_max.
**Here, we give all aa_max with length no larger than *length***

### Group
* **Inputs:**  
 * four_coordinates, a list contains all the interations of one pdb  
* **Returns:**  
 * grouped, a dictionary of four coordiantes, with keys h1HA, h2HA,...

In [1]:
def group(four_coordinates):
    grouped = {}
    for i in four_coordinates:
        if i[0] not in grouped:
            grouped[i[0]] = [i]
        else:
            grouped[i[0]].append(i)
    return grouped

### extract_aa_pos
* **Inputs:**  
 * grouped_list, a list of four coordinates, for example, a list of four coordinates of 'h1HA' of '1adq'.  
 * ref_chain, a string, takes values as either 'Ab' or 'Ag'.  
* **Returns:**  
 * aa_pos, a list contains the sorted list of amino acid positions without repitation either of antigen chain or antibody chain.


In [2]:
def extract_aa_pos(grouped_list, ref_chain = 'Ag'):    
    if ref_chain == 'Ab':
        ind = 1
    if ref_chain == 'Ag':
        ind = 2
    aa_pos = []
    for i in grouped_list:
        if i[ind] not in aa_pos:
            aa_pos.append(i[ind])
    aa_pos.sort()    
    return aa_pos

### extract_aa_consecutive_pos
* **Inputs:*  
 * aa_pos, a list of amino acid positions without repeat  
 * length, an integer, give the length of the reference frame  
* **Retruns:**  
 * aa_consecutive_pos, a list aa_positions which are consecutive with length length, in the form [[1, 2, 3], [3, 4, 5],...]

In [3]:
def extract_aa_consecutive_pos(aa_pos, length):
    aa_consecutive_pos = []
    if len(aa_pos) >= length:
        for i in range(len(aa_pos) - length + 1):
            if aa_pos[i + length - 1] -  aa_pos[i] == length - 1:
                aa_consecutive_pos.append(aa_pos[i : i + length ])              
        
    return aa_consecutive_pos

### extract_aa_max
* **Inputs:**
 * aa_consecutive_pos, a list of list, gives the position of the amino acids in the form of [[1, 2, 3], [3, 4, 5],...].  
 * four_coordinates, a list of four coordinates, to which the aa_consecutive_pos is related.  
 * ref_chain, a string, takes values as either 'Ab' or 'Ag'  
* **Returns:**  
 * aa_max, a list of aa positions, corresponding to the biggest contact number among all the aa_consecutive_pos.  
 * contact, an integer, gives the total number of conatact corresponding to aa_max

In [4]:
def extract_aa_max(aa_consecutive_pos, four_coordinates, ref_chain = 'Ag'):
    contact = 0
    aa_max = None    
    if ref_chain == 'Ab':
        ind = 1
    if ref_chain == 'Ag':
        ind = 2
    for i in aa_consecutive_pos:
        s = 0
#        aa_max_temp = []
        for j in four_coordinates:
            if j[ind] in i:
#                aa_max_temp.append(j)
                s += j[3]
        if contact <= s:
            contact = s
            aa_max = i
    return aa_max, contact

### Main
* **Inputs:**  
 * *contact*, a list of four coordinates for one pdb.  
 * *length*, an integer, gives the maximum length of the consecutive amino acids  
 * *ref_chain*, a string, takes values as either 'Ab' or 'Ag'.  
* **Returns:**  . 
 * *grouped_aa_max_contact*, a dictionary, in the form of {'h1HA': [[[16], [16, 17], None], [21, 37, 0]], ...}. It means for this pdb file, with *length* no larger than 3, and *ref_chain = 'Ag'*, amino acid *16* is the one with the largest contact number, 21; amino acids *[16, 17]* are the consecutive 2 amino acides with the largest contact number 37; there is no consecutive amino acids with length 3 in the antigen chain, which conatact with CDRh1 as well.
 

In [6]:
def main(contact, length = 3, ref_chain = 'Ag'):
    grouped = group(contact)
    grouped_aa_max_contact = {}
    for i in grouped:
        aa_pos = extract_aa_pos(grouped[i], ref_chain)
        total_aa_consecutive_pos = []
        total_aa_max = []
        total_contact_max = []
        for j in range(length):
           total_aa_consecutive_pos.append(extract_aa_consecutive_pos(aa_pos, j+1))
        for k in total_aa_consecutive_pos:
           aa_max, contact = extract_aa_max(k, grouped[i], ref_chain)
           total_aa_max.append(aa_max)
           total_contact_max.append(contact)
        grouped_aa_max_contact[i] = [total_aa_max, total_contact_max]
    return grouped_aa_max_contact

### Do some preparation of the working directory

In [8]:
import os
os.getcwd()
# os.chdir

'C:\\Users\\leo\\Documents\\Research\\Database\\PDB Learning'

### load the out put of AAC-1

In [10]:
import json
with open("contact_current", 'r') as f:
    contact_current = json.load(f)

### Extract data from the file loaded above

In [13]:
DataExtract = {}
for i in contact_current:
    DataExtract[i] = main(contact_current[i], length = 3, ref_chain = 'Ab')

In [14]:
DataExtract

{'1adq': {'h1HA': [[[30], [29, 30], None], [37, 40, 0]],
  'h2HA': [[[52], None, None], [47, 0, 0]],
  'h3HA': [[[101], [101, 102], [99, 100, 101]], [93, 122, 134]],
  'l1LA': [[[29], None, None], [2, 0, 0]],
  'l2LA': [[[53], [53, 54], None], [30, 37, 0]]},
 '1bvk': {'h1BC': [[[30], [30, 31], [29, 30, 31]], [22, 32, 38]],
  'h1EF': [[[30], [30, 31], [29, 30, 31]], [26, 39, 43]],
  'h2BC': [[[51], [51, 52], [51, 52, 53]], [53, 62, 80]],
  'h2EF': [[[51], [51, 52], [51, 52, 53]], [44, 51, 72]],
  'h3BC': [[[100], [99, 100], [98, 99, 100]], [65, 99, 119]],
  'h3EF': [[[100], [99, 100], [99, 100, 101]], [65, 98, 110]],
  'l1AC': [[[31], None, None], [34, 0, 0]],
  'l1DF': [[[31], None, None], [34, 0, 0]],
  'l2AC': [[[49], [48, 49], None], [28, 43, 0]],
  'l2DF': [[[49], [48, 49], None], [29, 37, 0]],
  'l3AC': [[[91], [91, 92], [90, 91, 92]], [38, 57, 63]],
  'l3DF': [[[91], [91, 92], [90, 91, 92]], [47, 70, 76]]},
 '1dee': {'h2DG': [[[57], None, None], [28, 0, 0]],
  'h2FH': [[[59], Non