# <center>Circuit Topology script V2.0</center>

<center>Duane Moes - For suggestions and further questions: moesduane@gmail.com </center><br>
<center>Github:   <a href="url" target="https://github.com/Duanetech/circuit_topology">github.com/Duanetech/circuit_topology</a></center>


---
This is a fully automated script that mainly utilizes biopython to perform circuit topology analysis on a given set of proteins. When possible, try to use the mmCIF file system instead of the PDB file option, this is because PDB is outdated and more prone to missing atoms etc. See the README for installation help and documentation of the functions. If a new update comes out, use the download code option on the github page.

#### Packages used
<ul><li>BioPython</li>
    <li>Pandas</li>
<li>SciPy </li>
<li>NumPy</li>
<li>MatPlotlib</li>
<li>DSSP</li>
</ul>


  
Run the code below to install all the needed dependencies (only once!).<br> Warning: This can take a while, if it has finished you can delete the code block.  

      


In [None]:
!conda env update --file requirements.yml

#### Importing
These are import statements, you have to run this codeblock everytime you restart and/or quit Jupyter. 

In [1]:
from functions.plots.circuit_plot import circuit_plot
from functions.plots.matrix_plot import matrix_plot
from functions.plots.stats_plot import stats_plot
from functions.plots.matrix_plot_model import matrix_plot_model

from functions.calculating.get_cmap import get_cmap
from functions.calculating.get_matrix import get_matrix
from functions.calculating.get_stats import get_stats
from functions.calculating.energy_cmap import energy_cmap
from functions.calculating.string_pdb import string_pdb
from functions.calculating.secondary_struc_cmap import secondary_struc_cmap
from functions.calculating.secondary_struc_filter import secondary_struc_filter
from functions.calculating.glob_score import glob_score
from functions.calculating.length_filter import length_filter

from functions.importing.retrieve_chain import retrieve_chain
from functions.importing.retrieve_cif import retrieve_cif
from functions.importing.retrieve_cif_list import retrieve_cif_list
from functions.importing.retrieve_secondary_struc import retrieve_secondary_struc
from functions.importing.stride_secondary_struc import stride_secondary_struc

from functions.exporting.export_psc import export_psc
from functions.exporting.export_cmap3 import export_cmap3
from functions.exporting.export_mat import export_mat
from functions.exporting.export_cmap4 import export_cmap4

from ipywidgets import widgets
import numpy as np 
import pandas as pd
import os
import matplotlib
%matplotlib 

Using matplotlib backend: MacOSX


## <center> User guide </center>
<ul>
    <li>Either copy your <code>.PDB</code> or <code>.CIF </code> files to their respective maps in <code>/input_files/</code>, or enter the 4 letter protein codes in <code>input_files/protlist.txt</code>.
</li>
</ul>
<i>NOTE that when using a large number of proteins (>50), it is more efficient to use the batch download function from the <a href="url" target="https://www.rcsb.org/downloads">RCSB Db</a> </i>




 
####  ***Variable input*** <br>
<ul>
<li><code>fileformat</code> (0/1) Preferred filetype, CIF is recommendend because of a possibility of missing atoms occuring in PDB files. <br></li>
    <font color='red'>NOTE!! <code>fileformat</code> must be CIF to function properly.</font>
<li><code>fetch_db</code> (0/1) Downloads the CIF files stated in <code>input_files/protlist.txt</code>.</li> <br> 

<li><code>cutoff_distance</code>, maximal distance (Ångström) between two atoms that will count as an atom-atom contact.<br> </li>
<li><code>cutoff_numcontacts</code>, minimum number of contacts between two residues to count as a res-res contact. <br></li>
<li><code>length_filtering</code>, if length_filtering > 0, it is activated, input is the max contact distance. <br> </li>
<li><code>exclude_neighbour</code>, number of neighbours that are excluded from possbile res-res contacts. <br></li>

<br>
<li><code>export_psc</code>(0/1), exporting the resulting PSC stats to a txt file located in <code>results/statistics/psc</code>       (Overwrites a previous created file)</li> 
<li><code>export_cmap3</code>(0/1), exporting cmap3 to a csv file located in <code>results/circuit_diagram</code></li> 
<li><code>export_mat</code>(0/1), exporting the topology relations matrix to a csv file located in <code>results/matrix</code></li>
</ul>

In [1]:
# Format
fileformat =            'pdb'
fetch_db =              0

# CT variables
cutoff_distance =       4.5
cutoff_numcontacts =    5
length_filtering =      0
filtering_distance =    0
length_mode =           '<'
energy_filtering =      0
energy_filtering_mode = '+'
exclude_neighbour =     3

# Exporting
plot_figures =          0
exporting_psc =         0
exporting_cmap3 =       0
exporting_mat   =       0

if energy_filtering:
    potential_sign = input("positive or negative filtering? (+/-)")  
    
if fileformat == 'cif' and fetch_db:  
    retrieve_cif()  

In [16]:
retrieve_cif('1a6n')

Downloading PDB structure '1a6n'...


#### <center>MAIN</center>

In [64]:
number_of_files = len(os.listdir('input_files/' +fileformat))

psclist = []

for num,files in enumerate(os.listdir('input_files/' +fileformat)):
    print(files)
    if files.endswith(('cif','pdb')):
  
        try:
            
            chain,protid = retrieve_chain(files)
            print(f'{files} - {num+1}/{number_of_files}')
            
        except Exception as e:
            
            print(f'{files} - {e}')
            continue
    
    
    
    #Step 1 - Draw a segment-segment based contact map 
    index,numbering,protid,res_names = get_cmap(chain)
    
    #Step 2 - Lenght filtering
    if lenght_filtering:
        index = length_filter(index,filtering_distance,length_mode)

    #Step 1.5 - Energy filtering
    if energy_filtering:
        ef_index = energy_cmap(index,numbering,res_names,protid,energy_filtering_mode)    
    
    #Step 2 - Draw a circuit topology relations matrix
    mat, psc = get_matrix(index)
    
    #Step 3 - Circuit topology statistics
    entangled = get_stats(mat)
    psclist.append([protid,psc[0],psc[1],psc[2]])
    
    #plotting
    if plot_figures:
        circuit_plot(index,protid,numbering)
        matrix_plot(mat,protid)
        stats_plot(entangled,psc,protid)
    
    #exporting    
    if exporting_cmap3:
        export_cmap3(index,protid,numbering)
        
    if exporting_psc:
        export_psc(psclist)
        
    if exporting_mat:
        export_mat(index,mat,protid)
    

1a5v.pdb
1a5v.pdb - 1/10
3mtq.pdb
3mtq.pdb - 2/10
.DS_Store
1bcs.pdb
1bcs.pdb - 4/10
4l9h.pdb
4l9h.pdb - 5/10
4l6e.pdb
4l6e.pdb - 6/10
3lwf.pdb
3lwf.pdb - 7/10
3mxn.pdb
3mxn.pdb - 8/10
1a8m.pdb
1a8m.pdb - 9/10
1aa7.pdb
1aa7.pdb - 10/10


## Secondary structure tool
This function uses the STRIDE tool to calculate the protein's secondary structure. <br> ***NOTE*** STRIDE and DSSP agree in 95,4% of the cases, DSSP tends to assign shorter secondary structures. To use STRIDE files, download them from http://webclu.bio.wzw.tum.de/stride/ and put them in <code>input_files/STRIDE.
</code> 
<br>https://en.wikipedia.org/wiki/STRIDE <br>

It can be used to build a Sec. Struc - Sec. struc contact map, or filter out res-res contacts within a secondary structure.

STRIDE
* H - Alpha-Helix
* B - Isolated Beta-Bridge
* b - Isolated Beta-Bridge
* G - 3-10 Helix
* I - Pi helix
* T - Turn
* C - Coil

In [4]:
structure, sequence = stride_secondary_struc('1a34stride.txt')

The following function uses the secondary structure to create a secondary structure-secondary structure based cmap (cmap4).<br> Keep in mind that this function overwrites certain variables.

In [6]:
cmap4,segment = secondary_struc_cmap(
                                    chain,
                                    sequence,
                                    structure,
                                    cutoff_distance = 4.5,
                                    cutoff_numcontacts = 10,
                                    exclude_neighbour = 3,
                                    ss_elements = ['H','E','B','G'])

This function takes in a res-res contact map and filters out contacts that are within specified secondary structures,<code>filtered_structures</code>.

In [None]:
cmap5,struc_id = secondary_struc_filter(
                                        index,
                                        structure,
                                        filtered_structures = ['H','E'])

### <center> Multi-chain analysis </center>

In [4]:
chain,protid = retrieve_chain('1a34.cif')

In [12]:
index,numbering,protid,res_names = get_cmap(chain,level='chain')

In [13]:
mat, psc = get_matrix(index,protid)

In [13]:
psclist.append(psc)

In [15]:
psc.tolist()

['1a34_A', '28976', '11551', '10194']

In [11]:
psclist=[]

In [15]:
psc[1:]

[28976, 11551, 10194]

In [16]:
len(psclist[0])

4

In [3]:
psclist

NameError: name 'psclist' is not defined

In [9]:
numbering.max()

159

In [10]:
numbering

array([ 13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159])

In [11]:
psc

NameError: name 'psc' is not defined