#**Protein to graph**
<font color='grey' size='1.5'> Created by Parisa Hosseinzadeh for *Machine learning for proteins*, Spring 2022. 

In today's in-class activity, we will learn how to go from a protein to a graph.

## Step 1. Preparation

### 1.1. Installing biopandas

[Biopandas](http://rasbt.github.io/biopandas/) is a library that allow easy loading and manipulating of biological structures.

In [None]:
pip install biopandas #install biopandas

Collecting biopandas
  Downloading biopandas-0.3.0-py2.py3-none-any.whl (272 kB)
[?25l[K     |█▏                              | 10 kB 19.0 MB/s eta 0:00:01[K     |██▍                             | 20 kB 22.5 MB/s eta 0:00:01[K     |███▋                            | 30 kB 12.1 MB/s eta 0:00:01[K     |████▉                           | 40 kB 10.1 MB/s eta 0:00:01[K     |██████                          | 51 kB 4.4 MB/s eta 0:00:01[K     |███████▏                        | 61 kB 5.1 MB/s eta 0:00:01[K     |████████▍                       | 71 kB 5.7 MB/s eta 0:00:01[K     |█████████▋                      | 81 kB 4.1 MB/s eta 0:00:01[K     |██████████▉                     | 92 kB 4.5 MB/s eta 0:00:01[K     |████████████                    | 102 kB 5.0 MB/s eta 0:00:01[K     |█████████████▎                  | 112 kB 5.0 MB/s eta 0:00:01[K     |██████████████▍                 | 122 kB 5.0 MB/s eta 0:00:01[K     |███████████████▋                | 133 kB 5.0 MB/s eta 0:

### 1.2. Loading libraries

In [None]:
#importing necessary packages
import pandas as pd
import typing
from typing import Dict, List
import numpy as np
import seaborn as sns
from sklearn.metrics import pairwise_distances
import networkx as nx
import matplotlib.pyplot as plt
import os

In [None]:
# Importing read pdb function from biopandas
from biopandas.pdb import PandasPdb

## Step 2. Loading and preparing pdb file

In this step, we will load our PDB and prepare it for graph generation. You can use an istalled pdb, but we will use an online structure the protein [azurin](https://www.rcsb.org/structure/1AZU) with **PDB ID: 1AZU**.

In [None]:
ppdb = PandasPdb().fetch_pdb("1AZU")
# Generating a new dataframe that only contains x,y,z coordinates of CA atoms
## Getting chain A
p_df = ppdb.df['ATOM'][ppdb.df['ATOM']['chain_id'] == 'A']

Let's take a look at how the pdb is loaded.

In [None]:
p_df.head()

## Step 3. Generating distance map

At this stage, we will create the distance map for our protein.

You can see that it contains residue numbers, residue chains, atom IDs, among other information. For the purpose of this close, we're interested in the coordinates of "CA" atoms in PDB.


### 3.1. coordinate matrix

**Practice time**: Can you generate a new dataframe that only contains the coordinates of *CA* atoms?

In [None]:
# your code here

In [None]:
#@markdown Sample code

CA_vec_all = p_df.loc[p_df['atom_name'] == 'CA']
CA_vec = CA_vec_all[['x_coord','y_coord','z_coord']]

CA_vec.head()

#### Q1. What is the size of your list?

What is the size of your list? In other words, how many residues does your protein have?

In [None]:
# your code here

In [None]:
#@markdown Sample code

len(CA_vec)

### 3.2. Changing formats

This data is currently in the format of dataframe. To perform mathematical calculations, we need to change this to numpy array.

**Practice time**: Write code to change dataframe to numpy array.

In [None]:
# your code here

In [None]:
#@markdown Sample code

# array generation
vec_list = [np.array(
              list(
                  CA_vec.iloc[i])
              ) for i in range(len(CA_vec)
            )]

# looking at first 5 elements
vec_list[:5]

### 3.3. Distance map

You can write a code to calculate all pairwise distances. Alternatively, you can use code that has been designed to do exactly this. We will be using [`pairwise_distances`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html) function to generate the distance map.

**Practice time**: Given your vector and using the instruction from the link above, generate pairwise distance matrix of your coordinates.

NOTE: we use `'euclidean'` metric.

In [None]:
# your code here

In [None]:
#@markdown Sample code

M = pairwise_distances(
    vec_list, metric='euclidean'
)

M[:5]

#### Q2. Distance map heatmap

Using [seaborn's heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html), generate a heatmap of your matrix and submit to activity.

In [None]:
# your code here

In [None]:
#@markdown Sample code

sns.heatmap(M)

## Step 4. Generating the graph

In this step, we will use the distance map to create a graph. We will use a package called [networkx](https://networkx.org/) for graph generation and manipulation.

### 4.1. Getting sequence information

Each node in our graph will be a residue. Therefore, we need to extract sequence information from our protein.

**Practice time**: From the dataframe, generate an array that contains all the sequences.

In [None]:
# your code here

In [None]:
#@markdown Sample code

res_df=p_df.loc[
                    p_df['atom_name'] == 'CA'
                    ][['residue_name']]
seq_array = list(res_df['residue_name'])

seq_array[:5]

### 4.2. Generate a graph

The code below reads in residues, add them as nodes, and add an edge between them *if* the distance between them is less than 8å. Note that we're adding edge weights as 1/distance.

In [None]:
# preparing edge and node list
G=nx.Graph()
# adding nodes
for i in range(len(vec_list)):
  n1 = "{}_{}".format(
      seq_array[i],i) ## <-- replace this by the name you gave to residue arrays
  # Adding residue names as node labels
  G.add_node(n1)
  # for loop to add edges
  for j in range(i+1, len(vec_list)):
    n2 = "{}_{}".format(seq_array[j],j)
    # Adding an edge between every pair closer than 8 A
    if M[i][j] < 8:
      # Weights are reverse proportional to distance
      # Closer = higher weight
      G.add_edge(n1, n2, weight=1/M[i][j])

#### Q3. Graph properties

How many nodes and edges does your graph have? Is the node number consistent with the protein size? What does the edge number tell you?

In [None]:
print(G.number_of_nodes())
print(G.number_of_edges())

126
616


### 4.3. Visualization

Let's take a look at our graph.

#### Q4. Graph picture

Submit the image of your graph to your assignment.

In [None]:
# setting up size of image
fig, ax = plt.subplots(figsize=(15,8))

# getting edge weights
weights = list(nx.get_edge_attributes(G,'weight').values())

# upwegithing edges
up_weights = [i*5 for i in weights]

#drawing
nx.draw_spring(
    G,
    node_size=250,
    node_color='orange',
    width=up_weights,
    with_labels=True,
    edge_color='grey',
    font_size=9
)

plt.show()

#### Q5. Change distance cut-off

Change distance cut-off for edges to 5 and draw the graph again. Submit the graph. What differences do you see?