<a href="https://colab.research.google.com/github/tcardlab/optimus_bind_sample/blob/master/notebooks/1_0_TJC_raw_data_imports.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

These are practice scripts to get the latest version of SKEMPI, parse data, and import relevent protein models

# Scrape for Download links

## Package Install and imports

In [0]:
'''Package Installs'''

!pip install requests
!pip install beautifulsoup4



In [0]:
'''Package Imports'''

#Get HTML
import requests
import bs4

#Parsering – https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
import lxml.etree as xml 
import re

#Populate Dataframe
import numpy as np
import pandas as pd

## Scraper

**Learn to scrape here: **

https://colab.research.google.com/drive/15AEaOsAKWgikKY7BEOWxUlKsjjBjRD6R#scrollTo=dJTEYuhp4CBN


In [0]:
#URLs
home= 'https://life.bsc.es'
URL = home+'/pid/skempi2/database/index'


#Get HTML:
#https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/
webPage = bs4.BeautifulSoup(requests.get(
    URL, 
    headers={"UserAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.183 Safari/537.36"}).text, 
    "lxml")


#Exctract potential download links:
links = webPage.findAll(name='a', 
                attrs={"href": re.compile(".csv|tgz")})
#print(*_, sep = "\n")


#Find most recent verion:
download = dict() #format = {ver:[csv link, pdb link], ...}
for a in links:
  ver = eval(re.match('SKEMPI v[.]?(\d+\.\d+)', a.text)[1]);
  href = a.get('href')
  if ver not in download: download[ver]=[None, None]
  download[ver]['tgz' in href] = home+href if href[0]=='/' else href
#print(download)
current=max(download.keys())

{1.0: ['http://life.bsc.es/pid/mutation_database/SKEMPI_1.0.csv', None], 2.0: ['https://life.bsc.es/pid/skempi2/database/download/skempi_v2.csv', 'https://life.bsc.es/pid/skempi2/database/download/SKEMPI2_PDBs.tgz']}


# Download and Dataframe



In [0]:
data = pd.read_csv(download[current][0], sep=';')

In [0]:
print(data.info())
#print(*data.columns, sep='\n')
#print(data.iloc[0]) #get row

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7085 entries, 0 to 7084
Data columns (total 29 columns):
#Pdb                            7085 non-null object
Mutation(s)_PDB                 7085 non-null object
Mutation(s)_cleaned             7085 non-null object
iMutation_Location(s)           7085 non-null object
Hold_out_type                   3311 non-null object
Hold_out_proteins               7085 non-null object
Affinity_mut (M)                7085 non-null object
Affinity_mut_parsed             6800 non-null float64
Affinity_wt (M)                 7085 non-null object
Affinity_wt_parsed              7083 non-null float64
Reference                       7085 non-null object
Protein 1                       7085 non-null object
Protein 2                       7085 non-null object
Temperature                     7081 non-null object
kon_mut (M^(-1)s^(-1))          1844 non-null float64
kon_mut_parsed                  1844 non-null float64
kon_wt (M^(-1)s^(-1))           1853 non-

##**Index descriptions**

Source – https://life.bsc.es/pid/skempi2/info/faq_and_help

---



0.   **'#Pdb' – **
The PDB entry for the complex, followed by the chain identifiers for the two subunits. The first chain(s) correspond to protein 1 (column 10) and the second chain(s) correspond to protein 2 (column 11). Following this link will lead you to the relevant page in the protein databank.

0.   **'Mutation(s)_PDB' – **
The mutation(s) corresponding to the residue numbering found in the protein databank. The first character is the one letter amino acid code for the original residue, the second character is the chain identifier, the third to penultimate characters indicate the residue number, followed by the residue insertion code where applicable, and the final character indicates the mutant amino acid. Where multiple mutations are present, they are separated by commas.

0.   **'Mutation(s)_cleaned' – **
The mutation(s) corresponding to the residue numbering in the 'cleaned' pdb files, in the same format as for column 2.

0.   **'iMutation_Location(s)' – **
The locations of the mutations(s) in or away from the binding site, as defined in "A simple definition of structural regions in proteins and its use in analyzing interface evolution", ED Levy, J Mol Biol. 2010, 403(4):660-70.

0.   **'Hold_out_type' – **
Some of the complexes are classified as protease-inhibitor (Pr/PI), antibody-antigen (AB/AG) or pMHC-TCR (TCR/pMHC). This classification was introduced to aid in the cross-validation of empirical models trained using the data in the SKEMPI database, so that proteins of a similar type can be simultaneously held out during a cross-validation.

0.   **'Hold_out_proteins' – **
This column contains the PDB identifiers (in column 1) and/or hold-out types (column 5) for all the protein complexes which may be excluded from the training when cross-validating an empirical model trained on this data, so as to avoid contaminating the training set with information pertaining to the binding site being evaluated.

0.   **'Affinity_mut (M)' – **
The affinity of the mutant form (M).

0.   **'Affinity_mut_parsed' – **
The affinity of the mutant form (M).

0.   **'Affinity_wt (M)' – **
The affinity of the wild-type form, or form in the PDB structure (M).

0.    **'Affinity_wt_parsed' – **
The affinity of the wild-type form, or form in the PDB structure (M).

0.    **'Reference' – **
The reference for the affinities, as well as any further kinetic or thermodynamic information. Where available, the pubmed ID is given with a link to the relevant entry in pubmed, otherwise the whole reference is given.

0.    **'Protein 1' – **
This is the name of the protein which corresponds to the first chain(s) given in column 1.

0.   **'Protein 2' – **
This is the name of the protein which corresponds to the second chain(s) given in column 1.

0.   **'Temperature' – **
The temperature at which the experiment was performed.

0.   **'kon_mut (M^(-1)s^(-1))' – **
The association rate for the mutant protein, where available (M^(-1)s^(-1)).

0.   **'kon_mut_parsed' – **
The association rate for the mutant protein, where available (M^(-1)s^(-1)).

0.   **'kon_wt (M^(-1)s^(-1))' – **
The association rate for the wild-type protein or protein in the crystal structure, where available (M^(-1)S^(-1)).

0.   **'kon_wt_parsed' – **
The association rate for the wild-type protein or protein in the crystal structure, where available (M^(-1)S^(-1)).

0.   **'koff_mut (s^(-1))' – **
The dissociation rate for the mutant protein, where available (s^(-1)).

0.   **'koff_mut_parsed' – **
The dissociation rate for the mutant protein, where available (s^(-1)).

0.   **'koff_wt (s^(-1))' – **
The dissociation rate for the wild-type protein or protein in the crystal structure, where available (s^(-1)).

0.   **'koff_wt_parsed' – **
The dissociation rate for the wild-type protein or protein in the crystal structure, where available (s^(-1)).

0.   **'dH_mut (kcal mol^(-1))' – **
The enthalpy of association for the mutant protein, where available (kcal mol^(-1)).

0.   **'dH_wt (kcal mol^(-1))' – **
The enthalpy of association for the wild-type protein or protein in the crystal structure, where available (kcal mol^(-1)).

0.   **'dS_mut (cal mol^(-1) K^(-1))' – **
The entropy of association for the mutant protein, where available (cal mol^(-1) K^(-1)).

0.   **'dS_wt (cal mol^(-1) K^(-1))' – **
The entropy of association for the wild-type protein or protein in the crystal structure, where available (cal mol^(-1) K^(-1)).

0.   **'Notes' – **
Notes regarding the entry.

0.   **'Method' – **
The experimental method used to measure the affinities.

0.   **'SKEMPI version' – **
The SKEMPI version number.

## Some Calculations

**Q. I don't see any ΔG or ΔΔG values in the table. How do I calculate these?**

The affinities (Kd) of the wild-type complexes are in the column 'affinity_wt' 
and the affinities of the mutant are in the column 'affinity_mut'. These can be 
converted to ΔG values by the relationship ΔG = R*T*ln(Kd); at room temperature 
this is ΔG = (8.314/4184) * (273.15 + 25.0) * ln(wt), where ln() is the natural 
logarithm. The changes in affinity upon mutation is calculated as 
ΔΔG = ΔGmut-ΔGwt

Source – https://life.bsc.es/pid/skempi2/info/faq_and_help

In [0]:
#Sorted PDB#s
#print(data.iloc[:,0].sort_values())

row = 1 # for testing purposes
affinMuT = data.loc[row, 'Affinity_mut_parsed']
affinWiT = data.loc[row, 'Affinity_wt_parsed']

#ΔG = R*T*ln(Kd)
def gibbsEq(Kd):
  ΔG = (8.314/4184) * (273.15 + 25.0) * np.log(Kd) #log is ln in np
  return ΔG

print(gibbsEq(affinMuT),gibbsEq(affinWiT))

ΔΔG=gibbsEq(affinMuT)-gibbsEq(affinWiT)
print('ΔΔG =', ΔΔG)

-15.1141359632168 -16.30291146867832
ΔΔG = 1.1887755054615212


# Import PDBs

## SKEMPI v?

*– Citation –*

"SKEMPI 2.0: An updated benchmark of changes in protein-protein binding energy, kinetics and thermodynamics upon mutation". 
Justina Jankauskaitė, Brian Jiménez-García, Justas Dapkūnas, Juan Fernández-Recio, Iain H Moal 
Bioinformatics (2018), bty635, https://doi.org/10.1093/bioinformatics/bty635


In [0]:
#Download current SKEMPI PDBs
LINK = download[current][1]
!wget $LINK

In [0]:
#Extract Data from copressed file
#takes like 5 mins for folder to show... be patient 
!tar -xvzf SKEMPI2_PDBs.tgz

## ZEMu

In [0]:
#ZEMu

#Download ZEMu PDBs
LINK = 'https://files.slack.com/files-pri/THK7D0M9N-FHRPP4CQ1/download/zemu_pdbs.tar?pub_secret=20bb6acac1' 
!wget $LINK -O zemu_pdbs.tar

#NOTE: if pub_secret exprires or something...
#copy url from this publick link, paste as LINK
#https://slack-files.com/THK7D0M9N-FHRPP4CQ1-20bb6acac1

In [0]:
#Extract Data from copressed file
!tar -xvf zemu_pdbs.tar
#takes like 5 mins for folder to show... be patient 