# ML Project Notebook report

## 1. Understanding and plotting the data
In the first part of this exercise we will use `pandas` DataFrames to store and manipulate the data and use `seaborn` to produce nice visualizations of the data.

The point of this part is to handle the data to get to know the dataset better.

The Data ProQDock.csv is taken from the following piece of literature: https://academic.oup.com/bioinformatics/article/32/12/i262/2288786

The data attributes are used to find correct protein-portein models and are listed as follows within the above mentionned paper in the Training features section:
- rGb: Residue Given burial. Relative solvant accessibility of the protein amino acids. Values range around 0.059 (+- 0.022)
- nBSA: Normalized buried surface area. It measures the fraction of exposed surface area buried upon association
- Fintres: Fraction of residues buried at the interface
- Sc: Shape Complementarity at the interface
- Ec: Electrostatic Complementarity at the interface
- ProQ: Protein quality predictor score
- Isc: Rosetta energy at the interface
- rTs: Roseta total energy
- Erep: Rosetta repulsive term
- Etmr: Rosetta total ernergy minus repulsive
- CPM: Joint Conditional Probability of Sc, EC given nBSA. CPM is the joint conditional probability of finding its interface within a certain range of Sc and EC given its size (nBSA)
- Ld: Link Density at the interface
- CPscore: Contact Preference score

As presented in the paper, the target function is also part of our dataset and lists the following properties:
- DockQ: Score of quality for a protein-protein docking model
- DockQ-binary: Applied threshold on the DockQ score reflecting no similarity or perfect similarity scores
- ProQDock: Predicted DockQ protein docking quality score
- zrank and zrank2: All atom energy terms. Non bonded energy terms based (Coulomb, Van der Waals, desolvation)
- ProQDockZ: External energy term. Hybrid method combining ProQDock and Zrank.

Finally here, cv represents the cross validation batched used initally in our dataset.






In [1]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

train=pd.read_csv('ProQDock.csv')
train

Unnamed: 0,Model,rGb,nBSA,Fintres,Sc,EC,ProQ,zrank,zrank2,Isc,...,Erep,Etmr,CPM,Ld,CPscore,DockQ,DockQ-Binary,ProQDock,ProQDockZ,cv
0,T50-1,0.035,0.034,0.106,0.571,0.072,0.682,0.611,0.657,1.000,...,0.998,0.400,0.723,0.114,0.135,0.01262,0,0.296446,0.296439,5
1,T50-2,0.033,0.036,0.124,0.579,-0.128,0.703,0.633,0.671,1.000,...,0.998,0.487,0.695,0.088,0.112,0.01464,0,0.234311,0.220123,5
2,T50-3,0.042,0.027,0.088,0.776,0.434,0.698,0.536,0.452,0.464,...,0.611,0.345,0.857,0.146,0.167,0.01067,0,0.152381,0.225628,5
3,T50-4,0.032,0.032,0.118,0.514,0.458,0.640,0.579,0.534,0.490,...,0.406,0.911,0.735,0.101,0.133,0.01302,0,0.126823,0.134728,5
4,T50-5,0.040,0.029,0.102,0.336,0.172,0.708,0.589,0.839,1.000,...,1.000,0.419,0.451,0.097,0.113,0.01199,0,0.295767,0.301145,5
5,T50-6,0.046,0.030,0.104,0.375,-0.084,0.710,0.612,0.711,1.000,...,1.000,0.417,0.372,0.112,0.266,0.30721,1,0.237555,0.277879,5
6,T50-7,0.043,0.027,0.090,0.602,0.429,0.746,0.633,0.584,0.784,...,0.647,0.381,0.799,0.122,0.123,0.37010,1,0.131572,0.125022,5
7,T50-8,0.041,0.014,0.058,0.575,0.051,0.715,0.622,0.682,1.000,...,1.000,0.421,0.723,0.165,0.116,0.01421,0,0.253425,0.278002,5
8,T50-9,0.041,0.037,0.106,0.377,-0.287,0.683,0.687,0.997,1.000,...,1.000,0.491,0.289,0.098,0.142,0.00777,0,0.183456,0.125676,5
9,T50-10,0.045,0.020,0.082,0.552,0.065,0.714,0.639,0.655,0.527,...,0.982,0.414,0.723,0.134,0.302,0.07302,0,0.378346,0.405764,5
