<div style="background-color:rgb(255, 250, 240); padding:10px 0;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

# Binder Tutorial QC Analysis

### <font color='red'>To begin: Click the top cell and press 'Run' on the toolbar (or shift-enter). Alternatively click Kernel, Restart and Run All.</font> 


## Table of Contents:
1. [Import data](#1.) <br>
2. [Visualisation](#2.)<br>
   2.1. [Histagram of RSD](#2.1.)<br>
   2.2. [Jointplot of RSD vs. D-Ratio](#2.2.)<br>
   2.3. [PCA score plot of QC vs. Sample](#2.3.)<br>
   2.4. [Scatter plot of Molecular Weights vs. RT Time (sized by RSD)](#2.4.)<br>

<br>
<div style="background-color:rgb(240,248,255); padding:20px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em; font-size: 15px; font-style:italic;">
For more information on using Jupyter Notebooks refer to:<br> 
&nbsp;&nbsp;&nbsp;<a href="https://mybinder.org/v2/gh/jakevdp/PythonDataScienceHandbook/master?filepath=notebooks%2FIndex.ipynb">Python Data Science Handbook by Jake VanderPlas (2016)</a>
</div> 

<br>
<div style="background-color:rgb(229, 255, 229); padding:20px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em; font-size: 15px; font-style:italic;">
For more information on using Jupyter Notebooks refer to:<br> 
&nbsp;&nbsp;&nbsp;<a href="https://mybinder.org/v2/gh/jakevdp/PythonDataScienceHandbook/master?filepath=notebooks%2FIndex.ipynb">Python Data Science Handbook by Jake VanderPlas (2016)</a>
</div> 

<br>
<div style="background-color:rgb(255, 224, 224); padding:20px;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em; font-size: 15px; font-style:italic;">
For more information on using Jupyter Notebooks refer to:<br> 
&nbsp;&nbsp;&nbsp;<a href="https://mybinder.org/v2/gh/jakevdp/PythonDataScienceHandbook/master?filepath=notebooks%2FIndex.ipynb">Python Data Science Handbook by Jake VanderPlas (2016)</a>
</div> 




<a id="1."></a>
## 1.  Import Data

1. Import the pandas python module ( https://pandas.pydata.org/ ).<br>
2. Import the excel sheet "Data" from excel file "data.xlsx" into a data frame called "data".<br>
3. Display the numbe of rows and column.<br>
4. Display the fist 10 rows at the top (head) of the data frame.<br>

</div>

In [None]:
import pandas as pd

data = pd.read_excel('data.xlsx', sheet_name='Data') # import data sheet

print("Data Table: {} rows & {} columns".format(*data.shape))
display(data.head(10)) # View data table (top 10 rows)

peak = pd.read_excel('data.xlsx', sheet_name='Peak') # import peak sheet
print("Peak Table: {} rows & {} columns".format(*peak.shape))
display(peak.head(10)) # View peak table (top 10 rows)

<div style="background-color:rgb(255, 250, 240); padding:10px 0;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

<a id="2."></a>
## 2. Visualisation

#### <font color='red'>Note: Each cell in the Visualisation Section can be run in any order (provided data is imported in Section 1).</font> 
<br>

<a id="2.1."></a>
### 2.1. Histagram of RSD
<br>
</div>

In [None]:
import matplotlib.pyplot as plt

# Ensures plots are shown inline on Jupyter Notebooks
%matplotlib inline

fig = plt.figure(figsize=(6,5))
plt.hist(peak.RSD, 50, density=True, facecolor='g', alpha=0.5) # plt.hist computes and draws the histogram of RSD 
plt.xlabel('RSD', fontsize=15)
plt.show()

<div style="background-color:rgb(255, 250, 240); padding:10px 0;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

<a id="2.2."></a>
### 2.2. Jointplot of RSD vs. D-Ratio

<br>

</div>

In [None]:
import seaborn as sns

sns.jointplot(x=peak.RSD, y=peak.D_Ratio, kind='kde', color="skyblue") # plot of RSD and D_ratio with bivariate and univariate graphs.

<div style="background-color:rgb(255, 250, 240); padding:10px 0;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

<a id="2.3."></a>
### 2.3. PCA score plot of QC vs. Sample

<br>
</div>

In [None]:
# Import
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Extract X matrix
names = peak['Name']
x = data[names].values
x = np.log(x)
x = StandardScaler().fit_transform(x)

# Create and fit PCA
pca = PCA(n_components=2)
scores = pca.fit_transform(x)
label = data['SampleType']

# Split scores into sample and QC
Sample_scores = scores[label == 'Sample',:]
QC_scores = scores[label == 'QC',:]

# Plot Sample score and QC score
fig = plt.figure(figsize=(8,8))
h1 = plt.scatter(Sample_scores[:,0],Sample_scores[:,1],edgecolors='Black', facecolors='Green',s=100,alpha=0.5)
h2 = plt.scatter(QC_scores[:,0],QC_scores[:,1], edgecolors='Black', facecolors='Red',s=100,alpha=0.5)

# Add legend, labels, and title
plt.legend((h1,h2),('Sample','QC'),fontsize=15)
plt.xlabel('PC1', fontsize=15)
plt.ylabel('PC2', fontsize=15)
plt.title('Quality Control PCA plot',fontsize=20)

# Show plot
plt.show()

<div style="background-color:rgb(255, 250, 240); padding:10px 0;  border: 1px solid lightgrey; padding-left: 1em; padding-right: 1em;">

<a id="2.4."></a>
### 2.4. Scatter plot of Molecular Weights vs. RT Time (sized by RSD)

<br>

</div>

In [None]:
# Scatterplot of Mol_Weight vs. RT_minute with size RSD^2/2, and colour red
fig = plt.figure(figsize=(20,16))
plt.scatter(peak.Mol_Weight, peak.RT_minutes, s=peak.RSD**2/2, alpha=0.2, edgecolors='k', c='r') 
plt.xlabel('Molecular Weight', fontsize=15)
plt.ylabel('RT minutes', fontsize=15)
plt.title('Metabolites Detected (sized by RSD)',fontsize=20)
plt.show()