# AI for Biotechnology
<span style="color:#AAA;font-size:14px;" >Prof. Dr. Dominik Grimm</span>  
<span style="color:#AAA;font-size:14px;">Bioinformatics Research Lab</span>  
<span style="color:#AAA;font-size:14px;">TUM Campus Straubing for Biotechnology and Sustainability</span>  

## Principal Component Analysis from Scratch #8
Familiarize yourself with the Iris Flower Dataset. Here you can find a description of the data: https://en.wikipedia.org/wiki/Iris_flower_data_set

We will first load the data:

In [1]:
%matplotlib inline
import pylab as pl
import numpy as np
from sklearn.datasets import load_iris

data = load_iris()
X = data.data
y = data.target

print("No Samples:\t%d" % X.shape[0])
print("No Features:\t%d" % X.shape[1])
print("Feature Names:\t" + str(data.feature_names))
print()
print("No of classes:\t%d" % np.unique(y).shape[0])
print("Class Names:\t" + str(data.target_names))

No Samples:	150
No Features:	4
Feature Names:	['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

No of classes:	3
Class Names:	['setosa' 'versicolor' 'virginica']


The data contains three classes, one for each plant species: `setosa` (y=0), `versicolor` (y=1) and `virginica` (y=2).



In this example we would like to perform a principal component analysis on the iris dataset to reduce the four dimensional dataset down to a two dimensional one. For this purpose, we would like to implement a PCA from scratch.

First standardize your data to have zero mean and unit variance and store it in a variable with the name `Xn`:

In [None]:
#We use sklearn to scale and normalize our data - what do we want to do we can explore with full data normalized
# or combine with machine learning then we split the data


Next compute the covariance matrix on the standardized data, that is  
\begin{align}
\textbf{C} = \frac{1}{n-1} \textbf{X}_n^\top \textbf{X}_n
\end{align}

In [None]:
#implement the formula of the similarity matrix


Next we have to eigendecompose the matrix $\textbf{C}$ into its eigenvalues and eigenvector. This can be done with NumPy:

In [None]:
#We compute the vector of variances and the matrix of eigenvectors (principal components)
#We sort them in decreasing order from highest variance to lowest (explanatory features)

Now, we can sort the eigenvalues in decreasing order using numpy and also resort the columns of the eigenvector matrix $V$ using the indices from the sorting of the eigenvalues:

Finally, we can compute a lower dimensional representation of our input data matrix using the following formula:  

\begin{align}
\tilde{\textbf{X}}_r = \textbf{X}_n\textbf{V}_r,
\end{align}

where $r$ is equal to 2, since we would like to restrict our data to two dimensions:

In [None]:
#We want to transform into a two dimensional space to visualize the plot
#Our matrix using the dot product using the principal components


Now let's visualize the transformed data for the first two principal components:

In [None]:
#generate figure
fig = pl.figure(figsize=(7,7))
ax = fig.add_subplot(111)

#your code comes here



#Set axis labels
ax.set_xlabel("PC 1")
ax.set_ylabel("PC 2")
#show grid in grey and set top and right axis to invisible
ax.grid(color="#CCCCCC")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
pl.tight_layout()

Extend your visualization from before and highlight all samples that belong to one of the three flower species in a different color:

In [None]:
#generate figure
fig = pl.figure(figsize=(7,7))
ax = fig.add_subplot(111)

#your code comes here



#Set axis labels
ax.set_xlabel("PC 1")
ax.set_ylabel("PC 2")
#show grid in grey and set top and right axis to invisible
ax.grid(color="#CCCCCC")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
pl.legend()
pl.tight_layout()

We now should see that the species seperate nicely, by using only the first two principle components!  

Next we would like to check for how much variance the first two PCs account. This can be done by computing the ratio of variance explained using the eigenvalues:

In [None]:
#95.81% of the variance is explained by the 2 principal components


print("First 2 PCs account for %.2f %% of the total variance!" % (va*100))

Create a bar chart to visualize the ratio of variance explained for each PC:

In [None]:
#We can understand the distribution of the variance explained for each feature 

Extend the bar chart from above and also show the cumulative sum of the variance explained. For this purpose you can have a look at the numpy function `np.cumsum()`:

In [None]:
#We add a line to see how the cumulative  
