# Assignment 04

## Dimensionality Reduction 

## CSCI S-96

> Instructions: For this assignment you will complete the exercises shown. All exercises involve creating and executing some Python code. Additionally, most exercises have questions for you to answer. You can answer questions by creating a Markdown cell and writing your answer. If you are not familiar with Markdown, you can find a brief tutorial here.

## Introduction  

Dimensionality reduction algorithms are widely used in data mining. Human perception of relationships in data is limited beyond a few dimensions. Further, many data mining algorithms produce poor results where when there is significant dependency between the features. In both cases, we can apply dimensionality reduction methods.   

In the exercises in this notebook you will gain some experience working with some commonly used dimensionality reduction methods. Specifically, there are two distinct classes of algorithms you will explore:    
1. **Dimensionality reduction transformation methods** create operators to map a sample space to an orthogonal space. Typically the original data can be well-represented in lower dimensions in the orthogonal space. Examples of these methods include principle component analysis (PCA) and kernel principle component analysis (Kernal PCA).    
2. **Manifold learning methods** where data in a high dimensional space is mapped onto a 2-dimensional manifold. Manifold learning is primarily used to aid visualization of high-dimensional data.  

To begin, execute the code in the cell below to import the packages you will need. 

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import SpectralEmbedding, MDS
import itertools
import matplotlib.pyplot as plt
import seaborn as sns

## A Synthetic Example

To make the ideas of dimensionality reduction clear, we start with an extremely simple example. In this example, dimensionality reduction is applied to bivariate Normally distributed data. The code in the cell below does the following:  
1. Generate 500 samples from a bivariate, zero-centered Normal distribution with covariance having a high degree of dependency between the variables:  
$$
cov = 
 \begin{bmatrix}
   1.0 & 0.9\\
   0.9 & 1.0
   \end{bmatrix}
$$
2. Print the empirical covariance matrix of the sample.  
3. Plot the simulated data values.

Execute this code.

In [None]:
def plot_normal(X):
    X = pd.DataFrame(X, columns=['axis1','axis2'])
    _=sns.jointplot(x='axis1', y='axis2', data=X, xlim=(-4.5,4.5), ylim=(-4.5,4.5))

cov = [[1, 0.9], [0.9, 1]]
np.random.seed(367)
Normal_random = np.random.multivariate_normal([0.0,0.0], cov, size=500)
print(np.cov(Normal_random[:,0], Normal_random[:,1]))
plot_normal(Normal_random)

Notice the following aspects of these results:  
1. The empirical covariance matrix is very close in values to the covariance matrix used for the simulation.   
2. The scatter plot shows considerable dependency between the two variables. 
3. The marginal distributions of the two variables appear to be close to Normally distributed. 

Next, the code in the cell below does the following.  
1. A PCA object is instantiated and the data fit.   
2. The PCA model is used to transform or project the original data matrix into the new coordinate system.    
3. The empirical covariance is computed and printed. 
4. **Variance ratio** of the two dimensions of the new space is computed and printed. Here, variance ratio is the variance on each dimension of the space divided by the total variance of the data. 
5. A plot of the projected data values in the new coordinate space is plotted. 

Execute this code. 

In [None]:
Normal_pca = PCA().fit(Normal_random)
simple_pca = Normal_pca.transform(Normal_random)
print(np.cov(simple_pca[:,0], simple_pca[:,1]))
print(Normal_pca.explained_variance_ratio_)
plot_normal(simple_pca)

The PCA transformation of these data appears to have worked as expected. Notice the following:  
1. The diagonal terms of the covariance matrix are significantly different in value, indicating that the first component (axis) projects the majority of the variability in the data values.  
2. The off-diagonal terms of the covariance matrix are effectively 0, indicating there is no dependency between the variables in the new coordinate system. 
3. The observation that most of the variability of the projected data are explained by the first component is confirmed by both the variance ratio values and the scatter plot. 
4. The marginal distributions of the two variables are still close to Normal, but with significantly difference scale or variance. 

## First Running Example  

We will now start working with some simple real-world data. The famous Iris dataset was collected by a botanist named Edgar Anderson around 1935. Subsequently, the dataset became famous in data analysis circles when Ronald A Fisher used it as an example for his seminal 1936 paper on discriminate analysis, one of the first true multivariate statistical methods proposed. By modern standards this data set is small (only 150 samples) and simple (only 4 features), but the simplicity will help in understand the methods at hand.   

The code in the cell below loads the data set and transforms it into a demeaned Pandas data frame with human readable column and species names. Execute this code. 

In [None]:
iris_data = load_iris()

## Demean the data columns 
temp = np.zeros(iris_data['data'].shape)
for i in range(iris_data['data'].shape[1]): 
    temp[:,i] = np.subtract(iris_data['data'][:,i], np.mean(iris_data['data'][:,i], axis=0))

## Prepare the data frame 
target_species = {0:'Setosa',1:'Versicolour',2:'Virginica'}
species = [target_species[x] for x in iris_data['target']]
iris = pd.DataFrame(temp, columns=['sepal_length','sepal_width','petal_length','petal_width'])
iris['species'] = species
iris_data = iris_data['data']
iris

Since there are only 4 features in this dataset a pairs plot will help with understanding the relationships in these data. Execute the code below to display the plot. 

In [None]:
_=sns.pairplot(iris, hue='species')

Examine this plot array. You can see that values samples for the Setosa species are well separated. However there is some overlap between samples from Versicolour and Virginica. Further, and more importantly, it appears that these is considerable redundancy in these plots. This leads one to suspect that there is a high dependency between these cases.  

We can further investigate the dependency between the variables by computing the covariance matrix. Execute the code in the cell below to compute the covariance matrix of the iris data. 

In [None]:
np.cov(np.transpose(iris_data))

The off-diagonal terms of the covariance matrix are far from zero. We can conclude that there is significant dependency between the variables.   

## Compute PCA of the iris data   

The first algorithm you will apply to the iris data is linear PCA.  

> **Exercise 05-1:** Compute the PCA of the iris data and plot the explained variance of the components by the following steps:  
> 1. Instantiate a Scikit-learn PCA model object with [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).  
> 2. Fit the model to `iris_data` numpy array using the `fit` method on the model object.  
> 3. Create a scatter plot of the `explained_variance_ratio_` attribute of the fitted model vs. the component number.  `

In [None]:
## Put your code below 






> Examine the plot. Does it appear that much of the variance in the data is explained by the first component? Is there any substantial difference in the variance explained between the second and third and fourth components substantially different?   
> **End of exercise.**

Recall that the variance of the components from the PCA goes as the square of the singular values. You can gain another view of the relationship between the principle components by executing the code below to plot the singular values. 

In [None]:
_=plt.scatter(range(1, len(iris_pca.singular_values_) + 1), iris_pca.singular_values_)
_=plt.xlabel('Component number')
_=plt.ylabel('Singular value')
_=plt.title('Singular value vs. component number')

Next, you will investigate the principle components used to project the data into the new space. Execute the code in the cell below to print the components.  

In [None]:
components = iris_pca.components_
components

> **Exercise 05-2:** The principle components must be unitary (unit norm) and orthogonal. Do the following to verify these properties.  
> 1. In the first cell below compute and print the Euclidean norm of these of the components using [numpy.linalg.norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html).    
> 2. Using [itertools.combinations](https://docs.python.org/3/library/itertools.html) compute the dot (inner) product of each of pairwise combination of the components using [numpy.dot](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) in the second cell below. 

In [None]:
## Put your code below 



In [None]:
## Put your code below 




> Examine these results. Are these components orthogonal and unitary?   
> **End of exercise.**

> **Exercise 05-3:** From the initial exploration of the variance explained and singular values it is the case that most of the variance can be explained by the first two components. A PCA model with just 2 components will therefore explain most of the variance in the dataset. To create this model and display the results do the following:   
> 1. Instantiate the projected data array using a PCA model object with the argument `n_components=2` and using the `fit_transform` method on the `iris_data`.   
> 2. Plot the transformation of the data with the `plot_pca` function. Make sure you save the returned data frame as `pca_projected`.  

In [None]:
def plot_pca(X, labels):
    pca_projected = pd.DataFrame(X, columns=['Component_1','Component_2'])
    pca_projected['labels'] = labels 
    sns.scatterplot(data=pca_projected, x='Component_1', y='Component_2', hue='labels')
    return pca_projected

## Put your code below 



> Examine the plot you have created. Answer the following questions:  
> 1. How well can these clusters be linearly separated?  
> 2. Is the range of values of the components consistent with the variance of the components?  
> **End of Exercise.**

We can check the independence of the components by computing the covariance. Execute the code in the cell below to display the covariance. 

In [None]:
np.cov(pca_projected.Component_1, pca_projected.Component_2)

Notice that the off-diagonal components are effectively zero indicating independence. 

## Second Example Dataset

The bowl disease gene dataset has high dimensionality, with over 10,000 features. The question is can this high dimensional space be projected to a lower dimensional space.   

Execute the code in the cell below to load the data set and prepare it for analysis. 

In [None]:
gene_data = pd.read_csv('../data/ColonDiseaseGeneData-Cleaned.csv')
labels = gene_data.loc[:,'Disease State']
gene_data = gene_data.drop('Disease State', axis=1)

## Demean the columns 
for col in gene_data.columns: 
    gene_data.loc[:,col] = gene_data.loc[:,col].subtract(gene_data.loc[:,col].mean(axis=0))

## Display the results 
print(gene_data.shape)
print(gene_data.head())

For the 97 subjects there are gene expression values for over 10,000 genes.  

> **Exercise 05-4:** You will now explore the ability of PCA to reduce the dimensionality of the genetics data. To test this idea do the following:   
> 1. Instantiate a PCA object and apply the `fit` method with the `gene_data` as the argument.  
> 2. Print the cumulative sum of the variance explained by applying the [numpy.cumsum](https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html) function to the `explained_variance_ratio_` attribute of the model object. 
> 2. Plot the `explained_variance_ratio_` attribute of the model object vs. the component number. 
> 3. Execute your code. 

In [None]:
## Put your code below 







> Study your plot. Notice that the explained variance ratio decreases rapidly with the component number. Answer the following questions:  
> 1. Does this decay indicate that dimensionality reduction can be significant?   
> 2. Approximately how many components would you estimate are required to account for over 70% over the variance?  
> **End of exercise.**  

Now we will display and examine a pairwise scatter plot of the first 8 components of the PCA decomposition of the genetics data. Execute the code in the cell below to display the plot.   

In [None]:
gene_pca_8 = PCA(n_components=8).fit(gene_data)
gene_pca_projected = pd.DataFrame(gene_pca_8.transform(gene_data), columns=['Component_1','Component_2','Component_3','Component_4','Component_5','Component_6','Component_7','Component_8'])
gene_pca_projected['Disease'] = labels
_=sns.pairplot(gene_pca_projected, hue='Disease')

Examine the plot. Notice that most of the component values of the two disease types have significant overlap. However, in some cases there are differences in the values. This indicates that, to some extent, the disease cases are separable.    

## Kernel PCA

Kernel PCA uses a nonlinear mapping between nonlinear sample space and a lower dimensional linear space.   

> **Exercise 05-5:** You will now apply the nonlinear kernel PCA method to the iris dataset. Do the following:   
> 1. Instantiate a kernel PCA object with the `kernel='cosine'` argument. 
> 2. Use the `fit` method with the 'iris_data' as the argument.   
> 3. Plot the singular values, the `lambdas_` attribute of the model object, vs. the component number.        

In [None]:
## Put your code below 








> Examine the plot of the singular values. Does this plot indicate the dimensionality of the iris data can be reduced to 2 dimensions using the cosine kernel PCA, and why?   
> **End of exercise.**

> **Exercise 05-6:** Now you will create and plot the results of a kernel PCA decomposition of the iris data by the following steps:  
> 1. Instantiate a [sklearn.decomposition.KernelPCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html) object with arguments `n_components=2` and `kernel='cosine'`.  
> 2. Use the `fit_transform` method with the iris data as the argument to compute the projection into the new space. 
> 3. Display a plot of the two components using the `plot_pca` function. 

In [None]:
## Put your code below 



> Examine the plot you have created. Answer the following questions:  
> 1. How well can these clusters be linearly separated?  
> 2. Does the range of values of the components indicate that most of the variance is explained by the first component? 
> **End of Exercise.**

> **Exercise 05-7:** Changing the kernel can have a significant impact on the projected components of the kernel PCA model. To see an example, repeat the steps used in the previous exercise, but the the argument `kernel='rbf'`.

In [None]:
## Put your code below 



> These results are quite different. Which kernel do you think gives a more useful projection of the iris data?   
> **End of exercise.**

Next, we will try kernel PCA on the gene dataset. Execute the code in the cell below and examine the results. 

In [None]:
gene_kernel_pca = KernelPCA(n_components=8, kernel='sigmoid').fit_transform(gene_data)
gene_pca_kernel_projected = pd.DataFrame(gene_kernel_pca, columns=['Component_1','Component_2','Component_3','Component_4','Component_5','Component_6','Component_7','Component_8'])
gene_pca_kernel_projected['Disease'] = labels 
_=sns.pairplot(gene_pca_projected, hue='Disease')

Compare this plot to the one created with linear PCA. Overall, these results seem nearly identical. It may be the case that a nonlinear transformation is not required for these data.

## Spectral Manifold Learning  

Manifold learning seeks to map high-dimensional data onto a low-dimensional linear or nonlinear manifold. In this case we will map to a two dimensional manifold which can be displayed as a plot.  

> **Exercise 05-8:** You will now apply the spectral manifold learning to the iris dataset by these steps:   
> 1. Instantiate a [sklearn.manifold.SpectralEmbedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html#sklearn.manifold.SpectralEmbedding) object with argument `affinity='rbf'`.
> 2. Use the `fit_transform` method with the iris data as the argument. 
> 3. Display the result using the `plot_pca` function.  

In [None]:
## Put your code below 



> Examine this plot. Which aspects of the data are well separated and which are not?  
> **End of exercise.**

We can also apply spectral manifold learning to the gene data. Execute the code in the cell below and examine the results. 

In [None]:
gene_spectral = SpectralEmbedding(affinity='rbf').fit_transform(gene_data)
pca_projected=plot_pca(gene_spectral, labels)

Notice that this result looks remarkably like the plot of the first two components of the linear PCA. This again indicates that the dependency relationship may be primarily linear.  

#### Copyright 2021, Stephen F Elston. All rights reserved.