**BIO/QTM 385: In class exercise for Monday, August 30th (answers will be the vast majority of Assignment #1, Due 9/8)**

<font color='green'>**Enter your names and group numbers here.**  </font>



For this exercise, you will familiarize yourself with covariance matrices (as well as a bit more python), learning how to calculate them, about some of their interesting properties, and how to interpret a covariance matrix that is inferred from data.  We will discuss all of this more in lecture on Wednesday (so don't worry if things don't make perfect sense yet or if you can't get through the whole exercise), but this will hopefully serve as an introduction to the topic.  To provide some orientation, I put made all questions to be answered in <font color='blue'>blue text</font>, and the spaces where you should put your answers are in <font color='green'>green text</font>.  Sean and I will be cycling through the breakout rooms to answer whatever questions may arise.

Also note that although today's exercise is more abstract, we will start applying these ideas to real data sets over the following two class periods.

####To start, we will import ```numpy```, ```numpy.random```, ```numpy.linalg```, and ```matplotlib```.

In [None]:
import numpy as np
from numpy import random
from numpy import linalg
import matplotlib.pyplot as plt
%matplotlib inline

## The Covariance Matrix


###Defining the Covariance matrix

When dealing with high-dimensional data (especially real-valued data), one of the first things we will typically study is the covariance matrix, $C$.  This matrix descibes all of the pairwise linear correlations between the $d$ different variables that are simultaneously measured.  While this matrix has many limitations in terms of its ability to characterize the high-dimensional data (see question below), there is often much to be learned - and it's easy to calculate! - so it's usually a great place to orient one's self when staring at a new dataset.

To be more precise, the covariance matrix is the collection of all linear covariances between each pair of variables.  Mathematically, if $X\in\Re^{N\times d}$ ($N$ is the number of data points, $d$ is the dimensionality of the data), then we define the covariance matrix, C, via:
\begin{equation}
C_{ij} = \frac{1}{N}\sum_{k=1}^N (X_{ki} - \bar{X_i})(X_{kj} - \bar{X_j}),
\end{equation}
where 
\begin{equation}
\bar{X_i} = \frac{1}{N}\sum_{k=1}^N X_{ki}
\end{equation}
is the mean of the $i$th variable in $X$.  

Note how if $i=j$, $C_{ii}$ is simply the variance of $X_i$.  Thus, the diagonal terms of the matrix describe how individual variables vary, and each of the off-diagonal terms is the *covariance* between the two associated variables (hence the name of the overall matrix).

While each of the entries in the matrix can be positive (positively correlated), negative (negatively correlatied), or zero (uncorrelated), the covariance matrix is symmetric (switching $i$ and $j$ in the definition of $C_{ij}$ above results in the same result, so $C_{ij} = C_{ji}$).  Symmetric matrices have many convenient [properties](https://en.wikipedia.org/wiki/Symmetric_matrix), including the fact that all of its eigenvalues must be greater than zero.  We will use this property later.   

<font color='blue'>Question #1: Name at least three potential limitations of using the covariance matrix to characterize high-dimensional data</font> 

<font color='green'>Type answer here</font>

### Calculating the Covariance matrix

<font color='blue'> Question #2: Given that ```X``` is an ```N``` x ```d``` ```numpy``` array of real valued numbers, write a short script in the box below that (i) initializes a ```d``` x ```d``` ```numpy``` array, ```C``` and (ii) uses two ```for``` loops to fill in each element in ```C``` with the appropriate covariance matrix value, as defined above.</font>

In [None]:
#write answer to question #2 here

Despite the question above, you typically will calculate this matrix in ```numpy``` using [```np.cov()```](https://numpy.org/doc/stable/reference/generated/numpy.cov.html).  Note, however, that this function assumes that the input vector takes the form of a dimensions) by the # of data points matrix (i.e., $d\times N$ rather than $N\times d$), so if your data set is in the latter format, you will need to transpose it first (e.g., ```C = np.cov(X.T)```).

<font color='blue'> Question #3: In the box below, generate a 1000 x 5 matrix, ```Y```, of gaussian random numbers ($mu=0$ and $\sigma=3$). Then calculate and display its 5 x 5 covariance matrix.  Do the off-diagonal and diagonal entries match your expectations? Why or why not? (You can go back to the python tutorial for a refresher on generating random numbers)</font>

In [None]:
#write answer to question #3 here

<font color='green'>Do the off-diagonal and diagonal entries match your expectations?  Why or why not?</font> 



###Eigenvalues and eigenvectors of the covariance matrix

Lastly, an important means of characterizing the covariance matrix is through its eigenvectors ($\hat{v}_1,\hat{v}_2,\ldots,\hat{v}_d$) and eigenvalues ($\lambda_1,\lambda_2,\ldots,\lambda_d$).  Importantly, these quantities often will provide information about how multiple variables are (linearly) interacting.

For those who haven't had linear algebra yet (or are in need of a brief review), eigenvectors and eigenvalues are pairs of scalar quantities (eigenvalues) and vectors (naturally, eigenvectors), that satisfy the equation:
\begin{equation}
C \hat{v}_i = \lambda_i \hat{v}_i,
\end{equation}
where $C$ is the covariance matrix.

For symmetric, real, matrices like the covariance matrix, the eigenvectors and eigenvalues obey the following properties (amongst others):
- Each of the eigenvectors are orthogonal to each other ($\hat{v}_i\cdot \hat{v}_j = 0$ if $i\ne j$)
- By convention, each eigenvector has a norm of 1 ($\hat{v}_i \cdot \hat{v}_i = 1$).
- Unless any of the data columns generating $C$ are exact multiples of each other (highly unlikely for any real data), the eigenvectors will span $\Re^d$.  In other words, any $d$-dimensional real vector can be written as a linear combination of the eigenvectors
- All of the eigenvalues must be greater than or equal to zero and strictly greater than zero unless any of the data columns generating $C$ are exact multiples of each other (again, a highly-unlikely thing to have happen).


More conceptually, we can think of the eigenvectors as the directions where most variance in the data lie.  More precisely, if we order the eigenvalues such that $\lambda_1 \ge \lambda_2 \ge \ldots \lambda_d > 0$, then, $\hat{v}_1$ is the direction in the data set where we see the most variance, $\hat{v}_2$ is the direction with the second-most variance, and so on.  Moreover, $\lambda_i$ is the variance in the data when projecting it onto $\hat{v}_i$.  Don't worry, however, if this doesn't quite click now.  We will discuss this property in detail on Wednesday when we introduce Principal Components Analysis (PCA).

####Calculating eigenvalues and eigenvectors in ```numpy```

To find the eigenvectors and values of a matrix in ```numpy```, you can use the ```linalg.eig()``` function.  Here, the syntax is 

```eigvals,eigvecs = linalg.eig(A) ```

where ```A``` is the matrix you wish to find the eigenvalues of, ```eigvals``` is a $1\times d$ vector of eigenvectors, and ```eigvecs``` is a $d\times d$ matrix of eigenvectors (each column is associated with the corresponding eigenvalues -- e.g., ```eigvecs[:,0]``` is the eigenvalue pair of ```eigvals[0]```).

<font color='blue'> Question #4: Calculate and display the eigenvectors and eigenvalues for the matrix you generated in Question #3.  How do these findings compare to what you would expect for a matrix with random entries? </font>

In [None]:
#write your answer to question #4 here

<font color = "green">  How do these findings compare to what you would expect for a matrix with random entries?</font>

As a note, it is often useful to sort the eigenvalues/vectors according to the eigenvalues.  To achieve this, you can use the following script:

In [None]:
idx = eigvals.argsort()[::-1]
eigvals = eigvals[idx]
eigvecs = eigvecs[:,idx]
print(eigvals)

## Measuring the covariance matrix

Like any other empirically derived quantity, the covariance matrix is a quantity that we estimate from data.  Thus, we are often tasked with the job of figuring-out what aspects of the apparent structure in our measured covariance matrix are "real" and what aspects could be generated by statistical fluctuations.  In this section, we will investigate aspects of this problem, but we will return to this topic (in a more formal manner) later in the course.

###Calculating the covariance matrix from samples

For this section, we will start by looking at samples from data drawn from a 3-D gaussian (normal) distribution with a mean of zero ($\vec{\mu} = \left(\begin{matrix}
0 \\
0 \\ 
0
\end{matrix}\right)$) and a covariance matrix, $C = \left(\begin{matrix}
9 & 4 & -2\\
4 & 4 & 1/10 \\
-2 & 1/10 & 1
\end{matrix}\right)$.  The code in the cell below will generate and plot a data set of 100 random samples from the distribution.

In [None]:
#initialize parameters
N = 100
cov_matrix = [[9,4,-2],[4,4,.1],[-2,.1,1]]
mean_data = [0,0,0]

#draw N random samples from the gaussian
data = random.multivariate_normal(mean_data,cov_matrix,size=N)

#plotting each column against each other column
fig = plt.figure(figsize=(12,12))
plt.subplot(3,3,2)
plt.plot(data[:,0],data[:,1],'.')
plt.title('$x_0$ vs. $x_1$')

plt.subplot(3,3,3)
plt.plot(data[:,0],data[:,2],'.')
plt.title('$x_0$ vs. $x_2$')

plt.subplot(3,3,4)
plt.plot(data[:,1],data[:,0],'.')
plt.title('$x_1$ vs. $x_0$')

plt.subplot(3,3,6)
plt.plot(data[:,1],data[:,2],'.')
plt.title('$x_1$ vs. $x_2$')

plt.subplot(3,3,7)
plt.plot(data[:,2],data[:,0],'.')
plt.title('$x_2$ vs. $x_0$')

plt.subplot(3,3,8)
plt.plot(data[:,2],data[:,1],'.')
plt.title('$x_2$ vs. $x_1$')

plt.show()

<font color='blue'> Question #5: Calculate and print the mean and the standarde error of the mean (s.e.m. = $\sigma/\sqrt{N-1}$) for the samples you just generated.  Is your value for the $\mu$ within error of zero? (Note: to take the mean along a column of data, you need to use ```np.mean(data,axis=0)```)</font>

In [None]:
#type your answer to Question #5 here

<font color = "green">  Is your value for the  𝜇  within error of zero?</font>

<font color='blue'> Question #6: Calculate and print the covariance matrix for the same samples as in Question #5.  How does it compate to the actual covariance matrix?</font>

In [None]:
#type your answer to Question #6 here

<font color = "green">  How does it compate to the actual covariance matrix?</font>

###Distributions of sampled covariance matrices

This, of course, is just one simulated data set.  When trying to assess how having a finite sampling size (and all data sets are of finite size!) affects our results.

The code below will calculate the means, covariances matrices, and eigenvalues (sorted from largest to smallest) for 20,000 different instantiations of the 3-D gaussian described above.  ```means``` is a 20,000 x 3 matrix of the means from each instantiation, ```cov_matrices``` is a 20,000 x 3 x 3 array of covariance matrices, and ```eigenvalues`` is a 20,000 x 3 array of eigenvalues, arranged from largest to smallest.

In [None]:
numDataSets = 20000
N = 100
cov_matrix = [[9,4,-2],[4,4,.1],[-2,.1,1]]
mean_data = [0,0,0]

cov_matrices = np.zeros((numDataSets,3,3))
means = np.zeros((numDataSets,3))
eigenvalues = np.zeros((numDataSets,3))
for i in range(numDataSets):
    temp_data = random.multivariate_normal(mean_data,cov_matrix,size=N)
    cov_matrices[i,:,:] = np.cov(temp_data.T)
    means[i,:] = np.mean(temp_data,axis=0)
    eigvals,eigvecs = linalg.eig(cov_matrices[i,:,:])
    eigvals[::-1].sort()
    eigenvalues[i,:] = eigvals



<font color='blue'> Question #7: Using the results from the code above, make 50-bin histograms for each of the nine covariance matrix entries (you can use the same ```plt.subplot()``` code as shown above).  Do the actual values lie within the found distributions?</font>

In [None]:
#type your answer to Question #7 here

<font color='green'> Do the actual values lie within the found distributions?</font>

<font color='blue'> Question #8: Using the results from the code above, make 50-bin histograms for each of the three covariance matrix eigenvalues.  Do the distributions look gaussian, or are they asymmetric?</font>


In [None]:
#type your answer to Question #8 here

<font color='green'> Do the distributions look gaussian, or are they asymmetric? </font>

###Assessing statistical significance for eigenvalues

One way of performing *dimensionality reduction* (i.e., translating a high-dimensional representation into a lower-dimensional one) is to project a data set onto the space of eigenvectors corresponding to eigenvalues that are "significantly" different from zero.  Below, we will see one way of performing this assessment via independently shuffling the columns of a data matrix.

(Editorial note: I put "significantly" in scare quotes here, as I don't really like the term -- it's really just a way to pretend that an arbitrary threshold is really a magic value that delineates success from failure.  OK, rant over.  For now.)

Let's go back again to our 3-D gaussian matrix from before.

In [None]:
#initialize parameters
N = 100
cov_matrix = [[9,4,-2],[4,4,.1],[-2,.1,1]]
mean_data = [0,0,0]

#draw N random samples from the gaussian
data = random.multivariate_normal(mean_data,cov_matrix,size=N)

Because the covariance matrix has off-diagonal terms, the are correlations between the three measured variables.  A natural question to ask, though, is: *if there were no correlations, might we still see some by accident just because we've only measured a finite number of data points?*

One way to answer this question is to shuffle each of the columns independently from one another.  Thus, the resulting matrix **should** have no correlations, but, of course, will display some correlations due to noise from finite sampling.  We then will measure the resulting eigenvalues from these shuffled matrices, repeat the process multiple times, and will compare the resulting distribution of eigenvalues from shuffled matrices to the eigenvalues from the measured samples.

First, to shuffle the columns in the matrix, we can use the following code:

In [None]:
shuffled_data = np.copy(data)
#note: we can't say "shuffled_data = data" like in matlab or R, because then, 
#the created variable would be linked to the old variable
for i in range(3):
  random.shuffle(shuffled_data[:,i])

<font color='blue'> Question #9: Calculate the covariance matrix eigenvalues (sorted from largest to smallest) from the ```data``` matrix above.</font>


In [None]:
#type your answer to Question #9 here

<font color='blue'> Question #10: Perform the shuffling analysis described in the paragraph above on the ```data``` matrix (perform the shuffling 1,000 times.  Create 50-bin histograms for the each of the three eigenvalues (remember to sort!). </font>


In [None]:
#type your answer to Question #10 here

<font color='blue'> Question #11: In the histograms above, are the resulting distributions symmetric or non-symmetric?  If the latter, is there a more definitive upper-bound or a more definitive lower-bound?</font>


<font color='green'> Type your answer to Question #11 here </font>

<font color='blue'> Question #12: Given your answers to Questions #9, 10, and 11, which of the three eigenvalues that you've measured from ```data``` would you consider as "signficant" ? Explain your reasoning.</font>


<font color='green'> Type your answer to Question #12 here </font>
