Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# Lab 5: Principal Component Analysis  

In this lab you will be working with a synthetic dataset that describes geometrical attributes of talus samples, used to characterize a failed rock mass. 
You will be using principal component analysis (PCA) to better under the concept of eigenvectors and eigenvalues, and to gain statistical insight into the dataset. 
Catching up on your textbook reading is highly recommended before starting this lab.  


----

# Exercise 1  

__Description__:  

   Read the datafile `talus.txt` in as a NumPy array. The file contains a data matrix, where each of the 20 rows    represents an individual talus sample. 
   Each of the 5 columns represents a measured or derived geometric property in the following order: (1) length of long axis (m), (2) length of intermediate axis (m), (3) length of short axis (m), (4) longest diagonal axis (m), (5) surface-area-to-volume ratio (m$^2$/m$^3$). 

__Questions__:  

  a. Give a specific example of a problem (outside of the lecture notes and textbook reading) in which principal component analysis would be useful.  
  b. Compute the correlation matrix for the length of the long and intermediate axes of the samples (the first two columns) without using the `NumPy` function `np.corrcoef()`   .          
  c. Calculate the eigenvalues of the correlation matrix in (b). Check you answer by working out the problem by hand (see submission directions).   
  d. Calculate the eigenvectors of the correlation matrix in (b). Check your answer by working out the problem by hand.   
  e. What percent of the variance in the dataset is explained by each principal component?
  
__Note__: For hand written questions (1.c and 1.d) please hand in a single scanned PDF of your work. If you are comfortable with `latex` feel free to instead typset your work directly in this notebook in the markdown cells provided. 

---
a. Give a specific example of a problem (outside of the lecture notes and textbook reading) in which principal component analysis would be useful. 

---

YOUR ANSWER HERE

---
b. Compute the correlation matrix for the length of the long and intermediate axes of the samples (the first two columns) without using the `NumPy` function `np.corrcoef()`.  

__Hint__ : Compare your manually calculated answer to the output of `np.corrcoef()`. 

---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

---
  c. Calculate the eigenvalues of the correlation matrix in (b). Check you answer by working out the problem by hand (see submission directions). HINT: it's fine to use `LA.eigvals` for the `python` calculation.   
  
__Note__: For hand written questions (1.c and 1.d) please hand in a single scanned PDF of your work. 
  
---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

---
d. Calculate the eigenvectors of the correlation matrix in (b). Check your answer by working out the problem by hand. 

__Note__: For hand written questions (1.c and 1.d) please hand in a single scanned PDF of your work. 
  
---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

--- 
e. What percent of the variance in the dataset is explained by each principal component?

---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Exercise 2

__Description__:  

For this exercise, you will use the `sklearn` function `PCA()` with the standardized (Z) form of the data stored in ‘talus.txt’ as input. The class `PCA()` performs a principal component analysis on the data and has methods to return a matrix of loads (each column is an eigenvector), a matrix of the data projected onto the eigenvectors (the scores), and a list of the eigenvalues (latency):

```python
from sklearn.decomposition import PCA

# Initalize the model
pca = PCA()

# Use the standardized data Z
pca.fit(Z)

# Get PCA coefficients (i.e. loads)
loads   = pca.components_

# Scores (i.e. projection of the data into principal-component space)
scores  = pca.transform(Z)

# Latency (i.e. variance explained)
latency = pca.explained_variance_
```


__Questions__:   

a. Plot the correlation matrix using `plt.imshow()`. Make sure to add a color bar to your figure. What does the figure tell you about the data? What geometric properties are positively related, inversely related, or unrelated? __(5 points)__  

b. Plot the data projected onto the principal components (scores) using colour to represent the values. By eye, how many principal components appear to account for most of the variance in the data? __(5 points)__   

c. Plot the loads that define the first two principal components in a scatterplot. What does this figure tell you about the data? Be sure to include labels for each of the data points (see example code from lecture).  __(5 points)__ 

d. Plot the data projected onto the first two principal components in a scatterplot. What does this figure tell you about the data? Again, be sure to include labels for each of the data points (see example code from lecture). __(5 points)__   

e. How many principal components are required to explain 90% of the variance in the data? Show this with a calculation and figure. __(3 points)__ 

---
a. Plot the correlation matrix using `plt.imshow()`. Make sure to add a color bar to your figure. What does the figure tell you about the data? What geometric properties are positively related, inversely related, or unrelated? __(5 points)__

_Hint_: Presuming you have an axes object `ax` use code: 
```python 
# List of column names corresponding to the talus.txt matrix
cols = ['long axis','intermediate axis','short axis','diagonal axis','surface area/volume']

.
. # initalize figure and axes, plus do the required calculations
.

# Set tick locations to 0,1,2,3,4 (i.e. one for each column)
ax.set_yticks(np.arange(0,5))
# Set tick labels as the corresponding column names
ax.set_yticklabels(cols)

```
This code snippet can be reused for both the x and y axes. 


---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

---
b. Use the code snippet above to calculate the `loads`, `scores` and `latency` for the normalized data using `sklearn`'s `PCA`.  Plot the data projected onto the principal components (scores) using colour to represent the values. By eye, how many principal components appear to account for most of the variance in the data? __(5 points)__ 

_Warning_: Make sure you normalize your data before preforming PCA. 


---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

---  
c. Plot the loads that define the first two principal components in a scatterplot. What does this figure tell you about the data? Be sure to include labels for each of the data points (see example code from lecture).  __(5 points)__     


--- 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

---  
d. Plot the data projected onto the first two principal components in a scatterplot. What does this figure tell you about the data? Again, be sure to include labels for each of the data points (see example code from lecture). __(5 points)__   


---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

---
e. How many principal components are required to explain 90% of the variance in the data? Show this with a calculation and figure. __(3 points)__ 

---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Feel free to add cells if needed