1\. **PCA on 3D dataset**

* Generate a dataset with 3 features each with N entries (N being ${\cal O}(1000)$). With $N(\mu,\sigma)$ the normal distribution with mean $\mu$ and $\sigma$ standard deviation, generate the 3 variables $x_{1,2,3}$ such that:
    * $x_1$ is distributed as $N(0,1)$
    * $x_2$ is distributed as $x_1+N(0,3)$
    * $x_3$ is given by $2x_1+x_2$

In [None]:
import numpy as np
from scipy import linalg as la
import matplotlib.pyplot as plt

In [None]:
N=1000
rng=np.random.default_rng(seed=94)
x1=rng.normal(0.,1.,N)
x2=rng.normal(0.,3.,N)
x3=2*x1+x2
data_matrix=np.concatenate((x1[:,None], x2[:,None], x3[:,None]), axis=1)
print(data_matrix, data_matrix.shape)

* Find the eigenvectors and eigenvalues of the covariance matrix of the dataset
* Find the eigenvectors and eigenvalues using SVD (---> on data matrix). Check that the two procedures yield to same result

In [None]:
x1_centered=x1-x1.mean()
x2_centered=x2-x2.mean()
x3_centered=x3-x3.mean()
data_matrix_centered=np.concatenate((x1_centered[:,None], x2_centered[:,None], x3_centered[:,None]), axis=1)
covariance_matrix=(data_matrix_centered.T).dot(data_matrix_centered)/(N-1) # ---> NB the N-1
#print(covariance_matrix)
#element=[element in covariance_matrix if element > 1 else NaN] # ---> fix?
#print(element)

eig_vals, eig_vects= la.eig(covariance_matrix) # ---> NB eig(cov)
print(np.real_if_close(eig_vals), '\n \n', eig_vects, '\n \n')

U, spectrum, Vt = la.svd(data_matrix_centered) # ---> could also divide by sqrt(N-1) here and forget about it later
# ---> NB svd(data_ctrd)
print(U, '\n \n', spectrum, '\n \n', Vt)
# ---> SVD algo also orders eigvals->eigvects from largest to smallest

print(np.allclose(np.sort(eig_vals)[::-1], spectrum**2/(N-1))) # ---> also try it for matrix? (as an exercise)
# ---> NB the **2 / N-1

* What percent of the total dataset's variability is explained by the principal components? Given how the dataset was constructed, do these make sense? Reduce the dimensionality of the system so that at least 99% of the total variability is retained.
* Redefine the data in the basis yielded by the PCA procedure
* Plot the data points in the original and the new coordinates as a set of scatter plots. Your final figure should have 2 rows of 3 plots each, where the columns show the (0,1), (0,2) and (1,2) projections.

In [None]:
#---> For "variability" they mean the sum (trace) of the new eigenvalues normalised by the sum of the original ones
initial_trace=covariance_matrix.trace()
# ---> let's try to discard one dimension:
n_rv=data_matrix.shape[1]
new_spectrum=(np.sort(eig_vals)[::-1])[:(n_rv-1)]
#print(new_spectrum)
# ---> then two:
new_variability=np.sum(new_spectrum)/initial_trace
print(new_variability) # ---> how to approx correctly?
new_new_spectrum=(np.sort(eig_vals)[::-1])[:(n_rv-2)]
new_new_variability=np.sum(new_new_spectrum)/initial_trace
print(new_new_variability)
# ---> keep 2 dimensions (as expected)

new_eig_vects=eig_vects[:,[0,2]] 
reduced_data_matrix_centered=np.dot(data_matrix_centered, new_eig_vects) # ---> NB eig vects are on columns!
print(new_eig_vects)
# --->  data_matrix_centered everywhere in this cell

data_matrix_centered_new_base=np.dot(data_matrix_centered, eig_vects)

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12,7)) # ---> NB the syntax
acc=0
for i in range(2):
    for j in range(i+1,3):
        #print("j:",j)
        axes[0,acc].scatter(data_matrix_centered[:,i], data_matrix_centered[:,j], color='g')
        axes[0,acc].set_xlabel(f"x{i+1}")
        axes[0,acc].set_ylabel(f"x{j+1}")
        axes[1,acc].scatter(data_matrix_centered_new_base[:,i], data_matrix_centered_new_base[:,j], color='b')
        axes[1,acc].set_xlabel(f"x{i+1}")
        axes[1,acc].set_ylabel(f"x{j+1}")
        #print("acc:",acc)
        acc=acc+1
# ---> TODO: add legend. ax titles, fig title?

2\. **PCA on a nD dataset**

Start from the dataset you have genereted in the previous exercise and add uncorrelated random noise. Such noise should be represented by other 10 uncorrelated variables normal distributed, with standard deviation much smaller (say, a factor 50) than those used to generate the $x_1$ and $x_2$.

Repeat the PCA procedure and compare the results with what you obtained before

In [None]:
#---> everything the same, 10 new columns
n=10
epsilon=0.1
noise_matrix=rng.normal(0.,0.1,(N,n)) # ---> NB this is the meaning of "random noise" here
noise_matrix_centered=noise_matrix-noise_matrix.mean(axis=0)
data_matrix_centered=np.concatenate((x1_centered[:,None], x2_centered[:,None],
                                     x3_centered[:,None], noise_matrix_centered), axis=1)
n_rv=data_matrix_centered.shape[1]
_, spectrum, Vt = la.svd(data_matrix_centered)
sorted_eig_vals=spectrum**2/(N-1)
sorted_eig_vects=Vt # ---> the rows this time, NOT the columns
initial_trace=np.sum(sorted_eig_vals)
print(initial_trace)
# ---> let's try to discard one dimension at a time:
i_max=-1
new_variability=1
new_eig_vects=sorted_eig_vects
for i in range(1,n_rv):
    new_spectrum=sorted_eig_vals[:(n_rv-i)]
    previous_variability=new_variability # ---> of previous cycle
    new_variability=np.sum(new_spectrum)/initial_trace
    print(i,new_variability)
    if (new_variability<0.99):
        i_max=i-1 # ---> of previous cycle
        new_variability=previous_variability
        new_eig_vects=sorted_eig_vects[:(n_rv-i_max),:]
        # ---> eg if all to keep: n_rv-(1-1)= n_rv therefore original vector
        break
# --> check this?
print(i_max,'\n \n', new_variability,'\n \n', new_eig_vects)
data_matrix_centered_new_base=np.dot(data_matrix_centered, sorted_eig_vects.T)

In [None]:
#fig, axes = plt.subplots(nrows=2, ncols=17)
#acc=0
#for i in range(2):
#    for j in range(i+1,13):
#        axes[0,acc].scatter(data_matrix_centered[:,i], data_matrix_centered[:,j], color='g')
#        axes[0,acc].set_xlabel(f"x{i+1}")
#        axes[0,acc].set_ylabel(f"x{j+1}")
#        axes[1,acc].scatter(data_matrix_centered_new_base[:,i], data_matrix_centered_new_base[:,j], color='b')
#        axes[1,acc].set_xlabel(f"x{i+1}")
#        axes[1,acc].set_ylabel(f"x{j+1}")
#        acc=acc+1

# ---> TODO: draw some USEFUL graph?

3 \. **Looking at an oscillating spring** (optional)

Imagine you have $n$ cameras looking at a spring oscillating along the $x$ axis. Each  camera record the motion of the spring looking at it along a given direction defined by the pair $(\theta_i, \phi_i)$, the angles in spherical coordinates. 

Start from the simulation of the records (say ${\cal O}(1000)$) of the spring's motion along the x axis, assuming a little random noise affects the measurements along the $y$. Rotate such dataset to emulate the records of each camera.

Perform a Principal Component Analysis on the thus obtained dataset, aiming at finding the only one coordinate that really matters.


In [None]:
# y = 0, x = A cos(omega * t)
# unclear what this "rotation" should be?

4\. **PCA on the MAGIC dataset** (optional)

Perform a PCA on the magic04.data dataset

In [None]:
# get the dataset and its description on the proper data directory
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data -P ~/data/
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names -P ~/data/ 