1\. **PCA on 3D dataset**

* Generate a dataset with 3 features each with N entries (N being ${\cal O}(1000)$). With $N(\mu,\sigma)$ the normali distribution with mean $\mu$ and $\sigma$  standard deviation, generate the 3 variables $x_{1,2,3}$ such that:
    * $x_1$ is distributed as $N(0,1)$
    * $x_2$ is distributed as $x_1+N(0,3)$
    * $x_3$ is given by $2x_1+x_2$
* Find the eigenvectors and eigenvalues of the covariance matrix of the dataset
* Find the eigenvectors and eigenvalues using SVD. Check that the two procedures yield to same result
* What percent of the total dataset's variability is explained by the principal components? Given how the dataset was constructed, do these make sense? Reduce the dimensionality of the system so that at least 99% of the total variability is retained.
* Redefine the data in the basis yielded by the PCA procedure
* Plot the data points in the original and the new coordiantes as a set of scatter plots. Your final figure should have 2 rows of 3 plots each, where the columns show the (0,1), (0,2) and (1,2) proejctions.


In [9]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg as la

rng=np.random.RandomState(2057974)

x1=rng.normal(0,1,1000)
x2=x1+rng.normal(0,3,1000)
x3=2*x1+x2

dataset=np.empty((3,1000))
dataset[0]=x1
dataset[1]=x2
dataset[2]=x3

X=np.cov(dataset)

print('Covariance Matrix:')
print(X,'\n')
print('------------------------------------------------------------------')

l, V = la.eig(X)

print('Eigenvalues:')
print(np.real_if_close(l),'\n')
print('Eigenvectors:')
print (V,'\n')

print('Manual check:')
A=np.dot(V,np.dot(np.diag(np.real_if_close(l)), la.inv(V)))
print(A,'\n')
print('------------------------------------------------------------------')

U, spectrum, Vt = la.svd(X)

print('Eigenvalues (via SVD)')
print (spectrum,'\n')
print('Eigenvectors (via SVD)')
print (U,'\n')

D = np.zeros((3, 3))
for i in range(min(3, 3)):
    D[i, i] = spectrum[i]
SVD = np.dot(U, np.dot(D, Vt))

print('Manual check:')
print (SVD,'\n')
print('------------------------------------------------------------------')

print('Do the two decompositions yields to the same covariance matrix?')
print(np.allclose(SVD, A))
print('------------------------------------------------------------------')


%matplotlib notebook
fig=plt.figure()
ax=fig.add_subplot(projection='3d')
ax.scatter(dataset[0,:], dataset[1,:], dataset[2,:], alpha=0.2)
plt.show()

scale_factor=1

for li, vi in zip(l, V.T):
    #print (li, vi)
    plt.plot([0, scale_factor*li*vi[0]], [0, scale_factor*li*vi[1]],[0, scale_factor*li*vi[2]], 'r-', lw=1)

plt.show()

Covariance Matrix:
[[ 0.94499233  0.93114284  2.82112751]
 [ 0.93114284 10.11441847 11.97670415]
 [ 2.82112751 11.97670415 17.61895916]] 

------------------------------------------------------------------
Eigenvalues:
[ 2.67273285e+01 -5.99846057e-16  1.95104143e+00] 

Eigenvectors:
[[-0.10905148 -0.81649658  0.56695777]
 [-0.58532714 -0.40824829 -0.70051801]
 [-0.8034301   0.40824829  0.43339753]] 

Manual check:
[[ 0.94499233  0.93114284  2.82112751]
 [ 0.93114284 10.11441847 11.97670415]
 [ 2.82112751 11.97670415 17.61895916]] 

------------------------------------------------------------------
Eigenvalues (via SVD)
[2.67273285e+01 1.95104143e+00 9.46169519e-17] 

Eigenvectors (via SVD)
[[-0.10905148  0.56695777 -0.81649658]
 [-0.58532714 -0.70051801 -0.40824829]
 [-0.8034301   0.43339753  0.40824829]] 

Manual check:
[[ 0.94499233  0.93114284  2.82112751]
 [ 0.93114284 10.11441847 11.97670415]
 [ 2.82112751 11.97670415 17.61895916]] 

----------------------------------------------

<IPython.core.display.Javascript object>

In [10]:
Lambda=np.diag(l)

print('Lambda matrix:')
print (Lambda,'\n')

print('Relevance of eigenvalues:')
print (np.real_if_close(Lambda[0,0]),':',np.real_if_close(Lambda[0,0]/Lambda.trace()))
print (np.real_if_close(Lambda[1,1]),':',np.real_if_close(Lambda[1,1]/Lambda.trace()))
print (np.real_if_close(Lambda[2,2]),':',np.real_if_close(Lambda[2,2]/Lambda.trace()),'\n')

print('The most relevant part of variability is given by',np.real_if_close(Lambda[0,0]),'and',np.real_if_close(Lambda[2,2]))

Lambda matrix:
[[ 2.67273285e+01+0.j  0.00000000e+00+0.j  0.00000000e+00+0.j]
 [ 0.00000000e+00+0.j -5.99846057e-16+0.j  0.00000000e+00+0.j]
 [ 0.00000000e+00+0.j  0.00000000e+00+0.j  1.95104143e+00+0.j]] 

Relevance of eigenvalues:
26.727328529631933 : 0.931968189403358
-5.998460566415332e-16 : -2.0916323257266908e-17
1.9510414335560322 : 0.06803181059664205 

The most relevant part of variability is given by 26.727328529631933 and 1.9510414335560322


In [11]:
Xp = np.dot(V.T, dataset)

%matplotlib notebook
fig=plt.figure()
ax=fig.add_subplot(projection='3d')
ax.scatter(Xp[0,:], Xp[1,:], Xp[2,:], alpha=0.2)


scale_factor=1

for li, vi in zip(l, V.T):
    plt.plot([0, scale_factor*li*vi[0]], [0, scale_factor*li*vi[1]],[0, scale_factor*li*vi[2]], 'r-', lw=1)

plt.show()

<IPython.core.display.Javascript object>

In [12]:
%matplotlib notebook
fig=plt.figure()
ax=fig.add_subplot()
ax.scatter(Xp[0,:], Xp[2,:], alpha=0.2)
scale_factor=1

for li, vi in zip(l, V.T):
    plt.plot([0, scale_factor*li*vi[0]],[0, scale_factor*li*vi[2]], 'r-', lw=1)

plt.show()

<IPython.core.display.Javascript object>

In [13]:
fig, ax = plt.subplots(ncols=3, nrows=2, constrained_layout=True, figsize=(9,5))

ax[0,0].scatter(dataset[0,:], dataset[1,:], alpha=0.2)
ax[0,1].scatter(dataset[0,:], dataset[2,:], alpha=0.2)
ax[0,2].scatter(dataset[1,:], dataset[2,:], alpha=0.2)

ax[1,0].scatter(Xp[0,:], Xp[1,:], alpha=0.2, c='r')
ax[1,1].scatter(Xp[0,:], Xp[2,:], alpha=0.2, c='r')
ax[1,2].scatter(Xp[1,:], Xp[2,:], alpha=0.2, c='r')

ax[0,0].set_title('Original Data')
ax[0,0].set_xlabel('x1')
ax[0,0].set_ylabel('x2')

ax[0,1].set_title('Original Data')
ax[0,1].set_xlabel('x1')
ax[0,1].set_ylabel('x3')

ax[0,2].set_title('Original Data')
ax[0,2].set_xlabel('x2')
ax[0,2].set_ylabel('x3')

ax[1,0].set_title('Rotated Data')
ax[1,0].set_xlabel('x1')
ax[1,0].set_ylabel('x2')

ax[1,1].set_title('Rotated Data')
ax[1,1].set_xlabel('x1')
ax[1,1].set_ylabel('x3')

ax[1,2].set_title('Rotated Data')
ax[1,2].set_xlabel('x2')
ax[1,2].set_ylabel('x3')

for i in range(2):
    for j in range(3):
        ax[i,j].set_xlim(-20, 20)
        ax[i,j].set_ylim(-15,15)

plt.show()

<IPython.core.display.Javascript object>

2\. **PCA on a nD dataset**

Start from the dataset you have genereted in the previous exercise and add uncorrelated random noise. Such noise should be represented by other 10 uncorrelated variables normal distributed, with standar deviation much smaller (say, a factor 50) than those used to generate the $x_1$ and $x_2$.

Repeat the PCA procedure and compare the results with what you obtained before

In [6]:
noise=np.empty((10,1000))
for i in range(10):
    noise[i]=rng.normal(0,(i+1)/(50+(i+1)),1000)

dataset1=np.concatenate((dataset,noise))

Y=np.cov(dataset1)

l1, V1 = la.eig(Y)

Lambda1=np.diag(l1)

print('Relevance of eigenvalues:')
for i in range(13):
    print (np.real_if_close(Lambda1[i,i]),':',np.real_if_close(Lambda1[i,i]/Lambda1.trace()))

Relevance of eigenvalues:
26.7274518105639 : 0.9281701827364787
1.9511620399632004 : 0.06775843952566143
0.028489638975913237 : 0.0009893660496253568
0.023675382921183277 : 0.0008221803054051475
0.019266962700589394 : 0.0006690881127513547
0.015408909174730302 : 0.0005351086271092404
0.011679920504382962 : 0.0004056112055027948
0.008428649636752212 : 0.0002927035966247244
0.0052222857402922035 : 0.00018135548215461754
0.0032483332997916437 : 0.00011280559530426061
1.3820759996496812e-15 : 4.799566162321261e-17
0.0014146264330063929 : 4.912604778538724e-05
0.00040408401052180156 : 1.4032715596878171e-05


In [7]:
#As expected, the principal components are the same of the previous exercise.

3 \. **Looking at an oscillating spring** (optional)

Imagine you have $n$ cameras looking at a spring oscillating along the $x$ axis. Each  camera record the motion of the spring looking at it along a given direction defined by the pair $(\theta_i, \phi_i)$, the angles in spherical coordinates. 

Start from the simulation of the records (say ${\cal O}(1000)$) of the spring's motion along the x axis, assuming a little random noise affects the measurements along the $y$. Rotate such dataset to emulate the records of each camera.

Perform a Principal Component Analysis on the thus obtained dataset, aiming at finding the only one coordinate that really matters.


4\. **PCA on the MAGIC dataset** (optional)

Perform a PCA on the magic04.data dataset

In [8]:
# get the dataset and its description on the proper data directory
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data -P ~/data/
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names -P ~/data/ 

--2021-12-14 09:46:21--  https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data
Risoluzione di archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connessione a archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 200 OK
Lunghezza: 1477391 (1,4M) [application/x-httpd-php]
Salvataggio in: «/Users/raffaelegaudio/data/magic04.data.1»


2021-12-14 09:46:40 (106 KB/s) - «/Users/raffaelegaudio/data/magic04.data.1» salvato [1477391/1477391]

--2021-12-14 09:46:41--  https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.names
Risoluzione di archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connessione a archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 200 OK
Lunghezza: 5400 (5,3K) [application/x-httpd-php]
Salvataggio in: «/Users/raffaelegaudio/data/magic04.names.1»


2021-12-14 09:46:41 (48,1