1\. **PCA on 3D dataset**

* Generate a dataset with 3 features each with N entries (N being ${\cal O}(1000)$). With $N(\mu,\sigma)$ the normali distribution with mean $\mu$ and $\sigma$  standard deviation, generate the 3 variables $x_{1,2,3}$ such that:
    * $x_1$ is distributed as $N(0,1)$
    * $x_2$ is distributed as $x_1+N(0,3)$
    * $x_3$ is given by $2x_1+x_2$
* Find the eigenvectors and eigenvalues of the covariance matrix of the dataset
* Find the eigenvectors and eigenvalues using SVD. Check that they are two procedure yields to same result
* What percent of the total variability is explained by the principal components? Given how the dataset was constructed, do these make sense? Reduce the dimensionality of the system so that at least 99% of the total variability is retained.
* Redefine the data in the basis yielded by the PCA procedure
* Plot the data points in the original and the new coordiantes as a set of scatter plots. Your final figure should have 2 rows of 3 plots each, where the columns show the (0,1), (0,2) and (1,2) proejctions.


In [None]:
import numpy as np
import pandas as pd
import scipy as sp
import scipy.linalg as la
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
N = 1000
x1 = np.random.normal(0,1,size=N)
x2 = x1 + np.random.normal(0,3,size=N)
x3 = 2*x1 + x2
X = np.array([x1,x2,x3])

In [None]:
cov = np.cov(X)
l, V = np.linalg.eig(cov)

In [None]:
U, spectrum, Vt = np.linalg.svd(X)
l_svd = spectrum**2/(N-1)
V_svd = U

print("Shape {}".format(X.shape))
print ("Numpy eig:\n")
print ("eigenvalues:",l)
print ("eigenvectors:",V)
print ("\nSVD\n")
print ("eigenvalues:",l_svd)
print ("eigenvectors:",V_svd)

# NOTICE

The results are correct, just swap the first and the third column. Difference due to machine precision.

In [None]:
#using svd ordered results
Lambda=np.diag(l_svd)
print ("Lambda {}".format(Lambda))
#cov matrix Trace
print ("Covariance trace:", cov.trace())
#lambda trace
print ("Lambda trace:", Lambda.trace())

print ("Percentage: {}".format(Lambda[0,0]/Lambda.trace()))


In [None]:
Lambda=np.diag(l_svd[0:2])
print ("Lambda {}".format(Lambda))
#cov matrix Trace
print ("Covariance trace:", cov.trace())
#lambda trace
print ("Lambda trace:", Lambda.trace())

print ("Percentage: {}".format(Lambda[0,0]/Lambda.trace()))
print ("Trace not vary. Right result!")
print ("Total variability: {}".format((l_svd[0]+l_svd[1])/sum(l_svd)))

In [None]:
Xp = np.dot(X.T, U).T
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=3, figsize=(15,10))
ax1[0].scatter(X[0,:],X[1,:], alpha=0.2)
ax1[1].scatter(X[0,:],X[2,:], alpha=0.2)
ax1[2].scatter(X[1,:],X[2,:], alpha=0.2)

ax2[0].scatter(Xp[0,:],Xp[1,:], alpha=0.2)
ax2[1].scatter(Xp[0,:],Xp[2,:], alpha=0.2)
ax2[2].scatter(Xp[1,:],Xp[2,:], alpha=0.2)

for j in range(0,3):
    #not same axis limit to get some comparison with seaborn
    ax1[j].axis([-10,10,-15,15])
    ax2[j].axis([-15,15,-5,5])
    
ax1[0].set_title("old {} vs {}".format(1,2))
ax2[0].set_title("new {} vs {}".format(1,2))
ax1[1].set_title("old {} vs {}".format(1,3))
ax2[1].set_title("new {} vs {}".format(1,3))
ax1[2].set_title("old {} vs {}".format(2,3))
ax2[2].set_title("new {} vs {}".format(2,3))

In [None]:
data = pd.DataFrame({'x1':x1, 'x2':x2,'x3':x3})
sns.pairplot(data)

In [None]:
data = pd.DataFrame(Xp.T)
sns.pairplot(data)

2\. **PCA on a nD dataset**

Start from the dataset you have genereted in the previous exercise and add uncorrelated random noise. Such noise should be represented by other 10 uncorrelated variables normal distributed, with standar deviation much smaller (say, a factor 50) than those used to generate the $x_1$ and $x_2$.

Repeat the PCA procedure and compare the results with what you obtained before


We retain less variability. The correlation between 1st and 3rd components and between 2nd and 3rd is less with respect to the case before.

In [None]:
%reset

In [None]:
import numpy as np
import pandas as pd
import scipy as sp
import scipy.linalg as la
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
N = 1000
n1 = np.zeros((10,N))
n2 = np.zeros((10,N))
n3 = np.zeros((10,N))
for i in range(10):
    mean = 0
    std = np.random.uniform(0.01,0.03)
    n1[i] = np.random.normal(mean,std,N)
    std = np.random.uniform(0.01,0.12)
    n2[i] = np.random.normal(mean,std,N)
    std = np.random.uniform(0.01,0.32)
    n2[i] = np.random.normal(mean,std,N)

x1 = np.random.normal(0,1,1000) + n1.sum(axis=0)
x2 = np.random.normal(0,3,1000) + x1 + n2.sum(axis=0)
df = pd.DataFrame({'x1':x1,'x2':x2,'x3':(2*x1+x2+n2.sum(axis=0)+n1.sum(axis=0))})
df.shape

In [None]:
cov = np.cov(df, rowvar=False)
l, V = np.linalg.eig(cov)
print ("Covariance:{}\n".format(cov))

U, spectrum, Vt = np.linalg.svd(df.T)
l_svd = spectrum**2/(N-1)
V_svd = U

print ("Numpy eig:\n")
print ("eigenvalues:",l)
print ("eigenvectors:",V)
print ("\nSVD\n")
print ("eigenvalues:",l_svd)
print ("eigenvectors:",V_svd)

In [None]:
found_variability = l/np.sum(l)

print('variability per principal component:',found_variability)

In [None]:
#using svd ordered results
Lambda=np.diag(l_svd)
print ("Lambda {}".format(Lambda))
#cov matrix Trace
print ("Covariance trace:", cov.trace())
#lambda trace
print ("Lambda trace:", Lambda.trace())

print ("Percentage: {}".format(Lambda[0,0]/Lambda.trace()))


In [None]:
#using svd ordered results
Lambda=np.diag(l_svd[0:2])
print ("Lambda {}".format(Lambda))
#cov matrix Trace
print ("Covariance trace:", cov.trace())
#lambda trace
print ("Lambda trace:", Lambda.trace())

print ("Percentage: {}".format(Lambda[0,0]/Lambda.trace()))

print ("Trace not vary. Right result!")
print ("Total variability: {}".format((l_svd[0]+l_svd[1])/sum(l_svd)))

In [None]:
Xp = np.dot(df, U).T
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=3, figsize=(15,10))
ax1[0].scatter(df.iloc[:,0],df.iloc[:,1], alpha=0.2)
ax1[1].scatter(df.iloc[:,0],df.iloc[:,2], alpha=0.2)
ax1[2].scatter(df.iloc[:,1],df.iloc[:,2], alpha=0.2)

ax2[0].scatter(Xp[0,:],Xp[1,:], alpha=0.2)
ax2[1].scatter(Xp[0,:],Xp[2,:], alpha=0.2)
ax2[2].scatter(Xp[1,:],Xp[2,:], alpha=0.2)

for j in range(0,3):
    #not same axis limit to get some comparison with seaborn
    ax1[j].axis([-10,10,-15,15])
    ax2[j].axis([-15,15,-5,5])
    
ax1[0].set_title("old {} vs {}".format(1,2))
ax2[0].set_title("new {} vs {}".format(1,2))
ax1[1].set_title("old {} vs {}".format(1,3))
ax2[1].set_title("new {} vs {}".format(1,3))
ax1[2].set_title("old {} vs {}".format(2,3))
ax2[2].set_title("new {} vs {}".format(2,3))

In [None]:
sns.pairplot(df)
sns.pairplot(pd.DataFrame(Xp.T))

3 \. **Looking at an oscillating spring** (optional)

Imagine you have $n$ cameras looking at a spring oscillating along the $x$ axis. Each  camera record the motion of the spring looking at it along a given direction defined by the pair $(\theta_i, \phi_i)$, the angles in spherical coordinates. 

Start from the simulation of the records (say ${\cal O}(1000)$) of the spring's motion along the x axis, assuming a little random noise affects the measurements along the $y$. Rotate such dataset to emulate the records of each camera.

Perform a Principal Component Analysis on the thus obtained dataset, aiming at finding the only one coordinate that really matters.


In [None]:
w = 2
A = 1
N = 1000
data = np.zeros((10,N,3))
phi = 0
time = np.arange(0, 10, 10/N)

x = A*np.sin(w*time+phi)
y = np.random.normal(0, 1/5, N)
z = np.zeros(N)
data[0] = np.array([x,y,z]).T
i = 1

In [None]:
plt.plot(time,x,'*')
plt.title("X-motion in time")
plt.xlabel('time')
plt.ylabel('$x$')

In [None]:
plt.plot(time,y,'*')
plt.title("Y-motion in time(noise)")
plt.xlabel('time')
plt.ylabel('$y$')

In [None]:
import math
i = 1
for theta in [math.pi/6,math.pi/4,math.pi/2]:
    for gamma in [math.pi/6,math.pi/4,math.pi/3]:
        R1 = [[math.cos(theta), math.sin(theta), 0], [-math.sin(theta), math.cos(theta), 0], [0,0,1]]
        R2 = [[math.cos(gamma), 0, -math.sin(gamma)], [0,1,0] , [math.sin(gamma), 0, math.cos(gamma)]]
        R = np.matmul(R1,R2)
        data_rot = np.matmul(R,data[0].T)
        data[i] = data_rot.T
        i+=1

In [None]:
fig, plots = plt.subplots(nrows=5, ncols=2, figsize=(10,10))
fig.suptitle("X-motion wrt time and camera changes", fontsize=10)
i = 0
j = 0
for record in data:
    plots[j][i].plot(time, record[:,0],'*')
    plots[j][i].set_aspect('equal')
    i+=1
    if i == 2:
        j += 1
        i = 0
plt.show()

In [None]:
fig, plots = plt.subplots(nrows=5, ncols=2, figsize=(10,10))
fig.suptitle("Y-motion wrt time and camera changes", fontsize=10)
i = 0
j = 0
for record in data:
    plots[j][i].plot(time, record[:,1],'*')
    plots[j][i].set_aspect('equal')
    i+=1
    if i == 2:
        j += 1
        i = 0
plt.show()

In [None]:
data[:,:,0].shape

In [None]:
cov = np.cov(data[:,:,0])
l, v = np.linalg.eig(cov)

U, s, Vt = np.linalg.svd(cov)

found_variability = l/np.sum(l)

print('variability per principal component for x direction:',found_variability)
summed = 0
indexes = []
var = found_variability.copy()
for i in range(0, found_variability.size):
    summed += found_variability[var.argmax()]
    indexes.append(var.argmax())
    var[var.argmax()] = 0
    if summed > 0.92:
        break


print('cameras {} retain ~{} of the total variability'.format(indexes, found_variability[indexes].sum()))

4\. **PCA on the MAGIC dataset** (optional)

Perform a PCA on the magic04.data dataset

In [None]:
df = pd.read_csv('data/magic04.data', names = ['fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym', 'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'fClass'])
df.loc[df['fClass'] =='h', 'fClass'] = 0
df.loc[df['fClass'] =='g', 'fClass'] = 1
df

In [None]:
cov = np.cov(df, rowvar=False)
l, V = np.linalg.eig(cov)
#print ("Covariance:{}\n".format(cov))

U, spectrum, Vt = np.linalg.svd(df.T)
l_svd = spectrum**2/(df.shape[0]-1)
V_svd = U

print ("Numpy eig:\n")
print ("eigenvalues:",l)
print ("eigenvectors:",V)
print ("\nSVD\n")
print ("eigenvalues:",l_svd)
print ("eigenvectors:",V_svd)

In [None]:
found_variability = l/np.sum(l)

print('variability per principal component:',found_variability)

In [None]:
#using svd ordered results
Lambda=np.diag(l_svd)
print ("Lambda {}".format(Lambda))
#cov matrix Trace
print ("Covariance trace:", cov.trace())
#lambda trace
print ("Lambda trace:", Lambda.trace())

print ("Percentage: {}".format(Lambda[0,0]/Lambda.trace()))


In [None]:
summed = 0
indexes = []
var = found_variability.copy()
for i in range(0, found_variability.size):
    summed += found_variability[var.argmax()]
    indexes.append(var.argmax())
    var[var.argmax()] = 0
    if summed > 0.9:
        break

In [None]:
#using svd ordered results
Lambda=np.diag(l_svd[indexes])
print ("Lambda {}".format(Lambda))
#cov matrix Trace
print ("Covariance trace:", cov.trace())
#lambda trace
print ("Lambda trace:", Lambda.trace())

print ("Percentage: {}".format(Lambda[0,0]/Lambda.trace()))

print ("Trace not vary. Right result!")
print ("Total variability: {}".format((l_svd[0]+l_svd[1]+l_svd[2]+l_svd[3])/sum(l_svd)))

In [None]:
Xp = np.dot(df, V).T
sns.pairplot(df)
sns.pairplot(pd.DataFrame(Xp.T))