# 1 Synthetic Data 
Create a matrix A ∈ $R^{3x2}$ whose individual entries are drawn from a Gaussian distribution with mean
0 and variance 1 in an independent and identically distributed (iid) fashion. Once generated, this
matrix should not be changed for the rest of this exercise.
       
    • Matrices with iid Gaussian entries are always full rank. Verify this by printing the rank of A; it
      should be 2.

In [46]:
#First we create our matrix and print out the rank as well to confirm it is full rank
import numpy as np
from numpy.linalg import matrix_rank
A = np.random.normal(0,1, (3,2))
print (A)
print("The rank of the Matrix is", matrix_rank(A))

[[-0.24013875  0.7115142 ]
 [-1.31952669 -0.42621924]
 [-0.72559812  0.27900963]]
The rank of the Matrix is 2


# Generation of Dataset #1

1. Each of our data sample $\mathbf{x}$ ∈ $R^3$ is going to be generated in the following fashion: x = Av, where v ∈ $R^2$ is a random vector whose entries are iid Gaussian with mean 0 and variance 1. Note that we will have a different v for each new data sample (i.e., unlike A, it is not fixed for each data sample).

• Generate 500 data samples ${x_i}_{i=1}^{500}$ using the aforementioned mathematical model.

• Store the data samples into a data matrix $\mathbf{X}$, such that each data sample is a column in this data
matrix. Print the dimensionality of X and confirm that it matches your expectations.

• Since we can write X = AV, where V ∈ $R^{2×500}$ is a matrix whose columns are the vectors $v_i$’s corresponding to data samples xi’s, the rank of X is 2 (Can you see why? Perhaps refer to Wikipedia?). Verify this by printing the rank of X.


In [47]:
# Generate the data matrix
v = np.random.normal(0,1, (2,500))
X = A @ v
print(X.shape)
# Print the rank of the matrix
print("The rank of X is" , np.linalg.matrix_rank(X))
print(X)

(3, 500)
The rank of X is 2
[[-1.051067    0.63021127 -0.52533119 ... -0.96208673  1.09520685
  -1.45268584]
 [-3.62929728  0.48952464  0.58590333 ... -2.4267398   0.80806103
  -1.53418788]
 [-2.24983728  0.62124657 -0.08897508 ... -1.67305538  1.06122253
  -1.60711789]]


# Singular Value and Eigenvalue Decomposition of Dataset #1
You will need to use the numpy.linalg package for the purposes of this part of the exercise.
1. Compute the singular value decomposition of X and the eigenvalue decomposition of X$X^T$ and verify (by printing) that:
    • The left singular vectors of X correspond to the eigenvectors of X$X^T$.
    • The eigenvalues of XXT are square of the singular values of X.
    • The energy in X, defined by $|X|_F^2$ , is equal to the sum of squares of the singular values of X.
2. Since the rank of X is 2, it means that the entire dataset spans only a two-dimensional space in $R^3$
.
• Since rank of X is 2, we should ideally only have two nonzero singular values of X. However, unless
you are really lucky, you will see that none of your singular values are exactly zero. Comment on
why that might be happening (and if you are the lucky one then run your code again and you will
hopefully become unlucky :).
• What do you think is the relationship between the left singular vectors of X corresponding to
the two largest singular values and the columns of A? Try to be as precise and mathematically
rigorous as you can.

In [48]:
#1. Compute the SVD of X
u, s, vh = np.linalg.svd(X, full_matrices = True)
u.shape, s.shape, vh.shape
print ("The X Matrix is \n", X, "\n")
print ("The SVD of X is \n u = ", u,"\n", "s =", s, "\n", "Vh =", vh)
# Find the EigenValue decomposition of X*X Transpose
xT = np.transpose(X) #Tanspose of matrix X   could have also done XT = X.T
values, vectors = np.linalg.eig(np.matmul(X,xT)) # Matrix multiplication
print("\n The EigenValues of XXT are ", values, "\n")
print("The eigenVectors of XXT are ", vectors, "\n")
print("As can be seen, the eigenVectors of XXT,v corresponds to the singular values of X, u \n")
#Calculate Energy


The X Matrix is 
 [[-1.051067    0.63021127 -0.52533119 ... -0.96208673  1.09520685
  -1.45268584]
 [-3.62929728  0.48952464  0.58590333 ... -2.4267398   0.80806103
  -1.53418788]
 [-2.24983728  0.62124657 -0.08897508 ... -1.67305538  1.06122253
  -1.60711789]] 

The SVD of X is 
 u =  [[-0.12371142  0.84952605 -0.51283621]
 [-0.8793104  -0.33336786 -0.3401163 ]
 [-0.45990076  0.40886594  0.7882385 ]] 
 s = [3.59213231e+01 1.90749210e+01 2.80476702e-15] 
 Vh = [[ 0.12126529 -0.02210721 -0.01139384 ...  0.08413716 -0.03713903
   0.06313398]
 [-0.03160692  0.03282822 -0.03554315 ... -0.03629771  0.05740118
  -0.07233274]
 [ 0.40249603 -0.70000856 -0.01381437 ... -0.0041767   0.00472315
  -0.01308864]
 ...
 [-0.03170337  0.02055295 -0.0710015  ...  0.99210007  0.0049321
  -0.00797879]
 [ 0.04951006  0.00980122  0.04064017 ...  0.00463402  0.9956231
   0.00628077]
 [-0.0595782  -0.00972047 -0.07024716 ... -0.0072056   0.00612016
   0.99092898]]

 The EigenValues of XXT are  [   0.         

# PCA of Dataset #1

1. Since each data sample xi lies in a three-dimensional space, we can have up to three principal components
of this data. However, based on your knowledge of how the data was created (and subsequent discussion
above), how many principal components should be enough to capture all variation in the data? Justify
your answer as much as you can.

2. While mean centering is an important preprocessing step for PCA, we do not necessarily need to carry out mean centering in this problem since the mean vector will have small entries. Indeed, if we let x1, x2, and x3 denote the first, second, and third component of the random vector x then it follows that  E[$x_k$] = 0, k = 1, 2, 3.
• Formally show that E[$x_k$] = 0, k = 1, 2, 3, for this particular problem.
• Compute the mean vector m from the data matrix X and verify by printing that its entries are
indeed small.

3. Compute the top two principal components U =[$u_1$ $u_2$] of this dataset and print them.



In [58]:
#Question 1
from numpy import mean
from numpy import cov
from numpy.linalg import eig
print("In this dataset, only two principal components should be neccessary. X is a matrix that lies 3-dimensional space,\n however, its rank is only 2 so only 2-dimesions of the three are being spanned in 3-d space.")
#2 
#First calculate the means
M = mean(X.T, axis = 1)
print (" \n The mean of X is", M, "\n")
#Calculate the center
Center = X - M
print ("The mean Centered vector of X is \n" , Center)
# We then calculate the covariance of the centered matrix
covX = cov(Center.T)
# To get our principal components, we will now need to do eigendecomposition on the cov matrix
covValues, covVectors = np.linalg.eig(covX)
print ("\n The eigenvalues of covariance matrix is\n" , covValues)
print ("\n The eigenvectors of covariance matrix\n", covVectors)
# Finally we have to project our principal components
P = covVectors.T.dot(Center.T)
print ("\n Our principal components are \n", P.T)
print("\n In this data, I acknowledge that there is error. What should have happend was when we print the eigenvalues")
print("We should see that only two of the three column eigenvalues is non-zero. This means only two eigenvectors are needed")
print("meaning that we can model this data using only 2-D space instead of 3-D as explained above")

In this dataset, only two principal components should be neccessary. X is a matrix that lies 3-dimensional space,
 however, its rank is only 2 so only 2-dimesions of the three are being spanned in 3-d space.
 
 The mean of X is [-7.62753553e-01  5.06895381e-01  2.03490361e-01  7.79359376e-01
 -2.06210040e+00  1.11212892e+00  4.04356604e-01  6.41729814e-01
 -2.45141384e-01 -6.21843945e-01  6.45971312e-01 -3.32522243e-01
 -5.11366133e-01 -1.60792395e+00 -1.67273492e-01  1.47843682e-01
  1.77835169e-01 -2.70233207e-01  4.07992856e-01  8.13995041e-01
 -3.54247204e-01  2.33904827e+00 -1.40386840e+00 -2.02284652e+00
  1.10068272e-01  4.27916786e-01 -6.52501781e-01  5.71536753e-01
 -2.65861957e-01  2.98782260e-01  3.59574291e-01 -1.55019874e-01
  1.83686326e+00  4.98939604e-01  1.19268397e+00  8.24147107e-01
  2.37998687e-01  1.66329830e+00 -1.77689618e+00 -2.89684756e-01
 -3.13042701e-01  1.58579867e+00 -1.09649778e+00  1.92952121e-01
 -1.07468064e+00 -1.03816806e+00  1.36700457e-02 -8.50820


 The eigenvalues of covariance matrix is
 [ 1.14686286e+04+0.00000000e+00j  2.33204885e+03+0.00000000e+00j
 -4.30224635e-13+2.73434473e-13j -4.30224635e-13-2.73434473e-13j
  4.50949573e-13+2.57357256e-13j  4.50949573e-13-2.57357256e-13j
 -4.47412209e-13+1.11919105e-13j -4.47412209e-13-1.11919105e-13j
 -3.08760318e-13+3.19816804e-13j -3.08760318e-13-3.19816804e-13j
 -1.02168517e-13+4.43311171e-13j -1.02168517e-13-4.43311171e-13j
  8.71426760e-14+4.46306557e-13j  8.71426760e-14-4.46306557e-13j
  2.18704259e-13+3.95025953e-13j  2.18704259e-13-3.95025953e-13j
 -4.10645989e-13+0.00000000e+00j -3.21627262e-13+2.07925529e-13j
 -3.21627262e-13-2.07925529e-13j  4.07283867e-13+2.60736109e-14j
  4.07283867e-13-2.60736109e-14j  3.35966047e-13+2.12792894e-13j
  3.35966047e-13-2.12792894e-13j -5.43760572e-14+3.57458747e-13j
 -5.43760572e-14-3.57458747e-13j  2.61836167e-13+2.53875677e-13j
  2.61836167e-13-2.53875677e-13j -1.97115708e-13+1.81697959e-13j
 -1.97115708e-13-1.81697959e-13j  2.01536547e-1

# Generation of Dataset #2
1. Create a vector c ∈ $R^{3}$ whose individual entries are drawn from a Gaussian distribution with mean 0 and variance 3 in an independent and identically distributed (iid) fashion. Once generated, this vector should not be changed for the rest of this exercise.

2. Each of our data sample x ∈ $R^{3}$ is going to be generated in the following fashion: x = Av + c, where v ∈ $R^{2}$
is a random vector whose entries are iid Gaussian with mean 0 and variance 1. Note that we will have a different v for each new data sample (i.e., unlike A and c, it is not fixed for each data sample).

• Generate 500 data samples ${{x_i}}_{i=1}^{500}$ using the aforementioned mathematical model.


• Except for the addition of c to each data sample, dataset #2 is identical to dataset #1 in terms of
the generation mechanism. Addition of c, however, shifts our data from the origin, which increases
its rank. Verify this by printing the rank of X (The curios may want to think about this question:
what are the possible choices of c for which the rank of X will not increase?).


In [65]:
#The creation of this vector will be similar to the generation of the first except the variance is 3

C = np.random.normal(0,3, (3,1))
print (C)
print(matrix_rank(C))

[[-0.2854629 ]
 [-0.39829144]
 [ 4.66619783]]
1


In [69]:
# Generate the data matrix
v = np.random.normal(0,1, (2,500))
x = (A @ v) + C
print(x.shape)
# Print the rank of the matrix
print(x)
print("The rank of the Matrix is now", np.linalg.matrix_rank(x))

(3, 500)
[[-0.5604007   0.41658356  0.55612758 ... -0.37952561 -1.66819826
  -0.65355621]
 [-0.75303762  1.01624661  0.93372059 ...  2.05726663 -0.67853332
  -1.68591409]
 [ 4.33425132  5.7333143   5.78849402 ...  5.66454611  3.64565443
   3.87111737]]

 The rank of the Matrix is now 3


# PCA, Centering, and Dataset #2

# Generation of Dataset #3

# PCA Denoising of Dataset #3

# Real Data