# Final Exam Second Semester 2566 - Principal Component Analysis (Red Wine Problem)

This exam problem has an objective to develop a principal component analysis to reduce the dimension of the attributes of the red wine including
1. Tartaric Acid
2. Grape Density
3. Citric Acid
4. Residual Sugar
5. Sodium Chloride	
6. Free Sulfur Dioxide	
7. Bound Sulfur Dioxide	
8. Alcohol Density

In [66]:
# used for manipulating directory paths
import os

# Scientific and vector computation for python
import numpy as np

# library written for this exam
import utilsPCA as utils

%load_ext autoreload
%autoreload 2

import random 
random.seed(10)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### We start the exam by first loading the dataset.

In [67]:
# Load the dataset into the variable X 
data = np.loadtxt(os.path.join('Data', 'PCA_WineData.txt'))
X = data

m = X.shape[0] # number of training examples

### Normalize data here
#### Hint: Use utils.featureNormalize

In [68]:
utils.featureNormalize

<function utilsPCA.featureNormalize(X)>

In [69]:
X[0]
Xnorm = utils.featureNormalize(X[0])
print(Xnorm)

(array([ 0.03365344, -0.54334881, -0.60363263, -0.44000513, -0.59708753,
        0.3436845 ,  2.32443851, -0.51770235]), 7.009225, 11.611739749279606)


In [70]:
X_Norm, mu, sigma = utils.featureNormalize(X)
U, S = pca(X_Norm)

In [71]:
def pca(X):
    """
    Run principal component analysis.
    
    Parameters
    ----------
    X : array_like
        The dataset to be used for computing PCA. It has dimensions (m x n)
        where m is the number of examples (observations) and n is 
        the number of features.
    
    Returns
    -------
    U : array_like
        The eigenvectors, representing the computed principal components
        of X. U has dimensions (n x n) where each column is a single 
        principal component.
    
    S : array_like
        A vector of size n, contaning the singular values for each
        principal component. Note this is the diagonal of the matrix we 
        mentioned in class.
    
    Instructions
    ------------
    You should first compute the covariance matrix. Then, you
    should use the "svd" function to compute the eigenvectors
    and eigenvalues of the covariance matrix. 

    Notes
    -----
    When computing the covariance matrix, remember to divide by m (the
    number of examples).
    
    """
    # Useful values
    m, n = X.shape

    # You need to return the following variables correctly.
    U = np.zeros((n,n))
    S = np.zeros(n)

    # ====================== YOUR CODE HERE ======================
    sigma = (1/m) * np.dot(X.T,X)
    U,S,V = np.linalg.svd(sigma)
    
    # ============================================================
    return U, S

In [72]:
def projectData(X, U, K):
    """
    Computes the reduced data representation when projecting only 
    on to the top K eigenvectors.
    
    Parameters
    ----------
    X : array_like
        The input dataset of shape (m x n). The dataset is assumed to be 
        normalized.
    
    U : array_like
        The computed eigenvectors using PCA. This is a matrix of 
        shape (n x n). Each column in the matrix represents a single
        eigenvector (or a single principal component).
    
    K : int
        Number of dimensions to project onto. Must be smaller than n.
    
    Returns
    -------
    Z : array_like
        The projects of the dataset onto the top K eigenvectors. 
        This will be a matrix of shape (m x k).
    
    Instructions
    ------------
    Compute the projection of the data using only the top K 
    eigenvectors in U (first K columns). 
    For the i-th example X[i,:], the projection on to the k-th 
    eigenvector is given as follows:
    
        x = X[i, :]
        projection_k = np.dot(x,  U[:, :k])

    """
    # You need to return the following variables correctly.
    Z = np.zeros((X.shape[0], K))

    # ====================== YOUR CODE HERE ======================
    Z = np.dot(X,U[:,:K])

    
    # =============================================================
    return Z

In [73]:
#  Project the data onto K = 1 dimension
K = 4
Z = projectData(X_Norm, U, K)
print(Z)

[[ 1.17118991 -0.46238533  1.25825924  0.36852371]
 [ 1.51540804  1.27597417  1.54641923 -0.01856076]
 [ 1.14882192  0.26134202  1.26477794  0.05092125]
 ...
 [ 1.42890224  0.62548711 -0.56209225 -0.03909093]
 [ 1.94632673  0.89447623 -0.24710565 -0.14883086]
 [ 0.07803172  0.14456627 -1.51026084  0.20439309]]


In [74]:
def recoverData(Z, U, K):
    """
    Recovers an approximation of the original data when using the 
    projected data.
    
    Parameters
    ----------
    Z : array_like
        The reduced data after applying PCA. This is a matrix
        of shape (m x K).
    
    U : array_like
        The eigenvectors (principal components) computed by PCA.
        This is a matrix of shape (n x n) where each column represents
        a single eigenvector.
    
    K : int
        The number of principal components retained
        (should be less than n).
    
    Returns
    -------
    X_rec : array_like
        The recovered data after transformation back to the original 
        dataset space. This is a matrix of shape (m x n), where m is 
        the number of examples and n is the dimensions (number of
        features) of original datatset.
    
    Instructions
    ------------
    Compute the approximation of the data by projecting back
    onto the original space using the top K eigenvectors in U.
    For the i-th example Z[i,:], the (approximate)
    recovered data for dimension j is given as follows:

        v = Z[i, :]
        recovered_j = np.dot(v, U[j, :K].T)

    Notice that U[j, :K] is a vector of size K.
    """
    # You need to return the following variables correctly.
    X_rec = np.zeros((Z.shape[0], U.shape[0]))

    # ====================== YOUR CODE HERE ======================
    X_rec = np.dot(Z,U[:,:K].T)


    # =============================================================
    return X_rec

In [75]:
X_rec  = recoverData(Z, U, K)
print(X_rec[0])

[-0.48332697  1.18247251 -1.05409947  0.04647057 -0.15572017 -0.54125759
 -0.49180637 -0.01727894]


((8,), (8,))

In [83]:
Sigma = (1/m) * np.dot(X.T, X)
U, S, V = np.linalg.svd(Sigma)
K = len(S)
c = []
SumS = np.sum(S)
for i in range(K):
    s = np.sum(S[:i+1])
    a = s/SumS
    c.append(a)
print(c)

[0.9737527183592682, 0.9929862192177847, 0.9994843258714879, 0.999970648933387, 0.9999922031306003, 0.9999966975912472, 0.9999994694062412, 1.0]


### End of Principal Component Analysis Problem