<a href="https://colab.research.google.com/github/RuthBiney/Summative-Assignment-PCA-Ruth_Senior_Biney-.ipynb/blob/main/Summative_Assignment_PCA_%5BRuth_Senior_Biney%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [2]:
#import necessary package
#TO DO
import numpy as np
from numpy.linalg import eig



In [3]:

data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

Ans:
The aim of standardizing the range of the continuous initial is to equally to the outcome of machine learning models by giving them the same scale, preventing features with larger numeric ranges from dominating those with smaller ranges. Standardization is critical before performing Principal Component Analysis (PCA) because PCA is sensitive to the variances of the original features. If one feature has a much larger variance and range of values than others, it will dominate the principal components and skew the analysis. By standardizing, you ensure each feature contributes equally to the determination of the principal components, allowing the PCA to find the true axes of variance rather than being misled by the scale of the data.

In [4]:
mean = np.mean(data, axis=0, keepdims=True)

standardized_data = (data - mean) / np.std(data, axis=0)

# test to see if the data is standardized
print(standardized_data)

[[-1.36438208  0.70710678  1.5109662  -0.99186978  0.77802924]
 [ 0.12403473 -1.94454365 -0.13736056  0.77145428 -2.06841919]
 [-0.62017367  0.1767767   0.68680282 -0.99186978  0.20873955]
 [ 1.61245155  0.1767767  -1.78568733  0.33062326  0.20873955]
 [-0.62017367  1.23743687 -0.13736056 -0.77145428  1.00574511]
 [ 0.86824314 -0.35355339 -0.13736056  1.65311631 -0.13283426]]


![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [5]:
n, d = data.shape
mean = np.mean(data, axis=0, keepdims=True)
centered_data = data - mean
cov_matrix = np.dot(centered_data.T, centered_data) / (n - 1)


print(cov_matrix)

[[  2.16666667  -1.06666667  -1.76666667   5.5         -4.36666667]
 [ -1.06666667   4.26666667   0.46666667  -6.6         19.66666667]
 [ -1.76666667   0.46666667   1.76666667  -3.3          2.36666667]
 [  5.5         -6.6         -3.3         24.7        -27.9       ]
 [ -4.36666667  19.66666667   2.36666667 -27.9         92.56666667]]


### Step 3: Eigendecomposition on the Covariance Matrix


In [6]:
# i am using the built in function from Numpy array to get the
# eigenvalues, and eigenvectors of cov_matrix

# Note: These come from the Covariance Matrix and not the data itself.

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# print the values down to test if they are the right values
print('eigenvalues: ', eigenvalues)
print('eigenvectors: ', eigenvectors)

eigenvalues:  [1.07224751e+02 1.61823788e+01 1.93173735e+00 1.27579741e-01
 2.20003762e-04]
eigenvectors:  [[-0.05817655 -0.2631212   0.57237125  0.6292347  -0.45148374]
 [ 0.19774895 -0.03283879  0.06849106 -0.60720902 -0.7657827 ]
 [ 0.0328828   0.17887983 -0.75671562  0.45776292 -0.42983171]
 [-0.33200499 -0.88598416 -0.30234056 -0.11461168  0.01609676]
 [ 0.91989252 -0.33574235 -0.06059523  0.11259736  0.15724145]]


### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [7]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[order_of_importance]

print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
sorted_eigenvectors = eigenvectors[:,order_of_importance] # sort the columns
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 2 3 4]


 sorted eigen values:
[1.07224751e+02 1.61823788e+01 1.93173735e+00 1.27579741e-01
 2.20003762e-04]


 The sorted eigen vector matrix is: 
 [[-0.05817655 -0.2631212   0.57237125  0.6292347  -0.45148374]
 [ 0.19774895 -0.03283879  0.06849106 -0.60720902 -0.7657827 ]
 [ 0.0328828   0.17887983 -0.75671562  0.45776292 -0.42983171]
 [-0.33200499 -0.88598416 -0.30234056 -0.11461168  0.01609676]
 [ 0.91989252 -0.33574235 -0.06059523  0.11259736  0.15724145]]


Question:

1. Why do we order eigen values and eigen vectors?

ANS:
Ordering eigenvalues and their eigenvectors helps prioritize the most significant components for efficient data representation and dimensionality reduction.

2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

ANS:
Yes, considering the lowest eigenvalues can be important for noise analysis and anomaly detection, but typically, the focus is on the highest eigenvalues for dimensionality reduction and capturing significant data variance.


You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [8]:
#  (sorted eigen values / sum of all sorted eigen values) * 100
sum_of_eigenvalues = np.sum(sorted_eigenvalues)

# print the sum of the Eigen values
print(sum_of_eigenvalues)

# now calculate the percentage value of each eigen value
percentages = (sorted_eigenvalues / sum_of_eigenvalues) * 100

# print out the percentages of each eigen values

print(percentages)

# print out the percentages
for idx, value in enumerate(sorted_eigenvalues):
    print(f"Eigenvalue {idx+1}: {value:.8f} ({percentages[idx]:.2f}%)")

125.46666666666665
[8.54607471e+01 1.28977515e+01 1.53964188e+00 1.01684172e-01
 1.75348375e-04]
Eigenvalue 1: 107.22475075 (85.46%)
Eigenvalue 2: 16.18237882 (12.90%)
Eigenvalue 3: 1.93173735 (1.54%)
Eigenvalue 4: 0.12757974 (0.10%)
Eigenvalue 5: 0.00022000 (0.00%)


In [9]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

#TO DO: Insert code here
explained_variance = (sorted_eigenvalues / sum_of_eigenvalues) * 100
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print(explained_variance)

['85.46%', '12.90%', '1.54%', '0.10%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [10]:
k = 2 # select the number of principal components

reduced_data_vectors = sorted_eigenvectors[:, order_of_importance[:k]]
reduced_data = np.matmul(standardized_data,reduced_data_vectors)#TO DO: insert code here)[#TO DO: insert code here] # transform the original data

In [11]:
# print the value of the reduced data
print(reduced_data)

# print the head of the updated data here
print(reduced_data.shape)

[[ 1.31389845  1.22362226]
 [-2.55511419  0.01760889]
 [ 0.61494463  1.08892909]
 [-0.03531847 -1.11250845]
 [ 1.45756867  0.44379893]
 [-0.7959791  -1.66145072]]
(6, 2)


In [13]:
print(reduced_data)

[[ 1.31389845  1.22362226]
 [-2.55511419  0.01760889]
 [ 0.61494463  1.08892909]
 [-0.03531847 -1.11250845]
 [ 1.45756867  0.44379893]
 [-0.7959791  -1.66145072]]


In [12]:
print(reduced_data.shape)

(6, 2)


# *What are 2 positive effects and 2 negative effects of PCA

1. Positive Effects of PCA:
Dimensionality Reduction: PCA reduces the dimensionality of data, significantly reducing storage and computational requirements, while retaining most of the variance (information) in the data. This makes complex datasets more manageable and accelerates machine learning algorithms.

2. Noise Reduction: By focusing on the components that explain the most variance and ignoring lower-variance components, PCA can effectively filter out noise from the data. This can improve the performance of machine learning models by focusing on the most relevant features.

Negative Effects of PCA:
1. Loss of Interpretability: The principal components are linear combinations of the original variables and often do not have a straightforward interpretation. This transformation can make it difficult to understand the meaning of the components in terms of the original features.

2. Assumption of Linearity: PCA assumes that the principal components are linear combinations of the original features, which might not capture the complexity of data structures that are inherently nonlinear. This can lead to suboptimal performance in capturing the true variability in datasets with nonlinear relationships.







