<a href="https://colab.research.google.com/github/Morioh/math_for_ml/blob/main/Summative_Assignment_PCA_Mourice_Onyonyi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [None]:
# Necessary package
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
# Usecase data
data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

**This process ensures that each variable contributes equally to the analysis and prevents variables with larger scales from dominating the result.**

In [None]:
# Standardize the data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
standardized_data

array([[-1.36438208,  0.70710678,  1.5109662 , -0.99186978,  0.77802924],
       [ 0.12403473, -1.94454365, -0.13736056,  0.77145428, -2.06841919],
       [-0.62017367,  0.1767767 ,  0.68680282, -0.99186978,  0.20873955],
       [ 1.61245155,  0.1767767 , -1.78568733,  0.33062326,  0.20873955],
       [-0.62017367,  1.23743687, -0.13736056, -0.77145428,  1.00574511],
       [ 0.86824314, -0.35355339, -0.13736056,  1.65311631, -0.13283426]])

![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [None]:
# Calculate the covariance matrix of the standardized data
cov_matrix = np.cov(data, rowvar=False)

print('covariance_matrix: ', cov_matrix)

covariance_matrix:  [[  2.16666667  -1.06666667  -1.76666667   5.5         -4.36666667]
 [ -1.06666667   4.26666667   0.46666667  -6.6         19.66666667]
 [ -1.76666667   0.46666667   1.76666667  -3.3          2.36666667]
 [  5.5         -6.6         -3.3         24.7        -27.9       ]
 [ -4.36666667  19.66666667   2.36666667 -27.9         92.56666667]]


### Step 3: Eigendecomposition on the Covariance Matrix


In [None]:
# Perform eigen decomposition on the covariance matrix

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print('eigenvalues: ', eigenvalues)
print('eigenvectors: ', eigenvectors)

eigenvalues:  [1.07224751e+02 1.61823788e+01 1.93173735e+00 1.27579741e-01
 2.20003762e-04]
eigenvectors:  [[-0.05817655 -0.2631212   0.57237125  0.6292347  -0.45148374]
 [ 0.19774895 -0.03283879  0.06849106 -0.60720902 -0.7657827 ]
 [ 0.0328828   0.17887983 -0.75671562  0.45776292 -0.42983171]
 [-0.33200499 -0.88598416 -0.30234056 -0.11461168  0.01609676]
 [ 0.91989252 -0.33574235 -0.06059523  0.11259736  0.15724145]]


### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [None]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[order_of_importance]

print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
sorted_eigenvectors = eigenvectors[:, order_of_importance]
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 2 3 4]


 sorted eigen values:
[1.07224751e+02 1.61823788e+01 1.93173735e+00 1.27579741e-01
 2.20003762e-04]


 The sorted eigen vector matrix is: 
 [[-0.05817655 -0.2631212   0.57237125  0.6292347  -0.45148374]
 [ 0.19774895 -0.03283879  0.06849106 -0.60720902 -0.7657827 ]
 [ 0.0328828   0.17887983 -0.75671562  0.45776292 -0.42983171]
 [-0.33200499 -0.88598416 -0.30234056 -0.11461168  0.01609676]
 [ 0.91989252 -0.33574235 -0.06059523  0.11259736  0.15724145]]


Question:

1. Why do we order eigen values and eigen vectors?

Eigenvectors represent the directions of maximum variance, and eigenvalues represent the magnitude of variance in those directions.

2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

No. The eigenvectors are sorted by their corresponding eigenvalues in descending order. The eigenvectors with the highest eigenvalues are the principal components. By selecting the top principal components (those with the largest eigenvalues), I can reduce the dimensionality of the data while retaining the aspects that contain the most variance.


You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [None]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

#TO DO: Insert code here
explained_variance = sorted_eigenvalues / np.sum(sorted_eigenvalues) * 100
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print( explained_variance)

['85.46%', '12.90%', '1.54%', '0.10%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The resulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [None]:
# Select the number of principal components
k = 2

# Select the first k eigenvectors
eigenvectors_k = sorted_eigenvectors[:, :k]

# Transform the original data using the first k principal components
reduced_data = np.matmul(standardized_data, eigenvectors_k)


In [None]:
print(reduced_data)

[[ 1.31389845  1.22362226]
 [-2.55511419  0.01760889]
 [ 0.61494463  1.08892909]
 [-0.03531847 -1.11250845]
 [ 1.45756867  0.44379893]
 [-0.7959791  -1.66145072]]


In [None]:
print(reduced_data.shape)

(6, 2)


# What are 2 positive effects and 2 negative effects of PCA

## Positive Effects of PCA

1. **Dimensionality Reduction**: PCA reduces the number of variables in a dataset while retaining most of the information or variability. This simplification can make subsequent analyses more efficient and less resource-intensive, particularly in the context of large datasets.

2. **Noise Reduction**: By keeping only the principal components that capture the most variance and discarding the rest, PCA can effectively filter out noise from the data. This can lead to more accurate models since they are trained on the cleaner, more relevant features.

## Negative Effects of PCA

1. **Loss of Interpretability**: One major drawback of PCA is that the principal components are linear combinations of the original variables and may not have a clear or intuitive interpretation. This makes it difficult to understand the underlying factors driving the patterns in the data.

2. **Assumption of Linearity**: PCA assumes that the principal components are linear combinations of the original features, which may not capture complex, non-linear relationships between variables. This can limit its effectiveness in analyzing data where non-linear interactions are significant.
