<a href="https://colab.research.google.com/github/Sadickachuli/ML_PCA/blob/main/Formative_Assignment_PCA_%5BSadick_Achuli%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [25]:
#import necessary package
import numpy as np

In [26]:

data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

PCA is influenced by the variance of the features. Features with bigger scales can improperly affect the findings so it becomes important to treat the data on the same scale such that each feature has a mean of 0 and a standard deviation of 1. This also guarantees that every feature contributes equally to the analysis, making it possible for PCA to more precisely determine the directions of highest variance. This results in a more balanced analysis, enhanced performance, and easier interpretation of the PCA results.


In [27]:
# USing the formula above
mean = np.mean(data, axis=0)
standard_dev = np.std(data, axis=0)
standardized_data = (data - mean) / standard_dev
print(standardized_data)

# A more easier way would have been to import the sklearn.preprocessing package and used this formula below

# scaler = StandardScaler()
# standardized_data = scaler.fit_transform(data)
# print(standardized_data)

[[-1.36438208  0.70710678  1.5109662  -0.99186978  0.77802924]
 [ 0.12403473 -1.94454365 -0.13736056  0.77145428 -2.06841919]
 [-0.62017367  0.1767767   0.68680282 -0.99186978  0.20873955]
 [ 1.61245155  0.1767767  -1.78568733  0.33062326  0.20873955]
 [-0.62017367  1.23743687 -0.13736056 -0.77145428  1.00574511]
 [ 0.86824314 -0.35355339 -0.13736056  1.65311631 -0.13283426]]


![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [28]:
cov_matrix = np.cov(standardized_data, rowvar=False)

print(cov_matrix)


[[ 1.2        -0.42098785 -1.0835838   0.90219291 -0.37000528]
 [-0.42098785  1.2         0.20397003 -0.77149364  1.18751836]
 [-1.0835838   0.20397003  1.2        -0.59947269  0.22208218]
 [ 0.90219291 -0.77149364 -0.59947269  1.2        -0.70017993]
 [-0.37000528  1.18751836  0.22208218 -0.70017993  1.2       ]]


### Step 3: Eigendecomposition on the Covariance Matrix


In [29]:


eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print("Eigenvalues:\n", eigenvalues)
print("\nEigenvectors:\n", eigenvectors)


Eigenvalues:
 [3.80985761e+00 1.73655615e+00 4.94531029e-02 4.74189469e-05
 4.04085720e-01]

Eigenvectors:
 [[-0.4640131   0.45182808 -0.70733581  0.28128049 -0.03317471]
 [ 0.45019005  0.48800851  0.29051532  0.6706731  -0.15803498]
 [ 0.37929082 -0.55665017 -0.48462321  0.24186072 -0.5029143 ]
 [-0.4976889   0.03162214  0.36999674 -0.03373724 -0.78311558]
 [ 0.43642295  0.49682965 -0.20861365 -0.64143906 -0.32822489]]


### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [30]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[order_of_importance]

print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
sorted_eigenvectors = eigenvectors[:, order_of_importance]
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 4 2 3]


 sorted eigen values:
[3.80985761e+00 1.73655615e+00 4.04085720e-01 4.94531029e-02
 4.74189469e-05]


 The sorted eigen vector matrix is: 
 [[-0.4640131   0.45182808 -0.03317471 -0.70733581  0.28128049]
 [ 0.45019005  0.48800851 -0.15803498  0.29051532  0.6706731 ]
 [ 0.37929082 -0.55665017 -0.5029143  -0.48462321  0.24186072]
 [-0.4976889   0.03162214 -0.78311558  0.36999674 -0.03373724]
 [ 0.43642295  0.49682965 -0.32822489 -0.20861365 -0.64143906]]


Question:

1. Why do we order eigen values and eigen vectors?

The reason why we rank the eigenvalues and eigenvectors is to find the principal components that most accurately represent the variation in the data. The components with the highest importance can be identified by arranging the eigenvalues in descending order. We may identify the directions of greatest variance in the data by aligning the eigenvectors with these sorted eigenvalues. This guarantees that the principal components are arranged according to their importance.


2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

No. In fact, we took into account the highest eigenvalues rather than the lowest ones during my brief PCA course. Which basically means that principal components with higher eigenvalues capture more variance in the data and are therefore more significant for dimensionality reduction. Lower eigenvalues on the other hand may indicate less important patterns and noise since they capture less variance.



You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [31]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

explained_variance = sorted_eigenvalues / np.sum(sorted_eigenvalues) * 100
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print( explained_variance)

['63.50%', '28.94%', '6.73%', '0.82%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [32]:
k = 2
reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:, :k])

In [33]:
print(reduced_data)

[[ 2.3577116  -0.75728867]
 [-2.27171739 -1.81970663]
 [ 1.21259114 -0.50390931]
 [-1.41935914  1.9229856 ]
 [ 1.61562536  0.87541857]
 [-1.49485157  0.28250044]]


In [34]:
print(reduced_data.shape)

(6, 2)


# *What are 2 positive effects and 2 negative effects of PCA

Give 2 Benefits and 2 limitations

Two Benefis of PCA:
1. Visualization: The most obvious reason why we use PCA is to reduce the dimensions in the dataset while retaining most of the variance.  This enables the visualization of high-dimensional data, which aids in the identification of outliers, patterns, and clusters that are not obvious in the original feature space.

2. Noise Reduction: As I indicated while responding to the question of whether we should prioritize the lowest eigen value above the highest, noise exists in data. By concentrating on the principal components that capture the most variation, PCA can reduce noise in the data by possibly removing less useful and noisy components.


Two Limitations of PCA:
1. Loss of some valuable information: One downside about PCA is that, while reducing the noise to retain most of the variance in a data, it actually loses some part of the data.  For some tasks, the eliminated variance might contain some information that could result in a loss of predicted accuracy.

2. Loss of Interpretability: The principal components following PCA are just linear combinations of the initial features. This reduces the model's interpretability by making it difficult to understand the primary components in terms of the original features.
