<a href="https://colab.research.google.com/github/Geu-Pro2023/Principle-Component-Analysis_PCA_Assignment/blob/master/Formative_Assignment_PCA_%5BGeu_Aguto_Garang_Bior%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [21]:
#import necessary package
#TO DO
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [22]:
data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

**Handling data on the same scale in PCA is crucial because PCA identifies directions of maximum variance. If features have different scales, the larger-scaled features will dominate the variance, leading to biased results. Standardizing ensures all features contribute equally, preventing one feature from disproportionately influencing the principal components, and allows PCA to capture the true structure of the data.**

In [23]:
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)

print("Standardized Data:\n", standardized_data)

Standardized Data:
 [[-1.36438208  0.70710678  1.5109662  -0.99186978  0.77802924]
 [ 0.12403473 -1.94454365 -0.13736056  0.77145428 -2.06841919]
 [-0.62017367  0.1767767   0.68680282 -0.99186978  0.20873955]
 [ 1.61245155  0.1767767  -1.78568733  0.33062326  0.20873955]
 [-0.62017367  1.23743687 -0.13736056 -0.77145428  1.00574511]
 [ 0.86824314 -0.35355339 -0.13736056  1.65311631 -0.13283426]]


![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [24]:
cov_matrix = np.cov(standardized_data.T)
print("Covariance Matrix:\n", cov_matrix)

Covariance Matrix:
 [[ 1.2        -0.42098785 -1.0835838   0.90219291 -0.37000528]
 [-0.42098785  1.2         0.20397003 -0.77149364  1.18751836]
 [-1.0835838   0.20397003  1.2        -0.59947269  0.22208218]
 [ 0.90219291 -0.77149364 -0.59947269  1.2        -0.70017993]
 [-0.37000528  1.18751836  0.22208218 -0.70017993  1.2       ]]


### Step 3: Eigendecomposition on the Covariance Matrix


In [25]:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print("Eigenvalues:\n", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

Eigenvalues:
 [3.80985761e+00 1.73655615e+00 4.94531029e-02 4.74189469e-05
 4.04085720e-01]
Eigenvectors:
 [[-0.4640131   0.45182808 -0.70733581  0.28128049 -0.03317471]
 [ 0.45019005  0.48800851  0.29051532  0.6706731  -0.15803498]
 [ 0.37929082 -0.55665017 -0.48462321  0.24186072 -0.5029143 ]
 [-0.4976889   0.03162214  0.36999674 -0.03373724 -0.78311558]
 [ 0.43642295  0.49682965 -0.20861365 -0.64143906 -0.32822489]]


### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [26]:
# Sort the eigenvalues and eigenvectors
sorted_indices = np.argsort(eigenvalues)[::-1]
sorted_eigenvalues = eigenvalues[sorted_indices]
sorted_eigenvectors = eigenvectors[:, sorted_indices]

print("Sorted Eigenvalues:\n", sorted_eigenvalues)
print("Sorted Eigenvectors:\n", sorted_eigenvectors)

Sorted Eigenvalues:
 [3.80985761e+00 1.73655615e+00 4.04085720e-01 4.94531029e-02
 4.74189469e-05]
Sorted Eigenvectors:
 [[-0.4640131   0.45182808 -0.03317471 -0.70733581  0.28128049]
 [ 0.45019005  0.48800851 -0.15803498  0.29051532  0.6706731 ]
 [ 0.37929082 -0.55665017 -0.5029143  -0.48462321  0.24186072]
 [-0.4976889   0.03162214 -0.78311558  0.36999674 -0.03373724]
 [ 0.43642295  0.49682965 -0.32822489 -0.20861365 -0.64143906]]


Question:

1. Why do we order eigen values and eigen vectors?

**We order eigenvalues and eigenvectors to identify the principal components that explain the most variance in the dataset. Higher eigenvalues indicate greater variance captured by their corresponding eigenvectors, allowing us to focus on the most informative components for analysis and dimensionality reduction.**

2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

**No, we typically do not consider the lowest eigenvalues because they represent principal components that explain minimal variance. Instead, we focus on the highest eigenvalues, as they provide the most significant insights into the data structure and patterns.**

You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [27]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

# Calculate the explained variance for each eigenvalue
explained_variance = (sorted_eigenvalues / sum(sorted_eigenvalues)) * 100
# Format the explained variance as percentages with two decimal places
explained_variance = ["{:.2f}%".format(value) for value in explained_variance]

print( explained_variance)

['63.50%', '28.94%', '6.73%', '0.82%', '0.00%']


#Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [28]:
k = 2
reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:, :k])

In [29]:
print(reduced_data)

[[ 2.3577116  -0.75728867]
 [-2.27171739 -1.81970663]
 [ 1.21259114 -0.50390931]
 [-1.41935914  1.9229856 ]
 [ 1.61562536  0.87541857]
 [-1.49485157  0.28250044]]


In [30]:
print(reduced_data.shape)

(6, 2)


# **Principal Component Analysis on Fuel-Econ Data**

In this notebook, I first explored the initial given matrix to understand the principles of Principal Component Analysis **(PCA)**. After successfully implementing the PCA steps on the provided data, I applied the same methodology to the fuel-econ.csv dataset. This approach allowed me to ensure consistency in the analysis and to reinforce my understanding of the PCA process through practical application on both datasets.

Below is the analysis and results for the **fuel-econ.csv** dataset.

In [31]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [32]:
# Load the fuel-econ data
data = pd.read_csv('/content/drive/MyDrive/Machine Learning/fuel_econ.csv')

# Display a few rows of the dataset
data.head((10))

Unnamed: 0,id,make,model,year,VClass,drive,trans,fuelType,cylinders,displ,pv2,pv4,city,UCity,highway,UHighway,comb,co2,feScore,ghgScore
0,32204,Nissan,GT-R,2013,Subcompact Cars,All-Wheel Drive,Automatic (AM6),Premium Gasoline,6,3.8,79,0,16.4596,20.2988,22.5568,30.1798,18.7389,471,4,4
1,32205,Volkswagen,CC,2013,Compact Cars,Front-Wheel Drive,Automatic (AM-S6),Premium Gasoline,4,2.0,94,0,21.8706,26.977,31.0367,42.4936,25.2227,349,6,6
2,32206,Volkswagen,CC,2013,Compact Cars,Front-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.6,94,0,17.4935,21.2,26.5716,35.1,20.6716,429,5,5
3,32207,Volkswagen,CC 4motion,2013,Compact Cars,All-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.6,94,0,16.9415,20.5,25.219,33.5,19.8774,446,5,5
4,32208,Chevrolet,Malibu eAssist,2013,Midsize Cars,Front-Wheel Drive,Automatic (S6),Regular Gasoline,4,2.4,0,95,24.7726,31.9796,35.534,51.8816,28.6813,310,8,8
5,32209,Lexus,GS 350,2013,Midsize Cars,Rear-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.5,0,99,19.4325,24.1499,28.2234,38.5,22.6002,393,6,6
6,32210,Lexus,GS 350 AWD,2013,Midsize Cars,All-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.5,0,99,18.5752,23.5261,26.3573,36.2109,21.4213,412,5,5
7,32214,Hyundai,Genesis Coupe,2013,Subcompact Cars,Rear-Wheel Drive,Automatic 8-spd,Premium Gasoline,4,2.0,89,0,17.446,21.7946,26.6295,37.6731,20.6507,432,5,5
8,32215,Hyundai,Genesis Coupe,2013,Subcompact Cars,Rear-Wheel Drive,Manual 6-spd,Premium Gasoline,4,2.0,89,0,20.6741,26.2,29.2741,41.8,23.8235,375,6,6
9,32216,Hyundai,Genesis Coupe,2013,Subcompact Cars,Rear-Wheel Drive,Automatic 8-spd,Premium Gasoline,6,3.8,89,0,16.4675,20.4839,24.5605,34.4972,19.3344,461,4,4


**Step 1: Standardize the Data along the Features of fuel-econ dataset**

In the context of the fuel-econ dataset, standardization ensures that all numerical features (like fuel economy metrics) are on the same scale. This is important because it prevents features with larger values from dominating the analysis, allowing for a fair comparison of all features.

In [33]:
# Handle categorical data using one-hot encoding
data_encoded = pd.get_dummies(data, drop_first=True)

# Step 1: Standardize the Data along the Features
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data_encoded)

# Display the standardized data (first 5 rows)
print("Standardized Data (First 5 Rows):")
print(standardized_data[:5])

Standardized Data (First 5 Rows):
[[-1.73714048 -1.47583548  0.28310163 ... -0.02764302  0.84070013
  -0.79476067]
 [-1.73668367 -1.47583548 -0.78181585 ... -0.02764302  0.84070013
  -0.79476067]
 [-1.73622685 -1.47583548  0.28310163 ... -0.02764302  0.84070013
  -0.79476067]
 [-1.73577004 -1.47583548  0.28310163 ... -0.02764302  0.84070013
  -0.79476067]
 [-1.73531322 -1.47583548 -0.78181585 ... -0.02764302 -1.18948476
   1.25824043]]


**Step 2: Calculate the Covariance Matrix of fuel-econ dataset**

For the fuel-econ dataset, the covariance matrix illustrates how the various features (e.g., weight, horsepower, and fuel efficiency) relate to one another. It shows how much the features vary together, which is essential for understanding the data's structure and variance distribution.

In [34]:
# Step 2: Calculate the Covariance Matrix
cov_matrix = np.cov(standardized_data.T)

# Display the covariance matrix
print("\nCovariance Matrix:")
print(cov_matrix)


Covariance Matrix:
[[ 1.00025458  0.98591866 -0.06011148 ... -0.0164336   0.03205579
  -0.02254824]
 [ 0.98591866  1.00025458 -0.05532701 ... -0.02448998  0.03933023
  -0.0283311 ]
 [-0.06011148 -0.05532701  1.00025458 ... -0.02161725  0.40480546
  -0.4016625 ]
 ...
 [-0.0164336  -0.02448998 -0.02161725 ...  1.00025458 -0.03288932
  -0.02197518]
 [ 0.03205579  0.03933023  0.40480546 ... -0.03288932  1.00025458
  -0.94559637]
 [-0.02254824 -0.0283311  -0.4016625  ... -0.02197518 -0.94559637
   1.00025458]]


**Step 3: Eigendecomposition on the Covariance Matrix (fuel-econ datset)**

Performing eigendecomposition on the covariance matrix of the fuel-econ dataset provides eigenvalues and eigenvectors. The eigenvalues indicate how much variance each principal component captures from the data, while the eigenvectors provide the directions of these components in the context of the original feature space.

In [35]:
# Display the eigenvalues and eigenvectors
print("\nEigenvalues:")
print(eigenvalues[:5])  # First 5 eigenvalues
print("\nEigenvectors (First 5 Rows):")
print(eigenvectors[:5])  # First 5 rows of eigenvectors


Eigenvalues:
[3.80985761e+00 1.73655615e+00 4.94531029e-02 4.74189469e-05
 4.04085720e-01]

Eigenvectors (First 5 Rows):
[[-0.4640131   0.45182808 -0.70733581  0.28128049 -0.03317471]
 [ 0.45019005  0.48800851  0.29051532  0.6706731  -0.15803498]
 [ 0.37929082 -0.55665017 -0.48462321  0.24186072 -0.5029143 ]
 [-0.4976889   0.03162214  0.36999674 -0.03373724 -0.78311558]
 [ 0.43642295  0.49682965 -0.20861365 -0.64143906 -0.32822489]]


**Step 4: Sort the Principal Components (fuel-econ dataset)**

In the fuel-econ dataset, sorting the eigenvalues and their corresponding eigenvectors helps identify the most important principal components. This sorting highlights which components capture the most variance in the data, guiding the selection of components for further analysis.

In [36]:
# Step 4: Sort the Principal Components
sorted_indices = np.argsort(eigenvalues)[::-1]
sorted_eigenvalues = eigenvalues[sorted_indices]
sorted_eigenvectors = eigenvectors[:, sorted_indices]

# Display sorted eigenvalues and eigenvectors
print("\nSorted Eigenvalues (First 5):")
print(sorted_eigenvalues[:5])
print("\nSorted Eigenvectors (First 5 Rows):")
print(sorted_eigenvectors[:5])


Sorted Eigenvalues (First 5):
[3.80985761e+00 1.73655615e+00 4.04085720e-01 4.94531029e-02
 4.74189469e-05]

Sorted Eigenvectors (First 5 Rows):
[[-0.4640131   0.45182808 -0.03317471 -0.70733581  0.28128049]
 [ 0.45019005  0.48800851 -0.15803498  0.29051532  0.6706731 ]
 [ 0.37929082 -0.55665017 -0.5029143  -0.48462321  0.24186072]
 [-0.4976889   0.03162214 -0.78311558  0.36999674 -0.03373724]
 [ 0.43642295  0.49682965 -0.32822489 -0.20861365 -0.64143906]]


**Step 5: Initialize the number of Principal components and perform matrix multiplication**

By setting
𝑘
=
2
k=2 for the fuel-econ dataset, we perform matrix multiplication to project the standardized data onto the top two principal components. This reduction simplifies the dataset while retaining essential information about fuel economy characteristics, enabling easier visualization and analysis.

In [45]:
k = 2  # Example for 2 principal components

# Check shapes and perform matrix multiplication if compatible
if standardized_data.shape[1] == sorted_eigenvectors.shape[0]:
    reduced_data = np.matmul(standardized_data, sorted_eigenvectors[:, :k])

In [46]:
print("Reduced Data (First 5 Rows):\n", reduced_data[:5])

Reduced Data (First 5 Rows):
 [[ 2.3577116  -0.75728867]
 [-2.27171739 -1.81970663]
 [ 1.21259114 -0.50390931]
 [-1.41935914  1.9229856 ]
 [ 1.61562536  0.87541857]]


In [43]:
print("\nShape of Reduced Data:", reduced_data.shape)


Shape of Reduced Data: (6, 2)


# *What are 2 positive effects and 2 negative effects of PCA

Give 2 Benefits and 2 limitations


**Benefits**

*   Data Simplification: PCA helps condense a large dataset into a smaller set of components that still capture essential information. This can make it easier to analyze and visualize data trends, particularly in complex datasets with many features.

*  Enhanced Model Performance: By reducing dimensionality, PCA can decrease computation time and improve the efficiency of machine learning models. It often leads to better model performance by focusing on the most relevant features and reducing overfitting.


**limitations**

*  Reduced Feature Interpretability: The new components generated by PCA do not have direct physical meanings tied to the original features. This can make it difficult for analysts to understand the significance of the principal components in relation to the original data.

*  Linear Relationships Assumption: PCA is primarily designed to work with linear relationships. If the data has complex, non-linear relationships, PCA may not effectively capture the underlying structure, potentially leading to less accurate results in certain scenarios.