

<center>
    <img src="https://miro.medium.com/v2/resize:fit:300/1*mgncZaKaVx9U6OCQu_m8Bg.jpeg">
</center>



The goal of PCA is to extract information while reducing the number of features
from a dataset by identifying which existing features relate to another. The crux of the algorithm is trying to determine the relationship between existing features, called principal components, and then quantifying how relevant these principal components are. The principal components are used to transform the high dimensional data to a lower dimensional data while preserving as much information. For a principal component to be relevant, it needs to capture information about the features. We can determine the relationships between features using covariance.

In [83]:
#import necessary package
#TO DO
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [84]:
data = np.array([
    [   1,   2,  -1,   4,  10],
    [   3,  -3,  -3,  12, -15],
    [   2,   1,  -2,   4,   5],
    [   5,   1,  -5,  10,   5],
    [   2,   3,  -3,   5,  12],
    [   4,   0,  -3,  16,   2],
])

### Step 1: Standardize the Data along the Features

![image.png](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQLxe5VYCBsaZddkkTZlCY24Yov4JJD4-ArTA&usqp=CAU)




Explain why we need to handle the data on the same scale.

- **Answer:** Standardization ensures that each feature contributes equally to the principal components, preventing features with relatively larger magnitudes from dominating the analysis, which can lead to biased results.

In [85]:
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
standardized_data

array([[-1.36438208,  0.70710678,  1.5109662 , -0.99186978,  0.77802924],
       [ 0.12403473, -1.94454365, -0.13736056,  0.77145428, -2.06841919],
       [-0.62017367,  0.1767767 ,  0.68680282, -0.99186978,  0.20873955],
       [ 1.61245155,  0.1767767 , -1.78568733,  0.33062326,  0.20873955],
       [-0.62017367,  1.23743687, -0.13736056, -0.77145428,  1.00574511],
       [ 0.86824314, -0.35355339, -0.13736056,  1.65311631, -0.13283426]])

![cov matrix.webp](https://dmitry.ai/uploads/default/original/1X/9bd2851674ebb55e404cc3ff5e2ffe65b42ff460.png)

We use the pair - wise covariance of the different features to determine how they relate to each other. With these covariances, our goal is to group / cluster based on similar patterns. Intuitively, we can relate features if they have similar covariances with other features.

### Step 2: Calculate the Covariance Matrix



In [86]:
cov_matrix = np.cov(standardized_data, rowvar=False)

print(cov_matrix)

[[ 1.2        -0.42098785 -1.0835838   0.90219291 -0.37000528]
 [-0.42098785  1.2         0.20397003 -0.77149364  1.18751836]
 [-1.0835838   0.20397003  1.2        -0.59947269  0.22208218]
 [ 0.90219291 -0.77149364 -0.59947269  1.2        -0.70017993]
 [-0.37000528  1.18751836  0.22208218 -0.70017993  1.2       ]]


### Step 3: Eigendecomposition on the Covariance Matrix

> Add blockquote




In [87]:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print(eigenvalues)
print(eigenvectors)

[3.80985761e+00 1.73655615e+00 4.94531029e-02 4.74189469e-05
 4.04085720e-01]
[[-0.4640131   0.45182808 -0.70733581  0.28128049 -0.03317471]
 [ 0.45019005  0.48800851  0.29051532  0.6706731  -0.15803498]
 [ 0.37929082 -0.55665017 -0.48462321  0.24186072 -0.5029143 ]
 [-0.4976889   0.03162214  0.36999674 -0.03373724 -0.78311558]
 [ 0.43642295  0.49682965 -0.20861365 -0.64143906 -0.32822489]]


### Step 4: Sort the Principal Components
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

In [88]:
# np.argsort can only provide lowest to highest; use [::-1] to reverse the list

order_of_importance = np.argsort(eigenvalues)[::-1]
print ( 'the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[order_of_importance]

print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
sorted_eigenvectors = eigenvectors[:, order_of_importance]  # sort the columns
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 4 2 3]


 sorted eigen values:
[3.80985761e+00 1.73655615e+00 4.04085720e-01 4.94531029e-02
 4.74189469e-05]


 The sorted eigen vector matrix is: 
 [[-0.4640131   0.45182808 -0.03317471 -0.70733581  0.28128049]
 [ 0.45019005  0.48800851 -0.15803498  0.29051532  0.6706731 ]
 [ 0.37929082 -0.55665017 -0.5029143  -0.48462321  0.24186072]
 [-0.4976889   0.03162214 -0.78311558  0.36999674 -0.03373724]
 [ 0.43642295  0.49682965 -0.32822489 -0.20861365 -0.64143906]]


Question:

1. Why do we order eigen values and eigen vectors?

We order eigenvalues and eigenvectors in order to focus on the components that capture the most variance, making PCA both more informative and efficient for dimensionality reduction.

2. Is it true we would consider the lowest eigen value compared to the highest? Defend your answer

No, we typically would not consider the lowest eigenvalue compared to the highest in Principal Component Analysis (PCA). Larger eigenvalues indicate that the corresponding principal components capture more of the variance in the data, representing more important features. Remember that the goal of PCA is to maximize the variance and retain as much information as possible, so we focus on the principal components with the highest eigenvalues, as they contribute the most to the data's structure.

You want to see what percentage of information each eigen value holds. You would have print out the percentage of each eigen value using the formula



> (sorted eigen values / sum of all sorted eigen values) * 100



In [89]:
# use sorted_eigenvalues to ensure the explained variances correspond to the eigenvectors

#TO DO: Insert code here
explained_variance = sorted_eigenvalues / np.sum(sorted_eigenvalues) * 100
explained_variance =["{:.2f}%".format(value) for value in explained_variance]
print( explained_variance)

['63.50%', '28.94%', '6.73%', '0.82%', '0.00%']


## Initialize the number of Principle components then perfrom matrix multiplication with the variable K example k = 3 for 3 priciple components




> The reulting matrix (with reduced data) = standardized data * vector with columns k

See expected output for k = 2



In [90]:
k = 2
top_k_eigenvectors = eigenvectors[:, :k]
reduced_data = np.matmul(standardized_data, top_k_eigenvectors)

In [91]:
print(reduced_data)

[[ 2.3577116  -0.75728867]
 [-2.27171739 -1.81970663]
 [ 1.21259114 -0.50390931]
 [-1.41935914  1.9229856 ]
 [ 1.61562536  0.87541857]
 [-1.49485157  0.28250044]]


In [92]:
print(reduced_data.shape)

(6, 2)


# *What are 2 positive effects and 2 negative effects of PCA

Give 2 Benefits and 2 limitations
### **BENEFITS:**
- PCA helps to reduce the number of features used in training our model which in turn, reduces computation time and resources.
- PCA helps us focus more on components with most variance and filter out less important components which reduces noise in our training data therby improving the data quality.

### **LIMITATIONS:**
- Sometimes, we end up discarding too many dimensions/features with low variance which may lead to loss of important information.
- It introduces the complexity of understanding what each of the newly engineered features represents which may make the data harder to interpret.

# SECTION 2
## Perform PCA on the fuel_econ dataset

In [93]:
df = pd.read_csv("../../plotting/datasets/fuel_econ.csv")
df.head(10)

Unnamed: 0,id,make,model,year,VClass,drive,trans,fuelType,cylinders,displ,pv2,pv4,city,UCity,highway,UHighway,comb,co2,feScore,ghgScore
0,32204,Nissan,GT-R,2013,Subcompact Cars,All-Wheel Drive,Automatic (AM6),Premium Gasoline,6,3.8,79,0,16.4596,20.2988,22.5568,30.1798,18.7389,471,4,4
1,32205,Volkswagen,CC,2013,Compact Cars,Front-Wheel Drive,Automatic (AM-S6),Premium Gasoline,4,2.0,94,0,21.8706,26.977,31.0367,42.4936,25.2227,349,6,6
2,32206,Volkswagen,CC,2013,Compact Cars,Front-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.6,94,0,17.4935,21.2,26.5716,35.1,20.6716,429,5,5
3,32207,Volkswagen,CC 4motion,2013,Compact Cars,All-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.6,94,0,16.9415,20.5,25.219,33.5,19.8774,446,5,5
4,32208,Chevrolet,Malibu eAssist,2013,Midsize Cars,Front-Wheel Drive,Automatic (S6),Regular Gasoline,4,2.4,0,95,24.7726,31.9796,35.534,51.8816,28.6813,310,8,8
5,32209,Lexus,GS 350,2013,Midsize Cars,Rear-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.5,0,99,19.4325,24.1499,28.2234,38.5,22.6002,393,6,6
6,32210,Lexus,GS 350 AWD,2013,Midsize Cars,All-Wheel Drive,Automatic (S6),Premium Gasoline,6,3.5,0,99,18.5752,23.5261,26.3573,36.2109,21.4213,412,5,5
7,32214,Hyundai,Genesis Coupe,2013,Subcompact Cars,Rear-Wheel Drive,Automatic 8-spd,Premium Gasoline,4,2.0,89,0,17.446,21.7946,26.6295,37.6731,20.6507,432,5,5
8,32215,Hyundai,Genesis Coupe,2013,Subcompact Cars,Rear-Wheel Drive,Manual 6-spd,Premium Gasoline,4,2.0,89,0,20.6741,26.2,29.2741,41.8,23.8235,375,6,6
9,32216,Hyundai,Genesis Coupe,2013,Subcompact Cars,Rear-Wheel Drive,Automatic 8-spd,Premium Gasoline,6,3.8,89,0,16.4675,20.4839,24.5605,34.4972,19.3344,461,4,4


In [94]:
corr_matrix = df.corr(numeric_only=True)
corr_matrix

Unnamed: 0,id,year,cylinders,displ,pv2,pv4,city,UCity,highway,UHighway,comb,co2,feScore,ghgScore
id,1.0,0.985668,-0.060096,-0.074666,-0.006569,-0.021951,0.0918,0.091225,0.090593,0.095359,0.093803,-0.099717,-0.127873,-0.122321
year,0.985668,1.0,-0.055313,-0.070424,0.006232,-0.033643,0.06805,0.066742,0.07329,0.077641,0.071993,-0.081165,-0.149829,-0.145141
cylinders,-0.060096,-0.055313,1.0,0.933872,0.247571,-0.004264,-0.693103,-0.666029,-0.766275,-0.771503,-0.738023,0.848274,-0.783858,-0.781815
displ,-0.074666,-0.070424,0.933872,1.0,0.259336,0.022072,-0.713479,-0.686166,-0.783984,-0.788457,-0.758397,0.855375,-0.793432,-0.791216
pv2,-0.006569,0.006232,0.247571,0.259336,1.0,-0.665642,-0.278109,-0.272546,-0.296808,-0.298504,-0.290883,0.2872,-0.296088,-0.293156
pv4,-0.021951,-0.033643,-0.004264,0.022072,-0.665642,1.0,0.035188,0.037869,0.074952,0.077442,0.047333,-0.050153,0.064876,0.065263
city,0.0918,0.06805,-0.693103,-0.713479,-0.278109,0.035188,1.0,0.996377,0.915435,0.909658,0.989552,-0.904305,0.905681,0.898793
UCity,0.091225,0.066742,-0.666029,-0.686166,-0.272546,0.037869,0.996377,1.0,0.899557,0.897814,0.981106,-0.885823,0.891297,0.884458
highway,0.090593,0.07329,-0.766275,-0.783984,-0.296808,0.074952,0.915435,0.899557,1.0,0.992191,0.962757,-0.916456,0.914116,0.897585
UHighway,0.095359,0.077641,-0.771503,-0.788457,-0.298504,0.077442,0.909658,0.897814,0.992191,1.0,0.95658,-0.912117,0.911355,0.894314


In [95]:
numeric_cols = df.select_dtypes(include='number').drop(
    columns=["id", "year", "pv2", "pv4"],
    errors='ignore'
)
numeric_cols

Unnamed: 0,cylinders,displ,city,UCity,highway,UHighway,comb,co2,feScore,ghgScore
0,6,3.8,16.4596,20.2988,22.5568,30.1798,18.7389,471,4,4
1,4,2.0,21.8706,26.9770,31.0367,42.4936,25.2227,349,6,6
2,6,3.6,17.4935,21.2000,26.5716,35.1000,20.6716,429,5,5
3,6,3.6,16.9415,20.5000,25.2190,33.5000,19.8774,446,5,5
4,4,2.4,24.7726,31.9796,35.5340,51.8816,28.6813,310,8,8
...,...,...,...,...,...,...,...,...,...,...
3924,4,1.8,55.2206,78.8197,53.0000,73.6525,54.4329,78,10,10
3925,4,2.0,39.0000,55.9000,44.3066,64.0000,41.0000,217,9,9
3926,4,2.0,40.0000,56.0000,46.0000,64.0000,42.0000,212,9,9
3927,6,3.4,19.2200,24.2000,30.2863,43.4000,23.0021,387,5,5


### Step 1: Standardize the Data along the Features

In [96]:
standardized_data = scaler.fit_transform(numeric_cols)
standardized_data

array([[ 0.28310163,  0.65053594, -0.85996012, ...,  1.02283829,
        -0.95057953, -0.94575548],
       [-0.78181585, -0.72799833,  0.00642675, ..., -0.29854998,
         0.1886082 ,  0.1942578 ],
       [ 0.28310163,  0.49736547, -0.69441634, ...,  0.56793413,
        -0.38098566, -0.37574884],
       ...,
       [-0.78181585, -0.72799833,  2.90923103, ..., -1.78240402,
         1.89738979,  1.90427772],
       [ 0.28310163,  0.34419499, -0.41797632, ...,  0.11302997,
        -0.38098566, -0.37574884],
       [ 0.28310163,  0.34419499, -0.60641667, ...,  0.43796152,
        -0.95057953, -0.94575548]])

### Step 2: Calculate the Covariance Matrix

In [97]:
cov_matrix = np.cov(standardized_data, rowvar=False)

print(cov_matrix)

[[ 1.00025458  0.93411019 -0.69327904 -0.66619842 -0.76646982 -0.77169964
  -0.73821112  0.84848979 -0.78405759 -0.78201448]
 [ 0.93411019  1.00025458 -0.71366074 -0.6863403  -0.78418374 -0.78865771
  -0.75859024  0.85559254 -0.7936343  -0.79141752]
 [-0.69327904 -0.71366074  1.00025458  0.99663082  0.9156677   0.90989004
   0.98980432 -0.90453509  0.9059112   0.89902154]
 [-0.66619842 -0.6863403   0.99663082  1.00025458  0.89978578  0.89804238
   0.98135571 -0.8860481   0.89152389  0.88468357]
 [-0.76646982 -0.78418374  0.9156677   0.89978578  1.00025458  0.99244327
   0.9630022  -0.91668944  0.91434884  0.89781322]
 [-0.77169964 -0.78865771  0.90989004  0.89804238  0.99244327  1.00025458
   0.95682339 -0.91234956  0.91158665  0.89454192]
 [-0.73821112 -0.75859024  0.98980432  0.98135571  0.9630022   0.95682339
   1.00025458 -0.92963549  0.92909879  0.91904062]
 [ 0.84848979  0.85559254 -0.90453509 -0.8860481  -0.91668944 -0.91234956
  -0.92963549  1.00025458 -0.94086368 -0.94480617]


### Step 3: Eigendecomposition on the Covariance Matrix

In [98]:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print(eigenvalues)
print(eigenvectors)

[8.84730337e+00 6.67702093e-01 1.97957320e-01 1.56642445e-01
 6.56749106e-02 5.09901088e-02 9.38311059e-03 4.31401016e-03
 2.18409060e-03 3.94363265e-04]
[[ 0.28375385  0.61039118 -0.09948505 -0.17154364 -0.67605413  0.22329265
   0.01770715 -0.0176749  -0.00671151  0.00144201]
 [ 0.28822685  0.574993   -0.15926252 -0.17324798  0.72616419  0.0557668
  -0.00273307 -0.00409296 -0.02083836 -0.00678931]
 [-0.32030196  0.30620037  0.09908369  0.41454238  0.00213767 -0.06720823
  -0.16862221 -0.02277963 -0.46306804  0.61162009]
 [-0.31547813  0.34876194  0.10595195  0.46257394  0.01089318 -0.12215146
   0.33984451  0.02217467  0.64178886 -0.10353553]
 [-0.32427103  0.1027842   0.34015226 -0.48319392  0.01686264  0.02430184
  -0.5610654   0.08043514  0.40976055  0.21227566]
 [-0.32370216  0.08856727  0.35907789 -0.49252161  0.01883343 -0.05236185
   0.66322021  0.00801798 -0.26737839  0.03667809]
 [-0.32863234  0.22969644  0.17498666  0.11316021 -0.00210197 -0.03232589
  -0.30850687  0.012072

### Step 4: Sort the Principal Components

In [99]:
order_of_importance = np.argsort(eigenvalues)[::-1]
print('the order of importance is :\n {}'.format(order_of_importance))

# utilize the sort order to sort eigenvalues and eigenvectors
sorted_eigenvalues = eigenvalues[order_of_importance]

print('\n\n sorted eigen values:\n{}'.format(sorted_eigenvalues))
sorted_eigenvectors = eigenvectors[:, order_of_importance]
print('\n\n The sorted eigen vector matrix is: \n {}'.format(sorted_eigenvectors))

the order of importance is :
 [0 1 2 3 4 5 6 7 8 9]


 sorted eigen values:
[8.84730337e+00 6.67702093e-01 1.97957320e-01 1.56642445e-01
 6.56749106e-02 5.09901088e-02 9.38311059e-03 4.31401016e-03
 2.18409060e-03 3.94363265e-04]


 The sorted eigen vector matrix is: 
 [[ 0.28375385  0.61039118 -0.09948505 -0.17154364 -0.67605413  0.22329265
   0.01770715 -0.0176749  -0.00671151  0.00144201]
 [ 0.28822685  0.574993   -0.15926252 -0.17324798  0.72616419  0.0557668
  -0.00273307 -0.00409296 -0.02083836 -0.00678931]
 [-0.32030196  0.30620037  0.09908369  0.41454238  0.00213767 -0.06720823
  -0.16862221 -0.02277963 -0.46306804  0.61162009]
 [-0.31547813  0.34876194  0.10595195  0.46257394  0.01089318 -0.12215146
   0.33984451  0.02217467  0.64178886 -0.10353553]
 [-0.32427103  0.1027842   0.34015226 -0.48319392  0.01686264  0.02430184
  -0.5610654   0.08043514  0.40976055  0.21227566]
 [-0.32370216  0.08856727  0.35907789 -0.49252161  0.01883343 -0.05236185
   0.66322021  0.00801798 -0.267

In [100]:
explained_variance = sorted_eigenvalues / np.sum(sorted_eigenvalues) * 100
explained_variance = ["{:.2f}%".format(value) for value in explained_variance]
print(explained_variance)

['88.45%', '6.68%', '1.98%', '1.57%', '0.66%', '0.51%', '0.09%', '0.04%', '0.02%', '0.00%']


In [101]:
k = 2
top_k_eigenvectors = eigenvectors[:, :k]
reduced_data = np.matmul(standardized_data, top_k_eigenvectors)

In [102]:
print(reduced_data)

[[ 2.96093605 -0.47606835]
 [-0.68908233 -0.91603194]
 [ 1.80755222 -0.28629287]
 ...
 [-6.7460631   2.19200837]
 [ 0.78953126  0.02162714]
 [ 1.79979564 -0.26809069]]


In [103]:
print(reduced_data.shape)

(3929, 2)
