[PCA] Use the "USArrests" data

(a) Show the first to fourth principle component loadings vectors using PCA() function

In [2]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load your USArrests CSV file, ignoring the first column (state names)
df = pd.read_csv('USArrests.csv', index_col=0)

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Apply PCA
pca = PCA()
pca.fit(scaled_data)

# Get the first 4 principal components (loadings)
loadings = pca.components_[:4]

# Convert to DataFrame for easier viewing
loadings_df = pd.DataFrame(loadings, columns=df.columns)

# Display the loadings
print(loadings_df)


     Murder   Assault  UrbanPop      Rape
0  0.535899  0.583184  0.278191  0.543432
1 -0.418181 -0.187986  0.872806  0.167319
2 -0.341233 -0.268148 -0.378016  0.817778
3 -0.649228  0.743407 -0.133878 -0.089024


(b) Use np.linalg.eig() function to find the first to fourth principle component loadings vectors.

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load your USArrests CSV file, ignoring the first column (state names)
df = pd.read_csv('USArrests.csv', index_col=0)

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Compute the covariance matrix
cov_matrix = np.cov(scaled_data.T)

# Perform eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Get the first 4 principal components (eigenvectors)
loadings = eigenvectors[:, :4]

# Convert to DataFrame for easier viewing
loadings_df = pd.DataFrame(loadings, columns=[f'PC{i+1}' for i in range(4)], index=df.columns)

# Display the loadings
print(loadings_df)


               PC1       PC2       PC3       PC4
Murder    0.535899  0.418181  0.649228 -0.341233
Assault   0.583184  0.187986 -0.743407 -0.268148
UrbanPop  0.278191 -0.872806  0.133878 -0.378016
Rape      0.543432 -0.167319  0.089024  0.817778


(c) Use np.linalg.svd() function to find the first to fourth principle component loadings vectors.

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load your USArrests CSV file, ignoring the first column (state names)
df = pd.read_csv('USArrests.csv', index_col=0)

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

# Perform SVD on the standardized data
U, S, Vt = np.linalg.svd(scaled_data)

# Vt (transpose of V) contains the principal component loadings
# Take the first 4 components
loadings = Vt[:4, :]

# Convert to DataFrame for easier viewing
loadings_df = pd.DataFrame(loadings, columns=df.columns, index=[f'PC{i+1}' for i in range(4)])

# Display the loadings
print(loadings_df)


       Murder   Assault  UrbanPop      Rape
PC1 -0.535899 -0.583184 -0.278191 -0.543432
PC2 -0.418181 -0.187986  0.872806  0.167319
PC3  0.341233  0.268148  0.378016 -0.817778
PC4  0.649228 -0.743407  0.133878  0.089024


(d) Are those from (a),(b), and (c) exactly the same? Why or why not?

No, the results from (a), (b), and (c) will not be exactly the same, though they represent the same fundamental components.

The differences between (a), (b), and (c) are caused by:

- Sign Flipping: Eigenvectors (loadings) can have opposite signs, but they still represent the same direction.
- Numerical Precision: Different algorithms handle calculations slightly differently, leading to small variations.
- Normalization: Some methods may handle scaling or normalization differently, affecting the loadings' magnitude.
These methods fundamentally compute the same principal components, but the representation might vary.