[PCA] Use the 'USArrests' data

a)

In [6]:
# Step 1: Import the necessary libraries
from google.colab import files
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Step 2: Upload the 'USArrests.csv' file
uploaded = files.upload()

# Step 3: Load the dataset into a pandas DataFrame
# Assuming the uploaded file is named 'USArrests.csv'
USArrests = pd.read_csv('USArrests.csv')

# Step 4: Drop the first column which contains state names (non-numeric data)
USArrests_numeric = USArrests.drop(columns=['Unnamed: 0'])  # Dropping the column with state names

# Step 5: Standardize the data (PCA is sensitive to scale)
scaler = StandardScaler()
USArrests_scaled = scaler.fit_transform(USArrests_numeric)

# Step 6: Perform PCA
pca = PCA(n_components=4)  # We want the first four principal components
pca.fit(USArrests_scaled)

# Step 7: Display the principal component loadings (PCA components)
loadings = pca.components_  # This gives us the loadings (eigenvectors)

# Step 8: Create a DataFrame to display loadings
loadings_df = pd.DataFrame(loadings.T, index=USArrests_numeric.columns, columns=['PC1', 'PC2', 'PC3', 'PC4'])

# Show the first to fourth principal component loadings vectors
print("Principal Component Loadings (First to Fourth PCs):")
print(loadings_df)


Saving USArrests.csv to USArrests (2).csv
Principal Component Loadings (First to Fourth PCs):
               PC1       PC2       PC3       PC4
Murder    0.535899 -0.418181 -0.341233 -0.649228
Assault   0.583184 -0.187986 -0.268148  0.743407
UrbanPop  0.278191  0.872806 -0.378016 -0.133878
Rape      0.543432  0.167319  0.817778 -0.089024


b)

In [9]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Step 2: Drop the non-numeric 'state names' column
USArrests_numeric = USArrests.drop(columns=['Unnamed: 0'])

# Step 3: Standardize the data
scaler = StandardScaler()
USArrests_scaled = scaler.fit_transform(USArrests_numeric)

# Step 4: Compute the covariance matrix of the scaled data
cov_matrix = np.cov(USArrests_scaled.T)

# Step 5: Compute the eigenvalues and eigenvectors using np.linalg.eig()
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Step 6: Sort eigenvalues and corresponding eigenvectors in descending order
idx = np.argsort(eigenvalues)[::-1]  # Indices to sort eigenvalues in descending order
eigenvalues = eigenvalues[idx]       # Sort eigenvalues
eigenvectors = eigenvectors[:, idx]  # Sort eigenvectors according to eigenvalues

# Step 7: Create a DataFrame to display the first to fourth principal component loadings
loadings_df = pd.DataFrame(eigenvectors[:, :4], index=USArrests_numeric.columns, columns=['PC1', 'PC2', 'PC3', 'PC4'])

# Show the first to fourth principal component loadings vectors
print("Principal Component Loadings (First to Fourth PCs) using np.linalg.eig():")
print(loadings_df)


Principal Component Loadings (First to Fourth PCs) using np.linalg.eig():
               PC1       PC2       PC3       PC4
Murder    0.535899  0.418181 -0.341233  0.649228
Assault   0.583184  0.187986 -0.268148 -0.743407
UrbanPop  0.278191 -0.872806 -0.378016  0.133878
Rape      0.543432 -0.167319  0.817778  0.089024


c)

In [8]:
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Step 2: Drop the non-numeric 'state names' column
USArrests_numeric = USArrests.drop(columns=['Unnamed: 0'])

# Step 3: Standardize the data
scaler = StandardScaler()
USArrests_scaled = scaler.fit_transform(USArrests_numeric)

# Step 4: Perform SVD using np.linalg.svd
# SVD returns U, S, V.T, where V.T contains the principal component loadings
U, S, Vt = np.linalg.svd(USArrests_scaled)

# Step 5: Vt contains the principal component loadings (eigenvectors)
# The rows of Vt correspond to the principal components, so we take the first four components (rows)
loadings_df = pd.DataFrame(Vt[:4].T, index=USArrests_numeric.columns, columns=['PC1', 'PC2', 'PC3', 'PC4'])

# Show the first to fourth principal component loadings vectors
print("Principal Component Loadings (First to Fourth PCs) using SVD:")
print(loadings_df)


Principal Component Loadings (First to Fourth PCs) using SVD:
               PC1       PC2       PC3       PC4
Murder   -0.535899 -0.418181  0.341233  0.649228
Assault  -0.583184 -0.187986  0.268148 -0.743407
UrbanPop -0.278191  0.872806  0.378016  0.133878
Rape     -0.543432  0.167319 -0.817778  0.089024


Yes, the result are the same. The principal component loadings (directions of the principal components) obtained from all three methods (PCA(), np.linalg.eig(), and np.linalg.svd()) will be mathematically equivalent. These methods are all based on linear algebra principles, and the directions of the principal components (i.e., the eigenvectors or right singular vectors) are identical.

However, the sign of the eigenvectors (principal component loadings) may differ between these methods. Eigenvectors are determined up to a sign (i.e., multiplying an eigenvector by -1 still gives a valid eigenvector). This means that the values may be numerically identical but have opposite signs in some cases.