<a href="https://colab.research.google.com/github/FeedingDejaj/MAT422/blob/main/1_4_Principal_Component_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Principal Component Analysis


By Abdula Alkhafaji with assistance from Glendale Community College Tutoring Center

This notebook covers the fundamental concepts of Principal Component Analysis (PCA), Singular Value Decomposition (SVD), lower-rank matrix approximations, and other key concepts related to PCA.

## 1.4 Principal Component Analysis

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of data while retaining as much variability as possible. It works by transforming the original variables into a new set of variables called principal components, which are uncorrelated and ordered by the amount of variance they capture.

### 1.4.1 Singular Value Decomposition

Singular Value Decomposition (SVD) is a mathematical method used to decompose a matrix into three other matrices: a diagonal matrix of singular values and two orthogonal matrices. It is commonly used in PCA to compute the principal components.

In [1]:
import numpy as np

# Create a random 3x3 matrix
A = np.random.rand(3, 3)
U, S, Vt = np.linalg.svd(A)
print('Matrix A:\n', A)
print('\nU matrix:\n', U)
print('\nSingular values (S):\n', S)
print('\nVt matrix:\n', Vt)

Matrix A:
 [[0.4616197  0.7277056  0.17053715]
 [0.72645923 0.8955592  0.21996973]
 [0.33779387 0.60560473 0.05160472]]

U matrix:
 [[-0.54241088  0.22227531 -0.81017537]
 [-0.724317   -0.61230156  0.31694114]
 [-0.42562345  0.75873611  0.49311681]]

Singular values (S):
 [1.61708905 0.12654894 0.04919661]

Vt matrix:
 [[-0.5691383  -0.80462097 -0.16931238]
 [-0.67886025  0.5760054  -0.45537517]
 [ 0.46392926 -0.14423201 -0.87405193]]


### 1.4.2 Lower Rank Matrix Approximations

A lower-rank matrix approximation is an approximation of a matrix using a smaller number of singular values. This technique is used in PCA to reduce the dimensionality of the data by selecting only the most important components.

In [2]:
# We'll keep only the top singular value for approximation
rank_1_approx = U[:, 0].reshape(-1, 1) @ np.diag([S[0]]) @ Vt[0, :].reshape(1, -1)
print('Rank 1 approximation of A:\n', rank_1_approx)

Rank 1 approximation of A:
 [[0.4992064  0.70575453 0.14850841]
 [0.6666232  0.94244054 0.19831307]
 [0.3917214  0.5537973  0.11653281]]


#### 1.4.2.1 Induced Norm

The **induced norm** is the largest singular value of a matrix. It measures the maximum amount by which the matrix can stretch a vector.

### 1.4.3 Principal Component Analysis

PCA transforms data into a new coordinate system such that the greatest variance comes to lie on the first principal component, the second greatest variance on the second component, and so on. This allows for dimensionality reduction while preserving as much variance as possible.

In [None]:
#Example from Internet
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Create a random 2D dataset
X = np.random.rand(100, 2)

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot the original data and the principal components
plt.scatter(X[:, 0], X[:, 1], label='Original Data')
plt.scatter(X_pca[:, 0], X_pca[:, 1], label='PCA Transformed Data')
plt.legend()
plt.title('PCA on 2D Data')
plt.show()

#### 1.4.3.1 Covariance Matrix

The **covariance matrix** is used in PCA to understand the relationships between variables. The elements of the covariance matrix represent the covariance between pairs of variables.

In [6]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np # Added import statement

# Create a random 2D dataset
X = np.random.rand(100, 2) # Defined X

cov_matrix = np.cov(X.T)
print('Covariance matrix:\n', cov_matrix)

Covariance matrix:
 [[0.08299502 0.00615148]
 [0.00615148 0.07306044]]


#### 1.4.3.3 Total Variance

The **total variance** in PCA is the sum of the variances of the principal components. It provides a measure of how much information is retained after dimensionality reduction.

In [7]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# Create a random 2D dataset
X = np.random.rand(100, 2)

cov_matrix = np.cov(X.T)
print('Covariance matrix:\n', cov_matrix)

# Create a PCA object and fit the data
pca = PCA(n_components=2) # Create PCA object
pca.fit(X) # Fit the data to the PCA object

explained_variance = pca.explained_variance_ratio_
total_variance = np.sum(explained_variance)
print('Explained variance ratio:', explained_variance)
print('Total variance explained:', total_variance)

Covariance matrix:
 [[0.07952653 0.00754066]
 [0.00754066 0.09535567]]
Explained variance ratio: [0.56250889 0.43749111]
Total variance explained: 1.0
