# Principal component analysis (PCA)

* ## Implementing PCA using Numpy library

In [37]:
# Importing NumPy library with an alias 'np'
import numpy as np

# Setting a random seed for reproducibility
np.random.seed(4)

# Setting parameters for generating synthetic data
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

# Generating random angles and creating a 3D dataset
angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

# Centering the data by subtracting the mean along each feature
X_centered = X - X.mean(axis=0)

# Performing Singular Value Decomposition (SVD) on the centered data
U, s, Vt = np.linalg.svd(X_centered)

# Extracting the principal components
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

# Getting the shape of the original data
m, n = X.shape

# Creating a diagonal matrix S from the singular values
S = np.zeros(X_centered.shape)
S[:n, :n] = np.diag(s)

# Checking if the original data can be reconstructed using SVD
np.allclose(X_centered, U.dot(S).dot(Vt))

# Extracting the first two principal components
W2 = Vt.T[:, :2]

# Projecting the centered data onto the first two principal components
X2D = X_centered.dot(W2)

# Storing the result in X2D_using_svd
X2D_using_svd = X2D

# Displaying the resulting 2D projection
X2D_using_svd

array([[-1.26203346, -0.42067648],
       [ 0.08001485,  0.35272239],
       [-1.17545763, -0.36085729],
       [-0.89305601,  0.30862856],
       [-0.73016287,  0.25404049],
       [ 1.10436914, -0.20204953],
       [-1.27265808, -0.46781247],
       [ 0.44933007, -0.67736663],
       [ 1.09356195,  0.04467792],
       [ 0.66177325,  0.28651264],
       [-1.04466138,  0.11244353],
       [ 1.05932502, -0.31189109],
       [-1.13761426, -0.14576655],
       [-1.16044117, -0.36481599],
       [ 1.00167625, -0.39422008],
       [-0.2750406 ,  0.34391089],
       [ 0.45624787, -0.69707573],
       [ 0.79706574,  0.26870969],
       [ 0.66924929, -0.65520024],
       [-1.30679728, -0.37671343],
       [ 0.6626586 ,  0.32706423],
       [-1.25387588, -0.56043928],
       [-1.04046987,  0.08727672],
       [-1.26047729, -0.1571074 ],
       [ 1.09786649, -0.38643428],
       [ 0.7130973 , -0.64941523],
       [-0.17786909,  0.43609071],
       [ 1.02975735, -0.33747452],
       [-0.94552283,

* ## Implementing PCA using Scikit-Learn library
Scikit-Learn’s PCA class uses SVD decomposition to implement PCA, just like we did earlier.

In [38]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)
print('Implementing PCA using SVD method: \n',X2D[:5])
print('Implementing PCA using Scikit-learn PCA class: \n',X2D_using_svd[:5])
print('In general the only difference is that some axes may be flipped.')

Implementing PCA using SVD method: 
 [[ 1.26203346  0.42067648]
 [-0.08001485 -0.35272239]
 [ 1.17545763  0.36085729]
 [ 0.89305601 -0.30862856]
 [ 0.73016287 -0.25404049]]
Implementing PCA using Scikit-learn PCA class: 
 [[-1.26203346 -0.42067648]
 [ 0.08001485  0.35272239]
 [-1.17545763 -0.36085729]
 [-0.89305601  0.30862856]
 [-0.73016287  0.25404049]]
In general the only difference is that some axes may be flipped.


* ### Explained Variance Ratio
    The ratio indicates the percentage of the dataset's variance present along each principal component.

In [39]:
print(pca.explained_variance_ratio_)
print('We can see that the majority proportion of the variance is in the first PC axis.')

[0.84248607 0.14631839]
We can see that the majority proportion of the variance is in the first PC axis.


* ### Choosing the Right Number of Dimensions
    computing the minimum number of dimensions required to preserve 95% of the training set’s variance

In [40]:
# Importing PCA from scikit-learn
from sklearn.decomposition import PCA
pca = PCA() # Initializing a PCA object without specifying the number of components
pca.fit(X) # Fitting the PCA model to the data
cumsum = np.cumsum(pca.explained_variance_ratio_) # Calculating the cumulative explained variance
# Finding the number of principal components that explain at least 95% of the variance
d = np.argmax(cumsum >= 0.95) + 1
print('The number of principal components that explain at least 95% of the variance is',d)


# Initializing a PCA object with a specified explained variance threshold (95%)
pca = PCA(n_components=0.95)
# Transforming the data to the reduced-dimensional space
X_reduced = pca.fit_transform(X)

The number of principal components that explain at least 95% of the variance is 2


* ### Reconstructing the original data
    The inverse_transform method to reconstruct the original data (X_recovered) from the reduced-dimensional representation (X_reduced). 


In [41]:
# Initializing a PCA object with a specified explained variance threshold (95%)
pca = PCA(n_components=0.95)

# Transforming the original data (X) to the reduced-dimensional space
X_reduced = pca.fit_transform(X)

# Reconstructing the original data from the reduced-dimensional representation
X_recovered = pca.inverse_transform(X_reduced)

* ### Randomized PCA
    * If the svd_solver is set to "randomized," Scikit-Learn employs a stochastic algorithm called Randomized PCA.

    * This results in a significantly faster computation when the number of desired principal components (d) is much smaller than the original dimensionality (n).

    * By default, the svd_solver is set to "auto" in Scikit-Learn.

    * If there is a need to force Scikit-Learn to use the full SVD approach, the svd_solver hyperparameter can be set explicitly to "full."

In [44]:
rnd_pca = PCA(n_components=2, svd_solver="randomized", random_state=42)
X_reduced = rnd_pca.fit_transform(X)

* ### Incremental PCA (IPCA)
    * Incremental PCA (IPCA) is a variant of the traditional Principal Component Analysis (PCA) that allows for incremental and memory-efficient computation of principal components.

    * IPCA operates on mini-batches of data, making it memory-efficient compared to batch PCA, especially when dealing with large datasets that may not fit into memory entirely.

In [45]:
from sklearn.decomposition import IncrementalPCA

n_batches = 10
inc_pca = IncrementalPCA(n_components=2)
for X_batch in np.array_split(X, n_batches):
    print(".", end="") 
    inc_pca.partial_fit(X_batch)

X_reduced = inc_pca.transform(X)

..........