<a href="https://colab.research.google.com/github/KhotNoorin/Machine-Learning-/blob/main/PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Principal Component Analysis:

---


PCA (Principal Component Analysis) is a dimensionality reduction technique used in machine learning and data analysis. It helps reduce the number of features in a dataset while preserving as much variability (information) as possible.

---
Key Concepts of PCA:

| Term                         | Meaning                                                                     |
| ---------------------------- | --------------------------------------------------------------------------- |
| **Principal Components**     | New axes (directions) in the data that capture the most variance.           |
| **Dimensionality Reduction** | Reducing the number of input variables while keeping important information. |
| **Orthogonal Components**    | Each principal component is orthogonal (uncorrelated) to the others.        |


---


How PCA Works:



1.   Standardize the data: Subtract the mean and divide by standard deviation.
2.   Compute the covariance matrix: Measures how features vary with each other.
3.   Calculate eigenvectors and eigenvalues
  *  Eigenvectors: Directions of maximum variance (principal components).
  *  Eigenvalues: Magnitude of the variance in those directions.

4.   Sort eigenvalues & select top-k components: Pick the top k eigenvectors with the highest eigenvalues.

5.   Project the original data: Multiply the original data with the selected eigenvectors.


---


When to Use PCA?

To remove noise or multicollinearity

To visualize high-dimensional data

As a preprocessing step before clustering or classification


---
Limitations

PCA is linear: not good for capturing non-linear relationships.

Components are not interpretable (they're combinations of original features).

Sensitive to data scaling.



---




# using sklearn

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

In [None]:
# Example data
X = np.array([[1, 2], [3, 4], [5, 6]])

In [None]:
# Step 1: Standardize
X_std = StandardScaler().fit_transform(X)

In [None]:
# Step 2–5: Apply PCA
pca = PCA(n_components=1)
X_pca = pca.fit_transform(X_std)

In [None]:
print("Reduced Data:\n", X_pca)

Reduced Data:
 [[-1.73205081]
 [ 0.        ]
 [ 1.73205081]]


# Manual PCA

In [None]:
import numpy as np
import pandas as pd

In [None]:
np.random.seed(23)

In [None]:
mu_vec1 = np.array([0,0,0])
cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, 20)

df = pd.DataFrame(class1_sample, columns=['feature1','feature2','feature3'])
df['target'] = 1

mu_vec2 = np.array([1,1,1])
cov_mat2 = np.array([[1,0,0],[0,1,0],[0,0,1]])
class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, 20)

df1 = pd.DataFrame(class2_sample, columns=['feature1','feature2','feature3'])
df1['target'] = 0

# Use pd.concat instead of append
df = pd.concat([df, df1], ignore_index=True)

# Shuffle the DataFrame
df = df.sample(40, random_state=23).reset_index(drop=True)

print(df)

    feature1  feature2  feature3  target
0   0.260486 -0.586113 -1.226884       1
1   0.959423  2.006987  1.798349       0
2   0.762980  0.786629 -0.061577       1
3   0.332272 -1.462440  0.686414       0
4  -0.093170  1.413483  0.158022       0
5   2.388686  2.312095  1.499052       0
6  -1.353366 -0.827515 -0.258705       1
7   0.828560 -0.055784 -0.992063       1
8  -1.183690 -0.635805 -0.103391       1
9   1.082238  0.224355  0.736096       0
10 -0.320805  0.417779  0.658913       0
11  1.852804 -1.354357  1.136932       0
12  0.697495  1.863479 -0.677748       0
13  0.993480  0.428904 -1.569439       1
14  0.512206 -0.202883  0.213886       1
15  0.736943 -0.741974  0.590850       0
16 -1.191042 -0.811582  0.211131       1
17  1.384935 -0.141715 -0.524921       1
18  0.455987 -0.197762  0.614114       0
19  2.028214 -0.127736  1.209491       1
20  1.525839  0.905326  3.337835       0
21 -0.297508 -0.333986 -0.263938       1
22 -0.947339 -1.449357 -2.068311       1
23 -0.716745 -0.

In [None]:
import plotly.express as px
#y_train_trf = y_train.astype(str)
fig = px.scatter_3d(df, x=df['feature1'], y=df['feature2'], z=df['feature3'],
              color=df['target'].astype('str'))
fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

fig.show()

In [None]:
# Step 1 - Apply standard scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df.iloc[:,0:3] = scaler.fit_transform(df.iloc[:,0:3])

In [None]:
# Step 2 - Find Covariance Matrix
covariance_matrix = np.cov([df.iloc[:,0],df.iloc[:,1],df.iloc[:,2]])
print('Covariance Matrix:\n', covariance_matrix)

Covariance Matrix:
 [[1.02564103 0.43350338 0.19540141]
 [0.43350338 1.02564103 0.22729312]
 [0.19540141 0.22729312 1.02564103]]


In [None]:
# Step 3 - Finding EV and EVs
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)

In [None]:
eigen_values

array([1.61170658, 0.59060671, 0.87460979])

In [None]:
eigen_vectors

array([[-0.62266465, -0.68881413, -0.37124631],
       [-0.63688455,  0.72176862, -0.27097624],
       [-0.4546062 , -0.06771371,  0.88811489]])

In [None]:
pc = eigen_vectors[0:2]
pc

array([[-0.62266465, -0.68881413, -0.37124631],
       [-0.63688455,  0.72176862, -0.27097624]])

In [None]:
transformed_df = np.dot(df.iloc[:,0:3],pc.T)
# 40,3 - 3,2
new_df = pd.DataFrame(transformed_df,columns=['PC1','PC2'])
new_df['target'] = df['target'].values
new_df.head()

Unnamed: 0,PC1,PC2,target
0,1.187391,-0.152434,1
1,-1.996161,0.520048,0
2,-0.434744,0.225946,1
3,1.104267,-1.280837,0
4,-0.408135,1.153334,0


In [None]:
new_df['target'] = new_df['target'].astype('str')
fig = px.scatter(x=new_df['PC1'],
                 y=new_df['PC2'],
                 color=new_df['target'],
                 color_discrete_sequence=px.colors.qualitative.G10
                )

fig.update_traces(marker=dict(size=12,
                              line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))
fig.show()

# PCA on the MNIST dataset from Kaggle:




