## Dimensionality Reduction 
- Data Preprocessing technique used to reduce the number of input features (Dimensions) in a dataset while keeping as much important information as possible

### WHY?
- Reduces overfitting
- Improves the model performance
- Handles the curse of dimensionality
- Helps in data visualization

## 2 Main types 
### 1. Feature Selection 
- You keep the original Features, but remove unimportant ones
### 2. Feature Extraction
- Create new features by combining the existing ones

In [1]:
# Set up the env
!pip install pandas numpy scikit-learn matplotlib



In [2]:
from sklearn.datasets import load_wine
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
# Load teh wine dataset 
wine = load_wine()

# Create a dataframe
X = pd.DataFrame(wine.data, columns = wine.feature_names)
y = wine.target

In [5]:
print(X.head())

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  
0                  

In [6]:
print("Shape:", X.shape)

Shape: (178, 13)


# Principal Component Analysis

In [None]:
from sklearn.decomposition import PCA

# Reduce dimension to 2
pca = PCA(n_components = 2)

X_pca = pca.fit_transform(X_scaled)

print("Original Shape:", X.shape)
print("Reduced Shape:", X_pca.shape)

# Visualize the reduced data

In [None]:
plt.figure(figsize = (8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c = y)
plt.xlabel("Principal component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA of wine Dataset")
plt.colorbar(label = "Wine Class")
plt.show()

# Mathematics Behind PCA

PCA is based on the following ideas:

1. Standardization

All features are scaled to mean = 0 and standard deviation = 1

This avoids bias due to different units

2. Covariance Matrix

Measures how features vary together

PCA looks for directions with high variance

3. Eigenvectors and Eigenvalues

Eigenvectors → directions of maximum variance (principal components)

Eigenvalues → amount of variance captured by each direction

4. Projection

Original data is projected onto top principal components

This gives reduced dimensions

| Method | Type       | Main Purpose              | Intuition                                   |
| ------ | ---------- | ------------------------- | ------------------------------------------- |
| PCA    | Linear     | Compression + speed       | Finds straight directions with max variance |
| t-SNE  | Non-linear | Visualization             | Keeps nearby points close                   |
| UMAP   | Non-linear | Visualization + structure | Preserves local and global patterns         |


# PCA

Linear technique

Good for feature reduction

Preserves global variance

Fast and interpretable

Not great for complex non-linear patterns

Visual intuition:
Data is flattened onto straight axes.

# t-SNE

Non-linear

Best for 2D or 3D visualization

Preserves local neighbors

Computationally expensive

Not suitable for ML pipelines

Visual intuition:
Clusters are pulled apart to clearly show groups.

# UMAP

Non-linear

Faster than t-SNE

Preserves both local and some global structure

Good for large datasets

Visual intuition:
Like t-SNE but more stable and scalable.

# PCA using sklearn Pipeline

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=6)),
    ('classifier', LogisticRegression(max_iter=200))
])

pipeline.fit(X, y)


# Dimensionality Reduction for Images

In [None]:
!pip install tensorflow

In [None]:
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import numpy as np
import os

images = []
labels = []

for class_name in os.listdir(data_dir):
    class_path = os.path.join(data_dir, class_name)
    for img_name in os.listdir(class_path)[:100]:
        img = load_img(os.path.join(class_path, img_name), target_size=(64, 64))
        img = img_to_array(img).flatten()
        images.append(img)
        labels.append(class_name)

X_images = np.array(images)


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_images)

pca = PCA(n_components=100)
X_pca = pca.fit_transform(X_scaled)
