#  **SHETH L.U.J. & SIR M.V. COLLEGE**

 # Swati Mahajan T093

# Practical No . 09

**Principal Component Analysis (PCA)**
* Perform PCA on a dataset to reduce dimensionality.
* Evaluate the explained variance and select the appropriate number of principal components.
* Visualize the data in the reduced-dimensional space.

# Load Dataset

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, KernelPCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# 1. SETUP: Load and Prepare Data
df = pd.read_csv('kc_house_data.csv')

# --- Define Features (X) and Target (y) ---
FEATURE_COLUMNS = ['sqft_living', 'sqft_lot', 'price', 'yr_built']
X = df[FEATURE_COLUMNS].values # Features for PCA/LDA

# Create a simplified 3-class target 'y' for LDA (supervised method)
# Low (grade <= 6), Medium (grade 7-9), High (grade >= 10)
y = np.select(
    [df['grade'] <= 6, (df['grade'] >= 7) & (df['grade'] <= 9), df['grade'] >= 10],
    [0, 1, 2]
)

# Standardize the feature matrix (MANDATORY for PCA/LDA)
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

# Reducing Features Using Principal Components (PCA)

In [3]:
print("\n--- Principal Component Analysis (PCA) ---")

# Create a PCA that will retain 90% of the variance
# n_components=0.90 means 'keep enough components to explain 90% of the variance'
pca = PCA(n_components=0.90)

# Conduct PCA (Fit and transform the standardized data)
X_pca = pca.fit_transform(X_standardized)

# Show results
print(f"Original number of features: {X_standardized.shape[1]}")
print(f"Reduced number of features: {X_pca.shape[1]}")
print(f"Variance Explained by the components: {pca.explained_variance_ratio_.sum():.4f}")
print("\nFirst 5 rows of the Reduced Feature Matrix (X_pca):")
print(X_pca[:5])


--- Principal Component Analysis (PCA) ---
Original number of features: 4
Reduced number of features: 3
Variance Explained by the components: 0.9388

First 5 rows of the Reduced Feature Matrix (X_pca):
[[-1.42881554 -0.00901084  0.13673587]
 [ 0.10451991 -0.60896498  0.2731242 ]
 [-2.01604461 -0.32835442  0.71387946]
 [-0.10136385 -0.34515009 -0.02243807]
 [-0.21605382  0.32648514 -0.46824389]]


# Reducing Features When Data Is Linearly Inseparable (KernelPCA)
NOTE : KernelPCA is often used for non-linear data like image or text, but we demonstrate it here using the same features.

In [4]:
print("\n--- Kernel PCA (KPCA) ---")

# Create a KernelPCA that will reduce the data to 2 components
kpca = KernelPCA(kernel="rbf", gamma=1, n_components=2)

# Conduct KPCA
X_kpca = kpca.fit_transform(X_standardized)

# Show results
print(f"Original number of features: {X_standardized.shape[1]}")
print(f"Reduced number of features (n_components=2): {X_kpca.shape[1]}")
print("\nFirst 5 rows of the Reduced Feature Matrix (X_kpca):")
print(X_kpca[:5])


--- Kernel PCA (KPCA) ---
Original number of features: 4
Reduced number of features (n_components=2): 2

First 5 rows of the Reduced Feature Matrix (X_kpca):
[[ 0.60642762  0.17693025]
 [ 0.08674685 -0.18320531]
 [ 0.34381513 -0.24840479]
 [ 0.11414948  0.11168343]
 [-0.299246    0.45622665]]


# Reducing Features by Maximizing Class Separability (LDA)
Note: The maximum number of components in LDA is C-1, where C is the number of classes.

Our target 'y' has 3 classes (0, 1, 2), so max components is 2.

In [5]:
print("\n--- Linear Discriminant Analysis (LDA) ---")

# Create an LDA that will reduce the data down to 2 feature (max possible)
lda = LinearDiscriminantAnalysis(n_components=2)

# Run LDA and use it to transform the features
X_lda = lda.fit(X_standardized, y).transform(X_standardized)

# Print the number of features
print(f"Original number of features: {X_standardized.shape[1]}")
print(f"Reduced number of features (max possible): {X_lda.shape[1]}")

# View the ratio of explained variance
print("Ratio of explained variance by each component:")
print(lda.explained_variance_ratio_)

print("\nFirst 5 rows of the Reduced Feature Matrix (X_lda):")
print(X_lda[:5])


--- Linear Discriminant Analysis (LDA) ---
Original number of features: 4
Reduced number of features (max possible): 2
Ratio of explained variance by each component:
[0.93295347 0.06704653]

First 5 rows of the Reduced Feature Matrix (X_lda):
[[ 1.44850459  0.03845149]
 [-0.08615062  0.42678322]
 [ 2.11780599  0.70507866]
 [ 0.05533933  0.23678207]
 [ 0.13481206 -0.47797024]]
