## 📊 PCA vs. LDA – Dimensionality Reduction Comparison
You are provided with a synthetic multiclass classification dataset named synthetic_multiclass_data.csv.

# Your Task:

Data Cleaning

Load the dataset.

Handle missing values appropriately.

Remove duplicate records.

**Model Building (Without Dimensionality Reduction)**

Train a classification model (Logistic Regression, Random Forest, etc.) on the original dataset.

Evaluate model performance using accuracy, confusion matrix, and classification report.

**Apply PCA (Principal Component Analysis)**

Reduce the dimensionality of your dataset using PCA.

Retain components that explain at least 95% of the variance.

Train the same classification model and compare performance.

**Apply LDA (Linear Discriminant Analysis)**

Apply LDA to reduce dimensions (based on class separability).

Train the same classification model and compare performance.

# Compare & Analyze

Which technique (PCA or LDA) helped improve model performance the most?

Explain why the result might differ between PCA and LDA in this case.


In [235]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

# Generate synthetic classification data
X, y = make_classification(
    n_samples=500,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=3,
    random_state=42
)

# Create DataFrame
columns = [f'feature_{i}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=columns)
df['target'] = y

# Introduce missing values
df.iloc[10:20, 0] = np.nan  # Add NaNs in feature_0

# Add some duplicate rows
df = pd.concat([df, df.iloc[0:5]], ignore_index=True)

In [236]:
df

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,target
0,-0.276320,-2.963594,3.667779,-2.063303,-1.256963,-1.122153,-0.756444,1.872198,-0.816038,0.497937,2
1,-0.116510,-1.307850,-0.620535,-0.605216,-0.204580,-0.930664,1.115350,-3.078705,1.664802,-3.197971,1
2,-2.256250,2.980471,-0.768651,2.359745,-0.658595,-0.579045,3.313295,4.833280,-1.178440,1.513542,2
3,-0.087256,-1.075654,1.139018,-0.447514,-0.385106,1.164583,0.962627,0.267886,0.853421,-1.067666,2
4,1.507634,-3.801445,-0.331507,1.757233,-0.406608,-0.085608,0.841192,-0.810027,-0.692152,-0.125343,1
...,...,...,...,...,...,...,...,...,...,...,...
500,-0.276320,-2.963594,3.667779,-2.063303,-1.256963,-1.122153,-0.756444,1.872198,-0.816038,0.497937,2
501,-0.116510,-1.307850,-0.620535,-0.605216,-0.204580,-0.930664,1.115350,-3.078705,1.664802,-3.197971,1
502,-2.256250,2.980471,-0.768651,2.359745,-0.658595,-0.579045,3.313295,4.833280,-1.178440,1.513542,2
503,-0.087256,-1.075654,1.139018,-0.447514,-0.385106,1.164583,0.962627,0.267886,0.853421,-1.067666,2


**importing libraries**

In [238]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import *
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

**data cleaning**

In [240]:
df.isnull().sum()

feature_0    10
feature_1     0
feature_2     0
feature_3     0
feature_4     0
feature_5     0
feature_6     0
feature_7     0
feature_8     0
feature_9     0
target        0
dtype: int64

In [241]:
df['feature_0'].mean()

-0.030855533834614143

In [242]:
df.shape

(505, 11)

In [243]:
df['feature_0'].fillna(df['feature_0'].mean(), inplace = True)

In [244]:
df.isnull().sum()

feature_0    0
feature_1    0
feature_2    0
feature_3    0
feature_4    0
feature_5    0
feature_6    0
feature_7    0
feature_8    0
feature_9    0
target       0
dtype: int64

In [245]:
df.duplicated().sum()

5

In [246]:
df.drop_duplicates(inplace = True)

In [247]:
df.duplicated().sum()

0

**Model Building (Without Dimensionality Reduction)**

In [249]:
df1 = df.copy()

In [250]:
X = df1.drop(columns = 'target')
y = df1['target']

In [251]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y, random_state = 20)

In [252]:
rf = RandomForestClassifier(random_state = 20)

In [253]:
rf.fit(X_train, y_train)

In [254]:
y_pred = rf.predict(X_test)

In [255]:
accuracy_score(y_test, y_pred)

0.904

In [256]:
confusion_matrix(y_test, y_pred)

array([[41,  0,  1],
       [ 0, 40,  2],
       [ 5,  4, 32]], dtype=int64)

In [257]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.89      0.98      0.93        42
           1       0.91      0.95      0.93        42
           2       0.91      0.78      0.84        41

    accuracy                           0.90       125
   macro avg       0.90      0.90      0.90       125
weighted avg       0.90      0.90      0.90       125



**Apply PCA (Principal Component Analysis)**

In [259]:
scaler = StandardScaler()

In [260]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [261]:
pca = PCA(0.95)

In [262]:
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [263]:
rf.fit(X_train_pca, y_train)

In [264]:
y_pred = rf.predict(X_test_pca)

In [265]:
accuracy_score(y_test, y_pred)

0.672

In [266]:
print(f"Components retained: {pca.n_components_}")

Components retained: 7


In [267]:
confusion_matrix(y_test, y_pred)

array([[33,  3,  6],
       [ 1, 30, 11],
       [10, 10, 21]], dtype=int64)

In [268]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.79      0.77        42
           1       0.70      0.71      0.71        42
           2       0.55      0.51      0.53        41

    accuracy                           0.67       125
   macro avg       0.67      0.67      0.67       125
weighted avg       0.67      0.67      0.67       125



**Apply LDA (Linear Discriminant Analysis)**

In [270]:
lda = LinearDiscriminantAnalysis()

In [271]:
X_train_lda = lda.fit_transform(X_train_scaled, y_train)

In [272]:
X_test_lda = lda.transform(X_test_scaled)

In [273]:
rf.fit(X_train_lda, y_train)

In [274]:
y_pred = rf.predict(X_test_lda)

In [275]:
accuracy_score(y_test, y_pred)

0.736

In [276]:
confusion_matrix(y_test, y_pred)

array([[34,  1,  7],
       [ 2, 35,  5],
       [10,  8, 23]], dtype=int64)

In [277]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.74      0.81      0.77        42
           1       0.80      0.83      0.81        42
           2       0.66      0.56      0.61        41

    accuracy                           0.74       125
   macro avg       0.73      0.73      0.73       125
weighted avg       0.73      0.74      0.73       125



✅ **Don’t reduce features unless you have to.**  
Our model scored **90%** with all features — that’s the best it’ll ever do.

✅ **PCA made it worse** (down to 67%) — it ignored class differences and removed useful info.

✅ **LDA helped a bit** (74%) — because it used labels to focus on what separates classes.  
But still not as good as original.

💡 **Takeaway**:  
> If you want the best accuracy — use all your features.  
> Only reduce if you’re forced to (speed, memory, etc.).  
> And if you must — use LDA, not PCA.