### Task: Feature Extraction and Dimensionality Reduction using PCA and LDA:
### Perform PCA on the provided Employee Productivity Dataset (employee_productivity_pca.csv) and LDA on the provided Vehicle Sensor Dataset (vehicle_sensor_lda.csv). The dataset descriptions are also provided as a word document for each.

### Submit the Python Notebook file (.ipynb) after you finish with PC and LD components generated as new features in your dataframe or in a new dataframe in your notebook. Your notebook should also show how the number of features are reduced through both of the approaches.


# PCA and LDA ‚Äî Dimensionality Reduction and Feature Extraction

This notebook demonstrates **Principal Component Analysis (PCA)** and **Linear Discriminant Analysis (LDA)** using step-by-step explanations and code.  
You‚Äôll see how both techniques transform input features into new reduced features that capture the most important information.

---

### Learning Objectives
- Understand what PCA and LDA are conceptually.
- Learn **why and when** to use each method.
- See **how both methods transform original features** into fewer, more informative ones.
- Visually inspect **before and after** datasets (input ‚Üí output).


## 1Ô∏è‚É£ Principal Component Analysis (PCA)

**Concept (in simple terms):**
- PCA is an **unsupervised** dimensionality reduction technique.
- It identifies new directions (called **Principal Components**) that capture the **maximum variance** in data.
- These components are combinations of the original features.

**When to use PCA:**
- When you have many features and want to reduce them while keeping most information.
- For visualization or removing redundancy among correlated variables.

We‚Äôll now demonstrate PCA step by step using the Wine dataset.


In [1]:
# Step 1: Load and inspect the Wine dataset
import pandas as pd

df_employee = pd.read_csv('employee_productivity_pca.csv')
print("Shape of dataset:", df_employee.shape)
df_employee.head()

Shape of dataset: (160, 10)


Unnamed: 0,Age,Years_At_Company,Monthly_Hours_Worked,Projects_Handled,Avg_Project_Duration,Training_Hours,Performance_Score,Attendance_Score,Job_Satisfaction,Stress_Level
0,56,6.3,166.4,6,8.7,42.7,6.3,8.6,1,5.1
1,40,9.7,157.9,9,7.5,16.5,6.7,7.7,3,8.7
2,42,2.7,118.9,8,9.9,30.2,9.9,8.7,4,3.7
3,39,0.3,169.4,9,7.6,50.9,5.5,6.6,2,4.1
4,40,4.9,198.1,6,10.0,38.7,5.1,8.7,4,5.2


### Step 2: Standardize the features
PCA works best when data is standardized (each feature has mean = 0 and variance = 1).

In [2]:
from sklearn.preprocessing import StandardScaler

#X = df_wine.drop('target', axis=1)
X = df_employee
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Before scaling (mean of first feature):", X.iloc[:, 0].mean())
print("After scaling (mean of first feature):", X_scaled[:, 0].mean())


Before scaling (mean of first feature): 35.33125
After scaling (mean of first feature): 3.552713678800501e-16


### Step 3: Apply PCA
Let‚Äôs apply PCA and reduce 13 original features to 8 principal components for 90% variance.

In [3]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.90)
X_pca = pca.fit_transform(X_scaled)

print("Original shape:", X_scaled.shape)
print("Reduced shape (after PCA):", X_pca.shape)


Original shape: (160, 10)
Reduced shape (after PCA): (160, 9)


### Step 4: Compare input and output
Below you can see how PCA creates new features that summarize the information from all 13 original features.

In [4]:
df_pca = pd.DataFrame(X_pca)
df_pca.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,2.507466,0.603342,0.531457,-0.438261,0.737648,-0.158457,-0.258387,-0.543918,0.891092
1,0.961825,-1.636442,0.165319,0.728749,0.114775,0.123155,-1.59481,1.851707,0.161324
2,1.49673,0.051251,-2.762344,0.920684,0.33685,-2.299036,0.06761,0.15975,-0.547162
3,1.394321,-0.949739,0.051207,-0.96464,-0.999406,-0.448369,1.599163,-0.761678,-0.838645
4,1.213244,0.574764,-0.192677,0.317744,-0.154516,1.460977,1.323896,0.567259,-0.015145


### Step 5: How much information is preserved?
Each principal component captures a fraction of total variance in the data.

In [5]:
explained = pca.explained_variance_ratio_
print("Explained variance ratio by each component:", explained)
print("Total variance retained by reduced components:", explained.sum())

Explained variance ratio by each component: [0.13737228 0.12772121 0.1147068  0.107659   0.10133169 0.09698713
 0.09000231 0.0781953  0.0738259 ]
Total variance retained by reduced components: 0.9278016133935169



## 2Ô∏è‚É£ Linear Discriminant Analysis (LDA)

**Concept (in simple terms):**
- LDA is a **supervised** dimensionality reduction technique.
- It uses the **class labels** to find directions (called **Linear Discriminants**) that best separate different classes.
- Each new LDA feature is a combination of the original features that maximizes **class separability**.

**When to use LDA:**
- When you have labeled data and want to project it into a lower-dimensional space while keeping classes distinct.
- Often used before classification tasks to simplify data and remove noise.

We‚Äôll now demonstrate LDA step by step using a synthetic **Customer Dataset**.


In [6]:
# Step 1: Load and inspect the synthetic customer dataset
df_vehicle = pd.read_csv('vehicle_sensor_lda.csv')
print("Shape of dataset:", df_vehicle.shape)
df_vehicle.head()

Shape of dataset: (200, 11)


Unnamed: 0,Engine_Temperature,Fuel_Pressure,RPM,Vibration_Intensity,Oil_Level,Coolant_Temperature,Battery_Voltage,Air_Intake_Temp,Throttle_Position,Exhaust_CO2,Fault_Type
0,83.8,331.7,2635.0,0.437,5.08,85.3,12.78,29.2,41.6,128.4,0
1,88.7,359.1,2897.0,0.476,3.69,91.9,13.07,32.4,68.1,123.0,1
2,83.7,372.0,2152.0,0.419,4.75,85.3,13.19,35.4,41.2,108.0,0
3,79.7,346.4,2745.0,0.382,5.12,81.0,13.68,37.1,66.8,122.0,0
4,87.3,360.9,2780.0,0.504,4.27,88.0,13.54,36.1,33.7,124.7,2


### üìò Understanding the Customer Dataset

We‚Äôll be working with a **synthetic customer dataset** that simulates a marketing or retail business scenario.  
Each row represents a **customer profile** with realistic attributes.

#### üß© Feature Descriptions

| Feature | Description | Typical Range |
|----------|--------------|----------------|
| **Customer_Age** | Customer age in years | 18‚Äì70 |
| **Annual_Income** | Yearly income in thousands of dollars | 25‚Äì200 |
| **Spending_Score** | Customer‚Äôs overall spending tendency or activity score | 1‚Äì100 |
| **Savings_Account_Balance** | Savings account balance in thousands of dollars | 0‚Äì120 |
| **Credit_Utilization** | Ratio of used credit (0 = no use, 1 = fully utilized) | 0‚Äì1 |
| **Online_Purchases** | Average number of monthly online purchases | 0‚Äì30 |
| **Instore_Purchases** | Average number of monthly in-store purchases | 0‚Äì25 |
| **Years_with_Company** | Number of years the customer has been with the company | 0‚Äì20 |
| **Satisfaction_Rating** | Customer satisfaction rating from surveys | 1‚Äì10 |
| **Complaints_Filed** | Number of complaints filed annually | 0‚Äì5 |

#### üéØ Target Variable

| Target Value | Customer Segment | Description |
|---------------|------------------|--------------|
| **0** | **Budget Customer** | Lower income, low spending, few purchases, smaller savings |
| **1** | **Regular Customer** | Medium income, moderate spending, stable engagement |
| **2** | **Premium Customer** | High income, high spending, long-term loyal and satisfied |

---

This dataset will be used for demonstrating **Linear Discriminant Analysis (LDA)** ‚Äî  
a technique that finds the best combinations of these features to **separate different customer segments**.


### Step 2: Separate features and target

In [7]:
X = df_vehicle.drop('Fault_Type', axis=1)
y = df_vehicle['Fault_Type']

print("Feature matrix shape:", X.shape)
print("Target shape:", y.shape)

Feature matrix shape: (200, 10)
Target shape: (200,)


### Step 3: Standardize the features (LDA also benefits from standardized input)

In [8]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Before scaling (mean of first feature):", X.iloc[:, 0].mean())
print("After scaling (mean of first feature):", X_scaled[:, 0].mean())

Before scaling (mean of first feature): 84.58649999999997
After scaling (mean of first feature): 4.6007642140466485e-15


### Step 4: Apply LDA
Since this dataset has 3 classes, LDA can create up to **2 linear discriminants** (n_classes - 1).

In [9]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

print("Original shape:", X_scaled.shape)
print("Reduced shape (after LDA):", X_lda.shape)

Original shape: (200, 10)
Reduced shape (after LDA): (200, 2)


### Step 5: Inspect new features
Let‚Äôs see the first few transformed samples with new LDA features (LD1 and LD2).

In [10]:
df_lda = pd.DataFrame(X_lda, columns=['LD1', 'LD2'])
df_lda.head()

Unnamed: 0,LD1,LD2
0,-0.339709,-1.717955
1,1.427718,0.980329
2,-0.348382,0.024691
3,-0.679759,-1.345333
4,1.365791,-0.715194



## 3Ô∏è‚É£ Comparison Summary

| Aspect | PCA | LDA |
|--------|------|------|
| Type | Unsupervised | Supervised |
| Uses Labels? | No | Yes |
| Focus | Maximizes variance | Maximizes class separability |
| Output | Principal Components | Linear Discriminants |
| Best Used For | Data compression, visualization | Preprocessing for classification |

Both PCA and LDA **reduce features** while preserving important information ‚Äî but in different ways.  
You can now clearly see how the data was transformed into new, compact feature spaces.
