# 02. Feature Engineering
## Smart Wafer Yield Optimization Project

This notebook focuses on advanced feature engineering techniques for the SECOM semiconductor manufacturing dataset.

### Objectives:
- Apply PCA for dimensionality reduction
- Create domain-specific features for manufacturing data
- Implement statistical feature extraction
- Perform feature selection and ranking
- Prepare engineered features for machine learning

### Feature Engineering Strategies:
1. **Dimensionality Reduction**: PCA to reduce 591 features to manageable size
2. **Statistical Features**: Rolling statistics, trend analysis, stability metrics
3. **Domain Features**: Sensor drift detection, process jump identification
4. **Feature Selection**: Mutual information, recursive feature elimination
5. **Feature Ranking**: Importance analysis for manufacturing insights


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_classif, SelectKBest, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Import our utility functions
import sys
import os
notebook_path = os.path.abspath("")
if notebook_path.endswith("notebooks"):
    project_root = os.path.dirname(notebook_path)
    os.chdir(project_root)
from app.utils import load_data

print("Libraries imported successfully!")
print("Ready to begin feature engineering...")


Notes for later:

✅ Step A — Create binary outlier flags for high-impact features

For the top features in your “reliable indicators” list (e.g. feature_38, feature_59, feature_348, …), create new binary columns:

for f in high_impact_features:
    Q1, Q3 = np.percentile(df[f], [25, 75])
    IQR = Q3 - Q1
    lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
    df[f + '_outlier_flag'] = ((df[f] < lower) | (df[f] > upper)).astype(int)


Then include these as new engineered features in your pipeline.
These flags encode “abnormal process behavior detected on this sensor” — which is highly interpretable and can improve model recall on faults.


✅ Step B — Avoid over-aggressive scaling for these features

In your preprocessing, you can:

Apply RobustScaler (not StandardScaler), since it won’t squash these few large values.

Optionally skip transformation (Yeo–Johnson/log) for these specific features so that their magnitude still carries meaning.

Implementation sketch:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, RobustScaler

normal_features = [f for f in all_features if f not in high_impact_features]
transformer = ColumnTransformer([
    ('robust', RobustScaler(), high_impact_features),
    ('yeo', PowerTransformer(method='yeo-johnson'), normal_features)
])

## 1. Load and Prepare Data


In [None]:
# Load the preprocessed data
print("Loading preprocessed SECOM data...")
data = load_data()

# Check if we have preprocessed data, otherwise preprocess
if os.path.exists('../data/processed/secom_cleaned.csv'):
    data = pd.read_csv('../data/processed/secom_cleaned.csv')
    print("✅ Loaded preprocessed data")
else:
    print("⚠️ No preprocessed data found, using raw data")
    from app.utils import preprocess_data
    data = preprocess_data(data, method='knn')

print(f"Dataset shape: {data.shape}")
print(f"Missing values: {data.isnull().sum().sum()}")

# Separate features and target
if 'target' in data.columns:
    X = data.drop('target', axis=1)
    y = data['target']
    print(f"Features: {X.shape[1]}, Target distribution: {y.value_counts().to_dict()}")
else:
    X = data
    y = None
    print("No target variable found")


## 2. Principal Component Analysis (PCA)


In [None]:
# Apply PCA for dimensionality reduction
print("Applying PCA for dimensionality reduction...")

# Standardize features first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

# Find number of components for 95% variance
n_components_95 = np.where(cumulative_variance >= 0.95)[0][0] + 1
n_components_90 = np.where(cumulative_variance >= 0.90)[0][0] + 1

print(f"Number of components for 90% variance: {n_components_90}")
print(f"Number of components for 95% variance: {n_components_95}")

# Apply PCA with selected number of components
pca_final = PCA(n_components=n_components_95)
X_pca_final = pca_final.fit_transform(X_scaled)

print(f"Original features: {X.shape[1]}")
print(f"PCA features: {X_pca_final.shape[1]}")
print(f"Variance explained: {pca_final.explained_variance_ratio_.sum():.3f}")

# Create PCA DataFrame
pca_columns = [f'PC_{i+1}' for i in range(X_pca_final.shape[1])]
X_pca_df = pd.DataFrame(X_pca_final, columns=pca_columns, index=X.index)

# Add target back if available
if y is not None:
    X_pca_df['target'] = y.values

print("✅ PCA transformation completed!")
