# 1. Business Understanding
## 1.1 Background

Semiconductor manufacturing is a highly complex and capital-intensive process involving hundreds of fabrication steps that must be performed with extreme precision. Even microscopic defects introduced during wafer processing can lead to complete product failure, reducing manufacturing yield and increasing production costs.
Traditionally, quality control in semiconductor fabrication has relied on manual inspection and rule-based systems, which are time-consuming, subjective, and often unable to keep up with modern production speeds.

In recent years, semiconductor companies such as Intel, TSMC, and Samsung,Nvidia have shifted toward AI-driven defect detection systems to improve yield prediction, defect localization, and root-cause analysis. Leveraging machine learning and computer vision, these systems can detect defect patterns directly from wafer map images, enabling earlier and more accurate interventions in the production line.

## 1.2 Problem Statement

Manufacturers need an efficient and automated method to identify and classify wafer defects early in the production process. Manual inspection systems fail to scale with high-volume production and cannot accurately identify subtle, complex defect patterns.
Therefore, the goal is to develop a machine learning-based image analysis model capable of automatically detecting and classifying defect patterns in wafer maps therefore improving yield, reducing inspection time, and minimizing production losses.

## 1.3 Business Objective

The primary business objective is to enhance production efficiency and quality assurance in semiconductor manufacturing by automating defect detection.
The system will:

- Identify wafer defect types using image-based pattern recognition.

- Support process engineers in diagnosing the root cause of production faults.

- Reduce manual inspection time and related operational costs.

- Improve yield rate and product reliability.

Ultimately, the project aims to demonstrate how AI-based defect detection can improve decision-making, reduce downtime, and ensure data-driven manufacturing optimization.

## 1.4 Project Goal

To build and deploy a deep learning-based image classification model capable of identifying common wafer defect patterns (e.g., center, edge-ring, scratch, random) using the WM811K dataset. The modelâ€™s predictions will be integrated into an interactive Streamlit dashboard, allowing users to:

- Upload wafer map images,
- View real-time defect classification and confidence levels, and
- Visualize feature importance or activation maps (Grad-CAM) for interpretability.

## 1.5 Expected Business Impact

- `Operational Efficiency:`	Faster and more accurate defect detection compared to manual methods.
- `Cost Reduction:`	Reduced labor costs and fewer defective chips reaching final testing.
- `Quality Improvement:` Early detection minimizes yield loss and improves product reliability.
- `Decision Support:`	Data-driven insights for process optimization and predictive maintenance.
- `Scalability:`	System can be integrated into production pipelines and scaled to new wafer types.
## 1.6 Success Metrics

- Accuracy / F1 Score of classification model 

- Reduction in defect inspection time by .

- Improved detection of rare defect patterns (using confusion matrix or recall metrics).

- Usability feedback from engineers or end-users on the Streamlit dashboard prototype.

In [1]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.utils.class_weight import compute_class_weight
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import warnings
warnings.filterwarnings('ignore')


df = pd.read_pickle("WM811K.pkl")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'WM811K.pkl'

In [None]:
def create_balanced_dataset(df, target_samples_per_class=800):
    """
    Create a balanced dataset with strategic sampling
    """
    print("=== PHASE 3: DATA PREPARATION ===")
    
    # 1. Remove [0 0] class as requested
    print("1. Removing '[0 0]' class...")
    df = df[df['failureType'] != '[0 0]']
    
    # 2. Analyze original distribution
    print("\n2. Original Class Distribution:")
    original_dist = df['failureType'].value_counts()
    for defect_type, count in original_dist.items():
        percentage = (count / len(df)) * 100
        print(f"  {defect_type:12}: {count:5} samples ({percentage:5.1f}%)")
    
    # 3. Create balanced dataset
    print(f"\n3. Creating balanced dataset with target of {target_samples_per_class} samples per class")
    
    balanced_dfs = []
    
    for class_name in df['failureType'].unique():
        class_df = df[df['failureType'] == class_name]
        
        if len(class_df) >= target_samples_per_class:
            # If we have enough, take target amount
            sampled_df = class_df.sample(n=target_samples_per_class, random_state=42)
        else:
            # For rare classes, use all available
            sampled_df = class_df.copy()
            print(f"  Using all {len(sampled_df)} samples for rare class: {class_name}")
        
        balanced_dfs.append(sampled_df)
    
    # Combine and shuffle
    df_balanced = pd.concat(balanced_dfs, ignore_index=True)
    df_balanced = df_balanced.sample(frac=1, random_state=42)
    
    print(f"\n4. Final Balanced Dataset:")
    print(f"   Total samples: {len(df_balanced)}")
    print(f"   Number of classes: {df_balanced['failureType'].nunique()}")
    
    # Verify balance
    print("\n5. Balanced Class Distribution:")
    balanced_dist = df_balanced['failureType'].value_counts()
    for defect_type, count in balanced_dist.items():
        percentage = (count / len(df_balanced)) * 100
        print(f"  {defect_type:12}: {count:4} samples ({percentage:5.1f}%)")
    
    return df_balanced

# Create balanced dataset
df_balanced = create_balanced_dataset(df, target_samples_per_class=800)

In [None]:
def extract_advanced_features(df):
    """
    Extract comprehensive features from wafer maps
    """
    print("\n6. Feature Engineering...")
    
    def extract_wafer_features(wafer_map):
        if isinstance(wafer_map, np.ndarray):
            # Handle different data types
            if wafer_map.dtype == bool:
                wafer_map = wafer_map.astype(int)
            
            flat_map = wafer_map.flatten()
            non_zero_pixels = flat_map[flat_map > 0]
            
            features = {
                # Basic intensity features
                'mean_intensity': np.mean(wafer_map),
                'std_intensity': np.std(wafer_map),
                'defect_density': np.sum(wafer_map > 0) / wafer_map.size if wafer_map.size > 0 else 0,
                'max_intensity': np.max(wafer_map) if len(wafer_map) > 0 else 0,
                'min_intensity': np.min(wafer_map) if len(wafer_map) > 0 else 0,
                
                # Statistical features
                'defect_variance': np.var(non_zero_pixels) if len(non_zero_pixels) > 0 else 0,
                'defect_skewness': pd.Series(non_zero_pixels).skew() if len(non_zero_pixels) > 0 else 0,
                
                # Spatial distribution features
                'center_defect_ratio': calculate_center_defect_ratio(wafer_map),
                'edge_defect_ratio': calculate_edge_defect_ratio(wafer_map),
                'corner_defect_ratio': calculate_corner_defect_ratio(wafer_map),
                
                # Shape features
                'aspect_ratio': calculate_aspect_ratio(wafer_map),
                'symmetry_score': calculate_symmetry(wafer_map)
            }
            return features
        return {}
    
    def calculate_center_defect_ratio(wafer_map):
        """Calculate ratio of defects in center region"""
        h, w = wafer_map.shape
        center_region = wafer_map[h//4:3*h//4, w//4:3*w//4]
        total_defects = np.sum(wafer_map > 0)
        return np.sum(center_region > 0) / total_defects if total_defects > 0 else 0
    
    def calculate_edge_defect_ratio(wafer_map):
        """Calculate ratio of defects in edge region"""
        h, w = wafer_map.shape
        edge_region = np.copy(wafer_map)
        edge_region[h//4:3*h//4, w//4:3*w//4] = 0
        total_defects = np.sum(wafer_map > 0)
        return np.sum(edge_region > 0) / total_defects if total_defects > 0 else 0
    
    def calculate_corner_defect_ratio(wafer_map):
        """Calculate ratio of defects in corner regions"""
        h, w = wafer_map.shape
        corner_size = min(h, w) // 3
        corners = (np.sum(wafer_map[:corner_size, :corner_size] > 0) +
                  np.sum(wafer_map[:corner_size, -corner_size:] > 0) +
                  np.sum(wafer_map[-corner_size:, :corner_size] > 0) +
                  np.sum(wafer_map[-corner_size:, -corner_size:] > 0))
        total_defects = np.sum(wafer_map > 0)
        return corners / total_defects if total_defects > 0 else 0
    
    def calculate_aspect_ratio(wafer_map):
        """Calculate aspect ratio of defect pattern"""
        defect_positions = np.argwhere(wafer_map > 0)
        if len(defect_positions) == 0:
            return 1.0
        min_y, min_x = defect_positions.min(axis=0)
        max_y, max_x = defect_positions.max(axis=0)
        height = max_y - min_y + 1
        width = max_x - min_x + 1
        return width / height if height > 0 else 1.0
    
    def calculate_symmetry(wafer_map):
        """Calculate symmetry score"""
        if wafer_map.shape[0] != wafer_map.shape[1]:
            return 0.5
        horizontal_symmetry = np.sum(wafer_map == np.fliplr(wafer_map)) / wafer_map.size
        vertical_symmetry = np.sum(wafer_map == np.flipud(wafer_map)) / wafer_map.size
        return (horizontal_symmetry + vertical_symmetry) / 2
    
    # Apply feature extraction
    print("Extracting features from wafer maps...")
    wafer_features = df['waferMap'].apply(extract_wafer_features)
    wafer_features_df = pd.DataFrame(wafer_features.tolist())
    
    # Combine with original features
    df_processed = pd.concat([df[['dieSize', 'failureType']], wafer_features_df], axis=1)
    df_processed = df_processed.dropna()
    
    print(f"Final processed dataset: {df_processed.shape}")
    print(f"Features: {df_processed.columns.tolist()}")
    
    return df_processed

# Extract features
df_processed = extract_advanced_features(df_balanced)