# Data Preprocessing_EfficientNet B0

    Already Done
        -Images are already 48x48 pixels
        -Images are already grayscale
        -Face detection already applied (FER2013 is pre-cropp
----------------------------------------------------------------------------
        - 1.Load raw data directories for train and test splits.
        - 2.For each image in the dataset:
            -Read the image in grayscale.
            - Check if image is corrupted or unreadable.
            - Verify image size is exactly 48×48 pixels.
            - Check if image is nearly blank (low standard deviation).
            - Check if image is blurry (low variance of Laplacian).
            - Remove duplicate images using SHA256 hash comparison.
            - If all checks pass, normalize pixel values to [0, 1].
        - 3.Map original FER2013 classes (7 classes) to your 5 target classes:
        - 4.Split the training data into training and validation sets , stratified by class to maintain class proportions.
        - 5.Encode labels:
            - Convert string class labels to integers using LabelEncoder.
            - One-hot encode the integer labels for model training.
        - 6.Print class distribution counts for training, validation, and test sets to show class imbalance.
        - 7.Compute class weights based on the training data to help balance the loss function during training.
        - 8.Print final dataset shapes for all splits (train, val, test).
        - 9.Save processed datasets (images and labels) as compressed .npz files (train.npz, val.npz, test.npz).ed)


In [1]:
import os
import cv2
import numpy as np
import hashlib
from pathlib import Path
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.utils import to_categorical

2025-07-16 04:39:45.915426: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-16 04:39:45.934045: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-16 04:39:46.092814: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-16 04:39:46.096669: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Configuration

In [2]:
#set paths
RAW_DIR = Path("/app/data/raw/fer2013")
PROCESSED_DIR = Path("/app/data/processed/FC211002_Nethmi")
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)


In [3]:
## Mapping FER2013 classes to 5 project classes
CLASS_MAPPING = {
    'angry': 'angry',
    'disgust': 'angry',
    'fear': 'stressed',
    'surprise': 'stressed',
    'happy': 'happy',
    'neutral': 'neutral',
    'sad': 'sad'
}
TARGET_CLASSES = ['angry', 'happy', 'sad', 'stressed', 'neutral']

In [4]:
# Thresholds for low-quality image detection
BLANK_STD_THRESHOLD = 5      # std dev below means nearly blank
BLUR_THRESHOLD = 100         # variance of Laplacian below means blurry
VAL_SPLIT_RATIO = 0.1

## Data Cleaning

In [5]:
def preprocess_image(img):
    # Normalize a grayscale image to the [0,1] range.
    # Input img - numpy array of shape (48, 48), dtype uint8
    # Outp normalized float32 image of same shape with values between 0 & 1
    return img.astype(np.float32) / 255.0  # shape: (48, 48)

# Check if the image is essentially blank (very low pixel variance).
def is_blank_image(img, threshold=BLANK_STD_THRESHOLD):
    return np.std(img) < threshold
    
 # Determine if an image is blurry using variance of Laplacian method.
def is_blurry(img, threshold=BLUR_THRESHOLD):
    return cv2.Laplacian(img, cv2.CV_64F).var() < threshold
   
# Compute SHA256 hash of an image file to detect duplicates
def hash_image(img_path):
    with open(img_path, 'rb') as f:
        return hashlib.sha256(f.read()).hexdigest()
    
#Print count of samples per class for a dataset split.   
def print_class_distribution(labels, label_encoder, split_name): 
    print(f"\n Class distribution in {split_name} set:")
    counts = Counter(labels)
    for cls in label_encoder.classes_:
        print(f"  {cls}: {counts.get(cls, 0)}")

##  Preprocessing Function

    - Steps include loading images, filtering by quality, mapping classes,
    - removing duplicates, normalizing, and collecting labels.
    
    - split (str): Dataset split name, e.g., 'train' or 'test'.

    - Returns:
         - Tuple of numpy arrays: (images, labels)
         - images: Array of preprocessed grayscale images (normalized)
         - labels: Corresponding array of target class labels (strings)
    

In [6]:
def process_split(split):
    
    print(f"\n Processing split: '{split}'")
    
    # Define directory containing images for this split
    input_dir = RAW_DIR / split

    images = []          # List to hold processed image arrays
    labels = []          # Corresponding list to hold class labels
    seen_hashes = set()  # Set to track image hashes to avoid duplicates
    skipped = 0          # Counter for skipped images due to filters
    saved = 0            # Counter for successfully processed images

    # Iterate over original classes in the split directory
    for orig_class in os.listdir(input_dir):
        class_dir = input_dir / orig_class
        
        # Skip if it's not a directory
        if not class_dir.is_dir():
            continue
        
        # Map original class to target class (e.g., 'disgust' → 'angry')
        target_class = CLASS_MAPPING.get(orig_class)
        
        # Skip classes not in the defined target classes
        if target_class not in TARGET_CLASSES:
            continue
        
        # Iterate over all images in the class folder
        for img_name in os.listdir(class_dir):
            img_path = class_dir / img_name
            
            # Read image in grayscale mode
            img = cv2.imread(str(img_path), cv2.IMREAD_GRAYSCALE)
            
            # Skip if image couldn't be read or has wrong dimensions
            if img is None or img.shape != (48, 48):
                skipped += 1
                continue
            
            # Skip if image is blank or blurry based on heuristics
            if is_blank_image(img) or is_blurry(img):
                skipped += 1
                continue
            
            # Compute hash of image file to detect duplicates
            img_hash = hash_image(img_path)
            
            # Skip if duplicate image found
            if img_hash in seen_hashes:
                skipped += 1
                continue
            
            # Mark this hash as seen
            seen_hashes.add(img_hash)
            
            # Preprocess image (normalize pixel values)
            processed_img = preprocess_image(img)  # shape: (48, 48)
            
            # Append image and label to lists
            images.append(processed_img)
            labels.append(target_class)
            saved += 1

    print(f" Finished: {saved} images saved, {skipped} skipped")
    
    # Convert lists to numpy arrays before returning
    return np.array(images), np.array(labels)

## Main

In [7]:
def main():
    # Step 1: Process train and test splits
    X_train_full, y_train_full = process_split('train')
    X_test, y_test = process_split('test')

    # Step 2: Create validation set
    print(f"\n🔪 Splitting train into train+val ({100*(1-VAL_SPLIT_RATIO)}%/{100*VAL_SPLIT_RATIO}%)")
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_full, y_train_full,
        test_size=VAL_SPLIT_RATIO,
        stratify=y_train_full,
        random_state=42
    )

    # Step 3: Label encode + one-hot encode
    le = LabelEncoder()
    le.fit(y_train_full)

    print_class_distribution(y_train, le, "train")
    print_class_distribution(y_val, le, "val")
    print_class_distribution(y_test, le, "test")

    y_train_enc = le.transform(y_train)
    y_val_enc = le.transform(y_val)
    y_test_enc = le.transform(y_test)

    y_train_oh = to_categorical(y_train_enc, num_classes=len(TARGET_CLASSES))
    y_val_oh = to_categorical(y_val_enc, num_classes=len(TARGET_CLASSES))
    y_test_oh = to_categorical(y_test_enc, num_classes=len(TARGET_CLASSES))

    # Step 4: Reshape for model compatibility: (48, 48, 1)
    X_train = X_train[..., np.newaxis]
    X_val   = X_val[..., np.newaxis]
    X_test  = X_test[..., np.newaxis]

    # Step 5: Print final shapes
    print("\n Final Dataset Shapes:")
    print(f"Train: X={X_train.shape}, y={y_train_oh.shape}")
    print(f"Val  : X={X_val.shape}, y={y_val_oh.shape}")
    print(f"Test : X={X_test.shape}, y={y_test_oh.shape}")

    # Step 6: Compute class weights
    class_weights = compute_class_weight(class_weight='balanced',
                                         classes=np.unique(y_train_enc),
                                         y=y_train_enc)
    class_weights_dict = dict(enumerate(class_weights))
    print("\n  Class weights (use in training):")
    print(class_weights_dict)

    # Step 7: Save to disk
    print("\n Saving .npz files...")
    np.savez_compressed(PROCESSED_DIR / "train.npz", X=X_train, y=y_train_oh, label_names=le.classes_)
    np.savez_compressed(PROCESSED_DIR / "val.npz",   X=X_val,   y=y_val_oh,   label_names=le.classes_)
    np.savez_compressed(PROCESSED_DIR / "test.npz",  X=X_test,  y=y_test_oh,  label_names=le.classes_)

    print("\n Preprocessing complete. Files saved in:", PROCESSED_DIR)

if __name__ == "__main__":
    main()


 Processing split: 'train'
 Finished: 27457 images saved, 1252 skipped

 Processing split: 'test'
 Finished: 7089 images saved, 89 skipped

🔪 Splitting train into train+val (90.0%/10.0%)

 Class distribution in train set:
  angry: 3804
  happy: 6376
  neutral: 4379
  sad: 4244
  stressed: 5908

 Class distribution in val set:
  angry: 423
  happy: 709
  neutral: 486
  sad: 471
  stressed: 657

 Class distribution in test set:
  angry: 1054
  happy: 1765
  neutral: 1225
  sad: 1240
  stressed: 1805

 Final Dataset Shapes:
Train: X=(24711, 48, 48, 1), y=(24711, 5)
Val  : X=(2746, 48, 48, 1), y=(2746, 5)
Test : X=(7089, 48, 48, 1), y=(7089, 5)

  Class weights (use in training):
{0: 1.299211356466877, 1: 0.7751254705144291, 2: 1.1286138387759763, 3: 1.1645146088595664, 4: 0.8365267433987813}

 Saving .npz files...

 Preprocessing complete. Files saved in: /app/data/processed/FC211002_Nethmi
