This is the documentation for the final, optimized **Scientific Image Forgery Detection** solution, built on the Keras/TensorFlow framework.

The core strategy was to overcome the severe $\mathbf{\sim 0.00\%}$ class imbalance (forged pixels vs. authentic pixels), which caused earlier models to fail and get "stuck" by predicting only the background (Authentic).

***

## 1. Solution Overview and Strategy

| Component | Purpose | Key Result |
| :--- | :--- | :--- |
| **Strategy** | **Imbalance Resolution** | Used an aggressive, custom **Tversky Loss** ($\beta=0.90$) combined with forensic data augmentation to force the model to prioritize finding the rare forged pixels (True Positives). |
| **Architecture** | **Dual-Stream U-Net** | Uses two parallel inputs to fuse visual features (RGB) and forgery artifacts (Noise Residual) early in the network. |
| **Training** | **Stable Convergence** | Achieved active learning metrics (val\_loss continuously dropped) and successfully triggered $\mathbf{EarlyStopping}$ at the optimal generalization epoch (e.g., Epoch 25), avoiding overfitting. |

***

## 2. Model Architecture (`build_dual_stream_unet`)

The network uses a specialized dual-stream U-Net for multimodal input processing:

* **Stream 1 (RGB Input):** Learns standard visual features and contextual information from the downsampled color image.
* **Stream 2 (Feature Input):** Learns high-frequency noise patterns extracted from the grayscale image using the **Noise Residual (NR)** calculation (Gaussian Blur subtraction). This stream specializes in forensic evidence.
* **Fusion:** The outputs of Stream 1 and Stream 2 are concatenated and processed by the decoder paths, combining visual context with forensic evidence for final pixel-level prediction.

***

## 3. Imbalance Mitigation and Training Setup

The core of the success lies in aggressively handling the data imbalance in the loss function:

### A. Tversky Loss Function

The model uses a customized Tversky Loss, which is an extension of the Dice Loss tailored for class imbalance.

$$L_{\text{Tversky}} = 1 - \frac{\text{TP} + \epsilon}{\text{TP} + \alpha \cdot \text{FP} + \beta \cdot \text{FN} + \epsilon}$$

* **$\beta = 0.90$ (False Negative Weight):** This is the critical value. It assigns a **9x higher penalty** to the model for missing a true forgery pixel (False Negative) than for incorrectly labeling an authentic pixel as forgery (False Positive). This forces the model to stop ignoring the rare forgery class.
* **$\alpha = 0.10$ (False Positive Weight):** The complementary term ($\alpha = 1 - \beta$).
* **Metrics:** Training monitors `val_loss` (Tversky Loss) and the standard `dice_coef` (Foreground IoU proxy).

### B. Data Augmentation (`DualStreamDataGenerator`)

Data augmentation is applied **only to the training set** to increase the robustness of the sparse forgery signals:

1.  **Spatial Augmentations:** Random horizontal and vertical flips are applied to both the image and the mask.
2.  **Forensic Augmentation (JPEG Recompression):** To make the Noise Residual feature robust to image saving, the RGB image is randomly recompressed with a quality factor between **70 and 95** (50% probability). The Noise Residual feature is then calculated from this compressed image, simulating real-world scenarios.

***

## 4. Inference and Post-Processing

The inference pipeline ensures predictions are accurate and adhere to the competition format:

1.  **Dual-Stream Preprocessing:** The test image is loaded, resized, and processed to generate both the **RGB input** and the **Noise Residual (NR) input** arrays.
2.  **Prediction:** The Keras model predicts a probability mask for forgery.
3.  **Thresholding:** A fixed threshold ($\mathbf{0.45}$) is applied to convert the probability map into a binary mask.
4.  **Minimum Area Filtering:** A small forgery region threshold ($\mathbf{32}$ pixels) is applied using OpenCV's `connectedComponentsWithStats` to remove spurious noise and false positives, improving the final score quality.
5.  **RLE Encoding:** The final binary mask is converted to the standard RLE format (`[start length] ...`), resulting in the complex output seen for forged images.

The final saved file is **`/tmp/model_new_scratch.weights.h5`** (Keras weights).

In [None]:
import os
import pandas as pd
import numpy as np
import cv2
import matplotlib.pyplot as plt
from tqdm import tqdm
import warnings
from warnings import filterwarnings

filterwarnings('ignore') # Suppress warnings

# --- CONFIGURATION (from the original notebook) ---
TARGET_SIZE = 256
TRAIN_ROOT = "/kaggle/input/recodai-luc-scientific-image-forgery-detection/train_images"
MASK_ROOT = "/kaggle/input/recodai-luc-scientific-image-forgery-detection/train_masks"

## COLAB
#TRAIN_ROOT = f"{recodai_luc_scientific_image_forgery_detection_path}/train_images"
#MASK_ROOT = f"{recodai_luc_scientific_image_forgery_detection_path}/train_masks"
##


# Replicate compute_ela for feature analysis
def compute_ela(img_path, quality=95, scale=10):
    # ... (omitted for brevity, assume the original function is available)
    # The original notebook's ELA function is used here.
    img = cv2.imread(img_path)
    if img is None or img.size == 0:
        try:
            img_data = np.load(img_path)
            if img_data.ndim == 3: img = cv2.cvtColor(img_data, cv2.COLOR_RGB2BGR)
            elif img_data.ndim == 2: img = cv2.cvtColor(img_data, cv2.COLOR_GRAY2BGR)
        except Exception: return np.zeros((TARGET_SIZE, TARGET_SIZE), dtype=np.float32)

    if img is None or img.size == 0:
        return np.zeros((TARGET_SIZE, TARGET_SIZE), dtype=np.float32)

    img_resized = cv2.resize(img, (TARGET_SIZE, TARGET_SIZE))
    temp_path = f"/tmp/temp_ela_{os.path.basename(img_path)}.jpg" # Simplified temp_path
    try:
        # Use a consistent quality setting (95)
        cv2.imwrite(temp_path, img_resized, [cv2.IMWRITE_JPEG_QUALITY, quality])
        compressed_img = cv2.imread(temp_path)
        if compressed_img is None: return np.zeros((TARGET_SIZE, TARGET_SIZE), dtype=np.float32)
        error = np.abs(img_resized.astype(np.float32) - compressed_img.astype(np.float32))
        ela_feature_2d = np.mean(error, axis=2) * scale # Scale by 10 as in the notebook
    finally:
        if os.path.exists(temp_path): os.remove(temp_path)
    return cv2.resize(ela_feature_2d, (TARGET_SIZE, TARGET_SIZE), interpolation=cv2.INTER_LINEAR).astype(np.float32)

# Load the filtered DataFrame (assuming the prior EDA cell's 'eda_df' is available or recreate it)
data_list = []
for root, _, files in os.walk(TRAIN_ROOT):
    for f in files:
        valid_extensions = ('.png', '.jpg', '.jpeg', '.tif', '.tiff', '.npy')
        if f.lower().endswith(valid_extensions) and 'forged' in root.lower():
            case_id = os.path.splitext(f)[0]
            mask_path = os.path.join(MASK_ROOT, f"{case_id}.npy")
            if os.path.exists(mask_path):
                data_list.append({'img_path': os.path.join(root, f), 'mask_path': mask_path})
eda_df = pd.DataFrame(data_list)

print("--- Starting Advanced EDA (Imbalance & Feature Check) ---")

if eda_df.empty:
    print("ðŸ›‘ EDA Skipped: Data frame is empty.")
else:
    total_pixels = 0
    forgery_pixels = 0
    ela_values, rgb_means = [], []

    # Process only the first 50 images to speed up ELA computation for EDA
    for index, row in tqdm(eda_df.head(50).iterrows(), total=len(eda_df.head(50)), desc="Processing samples"):
        try:
            # 1. Image and Mask Load
            rgb_image = cv2.cvtColor(cv2.imread(row['img_path']), cv2.COLOR_BGR2RGB)
            if rgb_image is None or rgb_image.size == 0: continue

            mask = np.load(row['mask_path'])
            if mask.ndim > 2: mask = mask[:, :, 0]

            # 2. Imbalance Check (Use original sizes for best estimate)
            h, w = rgb_image.shape[:2]
            total_pixels += h * w
            forgery_pixels += np.sum(mask > 0)

            # 3. ELA Feature Check (Use 256x256 resized data)
            ela_feature = compute_ela(row['img_path'])
            ela_values.extend(ela_feature.flatten())

            # RGB feature check (resize/normalize similar to training)
            rgb_resized = cv2.resize(rgb_image, (TARGET_SIZE, TARGET_SIZE)) / 255.0
            rgb_means.extend(rgb_resized.mean(axis=2).flatten())

        except Exception as e:
            # print(f"Warning: Could not process {row['img_path']}: {e}")
            continue

    # --- Analysis 1: Imbalance Ratio ---
    if total_pixels > 0:
        imbalance_ratio = (forgery_pixels / total_pixels) * 100
        print(f"\n--- Imbalance Ratio (Forged Pixels) ---")
        print(f"Total Pixels Sampled: {total_pixels:,}")
        print(f"Forged Pixels Sampled: {forgery_pixels:,}")
        print(f"Forgery Imbalance Ratio: **{imbalance_ratio:.2f}%** (Positive Class)")

    # --- Analysis 2: ELA Feature Distribution vs. RGB ---
    if ela_values:
        ela_values = np.array(ela_values)
        rgb_means = np.array(rgb_means)

        print(f"\n--- ELA Feature Distribution (Scaled by 10) ---")
        print(f"ELA Feature Mean: {np.mean(ela_values):.4f}")
        print(f"ELA Feature Std Dev: {np.std(ela_values):.4f}")
        print(f"RGB Mean (Normalized): {np.mean(rgb_means):.4f}")

        plt.figure(figsize=(12, 5))
        plt.hist(ela_values, bins=50, alpha=0.6, label='ELA Feature (Scaled)', color='red')
        plt.title('Distribution of ELA Feature Values')
        plt.xlabel('ELA Value (0 to ~2550)')
        plt.ylabel('Frequency')
        plt.legend()
        plt.show()

        # This histogram helps visualize if ELA is predominantly zero or clustered.

print("\n--- Advanced EDA Complete ---")

In [None]:
import matplotlib.pyplot as plt
import cv2
import os

# Define the file path
TEST_IMAGE_PATH = "/kaggle/input/recodai-luc-scientific-image-forgery-detection/test_images/45.png"

## COLAB
#TRAIN_ROOT = f"{recodai_luc_scientific_image_forgery_detection_path}/train_images"
#MASK_ROOT = f"{recodai_luc_scientific_image_forgery_detection_path}/train_masks"
#TEST_IMAGE_PATH =  f"{recodai_luc_scientific_image_forgery_detection_path}/test_images/45.png"
##

## COLAB
#TRAIN_ROOT = f"{recodai_luc_scientific_image_forgery_detection_path}/train_images"
#MASK_ROOT = f"{recodai_luc_scientific_image_forgery_detection_path}/train_masks"
##

TEST_IMAGE_PATH =  f"{recodai_luc_scientific_image_forgery_detection_path}/test_images/45.png"

print(f"Attempting to load image: {TEST_IMAGE_PATH}")

if not os.path.exists(TEST_IMAGE_PATH):
    print("ðŸ›‘ ERROR: The file path was not found. Please ensure the Kaggle competition data is mounted correctly.")
else:
    # Load the image using OpenCV (loads as BGR)
    img = cv2.imread(TEST_IMAGE_PATH)

    if img is None:
        print("ðŸ›‘ ERROR: Could not read the image file.")
    else:
        # Convert the image from BGR (OpenCV default) to RGB (Matplotlib default)
        img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        # Plot the image
        plt.figure(figsize=(10, 8))
        plt.imshow(img_rgb)
        plt.title(f"Test Image 45 (Dimensions: {img.shape[0]}x{img.shape[1]})")
        plt.axis('off') # Hide axes for a cleaner image view
        plt.show()

In [None]:
def validate_and_print_rle(submission_df):
    """
    Validates RLE output structure and prints debugging info.
    Checks for: 1. Authentic/RLE count. 2. Even number of RLE elements.
    """
    print("\n--- RLE Output Validation Check ---")

    # Analyze the annotations
    authentic_count = submission_df['annotation'].apply(lambda x: x == 'authentic').sum()
    rle_rows = submission_df[submission_df['annotation'] != 'authentic']

    print(f"Total Submissions: {len(submission_df)}")
    print(f"Authentic (No Forgery) Count: {authentic_count}")
    print(f"RLE Annotated (Forged) Count: {len(rle_rows)}")

    # CRITICAL CHECK: RLE strings must always have an even number of elements (start, length, start, length...)
    rle_check = rle_rows['annotation'].apply(lambda x: len(x.split(' ')) % 2 == 0)

    if rle_check.all():
        print(f"âœ… RLE Structure: All {len(rle_rows)} RLE strings contain an even number of elements.")
    else:
        # Prints a warning if any RLE string has an odd number of elements (a common error)
        bad_rle_count = len(rle_rows) - rle_check.sum()
        print(f"ðŸ›‘ RLE ERROR: Found {bad_rle_count} RLE strings with an odd number of elements (Invalid pairing).")

In [None]:
submission_df = pd.read_csv("submission.csv")
validate_and_print_rle(submission_df)

In [None]:
# --- CONFIGURATION (FINAL CODE POST-FIX) ---

COMPETITION_SLUG = "recodai-luc-scientific-image-forgery-detection"

# --- ROBUST PATH SETUP for KaggleHub/Colab Environment ---
KAGGLEHUB_PATH = f"/root/.cache/kagglehub/competitions/{COMPETITION_SLUG}"
TRAIN_ROOT = os.path.join(KAGGLEHUB_PATH, "train_images", "forged")
MASK_ROOT = os.path.join(KAGGLEHUB_PATH, "train_masks")

IMAGE_SIZE = 256
BATCH_SIZE = 16
MAX_EPOCHS = 50
# ---------------------

import numpy as np
import pandas as pd
import os
import warnings
import tensorflow as tf
from tensorflow.keras import layers, models, backend as K
from tensorflow.keras.callbacks import Callback, EarlyStopping, ReduceLROnPlateau
import cv2
from glob import glob
from tqdm import tqdm
import gc
from sklearn.model_selection import train_test_split
import logging
import random

tf.random.set_seed(42)

# Suppress warnings
warnings.filterwarnings('ignore')
tf.get_logger().setLevel(logging.ERROR)

# --- 1. Utility Functions, Metrics, and Callbacks ---

def dice_coef(y_true, y_pred, smooth=1e-7):
    """Standard Dice Coefficient (used as a metric)."""
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)
    intersection = K.sum(y_true_f * y_pred_f)
    return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth)

def tversky_loss(y_true, y_pred, alpha=0.10, beta=0.90, smooth=1e-7): # <--- CRITICAL FIX: BETA SET TO 0.90
    """
    CRITICAL FIX: Tversky Loss with beta=0.90 to aggressively prioritize False Negatives.
    """
    y_true_f = K.flatten(y_true)
    y_pred_f = K.flatten(y_pred)

    TP = K.sum(y_true_f * y_pred_f)
    FP = K.sum((1 - y_true_f) * y_pred_f)
    FN = K.sum(y_true_f * (1 - y_pred_f))

    tversky_index = (TP + smooth) / (TP + alpha * FP + beta * FN + smooth)

    return 1 - tversky_index

class EpochStatReporter(Callback):
    def __init__(self, generator):
        super().__init__()
        self.generator = generator

    def on_epoch_end(self, epoch, logs=None):
        skipped = self.generator.skipped_count
        log_message = f"Epoch {epoch + 1} finished: "
        for k, v in logs.items():
            log_message += f"{k}: {v:.4f} "
        log_message += f"| TOTAL SAMPLES SKIPPED: {skipped}"

        print("\n" + "="*80)
        print(log_message)
        print("="*80 + "\n")

# --- 2. Forgery Feature Extraction (Noise Residual) ---

def get_forgery_features_from_data(img_grayscale_data):
    """Generates a Noise Residual feature map (Stream 2 input) from augmented data."""
    img = img_grayscale_data.astype(np.uint8)

    if img is None or img.size == 0:
        return np.zeros((IMAGE_SIZE, IMAGE_SIZE, 3), dtype=np.float32)

    blur = cv2.GaussianBlur(img, (5, 5), 0)
    residual = img.astype(np.float32) - blur.astype(np.float32)

    residual = cv2.resize(residual, (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_LINEAR)
    residual = (residual - residual.min()) / (residual.max() - residual.min() + 1e-7)

    return np.stack([residual]*3, axis=-1).astype(np.float32)

# --- 3. Custom Dual-Stream Data Generator (WITH AUGMENTATION) ---

class DualStreamDataGenerator(tf.keras.utils.Sequence):

    def __init__(self, df, batch_size=16, shuffle=True, is_validation=False, **kwargs):
        super().__init__(**kwargs)
        self.df = df
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.is_validation = is_validation
        self.on_epoch_end()
        self.skipped_count = 0

    def on_epoch_end(self):
        self.indexes = np.arange(len(self.df))
        if self.shuffle and not self.is_validation:
            np.random.shuffle(self.indexes)
        self.skipped_count = 0

    def __len__(self):
        return int(np.floor(len(self.df) / self.batch_size))

    def __getitem__(self, index):
        indexes = self.indexes[index * self.batch_size:(index + 1) * self.batch_size]
        batch_df = self.df.iloc[indexes]

        temp_X1, temp_X2, temp_Y = [], [], []

        for row in batch_df.itertuples():
            image_id = row.id

            img_path_candidates = [
                os.path.join(TRAIN_ROOT, f'{image_id}.png'),
                os.path.join(TRAIN_ROOT, f'{image_id}.jpg'),
                os.path.join(TRAIN_ROOT, f'{image_id}.tif'),
                os.path.join(TRAIN_ROOT, f'{image_id}.tiff'),
            ]

            actual_img_path = next((path for path in img_path_candidates if os.path.exists(path)), None)

            if actual_img_path is None:
                self.skipped_count += 1
                continue

            mask_path = os.path.join(MASK_ROOT, f'{image_id}.npy')

            # --- Load Original Data ---
            try:
                img_rgb_orig = cv2.cvtColor(cv2.imread(actual_img_path), cv2.COLOR_BGR2RGB)
                mask = np.load(mask_path)
            except Exception as e:
                self.skipped_count += 1
                continue

            # --- AUGMENTATION (Training only) ---
            img_rgb_final = img_rgb_orig.copy()

            # 1. CRITICAL MASK FIX: Ensure mask is 2D for cv2.flip
            mask_final = np.squeeze(mask.copy())
            if mask_final.ndim == 3:
                mask_final = mask_final[:, :, 0]
            if mask_final.ndim != 2:
                self.skipped_count += 1
                continue
            # -----------------------------------

            if not self.is_validation:

                # 2. SPATIAL AUGMENTATION (Horizontal/Vertical Flip)
                if random.random() > 0.5: # Horizontal Flip
                    img_rgb_final = cv2.flip(img_rgb_final, 1)
                    mask_final = cv2.flip(mask_final, 1)
                if random.random() > 0.5: # Vertical Flip
                    img_rgb_final = cv2.flip(img_rgb_final, 0)
                    mask_final = cv2.flip(mask_final, 0)

                # 3. FORENSIC AUGMENTATION (Random JPEG Recompression)
                if random.random() > 0.5:
                    quality = random.randint(70, 95)
                    # Recompress and decode the RGB image
                    _, buffer = cv2.imencode('.jpg', cv2.cvtColor(img_rgb_final, cv2.COLOR_RGB2BGR), [cv2.IMWRITE_JPEG_QUALITY, quality])
                    img_rgb_final = cv2.cvtColor(cv2.imdecode(buffer, cv2.IMREAD_COLOR), cv2.COLOR_BGR2RGB)

            # --- Feature Generation & Final Preprocessing ---

            try:
                # Convert to grayscale after potential JPEG recompression
                img_gray_final = cv2.cvtColor(img_rgb_final, cv2.COLOR_RGB2GRAY)

                # X1: RGB Image (normalized and resized)
                X1_sample = cv2.resize(img_rgb_final, (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_LINEAR) / 255.0

                # X2: Feature Map (Noise Residual calculated from the augmented/recompressed grayscale image)
                X2_sample = get_forgery_features_from_data(img_gray_final)

                # Y: Mask (resized, reshaped)
                mask_resized = cv2.resize(mask_final, (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_NEAREST)
                Y_sample = mask_resized.reshape(IMAGE_SIZE, IMAGE_SIZE, 1).astype(np.float32)

            except Exception as e:
                self.skipped_count += 1
                continue

            temp_X1.append(X1_sample)
            temp_X2.append(X2_sample)
            temp_Y.append(Y_sample)

        # --- Final Batch Construction (Handling Skips) ---

        if not temp_X1:
            placeholder_x1 = np.zeros((self.batch_size, IMAGE_SIZE, IMAGE_SIZE, 3), dtype=np.float32)
            placeholder_x2 = np.zeros((self.batch_size, IMAGE_SIZE, IMAGE_SIZE, 3), dtype=np.float32)
            placeholder_y = np.zeros((self.batch_size, IMAGE_SIZE, IMAGE_SIZE, 1), dtype=np.float32)

            return (placeholder_x1, placeholder_x2), placeholder_y

        while len(temp_X1) < self.batch_size:
            temp_X1.append(temp_X1[-1])
            temp_X2.append(temp_X2[-1])
            temp_Y.append(temp_Y[-1])

        return (np.array(temp_X1), np.array(temp_X2)), np.array(temp_Y)

# --- 4. Dual-Stream U-Net Model (UNCHANGED) ---
def build_dual_stream_unet(input_shape):
    input_rgb = layers.Input(shape=input_shape, name='rgb_input')
    conv_rgb = models.Sequential([
        layers.Conv2D(32, 3, activation='relu', padding='same'), layers.MaxPooling2D(),
        layers.Conv2D(64, 3, activation='relu', padding='same')], name='rgb_stream')(input_rgb)
    input_feat = layers.Input(shape=input_shape, name='feature_input')
    conv_feat = models.Sequential([
        layers.Conv2D(32, 3, activation='relu', padding='same'), layers.MaxPooling2D(),
        layers.Conv2D(64, 3, activation='relu', padding='same')], name='feature_stream')(input_feat)
    merged = layers.concatenate([conv_rgb, conv_feat])
    up1 = layers.UpSampling2D(size=(2, 2))(merged)
    conv_final = layers.Conv2D(128, 3, activation='relu', padding='same')(up1)
    output = layers.Conv2D(1, 1, activation='sigmoid', padding='same')(conv_final)
    model = models.Model(inputs=[input_rgb, input_feat], outputs=output)
    return model

# --- 5. Training Loop (Execution) ---

# --- Setup DataFrames ---
all_files = glob(os.path.join(TRAIN_ROOT, '*'))
df = pd.DataFrame([os.path.basename(f).split('.')[0] for f in all_files], columns=['id'])

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

# --- Final Checks & Execution ---
if len(df) == 0:
    print(f"FATAL: The path '{TRAIN_ROOT}' is empty. Cannot proceed.")
    model = None
else:
    train_gen = DualStreamDataGenerator(train_df, batch_size=BATCH_SIZE, shuffle=True, is_validation=False)
    val_gen = DualStreamDataGenerator(val_df, batch_size=BATCH_SIZE, shuffle=False, is_validation=True)

    model = build_dual_stream_unet((IMAGE_SIZE, IMAGE_SIZE, 3))

    model.compile(optimizer='adam', loss=tversky_loss, metrics=[dice_coef, 'accuracy'])

    # Implement the stability callbacks
    reduce_lr = ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=3,
        min_lr=1e-6,
        verbose=1
    )

    early_stop = EarlyStopping(
        monitor='val_loss',
        patience=5,
        restore_best_weights=True,
        verbose=1
    )

    stat_reporter = EpochStatReporter(train_gen)
    callbacks = [stat_reporter, reduce_lr, early_stop]

    print(f"Training on {len(train_df)} samples. Generator batches: {len(train_gen)}")
    print(f"Validating on {len(val_df)} samples. Generator batches: {len(val_gen)}")
    print(f"CRITICAL: Using Tversky Loss (beta=0.90) and Spatial/Forensic Data Augmentation for maximum performance.")

    history = model.fit(
        train_gen,
        epochs=MAX_EPOCHS,
        verbose=1,
        callbacks=callbacks,
        validation_data=val_gen
    )

    # Save weights to the temporary directory for Part 2 inference
    model.save_weights('/tmp/model_new_scratch.weights.h5')

    # Final cleanup
    del model; del train_gen; del val_gen; gc.collect()

In [None]:
# --- CONFIGURATION & PATHS ---
COMPETITION_SLUG = "recodai-luc-scientific-image-forgery-detection"
KAGGLEHUB_PATH = f"/root/.cache/kagglehub/competitions/{COMPETITION_SLUG}"
TEST_IMAGE_ROOT = os.path.join(KAGGLEHUB_PATH, "test_images")
SAMPLE_SUBMISSION_FILE = os.path.join(KAGGLEHUB_PATH, "sample_submission.csv")
MODEL_SAVE_PATH = "/tmp/model_new_scratch.weights.h5" # Must match the training save path

IMAGE_SIZE = 256
FIXED_THRESHOLD = 0.45
MIN_FORGERY_AREA = 32
OUTPUT_FILENAME = "submission.csv"

# --- IMPORTS ---
import numpy as np
import pandas as pd
import os
import tensorflow as tf
from tensorflow.keras import layers, models, backend as K
import cv2
from tqdm import tqdm
import csv
import warnings
warnings.filterwarnings('ignore')

# --- 1. Utility Functions (from training, essential for NR and RLE) ---

def rle_encode(mask):
    """Encodes a binary mask using Run Length Encoding."""
    if mask.sum() == 0: return "authentic"
    pixels = mask.T.flatten() # Transpose and flatten for RLE standard
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return ' '.join(str(x) for x in runs)

def get_forgery_features_from_data(img_grayscale_data):
    """Generates the Noise Residual (NR) feature map (Stream 2 input)."""
    img = img_grayscale_data.astype(np.uint8)

    if img is None or img.size == 0:
        return np.zeros((IMAGE_SIZE, IMAGE_SIZE, 3), dtype=np.float32)

    blur = cv2.GaussianBlur(img, (5, 5), 0)
    residual = img.astype(np.float32) - blur.astype(np.float32)

    residual = cv2.resize(residual, (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_LINEAR)
    residual = (residual - residual.min()) / (residual.max() - residual.min() + 1e-7)

    return np.stack([residual]*3, axis=-1).astype(np.float32)

# --- 2. Model Architecture (Must match training) ---

def build_dual_stream_unet(input_shape):
    input_rgb = layers.Input(shape=input_shape, name='rgb_input')
    conv_rgb = models.Sequential([
        layers.Conv2D(32, 3, activation='relu', padding='same'), layers.MaxPooling2D(),
        layers.Conv2D(64, 3, activation='relu', padding='same')], name='rgb_stream')(input_rgb)
    input_feat = layers.Input(shape=input_shape, name='feature_input')
    conv_feat = models.Sequential([
        layers.Conv2D(32, 3, activation='relu', padding='same'), layers.MaxPooling2D(),
        layers.Conv2D(64, 3, activation='relu', padding='same')], name='feature_stream')(input_feat)
    merged = layers.concatenate([conv_rgb, conv_feat])
    up1 = layers.UpSampling2D(size=(2, 2))(merged)
    conv_final = layers.Conv2D(128, 3, activation='relu', padding='same')(up1)
    output = layers.Conv2D(1, 1, activation='sigmoid', padding='same')(conv_final)
    model = models.Model(inputs=[input_rgb, input_feat], outputs=output)
    return model

# --- 3. Data Preparation ---

def create_test_df_robust(test_image_root, sample_submission_path):
    master_df = pd.read_csv(sample_submission_path)
    # Ensure case_id in master_df is treated as string for merging robustness
    master_df['case_id'] = master_df['case_id'].astype(str)

    present_files = {}

    if os.path.exists(test_image_root):
        for f in os.listdir(test_image_root):
            case_id = os.path.splitext(f)[0]
            if f.lower().endswith(('.png', '.jpg', '.jpeg', '.tif', '.tiff', '.npy')):
                present_files[case_id] = os.path.join(test_image_root, f)

    master_df['img_path'] = master_df['case_id'].astype(str).map(present_files)
    master_df = master_df.dropna(subset=['img_path']).reset_index(drop=True)
    return master_df[['case_id', 'img_path']]

# --- 4. Inference Function with Dual-Stream Preprocessing ---

def run_inference_and_segment(unet_model, test_df):
    results = []

    for index, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Processing Test Set"):
        case_id = str(row['case_id'])
        img_path = row['img_path']

        # --- Load and Preprocess Image ---
        try:
            img_bgr = cv2.imread(img_path)
            if img_bgr is None or img_bgr.size == 0: raise ValueError("Invalid image data.")

            img_rgb_orig = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
            img_gray_orig = cv2.cvtColor(img_rgb_orig, cv2.COLOR_RGB2GRAY)

            original_shape = img_rgb_orig.shape[:2]

            # X1: RGB Stream Input (resized and normalized)
            X1_sample = cv2.resize(img_rgb_orig, (IMAGE_SIZE, IMAGE_SIZE), interpolation=cv2.INTER_LINEAR) / 255.0

            # X2: Feature Stream Input (Noise Residual)
            X2_sample = get_forgery_features_from_data(img_gray_orig)

            # Keras Model Input Format: List of NumPy arrays (batch dimension added)
            # The model expects [rgb_input, feature_input]
            input_rgb = np.expand_dims(X1_sample, axis=0)
            input_feat = np.expand_dims(X2_sample, axis=0)

            # --- Prediction ---
            output = unet_model.predict([input_rgb, input_feat], verbose=0)
            output_prob = output[0, :, :, 0] # Remove batch and channel dimensions

            # --- Post-Processing ---

            # 1. Apply Threshold (0.45)
            final_mask_resized = (output_prob > FIXED_THRESHOLD).astype(np.uint8)

            # 2. Minimum Area Filtering (32)
            clean_mask_resized = np.zeros_like(final_mask_resized)

            # Find connected components
            num_labels, labels, stats, _ = cv2.connectedComponentsWithStats(
                final_mask_resized, 4, cv2.CV_32S
            )

            # Iterate through each component (label 0 is the background)
            for label in range(1, num_labels):
                area = stats[label, cv2.CC_STAT_AREA]
                if area >= MIN_FORGERY_AREA:
                    clean_mask_resized[labels == label] = 1

            # 3. Resize the CLEANED mask back to the original size
            final_mask = cv2.resize(
                clean_mask_resized,
                (original_shape[1], original_shape[0]),
                interpolation=cv2.INTER_NEAREST
            )

            rle_annotation = rle_encode(final_mask)
            results.append({'case_id': case_id, 'annotation': rle_annotation})

        except Exception as e:
            # Fallback for corrupt files
            print(f"Error processing case {case_id}: {e}. Defaulting to authentic.")
            results.append({'case_id': case_id, 'annotation': 'authentic'})

    return pd.DataFrame(results)

# --- MAIN EXECUTION BLOCK ---
if __name__ == "__main__":

    print(f"--- Starting Keras Inference ({tf.__version__}) ---")

    # 1. Load Model (Architecture + Weights)
    model = build_dual_stream_unet((IMAGE_SIZE, IMAGE_SIZE, 3))

    try:
        model.load_weights(MODEL_SAVE_PATH)
        print(f"âœ… Model weights loaded successfully from {MODEL_SAVE_PATH}")
    except Exception as e:
        print(f"ðŸ›‘ Error loading weights: {e}. Cannot run inference.")
        model = None

    # 2. Prepare Data
    test_df = create_test_df_robust(TEST_IMAGE_ROOT, SAMPLE_SUBMISSION_FILE)

    if test_df.empty:
        print("ðŸ›‘ FATAL: No valid test samples found.")

    # 3. Run Inference
    if model and not test_df.empty:
        results_df = run_inference_and_segment(model, test_df)
    else:
        # Create a placeholder if inference fails
        results_df = pd.DataFrame([{'case_id': str(id), 'annotation': 'authentic'} for id in test_df['case_id'].tolist()])

    # 4. Finalize Submission DF
    # CRITICAL FIX: Ensure case_id is string in both DFs before merge
    test_df['case_id'] = test_df['case_id'].astype(str)
    results_df['case_id'] = results_df['case_id'].astype(str)

    submission_df = test_df[['case_id']].copy().merge(results_df, on='case_id', how='left')
    submission_df['annotation'] = submission_df['annotation'].fillna('authentic')
    submission_df = submission_df[['case_id', 'annotation']].sort_values('case_id').reset_index(drop=True)

    # 5. Write CSV with Correct RLE Formatting
    with open(OUTPUT_FILENAME, "w", newline='') as f:
        writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
        writer.writerow(['case_id', 'annotation'])

        for _, row in submission_df.iterrows():
            case_id = str(row['case_id'])
            annotation = row['annotation']

            if annotation.lower() == 'authentic':
                 writer.writerow([case_id, annotation])
            else:
                 # Create the full bracketed RLE string
                 full_rle_string = f"[{annotation}]"
                 writer.writerow([case_id, full_rle_string])

    print(f"\nâœ… Created {OUTPUT_FILENAME} with {len(submission_df)} rows.")

In [None]:
!cat submission.csv

In [None]:
!ls -ltha /tmp/

In [None]:
import pandas as pd
import numpy as np

def validate_and_print_rle(submission_df):
    """
    Validates RLE output structure and prints debugging info,
    including the total count of RLE segment pairs.
    """
    print("\n--- RLE Output Validation Check ---")

    # Analyze the annotations
    authentic_count = submission_df['annotation'].apply(lambda x: x.lower().strip().replace('[]', '') == 'authentic').sum()
    rle_rows = submission_df[submission_df['annotation'].apply(lambda x: x.lower().strip().replace('[]', '') != 'authentic')]

    print(f"Total Submissions: {len(submission_df)}")
    print(f"Authentic (No Forgery) Count: {authentic_count}")
    print(f"RLE Annotated (Forged) Count: {len(rle_rows)}")

    # --- NEW: Calculate Total Segment Pairs ---
    total_rle_elements = 0

    def count_rle_elements(rle_string):
        nonlocal total_rle_elements
        rle_string = rle_string.strip().strip('[]')
        if not rle_string or rle_string.lower() == 'authentic':
            return True
        try:
            elements = rle_string.replace(',', ' ').split()
            num_elements = len(elements)
            if num_elements % 2 == 0:
                total_rle_elements += num_elements
            return num_elements % 2 == 0
        except:
            return False

    rle_check = rle_rows['annotation'].apply(count_rle_elements)

    if rle_check.all():
        total_pairs = total_rle_elements // 2
        print(f"âœ… RLE Structure: All {len(rle_rows)} RLE strings are structurally valid.")
        print(f"Total Segment Pairs Detected: {total_pairs} (Indicates the complexity of the forgery patterns).")
    else:
        bad_rle_count = len(rle_rows) - rle_check.sum()
        print(f"ðŸ›‘ RLE ERROR: Found {bad_rle_count} RLE strings with invalid structure.")

# --- Execution ---
try:
    # Load the submission file
    submission_df = pd.read_csv("submission.csv")

    # Perform validation
    validate_and_print_rle(submission_df)

    # Print the content for confirmation
    print("\n--- submission.csv Content ---")
    print(submission_df.to_string(index=False))

    if not submission_df.empty:
        test_case_result = submission_df.iloc[0]['annotation']
        print(f"\nModel Prediction for Test Case {submission_df.iloc[0]['case_id']}: The image is classified as {test_case_result}.")

except FileNotFoundError:
    print("Error: submission.csv not found. Please ensure the inference code was run successfully.")

In [None]:
!cat submission.csv

In [None]:
import matplotlib.pyplot as plt
import cv2
import os
import numpy as np

# --- CONFIGURATION & PATHS (Using the same path setup as inference) ---
COMPETITION_SLUG = "recodai-luc-scientific-image-forgery-detection"
KAGGLEHUB_PATH = f"/root/.cache/kagglehub/competitions/{COMPETITION_SLUG}"
TEST_IMAGE_ROOT = os.path.join(KAGGLEHUB_PATH, "test_images")

# The successful inference was for case_id 45
TARGET_CASE_ID = '45'

print(f"Attempting to load image: {TEST_IMAGE_PATH}")

# --- Determine the actual path for case 45 ---
TEST_IMAGE_PATH = None
try:
    # Try common extensions
    possible_extensions = ['.png', '.jpg', '.jpeg', '.tif', '.tiff']
    for ext in possible_extensions:
        path = os.path.join(TEST_IMAGE_ROOT, f"{TARGET_CASE_ID}{ext}")
        if os.path.exists(path):
            TEST_IMAGE_PATH = path
            img = cv2.imread(TEST_IMAGE_PATH)
            break

    if img is None:
        raise FileNotFoundError(f"Image file for case {TARGET_CASE_ID} not found.")

    # OpenCV loads in BGR format; convert to RGB for Matplotlib
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # --- Plot the Image ---
    plt.figure(figsize=(10, 8))
    plt.imshow(img_rgb)

    # CRITICAL FIX: The title reflects the actual detection result (FORGED)
    title_text = f"Test Image {TARGET_CASE_ID} (Dimensions: {img.shape[0]}x{img.shape[1]})"
    classification_text = "Classification: FORGED (1961 Segments Detected)"

    plt.title(f"{title_text}\n{classification_text}", fontsize=14)
    plt.axis('off') # Hide axes
    plt.show()

except FileNotFoundError as e:
    print(f"ðŸ›‘ ERROR: {e}")
except Exception as e:
    print(f"ðŸ›‘ ERROR: Could not read or process the image file: {e}")