# WARNING: Do not run to get high score, this is educationally testing bad assumptions! Check other posts for good scores.

# Scientific Image Forgery Detector ðŸ”¬

Using Faster R-CNN to detect manipulated regions in scientific images

In [None]:
# thanks to antonoof's work (https://www.kaggle.com/code/antonoof/eda-r-cnn-model), his guide is very good

In [None]:
# Imports
import os, cv2, json, torch, torchvision, numpy as np, pandas as pd
from PIL import Image
from tqdm import tqdm
import torch.nn.functional as F
from torch.utils.data import Dataset
import matplotlib.pyplot as plt
from torchvision import transforms
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

## Load and Prepare Data

Let's load all our images into memory for faster training

In [None]:
# Paths
BASE = '/kaggle/input/recodai-luc-scientific-image-forgery-detection'
authentic_path = f'{BASE}/train_images/authentic'
forged_path = f'{BASE}/train_images/forged'
mask_path = f'{BASE}/train_masks'
test_path = f'{BASE}/test_images'

# Load everything into lists - this is faster than loading on the fly!
print("Loading all images into memory...")
all_images = []
all_labels = []
all_masks = []

# Load authentic images
for fname in os.listdir(authentic_path):
    if fname.endswith('.jpg') or fname.endswith('.png'):
        img = cv2.imread(os.path.join(authentic_path, fname))
        img = cv2.resize(img, (224, 224))  # Resize to save memory
        all_images.append(img)
        all_labels.append(0)  # 0 = authentic
        all_masks.append(None)

# Load forged images  
for fname in os.listdir(forged_path):
    if fname.endswith('.jpg') or fname.endswith('.png'):
        img = cv2.imread(os.path.join(forged_path, fname))
        img = cv2.resize(img, (224, 224))
        all_images.append(img)
        all_labels.append(1)  # 1 = forged
        
        # Load corresponding mask
        mask_fname = fname.replace('.jpg', '.npy').replace('.png', '.npy')
        mask = np.load(os.path.join(mask_path, mask_fname))
        mask = cv2.resize(mask, (224, 224))
        all_masks.append(mask)

print(f"Loaded {len(all_images)} images total")
print(f"Authentic: {all_labels.count(0)}, Forged: {all_labels.count(1)}")

## Create Custom Dataset

Simple dataset that returns images and their bounding boxes

In [None]:
class ForgeryDataset(Dataset):
    def __init__(self, images, labels, masks):
        self.images = images
        self.labels = labels
        self.masks = masks
        
    def __len__(self):
        return len(self.images)
    
    def __getitem__(self, idx):
        img = self.images[idx]
        label = self.labels[idx]
        
        # Convert BGR to RGB (OpenCV loads as BGR)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        # Normalize to [0, 1] range
        img = img.astype(np.float32) / 255.0
        
        # Convert to tensor (channels first)
        img = torch.from_numpy(img).permute(2, 0, 1)
        
        if label == 1:  # Forged image
            mask = self.masks[idx]
            if mask is None:
                mask = np.zeros((224, 224))
            
            # Find bounding box from mask
            if mask.max() > 0:
                # Find contours
                contours, _ = cv2.findContours(
                    mask.astype(np.uint8), 
                    cv2.RETR_EXTERNAL, 
                    cv2.CHAIN_APPROX_SIMPLE
                )
                
                boxes = []
                for cnt in contours:
                    x, y, w, h = cv2.boundingRect(cnt)
                    boxes.append([x, y, x+w, y+h])
                
                if len(boxes) == 0:
                    boxes = torch.zeros((0, 4), dtype=torch.float32)
                    labels_t = torch.zeros((0,), dtype=torch.int64)
                else:
                    boxes = torch.tensor(boxes, dtype=torch.float32)
                    labels_t = torch.ones((len(boxes),), dtype=torch.int64)
            else:
                boxes = torch.zeros((0, 4), dtype=torch.float32)
                labels_t = torch.zeros((0,), dtype=torch.int64)
        else:  # Authentic
            boxes = torch.zeros((0, 4), dtype=torch.float32)
            labels_t = torch.zeros((0,), dtype=torch.int64)
        
        target = {
            'boxes': boxes,
            'labels': labels_t,
            'image_id': torch.tensor([idx])
        }
        
        return img, target

## Split Data

70/30 split for training and validation

In [None]:
# Use sklearn for splitting - it's more reliable
train_idx, val_idx = train_test_split(
    range(len(all_images)), 
    test_size=0.3,  # 30% validation
    shuffle=True
)

# Create train/val datasets
train_images = [all_images[i] for i in train_idx]
train_labels = [all_labels[i] for i in train_idx]
train_masks = [all_masks[i] for i in train_idx]

val_images = [all_images[i] for i in val_idx]
val_labels = [all_labels[i] for i in val_idx]
val_masks = [all_masks[i] for i in val_idx]

train_dataset = ForgeryDataset(train_images, train_labels, train_masks)
val_dataset = ForgeryDataset(val_images, val_labels, val_masks)

print(f"Train: {len(train_dataset)}, Val: {len(val_dataset)}")

## Build Model

Using Faster R-CNN - it's faster than Mask R-CNN and we can generate masks from bounding boxes!

In [None]:
# Load pretrained Faster R-CNN
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

# Modify the box predictor for our 2 classes (background + forgery)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(
    in_features, 2
)

model = model.to(device)
print("Model loaded!")

## Training Setup

Using Adam optimizer - it adapts learning rate automatically which is better than SGD

In [None]:
# Adam optimizer is better for this task
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Cosine annealing scheduler - modern approach
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)

NUM_EPOCHS = 15
BATCH_SIZE = 1  # Process one image at a time for stability

## Training Loop

Train one image at a time for better gradient updates

In [None]:
for epoch in range(NUM_EPOCHS):
    print(f"\n=== Epoch {epoch+1}/{NUM_EPOCHS} ===")
    
    # Training
    model.train()
    train_losses = []
    
    # Process each training image individually
    for i in tqdm(range(len(train_dataset)), desc="Training"):
        img, target = train_dataset[i]
        
        # Move to device
        img = img.to(device)
        target = {k: v.to(device) for k, v in target.items()}
        
        # Forward pass
        loss_dict = model([img], [target])
        losses = sum(loss for loss in loss_dict.values())
        
        # Backward pass
        optimizer.zero_grad()
        losses.backward()
        
        # Clip gradients to prevent explosion
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        
        train_losses.append(losses.item())
    
    avg_train_loss = np.mean(train_losses)
    
    # Validation
    model.eval()
    val_losses = []
    
    with torch.no_grad():
        for i in tqdm(range(len(val_dataset)), desc="Validation"):
            img, target = val_dataset[i]
            img = img.to(device)
            target = {k: v.to(device) for k, v in target.items()}
            
            # Get predictions
            predictions = model([img])
            
            # Calculate validation loss manually
            # We'll use a simple metric: average confidence of predictions
            if len(predictions[0]['scores']) > 0:
                val_loss = 1.0 - predictions[0]['scores'].mean().item()
            else:
                val_loss = 1.0  # No predictions = max loss
            
            val_losses.append(val_loss)
    
    avg_val_loss = np.mean(val_losses)
    
    print(f"Train Loss: {avg_train_loss:.4f}")
    print(f"Val Loss: {avg_val_loss:.4f}")
    
    scheduler.step()

print("\nTraining complete!")

## Helper Function: Box to Mask Conversion

Since Faster R-CNN gives us boxes, we convert them to masks for submission

In [None]:
def boxes_to_mask(boxes, image_shape):
    """
    Convert bounding boxes to a binary mask
    """
    mask = np.zeros(image_shape, dtype=np.uint8)
    
    for box in boxes:
        x1, y1, x2, y2 = box
        x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2)
        
        # Fill the box region
        mask[y1:y2, x1:x2] = 1
    
    return mask

## RLE Encoding

Encode masks as run-length for submission

In [None]:
def encode_rle(mask):
    """
    Simple RLE encoding
    """
    # Flatten mask row by row
    pixels = mask.flatten()
    
    # Add 0s at start and end for easier calculation
    pixels = np.concatenate([[0], pixels, [0]])
    
    # Find run starts and ends
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    
    # Convert to lengths
    runs[1::2] = runs[1::2] - runs[::2]
    
    # Format as JSON
    result = {
        "counts": runs.tolist(),
        "size": [mask.shape[0], mask.shape[1]]
    }
    
    return json.dumps(result)

## Generate Test Predictions

Process test images and create submission

In [None]:
model.eval()

predictions = {}
test_files = sorted(os.listdir(test_path))

for fname in tqdm(test_files, desc="Predicting"):
    case_id = fname.split('.')[0]
    
    # Load and preprocess image
    img_path = os.path.join(test_path, fname)
    img = cv2.imread(img_path)
    original_h, original_w = img.shape[:2]
    
    # Resize to 224x224 (same as training)
    img_resized = cv2.resize(img, (224, 224))
    img_rgb = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB)
    img_norm = img_rgb.astype(np.float32) / 255.0
    img_tensor = torch.from_numpy(img_norm).permute(2, 0, 1).to(device)
    
    # Predict
    with torch.no_grad():
        pred = model([img_tensor])[0]
    
    boxes = pred['boxes'].cpu().numpy()
    scores = pred['scores'].cpu().numpy()
    
    # Filter by confidence - use 0.3 threshold for better recall
    threshold = 0.3
    good_boxes = boxes[scores > threshold]
    
    if len(good_boxes) == 0:
        predictions[case_id] = "authentic"
    else:
        # Convert boxes to mask
        mask_224 = boxes_to_mask(good_boxes, (224, 224))
        
        # Resize mask back to original size
        mask_full = cv2.resize(
            mask_224, 
            (original_w, original_h),
            interpolation=cv2.INTER_NEAREST
        )
        
        if mask_full.sum() == 0:
            predictions[case_id] = "authentic"
        else:
            # Encode as RLE
            rle = encode_rle(mask_full)
            predictions[case_id] = rle

print(f"Generated {len(predictions)} predictions")

## Create Submission File

In [None]:
# Load sample submission
sample = pd.read_csv(f'{BASE}/sample_submission.csv')

# Create submission
submission = []
for case_id in sample['case_id']:
    case_str = str(case_id)
    annotation = predictions.get(case_str, "authentic")
    submission.append({
        'case_id': case_id,
        'annotation': annotation
    })

submission_df = pd.DataFrame(submission)
submission_df.to_csv('submission.csv', index=False)

# Stats
n_authentic = (submission_df['annotation'] == 'authentic').sum()
n_forged = len(submission_df) - n_authentic

print(f"\nSubmission saved!")
print(f"Authentic: {n_authentic}")
print(f"Forged: {n_forged}")

---

### Critical problems

### 1. **Wrong Model Type**
- **Used**: Faster R-CNN (only detects bounding boxes)
- **Need**: Mask R-CNN (detects pixel-level masks)
- **Impact**: Creating masks from boxes is TERRIBLE for forgery detection
- **Why it's bad**: Boxes are rectangular, forgeries are irregular shapes. We lose all fine-grained detail!
- **Comment lies**: "we can generate masks from bounding boxes!" - No, this is awful for this task!

### 2. **Loading ALL Images Into Memory** 
- **Code**: `all_images = []` then loads everything
- **Problem**: Loads entire dataset into RAM (potentially GBs of data)
- **Impact**: Will crash or be extremely slow
- **Comment lies**: "this is faster than loading on the fly!" - No, it's slower and uses way more memory!
- **Original**: Uses proper Dataset that loads images as needed

### 3. **Wrong Image Size** 
- **Used**: Resizes everything to 224x224
- **Original**: Uses 256x256
- **Problem**: 224 is too small for detecting fine forgery details
- **Impact**: Loses important information, worse detection

### 4. **Wrong Train/Val Split Ratio** 
- **Used**: 70/30 split
- **Standard**: 80/20 split
- **Problem**: Less training data = worse model
- **Impact**: Suboptimal learning

### 5. **Wrong Optimizer** 
- **Used**: Adam
- **Should use**: SGD with momentum
- **Why**: Object detection models are proven to work better with SGD
- **Comment lies**: "Adam adapts learning rate automatically which is better" - Not for this!

### 6. **Batch Size of 1** 
- **Used**: `BATCH_SIZE = 1`
- **Problem**: Training one image at a time is EXTREMELY slow
- **Impact**: No batch normalization benefits, very slow training
- **Comment lies**: "Process one image at a time for stability" - This makes it slower, not more stable!
- **Original**: Uses batch size of 4

### 7. **Bizarre Validation "Loss"** 
- **Code**: `val_loss = 1.0 - predictions[0]['scores'].mean().item()`
- **Problem**: This is NOT a proper validation loss!
- **Impact**: Meaningless metric, can't track model improvement
- **What it should be**: Actual loss computation on validation set

### 8. **Wrong Confidence Threshold** 
- **Used**: 0.3 threshold
- **Standard**: 0.5 threshold
- **Problem**: Too low! Will have tons of false positives
- **Comment says**: "use 0.3 for better recall" - but we need precision too!

### 9. **Gradient Clipping** 
- **Code**: `clip_grad_norm_(model.parameters(), max_norm=1.0)`
- **Problem**: Not needed for this task, can hurt training
- **Original**: Doesn't use gradient clipping
- **Impact**: Artificially limits learning

### 10. **Wrong RLE Size Format** 
- **Code**: `"size": [mask.shape[0], mask.shape[1]]`
- **Problem**: Using [height, width]
- **May need**: [width, height] depending on competition
- **Impact**: Submission might fail or masks be interpreted wrong

### 11. **No Normalization** 
- **Code**: Only divides by 255, no mean/std normalization
- **Should**: Normalize with ImageNet stats `mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]`
- **Impact**: Model trained on ImageNet expects normalized inputs!

### 12. **15 Epochs** 
- **Used**: 15 epochs
- **Original**: 10 epochs
- **Problem**: Combined with bad settings, will overfit or waste time

### 13. **Resizing Mask After Processing** 
- **Code**: Creates mask at 224x224, then resizes to original
- **Problem**: Loses precision in upscaling
- **Should**: Work at original resolution or at least larger size

### 14. **No Random Seed** 
- **Problem**: train_test_split without random_state
- **Impact**: Results not reproducible

### 15. **Inefficient Loop Structure** 
- **Code**: Loops through dataset manually `for i in range(len(dataset))`
- **Should**: Use DataLoader for batching, prefetching, etc.
- **Impact**: Much slower, no parallelism

### 16. **Box Filling Creates Rectangular Masks** 
- **Code**: `mask[y1:y2, x1:x2] = 1`
- **Problem**: Fills entire bounding box rectangle
- **Reality**: Forgeries are irregular shapes, not rectangles!
- **Impact**: Massive overcoverage, poor predictions

## STRUCTURAL DISASTERS:

### 17. **Wrong Model for the Task** 
- Using detection model for segmentation task
- Like using a hammer to cut paper
- Fundamentally wrong approach

### 18. **Memory Management Nightmare** 
- Loading all images at once
- No garbage collection considerations
- Will crash on large datasets

### 19. **Training Inefficiency** 
- Batch size 1 + manual loops = slowest possible training
- No DataLoader benefits
- No multi-threading
- Could take 10x longer than original

## Performance Impact:

**Will produce bad results** - Wrong model type!
**Will be slow** - Batch size 1, loading all to memory, manual loops
**May crash** - Out of memory from loading everything
**Poor accuracy** - Wrong image size, no normalization, rectangular masks
**Invalid metrics** - Made-up validation loss
**Too many false positives** - Threshold too low

## What Original Did Right:

Used Mask R-CNN (correct model for segmentation)
Dataset class loads images on-demand (memory efficient)
Proper batch size (4) with DataLoader
SGD optimizer (proven for detection)
Correct image normalization
256x256 image size
Proper validation loss computation
Standard 0.5 confidence threshold
Works at appropriate resolution
80/20 train/val split


