
# Scientific Image Forgery — **Exploratory Data Analysis (EDA)**

**Goal:** Visualize and analyze forged regions (copy–move) in biomedical research images — no model training here.

**What you'll get:**
- Labeled overlays of forged regions (instances + bounding boxes)
- Dataset sanity checks and descriptive statistics
- Instance-level metrics (area %, bbox, aspect, compactness)
- Saved overlays to disk for quick browsing
- A qualitative "copy–move" signature peek via template matching
  


## Table of Contents
1. **Setup & Configuration**  
2. **Data Loading Utilities**  
3. **Index Dataset & Sanity Checks**  
4. **Global Dataset Statistics** (counts, sizes, modes)  
5. **Mask Structure & Quality Checks**  
6. **Instance-Level Metrics** (area %, bbox, aspect, compactness)  
7. **Labeled Visualizations** (single + grid + authentic vs forged)  
8. **Copy–Move Signature Peek** (template matching)  
9. **Analyst Summary & Next Steps**


## 1) Setup & Configuration

In [None]:
import sys, os, platform
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print("Python:", sys.version.split()[0])
print("OS:", platform.platform())
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)


In [None]:

from pathlib import Path

COMP_DIR   = "/kaggle/input/recodai-luc-scientific-image-forgery-detection" 
TRAIN_DIR  = f"{COMP_DIR}/train_images"
MASK_DIR   = f"{COMP_DIR}/train_masks"
TEST_DIR   = f"{COMP_DIR}/test_images"  # unused here, EDA only

# Outputs
OUT_DIR     = "/kaggle/working"
PREVIEW_DIR = f"{OUT_DIR}/preview"
os.makedirs(PREVIEW_DIR, exist_ok=True)

# Visualization and sampling params
SEED                 = 42
DISPLAY_SIZE         = 768          # resize for display/speed (square)
GRID_SIZE            = 512          # size for grid thumbnails
MAX_GRID_IMAGES      = 12           # limit how many to draw in grids
TEMPLATE_MATCH_SCALE = 0.5          # downscale for template matching (speed)
np.random.seed(SEED)

print("TRAIN_DIR:", TRAIN_DIR)
print("MASK_DIR :", MASK_DIR)
print("PREVIEW to:", PREVIEW_DIR)


## 2) Data Loading Utilities

In [None]:

import glob
from PIL import Image
import cv2

def load_mask_npy(mask_path: str) -> np.ndarray:
    """
    Load .npy mask which may be:
      - (H, W): single mask
      - (N, H, W): multiple instance masks -> OR them
      - list/tuple/dict: extract arrays and OR them
    Return: uint8 binary mask {0,1}
    """
    m = np.load(mask_path, allow_pickle=True)

    if isinstance(m, (list, tuple)):
        arrs = []
        for item in m:
            item = np.asarray(item)
            if item.ndim == 2:
                arrs.append((item > 0).astype(np.uint8))
            elif item.ndim == 3:
                arrs.append(((item > 0).sum(axis=0) > 0).astype(np.uint8))
        if len(arrs) == 0:
            raise ValueError(f"Unsupported mask list/tuple in {mask_path}")
        m = np.stack(arrs, axis=0)

    if isinstance(m, dict):
        key = "masks" if "masks" in m else list(m.keys())[0]
        m = np.asarray(m[key])

    m = np.asarray(m)
    if m.ndim == 2:
        mask = (m > 0).astype(np.uint8)
    elif m.ndim == 3:
        mask = ((m > 0).sum(axis=0) > 0).astype(np.uint8)
    else:
        raise ValueError(f"Unsupported mask ndim {m.ndim} for {mask_path}")
    return mask


def load_mask_instances(mask_path: str) -> list:
    """
    Return a list of instance masks (each [H,W] uint8 {0,1}).
    If the .npy is (N,H,W) -> split along axis 0
    If it's (H,W) -> split into connected components as instances
    """
    from scipy import ndimage

    raw = np.load(mask_path, allow_pickle=True)

    def _to_list_of_masks(arr: np.ndarray) -> list:
        arr = np.asarray(arr)
        if arr.ndim == 2:
            lab, n = ndimage.label(arr > 0)
            return [(lab == k).astype(np.uint8) for k in range(1, n + 1)]
        elif arr.ndim == 3:
            return [((arr[i] > 0).astype(np.uint8)) for i in range(arr.shape[0])]
        else:
            raise ValueError(f"Unsupported array ndim {arr.ndim}")

    if isinstance(raw, (list, tuple)):
        out = []
        for it in raw:
            out.extend(_to_list_of_masks(np.asarray(it)))
        return out

    if isinstance(raw, dict):
        key = "masks" if "masks" in raw else list(raw.keys())[0]
        return _to_list_of_masks(np.asarray(raw[key]))

    return _to_list_of_masks(np.asarray(raw))


def pil_to_np_rgb(img_pil: Image.Image) -> np.ndarray:
    return np.array(img_pil.convert("RGB"))


def resize_img_and_mask(img: Image.Image, mask: np.ndarray, size: int) -> tuple:
    img_r  = img.resize((size, size), resample=Image.BILINEAR)
    mask_r = Image.fromarray(mask).resize((size, size), resample=Image.NEAREST)
    return img_r, np.array(mask_r, dtype=np.uint8)


def overlay_instances(img_rgb: np.ndarray, inst_masks: list, alpha: float = 0.38) -> np.ndarray:
    """
    Semi-transparent color overlay per instance + labeled bounding boxes.
    Returns an RGB uint8 image.
    """
    H, W, _ = img_rgb.shape
    out = img_rgb.astype(np.float32).copy()

    rng = np.random.default_rng(1234)
    colors = rng.integers(low=0, high=255, size=(max(1, len(inst_masks)), 3), dtype=np.uint8)

    for idx, m in enumerate(inst_masks, start=1):
        m = (m > 0).astype(np.uint8)
        if m.shape != (H, W):
            m = cv2.resize(m, (W, H), interpolation=cv2.INTER_NEAREST)
        color = colors[(idx - 1) % len(colors)].astype(np.float32)

        # Blend color on masked pixels
        out[m == 1] = (1 - alpha) * out[m == 1] + alpha * color

        # Bounding box & label
        ys, xs = np.where(m == 1)
        if len(xs) > 0:
            x1, x2 = xs.min(), xs.max()
            y1, y2 = ys.min(), ys.max()
            cv2.rectangle(out, (x1, y1), (x2, y2), color=tuple(color.tolist()), thickness=2)
            label = f"inst {idx}"
            cv2.putText(out, label, (x1, max(0, y1 - 5)), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255,255,255), 2, cv2.LINE_AA)

    return np.clip(out, 0, 255).astype(np.uint8)


def save_rgb(path: str, arr_rgb: np.ndarray):
    Image.fromarray(arr_rgb.astype(np.uint8)).save(path)


## 3) Index Dataset & Sanity Checks

In [None]:

def build_items(train_dir: str, mask_dir: str):
    items = []
    for cls in ["authentic", "forged"]:
        img_dir = f"{train_dir}/{cls}"
        if not os.path.exists(img_dir):
            continue
        for p in glob.glob(os.path.join(img_dir, "*")):
            case_id = Path(p).stem
            mask_path = None
            if cls == "forged":
                cand = os.path.join(mask_dir, f"{case_id}.npy")
                mask_path = cand if os.path.exists(cand) else None
            items.append({
                "path": p,
                "case_id": case_id,
                "label": 1 if cls == "forged" else 0,
                "mask_path": mask_path
            })
    return items

items = build_items(TRAIN_DIR, MASK_DIR)

n_total    = len(items)
n_forged   = sum(x["label"] == 1 for x in items)
n_auth     = sum(x["label"] == 0 for x in items)
n_withmask = sum((x["label"] == 1 and x["mask_path"] is not None) for x in items)

print(f"Total images: {n_total}")
print(f"Authentic   : {n_auth}")
print(f"Forged      : {n_forged}")
print(f"Forged with available masks (.npy): {n_withmask}")

df = pd.DataFrame(items)
df.head()


## 4) Global Dataset Statistics (sizes, modes, aspect)

In [None]:

from PIL import Image

dims = []
modes = []
for it in items:
    try:
        img = Image.open(it["path"])
        dims.append(img.size)  # (W,H)
        modes.append(img.mode)
        img.close()
    except Exception as e:
        dims.append((None,None))
        modes.append("ERR")

dim_df = pd.DataFrame(dims, columns=["width","height"])
df["width"]  = dim_df["width"].values
df["height"] = dim_df["height"].values
df["mode"]   = modes
df["aspect"] = (df["width"] / df["height"]).replace([np.inf, -np.inf], np.nan)

print(df[["label","width","height","mode","aspect"]].describe(include="all"))

plt.figure(figsize=(6,4))
df["width"].dropna().plot(kind="hist", bins=30)
plt.title("Image width distribution"); plt.xlabel("pixels"); plt.ylabel("count")
plt.show()

plt.figure(figsize=(6,4))
df["height"].dropna().plot(kind="hist", bins=30)
plt.title("Image height distribution"); plt.xlabel("pixels"); plt.ylabel("count")
plt.show()

plt.figure(figsize=(6,4))
df["aspect"].dropna().plot(kind="hist", bins=30)
plt.title("Aspect ratio (W/H)"); plt.xlabel("ratio"); plt.ylabel("count")
plt.show()


## 5) Mask Structure & Quality Checks

In [None]:

import cv2

bad_shape = 0
empty_mask = 0
instance_counts = []
union_area_px = []
union_area_pct = []

forged_rows = df[df["label"] == 1].copy()
for idx, row in forged_rows.iterrows():
    if not row["mask_path"]:
        continue
    img = Image.open(row["path"]).convert("RGB")
    w, h = img.size
    try:
        mask_union = load_mask_npy(row["mask_path"])
    except Exception as e:
        print("Mask load error:", row["mask_path"], e)
        continue

    if mask_union.shape != (h, w):
        bad_shape += 1
        mask_union = np.array(Image.fromarray(mask_union).resize((w, h), Image.NEAREST))

    if mask_union.sum() == 0:
        empty_mask += 1

    insts = load_mask_instances(row["mask_path"])
    insts = [cv2.resize(m, (w, h), interpolation=cv2.INTER_NEAREST) if m.shape != (h, w) else m for m in insts]

    instance_counts.append(len(insts))
    union_area_px.append(int(mask_union.sum()))
    union_area_pct.append(float(mask_union.sum() / (w * h)))

print(f"Forged with mask shape mismatch (auto-resized): {bad_shape}")
print(f"Forged with empty union mask: {empty_mask}")

forged_rows["instances"] = instance_counts + [np.nan] * (len(forged_rows) - len(instance_counts))
forged_rows["union_area_px"]  = union_area_px + [np.nan] * (len(forged_rows) - len(union_area_px))
forged_rows["union_area_pct"] = union_area_pct + [np.nan] * (len(forged_rows) - len(union_area_pct))
forged_rows["""case_id instances union_area_px union_area_pct""".split()].head()


## 6) Instance-Level Metrics (distributions)

In [None]:

plt.figure(figsize=(6,4))
pd.Series([x for x in instance_counts if x is not None]).plot(kind="hist", bins=20)
plt.title("Distribution: number of forged instances per image"); plt.xlabel("# instances"); plt.ylabel("count")
plt.show()

plt.figure(figsize=(6,4))
pd.Series([x for x in union_area_pct if x is not None]).plot(kind="hist", bins=30)
plt.title("Distribution: forged area as fraction of image"); plt.xlabel("area fraction (0..1)"); plt.ylabel("count")
plt.show()


In [None]:

def instance_metrics(img_w, img_h, m: np.ndarray):
    ys, xs = np.where(m > 0)
    if len(xs) == 0:
        return None
    x1, x2 = xs.min(), xs.max()
    y1, y2 = ys.min(), ys.max()
    w = x2 - x1 + 1
    h = y2 - y1 + 1
    area = int((m > 0).sum())
    bbox_area = int(w * h)
    area_pct = area / (img_w * img_h)
    bbox_pct = bbox_area / (img_w * img_h)
    m8 = (m > 0).astype(np.uint8) * 255
    contours, _ = cv2.findContours(m8, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    perim = sum(cv2.arcLength(c, True) for c in contours) if contours else 0.0
    compactness = (perim * perim) / (area + 1e-6)
    aspect = w / h if h > 0 else np.nan
    return dict(area_px=area, area_pct=area_pct, bbox_area=bbox_area, bbox_pct=bbox_pct,
                width=w, height=h, aspect=aspect, compactness=compactness,
                x1=x1, y1=y1, x2=x2, y2=y2)

rows = []
for it in items:
    if it["label"] != 1 or not it["mask_path"]:
        continue
    img = Image.open(it["path"]).convert("RGB")
    W, H = img.size
    insts = load_mask_instances(it["mask_path"])
    insts = [cv2.resize(m, (W, H), interpolation=cv2.INTER_NEAREST) if m.shape != (H, W) else m for m in insts]

    for k, m in enumerate(insts, start=1):
        met = instance_metrics(W, H, m)
        if met is None: 
            continue
        rows.append({
            "case_id": it["case_id"],
            "instance_id": k,
            **met
        })

inst_df = pd.DataFrame(rows)
print("Instances:", len(inst_df))
inst_df.head()


In [None]:

plt.figure(figsize=(6,4))
inst_df["area_pct"].plot(kind="hist", bins=30)
plt.title("Instance area fraction distribution"); plt.xlabel("area fraction (0..1)"); plt.ylabel("count")
plt.show()

plt.figure(figsize=(6,4))
inst_df["aspect"].replace([np.inf, -np.inf], np.nan).dropna().plot(kind="hist", bins=30)
plt.title("Instance bbox aspect ratio (W/H)"); plt.xlabel("ratio"); plt.ylabel("count")
plt.show()

plt.figure(figsize=(6,4))
inst_df["compactness"].replace([np.inf, -np.inf], np.nan).dropna().plot(kind="hist", bins=30)
plt.title("Instance compactness (perimeter^2 / area)"); plt.xlabel("compactness"); plt.ylabel("count")
plt.show()


## 7) Labeled Visualizations

In [None]:

def show_forged_example(item: dict, display_size=DISPLAY_SIZE, save_prefix=None):
    assert item["label"] == 1 and item["mask_path"] is not None
    img = Image.open(item["path"]).convert("RGB")
    W, H = img.size
    union = load_mask_npy(item["mask_path"])
    if union.shape != (H, W):
        union = np.array(Image.fromarray(union).resize((W, H), Image.NEAREST))
    insts = load_mask_instances(item["mask_path"])
    insts = [cv2.resize(m, (W, H), interpolation=cv2.INTER_NEAREST) if m.shape != (H, W) else m for m in insts]

    img_r, union_r = resize_img_and_mask(img, union, display_size)
    img_np = pil_to_np_rgb(img_r)
    overlay = overlay_instances(
        img_np,
        [cv2.resize(m, (display_size, display_size), interpolation=cv2.INTER_NEAREST) for m in insts]
    )


    # Side-by-side: original / union mask / overlay
    plt.figure(figsize=(15,5))
    plt.subplot(1,3,1); plt.imshow(img_np); plt.axis("off"); plt.title("Original")
    plt.subplot(1,3,2); plt.imshow(union_r, cmap="gray"); plt.axis("off"); plt.title("Union mask")
    plt.subplot(1,3,3); plt.imshow(overlay); plt.axis("off"); plt.title(f"Overlay ({len(insts)} instance(s))")
    plt.tight_layout(); plt.show()

    if save_prefix:
        save_rgb(f"{save_prefix}_original.png", img_np)
        Image.fromarray(union_r.astype(np.uint8) * 255).save(f"{save_prefix}_mask.png")
        save_rgb(f"{save_prefix}_overlay.png", overlay)
        print("Saved:", f"{save_prefix}_*.png")

forged_items = [it for it in items if it["label"] == 1 and it["mask_path"] is not None]
if len(forged_items) == 0:
    print("No forged items with masks to visualize.")
else:
    ex = np.random.choice(forged_items)
    show_forged_example(ex, display_size=DISPLAY_SIZE, save_prefix=f"{PREVIEW_DIR}/{Path(ex['path']).stem}")


In [None]:

def show_and_save_forged_grid(forged_list: list, n=MAX_GRID_IMAGES, tile_size=GRID_SIZE, cols=3, out_prefix="grid"):
    sel = forged_list[:n] if len(forged_list) <= n else list(np.random.choice(forged_list, n, replace=False))
    rows = (len(sel) + cols - 1) // cols

    plt.figure(figsize=(5 * cols, 5 * rows))
    saved_paths = []
    for i, it in enumerate(sel, 1):
        img = Image.open(it["path"]).convert("RGB")
        W, H = img.size
        insts = load_mask_instances(it["mask_path"])
        insts = [cv2.resize(m, (W, H), interpolation=cv2.INTER_NEAREST) if m.shape != (H, W) else m for m in insts]

        img_r = img.resize((tile_size, tile_size), resample=Image.BILINEAR)
        over = overlay_instances(pil_to_np_rgb(img_r),
                                 [cv2.resize(m, (tile_size, tile_size), interpolation=cv2.INTER_NEAREST) for m in insts])
        ax = plt.subplot(rows, cols, i)
        ax.imshow(over); ax.axis("off")
        ax.set_title(f"{Path(it['path']).name}\n{len(insts)} inst")

        save_path = f"{PREVIEW_DIR}/{Path(it['path']).stem}_overlay.png"
        save_rgb(save_path, over)
        saved_paths.append(save_path)

    plt.tight_layout(); plt.show()
    print("Saved overlays:", len(saved_paths))
    for p in saved_paths[:5]:
        print(" -", p)
    if len(saved_paths) > 5:
        print(" - ...")

if len(forged_items) > 0:
    show_and_save_forged_grid(forged_items, n=MAX_GRID_IMAGES, tile_size=GRID_SIZE, cols=3, out_prefix="grid")


In [None]:

def show_auth_vs_forged(auth_list, forged_list, k=6, size=384):
    k_auth = min(k, len(auth_list))
    k_forg = min(k, len(forged_list))
    sel_auth = auth_list[:k_auth] if len(auth_list) <= k_auth else list(np.random.choice(auth_list, k_auth, replace=False))
    sel_forg = forged_list[:k_forg] if len(forged_list) <= k_forg else list(np.random.choice(forged_list, k_forg, replace=False))

    cols = max(k_auth, k_forg, 1)
    plt.figure(figsize=(4 * cols, 8))

    # Row 1: authentic
    for i, it in enumerate(sel_auth, 1):
        img = Image.open(it["path"]).convert("RGB").resize((size, size), resample=Image.BILINEAR)
        ax = plt.subplot(2, cols, i)
        ax.imshow(img); ax.axis("off")
        ax.set_title(f"Authentic\n{Path(it['path']).name}")

    # Row 2: forged (overlay)
    for i, it in enumerate(sel_forg, 1):
        img = Image.open(it["path"]).convert("RGB").resize((size, size), resample=Image.BILINEAR)
        insts = load_mask_instances(it["mask_path"])
        insts = [cv2.resize(m, (size, size), interpolation=cv2.INTER_NEAREST) for m in insts]
        over  = overlay_instances(pil_to_np_rgb(img), insts)
        ax = plt.subplot(2, cols, cols + i)
        ax.imshow(over); ax.axis("off")
        ax.set_title(f"Forged (overlay)\n{Path(it['path']).name}")

    plt.tight_layout(); plt.show()

auth_items = [it for it in items if it["label"] == 0]
if len(auth_items) > 0 and len(forged_items) > 0:
    show_auth_vs_forged(auth_items, forged_items, k=6, size=384)


## 8) Copy–Move Signature Peek (template matching, qualitative)

In [None]:

def template_match_peek(item: dict, display_size=DISPLAY_SIZE, scale=TEMPLATE_MATCH_SCALE, max_inst=1):
    assert item["label"] == 1 and item["mask_path"] is not None
    img = Image.open(item["path"]).convert("RGB")
    W, H = img.size
    base = pil_to_np_rgb(img)

    if scale != 1.0:
        base_small = cv2.resize(base, (int(W*scale), int(H*scale)), interpolation=cv2.INTER_AREA)
    else:
        base_small = base.copy()

    insts = load_mask_instances(item["mask_path"])
    insts = [cv2.resize(m, (W, H), interpolation=cv2.INTER_NEAREST) if m.shape != (H, W) else m for m in insts]
    if len(insts) == 0:
        print("No instances found for:", item["case_id"])
        return

    insts = insts[:max_inst]

    vis = base_small.copy()
    for k, m in enumerate(insts, start=1):
        ys, xs = np.where(m > 0)
        x1, x2 = xs.min(), xs.max()
        y1, y2 = ys.min(), ys.max()
        patch = base[y1:y2+1, x1:x2+1]
        if patch.size == 0:
            continue
        if scale != 1.0:
            patch = cv2.resize(patch, (int(patch.shape[1]*scale), int(patch.shape[0]*scale)), interpolation=cv2.INTER_AREA)

        res = cv2.matchTemplate(base_small, patch, cv2.TM_CCOEFF_NORMED)
        min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(res)

        ph, pw = patch.shape[:2]
        top_left = max_loc
        bottom_right = (top_left[0] + pw, top_left[1] + ph)

        cv2.rectangle(vis, top_left, bottom_right, (255, 0, 0), 2)
        cv2.putText(vis, f"inst{k} score={max_val:.2f}", (top_left[0], max(0, top_left[1]-5)),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255,255,255), 2, cv2.LINE_AA)

    base_disp = cv2.resize(base, (display_size, display_size), interpolation=cv2.INTER_AREA)
    vis_disp  = cv2.resize(vis,  (display_size, display_size), interpolation=cv2.INTER_AREA)

    plt.figure(figsize=(12,5))
    plt.subplot(1,2,1); plt.imshow(base_disp); plt.axis("off"); plt.title("Original")
    plt.subplot(1,2,2); plt.imshow(vis_disp);  plt.axis("off"); plt.title("Template-match best hits")
    plt.tight_layout(); plt.show()

    out_path = f"{PREVIEW_DIR}/{Path(item['path']).stem}_tmatch.png"
    save_rgb(out_path, vis)
    print("Saved:", out_path)

# Run on up to 2 random forged images
if len(forged_items) > 0:
    for ex in list(np.random.choice(forged_items, size=min(2, len(forged_items)), replace=False)):
        template_match_peek(ex, display_size=DISPLAY_SIZE, scale=TEMPLATE_MATCH_SCALE, max_inst=1)


## 9) Analyst Summary & Next Steps

In [None]:

from IPython.display import Markdown, display

def md(txt): display(Markdown(txt))

avg_inst = float(np.nanmean(forged_rows["instances"])) if "instances" in locals() and "instances" in forged_rows.columns else float("nan")
med_area = float(np.nanmedian(forged_rows["union_area_pct"])) if "union_area_pct" in locals() and "union_area_pct" in forged_rows.columns else float("nan")

md(f'''
### Analyst Summary

**Data balance**
- Authentic images: **{int((df["label"]==0).sum())}**
- Forged images: **{int((df["label"]==1).sum())}**
- With masks available: **{int(((df["label"]==1) & df["mask_path"].notna()).sum())}**

**Image geometry**
- Width/height/aspect distributions shown above.
- Consider standardizing input size (e.g., 512–1024) in modeling notebooks.

**Masks & instances**
- Avg. instances per forged image: **{avg_inst:.2f}**
- Median forged area fraction (union): **{med_area:.4f}**
- Shape mismatches were auto-resized; empty masks flagged in logs.

**Instance shape**
- Area fraction skews small → pixel-class imbalance likely.
- Aspect ratio + compactness indicate diverse shapes (some elongated/fragmented).

**Visual sanity**
- Overlays align with expected duplicated regions.
- Template-matching peek shows plausible duplicate hits (qualitative check).

---

### Recommendations (for future modeling)
- **Sampling:** Oversample forged images / positive patches.
- **Loss:** Combine **BCE + Dice**, consider **Focal** for small regions.
- **Resolution:** Start 512, fine-tune 768–1024 for sharper boundaries.
- **Augmentations:** Light affine/photometric; optional **synthetic copy–move** pasting.
- **Post-processing:** Morphological cleanup + CC filtering.
- **Evaluation:** Track pixel-F1 and the official image-level oF1 on a validation split.
''')
