## IMAGE CLEANING AND DEDEUPLICATION
This notebook takes the videos I have of each agent and converts them into images. As the videos I took on NVIDIA Shadowplay are 1080p 30fps, I first reduce the framerate to around 3fps otherwise we would get many images which are very similar. I then measure the blur of each image and keep the sharpest ones. I think remove any possible duplicated images. Finally I save all these processed images in a file, ready for use in the ML model section.

Below I get necessary project paths ready and also any important varibles. I then quickly loop over and count how many videos each agent has

In [2]:
from pathlib import Path
import os
import pandas as pd
import cv2
import numpy as np

PROJECT_DIR   = Path(r"D:\Valorant ML data")
RAW_VIDEO_DIR = PROJECT_DIR / "Raw videos"   

# Here are some variables that I create now for clarity
TARGET_FPS = 3          # we want to reduce fps from 30 to 3
BLUR_THRESHOLD = 110.0        #the lower this no., the stricter the blur detection
DUPLICATE_MAD_THRESHOLD = 2.0  #a lower no. here mains more likely to label frame as duplicate

In [3]:
FRAME_OUT_DIR = PROJECT_DIR / "frames_3fps_clean"
FRAME_OUT_DIR.mkdir(parents=True, exist_ok=True)

In [6]:
#Here I just check the agents list and how many videos each has
agents = []
counts = []
for agent_folder in sorted(RAW_VIDEO_DIR.iterdir()):
    if agent_folder.is_dir():
        n = 0
        for root, _, files in os.walk(agent_folder):
            for f in files:
                if f.lower().endswith(".mp4"):
                    n += 1
        agents.append(agent_folder.name)
        counts.append(n)

counts_df = pd.DataFrame({"agent": agents, "video_count": counts}).sort_values("agent").reset_index(drop=True)
print(counts_df.to_string(index=False))

    agent  video_count
    Astra            9
   Breach            8
Brimstone            8
  Chamber            8
   Cipher            8
    Clove            8
 Deadlock            8
     Fade            8
    Gekko            8
      Iso            8
     Jett            8
     Kayo            8
     Neon            8
     Raze            8
     Sage            8


Next I define some functions that will help in a bit:

`is_blurry` takes the image and the blur threshold as arguments and computes the variance of Laplacian. If this is low, its more likely the image is blurry. I read a tutorial about this online, but intuitively Im guessing that if variance is too low, then there isnt enough 'change' in  the image and it isnt sharp. This function returns a True of False, and also the value of the variance of the Laplacian

`frames_too_similar` takes 2 greyscaled images and compares them using mean absolute difference (MAD). If the MAD is small then the frames are more similar. The function returns a True or False, and also the value of MAD

`save_frame` takes the image and a savepath and just saves the frame as a jpg without adjusting the res. It returns a True or False and uses the cv2.imwrite to do this.

In [20]:
def is_blurry(image_bgr, threshold=BLUR_THRESHOLD):
    gray = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2GRAY) #first we convert the image into greyscale
    focus = cv2.Laplacian(gray, cv2.CV_64F).var()
    return (focus < threshold), float(focus)

In [21]:
def frames_too_similar(prev_gray, curr_gray, mad_threshold=DUPLICATE_MAD_THRESHOLD):
    diff = cv2.absdiff(prev_gray, curr_gray)
    mad = float(np.mean(diff))
    return (mad < mad_threshold), mad

In [22]:
def save_frame(image_bgr, save_path):
    
    save_path = Path(save_path)
    save_path.parent.mkdir(parents=True, exist_ok=True)
    done = cv2.imwrite(str(save_path), image_bgr)
    return done

Now we define the function that processes one video: samples at 3 fps, removes blurry or near duplicate frames, saves kept frames to `frames_3fps_clean/<Agent>/…`, prints a one-line summary, and returns a pandas table

In [23]:
def process_single_video(video_path, agent_name=None, out_dir=FRAME_OUT_DIR, target_fps=TARGET_FPS, 
                         blur_thr=BLUR_THRESHOLD, dup_mad_thr=DUPLICATE_MAD_THRESHOLD):    
    video_path = Path(video_path)
    if agent_name is None:
        agent_name = video_path.parent.name  # this sets the agent name
    cap = cv2.VideoCapture(str(video_path))
    #source fps, the fallback is 30
    src_fps = cap.get(cv2.CAP_PROP_FPS)
    if src_fps is None or src_fps <= 0:
        src_fps = 30.0
    #I keep roughly 1 fraem every 'step' frames
    step = max(1, int(round(src_fps / max(1e-6, target_fps))))
    sampled = 0
    kept = 0
    skipped_blur = 0
    skipped_dup = 0
    rows = []
    last_kept_gray = None
    frame_idx = 0
    while True:
        ok, frame_bgr = cap.read()
        if not ok:
            break

        if frame_idx % step != 0:
            frame_idx += 1
            continue  #this ensures we only take the frame on our step. I also increase the frame index counter to record this
        sampled += 1
        is_blur, focus_val = is_blurry(frame_bgr, threshold=blur_thr) #we call our blur function here

        # duplicate check vs the previous frame
        gray = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2GRAY)
        if last_kept_gray is not None:
            is_dup, mad_val = frames_too_similar(last_kept_gray, gray, mad_threshold=dup_mad_thr)
        else:
            is_dup, mad_val = False, np.nan

        keep = (not is_blur) and (not is_dup)
        save_name = f"{video_path.stem}_{frame_idx:06d}.jpg"
        save_path = out_dir / agent_name / save_name
        
        if keep:
            saved_fine = save_frame(frame_bgr, save_path)
            if saved_fine:
                last_kept_gray = gray
                kept += 1
        else:
            if is_blur:
                skipped_blur += 1
            elif is_dup:
                skipped_dup += 1
        # now store a tiny row 
        rows.append({
            "agent": agent_name,
            "video": video_path.name,
            "frame_index": frame_idx,
            "src_fps": float(src_fps),
            "sample_step": int(step),
            "focus": float(focus_val),
            "mad_vs_prev": float(mad_val) if not np.isnan(mad_val) else np.nan,
            "blurry": bool(is_blur),
            "duplicate": bool(is_dup),
            "kept": bool(keep),
            "saved_path": str(save_path) if keep else ""})
        frame_idx += 1
    cap.release()
    print(f"{agent_name}: {video_path.name} | sampled={sampled} kept={kept} "
          f"blur skipped={skipped_blur} duplicates skipped={skipped_dup}")

    return pd.DataFrame(rows)

Now we use `process_single_video` to process all the videos inside a single agent folder. Its arguments are the agent foler and the folder where the new frames will be saved

In [24]:
def process_agent_folder(agent_folder, out_dir=FRAME_OUT_DIR):
    agent_folder = Path(agent_folder)
    agent_name = agent_folder.name
    # find videos
    video_paths = []
    for root, _, files in os.walk(agent_folder):
        for f in files:
            if f.lower().endswith(".mp4"):
                video_paths.append(Path(root) / f)
    video_paths = sorted(video_paths)
    all_rows = []
    for vp in video_paths:    #now we process each video by calling the appropriate function
        df = process_single_video(video_path=vp,
            agent_name=agent_name,
            out_dir=out_dir,
            target_fps=TARGET_FPS,
            blur_thr=BLUR_THRESHOLD,
            dup_mad_thr=DUPLICATE_MAD_THRESHOLD)
        if not df.empty:
            all_rows.append(df)
    # lastly we combine and print a short summary
    agent_df = pd.concat(all_rows, ignore_index=True)
    total_sampled = len(agent_df)
    total_kept = int(agent_df["kept"].sum())
    keep_rate = total_kept / total_sampled if total_sampled > 0 else 0.0
    print(f"{agent_name}: videos={len(video_paths)} | sampled={total_sampled} | kept={total_kept} "
        f"| keep_rate={keep_rate:.2%}")
    return agent_df

Now we define a function to process all agents:

In [25]:
def process_all_agents(raw_root=RAW_VIDEO_DIR, out_dir=FRAME_OUT_DIR):
    raw_root = Path(raw_root)
    agent_logs = []
    for agent_folder in sorted(raw_root.iterdir()):
        if not agent_folder.is_dir():
            continue
        agent_name = agent_folder.name
        #I initialise a list to collect all the videos for this agent
        videos = []
        for root, _, files in os.walk(agent_folder):
            for f in files:
                if f.lower().endswith(".mp4"):
                    videos.append(Path(root) / f)
        videos = sorted(videos)
        if not videos:
            continue
        # process videos for this agent
        per_video = []
        for vp in videos:
            df = process_single_video(
                video_path=vp,
                agent_name=agent_name,
                out_dir=out_dir,
                target_fps=TARGET_FPS,
                blur_thr=BLUR_THRESHOLD,
                dup_mad_thr=DUPLICATE_MAD_THRESHOLD)
            if not df.empty:
                per_video.append(df)
        if per_video:
            agent_logs.append(pd.concat(per_video, ignore_index=True))
    if agent_logs:
        return pd.concat(agent_logs, ignore_index=True)
    # empty backup with expected columns
    return pd.DataFrame(columns=["agent","video","frame_index","src_fps","sample_step","focus","mad_vs_prev","blurry","duplicate","kept","saved_path"])

Now the most important function has been defined, we call it and just print some checks to ensure everythig is ok. These checks are important because I realised that some agents had very few images saved. This was the case for the agents I first gathered videos of, hence the first videos were not of the highest quality. Thus, in order to delete less videos I went back and adjusted the necessary parameters (namely the duplicate MAD threshold, I made it higher).

In [26]:
all_log_df = process_all_agents(raw_root=RAW_VIDEO_DIR, out_dir=FRAME_OUT_DIR)
print("frames saved under:", FRAME_OUT_DIR)

summary_df = (all_log_df.groupby("agent")["kept"].agg(total_sampled="size", kept_frames="sum").reset_index() )
summary_df["keep_rate"] = (summary_df["kept_frames"] / summary_df["total_sampled"]).round(4) 
print("\nPer agent summary:")
print(summary_df.to_string(index=False))
# quick look at the first few log rows 
print("\nLog preview (first 8 rows):")
print(all_log_df.head(8).to_string(index=False))

Astra: Valorant 2025.08.25 - 22.02.06.08.DVR.mp4 | sampled=46 kept=45 blur_skipped=0 dup_skipped=1
Astra: Valorant 2025.08.25 - 22.04.31.09.DVR.mp4 | sampled=46 kept=1 blur_skipped=0 dup_skipped=45
Astra: Valorant 2025.08.25 - 22.06.50.11.DVR.mp4 | sampled=46 kept=7 blur_skipped=0 dup_skipped=39
Astra: Valorant 2025.08.25 - 22.07.44.12.DVR.mp4 | sampled=46 kept=17 blur_skipped=0 dup_skipped=29
Astra: Valorant 2025.08.25 - 22.08.44.13.DVR.mp4 | sampled=46 kept=1 blur_skipped=0 dup_skipped=45
Astra: Valorant 2025.08.25 - 22.09.55.14.DVR.mp4 | sampled=46 kept=8 blur_skipped=0 dup_skipped=38
Astra: Valorant 2025.08.25 - 22.11.24.15.DVR.mp4 | sampled=46 kept=44 blur_skipped=2 dup_skipped=0
Astra: Valorant 2025.08.25 - 22.12.46.16.DVR.mp4 | sampled=47 kept=47 blur_skipped=0 dup_skipped=0
Astra: Valorant 2025.08.25 - 22.13.50.17.DVR.mp4 | sampled=47 kept=15 blur_skipped=0 dup_skipped=32
Breach: Valorant 2025.08.25 - 22.24.53.30.DVR.mp4 | sampled=46 kept=45 blur_skipped=0 dup_skipped=1
Breach: