# 🎬 Automatic Video Cutter & Editor for Cooking Videos

This notebook implements an **automatic video summarization and editing pipeline**. It is designed mainly for cooking videos, but the approach is generalizable.

The system will:
- Remove static (non-moving) sections from the videos.
- Optionally use CLIP to select only those segments that are semantically similar to a provided text prompt.
- Combine only segments with movement.
- Allow you to choose between using just motion, just CLIP, or both.
- Merge selected clips into a final video.
- Limit the total output duration (e.g., 60 seconds for a reel).
- Support both vertical and horizontal video formats, with an option to force a vertical (9:16) output.

**Sources:**
- OpenCV: [https://opencv.org/](https://opencv.org/)
- MoviePy: [https://zulko.github.io/moviepy/](https://zulko.github.io/moviepy/)
- CLIP (OpenAI): [https://github.com/openai/CLIP](https://github.com/openai/CLIP)
- Whisper (OpenAI): [https://github.com/openai/whisper](https://github.com/openai/whisper)

---

## 📦 Install Dependencies

Run the following cell to install the required libraries:


In [1]:
!pip install -q opencv-python moviepy openai-whisper
!pip install -q git+https://github.com/openai/CLIP.git
!apt-get -qq install -y ffmpeg

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/800.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.1/800.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m798.7/800.5 kB[0m [31m10.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m800.5/800.5 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0

##Mount Google Drive
Make sure your videos are stored in a Drive folder. Run the cell below to mount your Drive:

In [10]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Configuration
Customize these parameters as needed. You can adjust folder paths, thresholds, clip durations, and output format here.

In [11]:
import os

input_folder = "/content/drive/MyDrive/video_vongole"
output_folder = "/content/drive/MyDrive/video_output"
os.makedirs(output_folder, exist_ok=True)

# Text prompt (if empty, only motion will be used)
text_prompt = "chopping garlic"  # e.g., "chopping garlic". Set to "" for no text filter.

# Mode selection: use motion detection and/or CLIP similarity
use_motion = True   # Use motion detection to remove static parts
use_clip   = False   # Use CLIP to filter segments based on the text prompt

# Clip duration parameters (in seconds)
clip_min_duration = 0.3   # Minimum duration for an extracted clip
clip_max_duration = 3.0   # Maximum duration for an extracted clip (used as default clip length)

max_total_duration = 60.0  # Set to None for no limit

force_vertical = False    #9:16 format
# (width, height).
output_resolution = None

frame_sample_rate = 5      # seconds between frames to analyze
movement_threshold = 400000  # number of pixel differences to consider a frame "moving"
merge_clip_gap = 2.0
clip_similarity_threshold = 0.35  # CLIP cosine similarity threshold

##Import Libraries and Load Models
We import necessary libraries. CLIP is loaded for semantic similarity and OpenCV is used for motion detection. (Whisper is imported but not used in this version.)

In [12]:
import cv2
import numpy as np
import torch
import clip
from PIL import Image
from moviepy.editor import VideoFileClip, concatenate_videoclips

# Load CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
if use_clip and text_prompt.strip() != "":
    clip_model, clip_preprocess = clip.load("ViT-B/32", device=device)
else:
    clip_model, clip_preprocess = None, None

# (Optional) Load Whisper model if needed in the future
if False:
    import whisper
    whisper_model = whisper.load_model("base")

##Define Helper Functions
Below we define functions to:

Check motion between frames.

Compute CLIP similarity between a frame and a text prompt.

Merge segments that are close in time.

Process individual videos and extract candidate segments.

In [13]:
def is_frame_similar_to_prompt(frame_img, prompt, threshold):
    """
    Uses CLIP to determine if the given frame is semantically similar to the prompt.
    Returns a tuple: (is_similar, similarity_score).
    """
    # Preprocess and encode image
    image = clip_preprocess(Image.fromarray(frame_img)).unsqueeze(0).to(device)
    # Tokenize and encode text prompt
    text_tokens = clip.tokenize([prompt]).to(device)
    with torch.no_grad():
        image_features = clip_model.encode_image(image)
        text_features = clip_model.encode_text(text_tokens)
        # Compute cosine similarity
        similarities = (image_features @ text_features.T).squeeze(0)
        best_score = similarities.max().item()
    return best_score > threshold, best_score

def detect_motion(prev_frame, frame):
    """
    Compare two grayscale frames using absolute difference.
    Returns True if the number of changed pixels exceeds the threshold.
    """
    diff = cv2.absdiff(prev_frame, frame)
    non_zero_count = np.count_nonzero(diff)
    return non_zero_count > movement_threshold

def merge_close_segments(segments, gap=2.0):
    """
    Merge segments that are closer than `gap` seconds.
    Each segment is a tuple: (start, end, score).
    """
    if not segments:
        return []
    segments = sorted(segments, key=lambda x: x[0])
    merged = [segments[0]]
    for start, end, score in segments[1:]:
        last_start, last_end, last_score = merged[-1]
        if start - last_end <= gap:
            # Merge and update score as the maximum
            merged[-1] = (last_start, max(end, last_end), max(score, last_score))
        else:
            merged.append((start, end, score))
    return merged


##Process Videos and Extract Candidate Segments
This function processes a single video file:

It samples frames at the specified rate.

Checks for motion and (if enabled) CLIP similarity.

Extracts candidate segments (each of fixed length defined by clip_max_duration).

Merges segments that are too close in time.

In [14]:
def process_video(video_path, prompt):
    """
    Processes a video file and returns a list of candidate segments.
    Each candidate segment is a tuple: (video_path, start, end, score).
    """
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    duration = total_frames / fps

    print(f"Processing {video_path} | Duration: {duration:.2f}s")

    candidate_segments = []
    prev_frame_gray = None
    # Use clip_max_duration as the duration of each extracted segment
    clip_duration = clip_max_duration

    t = 0
    while t < duration:
        cap.set(cv2.CAP_PROP_POS_MSEC, t * 1000)
        ret, frame = cap.read()
        if not ret:
            t += frame_sample_rate
            continue

        # Convert to grayscale for motion detection
        frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        motion = True
        if use_motion and prev_frame_gray is not None:
            motion = detect_motion(prev_frame_gray, frame_gray)
        prev_frame_gray = frame_gray

        # Initialize similarity check result and score
        similar = True
        similarity_score = 1.0  # default score if CLIP is not used
        if use_clip and text_prompt.strip() != "":
            similar, similarity_score = is_frame_similar_to_prompt(frame, text_prompt, clip_similarity_threshold)

        # Accept the frame if conditions are met (must be moving and, if prompt is used, semantically similar)
        if motion and similar:
            start = max(0, t - clip_duration/2)
            end = min(duration, t + clip_duration/2)
            # Ensure the segment is at least clip_min_duration long
            if (end - start) >= clip_min_duration:
                candidate_segments.append((start, end, similarity_score))
        t += frame_sample_rate
    cap.release()

    # Merge segments that are close in time
    merged_segments = merge_close_segments(candidate_segments, gap=merge_clip_gap)
    return video_path, merged_segments

##Process All Videos and Select Best Segments
This section loops over all video files in the input folder, processes them, and then selects the best segments if the total duration exceeds the maximum allowed. If the maximum duration is exceeded, the segments are sorted by significance (CLIP score) and the best ones are selected.

In [15]:
import glob

def get_all_video_files(folder):
    video_extensions = ["*.mp4", "*.MOV", "*.avi", "*.mkv"]
    files = []
    for ext in video_extensions:
        files.extend(glob.glob(os.path.join(folder, ext)))
    return files

# Process videos and collect candidate segments
all_candidate_segments = []  # List of tuples: (video_path, start, end, score)
video_files = get_all_video_files(input_folder)
print(f"Found {len(video_files)} video(s).")

for video_file in video_files:
    vpath, segments = process_video(video_file, text_prompt)
    for seg in segments:
        start, end, score = seg
        all_candidate_segments.append((vpath, start, end, score))

print(f"Total candidate segments found: {len(all_candidate_segments)}")

# If a maximum total duration is set, select the best segments based on score.
def select_segments(segments, max_duration):
    if max_duration is None:
        return segments
    # Sort segments by descending score
    segments_sorted = sorted(segments, key=lambda x: x[3], reverse=True)
    selected = []
    cumulative_duration = 0.0
    for seg in segments_sorted:
        vpath, start, end, score = seg
        seg_duration = end - start
        if cumulative_duration + seg_duration <= max_duration:
            selected.append(seg)
            cumulative_duration += seg_duration
        if cumulative_duration >= max_duration:
            break
    # For a coherent montage, sort the selected segments by video path and start time
    selected = sorted(selected, key=lambda x: (x[0], x[1]))
    return selected

selected_segments = select_segments(all_candidate_segments, max_total_duration)
print(f"Selected segments after duration filtering: {len(selected_segments)}")


Found 5 video(s).
Processing /content/drive/MyDrive/video_vongole/IMG_6134.MOV | Duration: 27.70s
Processing /content/drive/MyDrive/video_vongole/IMG_6135.MOV | Duration: 5.66s
Processing /content/drive/MyDrive/video_vongole/IMG_6136.MOV | Duration: 37.87s
Processing /content/drive/MyDrive/video_vongole/IMG_6139.MOV | Duration: 5.29s
Processing /content/drive/MyDrive/video_vongole/IMG_6140.MOV | Duration: 6.25s
Total candidate segments found: 5
Selected segments after duration filtering: 4


##Crop (Force) to Vertical Format (Optional)
If you want to force the output video to vertical (9:16), the following helper function will crop each clip to a vertical format.

In [16]:
def force_vertical_format(clip):
    """
    Crops the clip to a vertical 9:16 aspect ratio from the center.
    """
    w, h = clip.size
    target_ratio = 9/16
    target_width = int(h * target_ratio)
    if w > target_width:
        # Crop horizontally: center crop
        x1 = (w - target_width) // 2
        x2 = x1 + target_width
        clip = clip.crop(x1=x1, x2=x2)
    else:
        # If the clip is too narrow, resize while maintaining aspect ratio
        clip = clip.resize(width=target_width)
    return clip

##Extract and Concatenate Clips into Final Video
Now we extract the subclips from each selected segment and concatenate them using MoviePy. We also apply the vertical format if forced, and finally write the output video.

In [17]:
final_clips = []
for seg in selected_segments:
    video_path, start, end, score = seg
    try:
        video = VideoFileClip(video_path)
    except Exception as e:
        print(f"Error loading {video_path}: {e}")
        continue
    clip = video.subclip(start, end)
    # Resize to output resolution if specified
    if output_resolution is not None:
        clip = clip.resize(newsize=output_resolution)
    # Force vertical format if required
    if force_vertical:
        clip = force_vertical_format(clip)
    final_clips.append(clip)

if final_clips:
    final_video = concatenate_videoclips(final_clips, method="compose")
    # If the final video is longer than max_total_duration, trim it.
    if max_total_duration is not None and final_video.duration > max_total_duration:
        final_video = final_video.subclip(0, max_total_duration)
    output_path = os.path.join(output_folder, "final_output.mp4")
    print(f"Writing final video to {output_path} (duration: {final_video.duration:.2f}s)")
    final_video.write_videofile(output_path, codec="libx264", audio=True)
else:
    print("⚠️ No valid segments found for the final montage.")

Writing final video to /content/drive/MyDrive/video_output/final_output.mp4 (duration: 43.70s)
Moviepy - Building video /content/drive/MyDrive/video_output/final_output.mp4.
MoviePy - Writing audio in final_outputTEMP_MPY_wvf_snd.mp3




MoviePy - Done.
Moviepy - Writing video /content/drive/MyDrive/video_output/final_output.mp4






Moviepy - Done !
Moviepy - video ready /content/drive/MyDrive/video_output/final_output.mp4


##Future Extensions
* Whisper Integration: Use Whisper for audio transcription to filter or annotate clips based on speech.

* Additional Visual Filters: Integrate object detection, pose
estimation (e.g., with MediaPipe) or action recognition (using I3D, SlowFast, or Swin Transformer).

* Different Output Formats: Enable export to various social media formats (vertical 9:16 for TikTok/Reels, horizontal for YouTube, etc.).

* GUI Integration: Build an interactive interface with Gradio or Streamlit.



##Sources
OpenCV

MoviePy

CLIP (OpenAI)

Whisper (OpenAI)

