## Initial setup
Use CLIPModel, OpenAI's pretrained model for processing videos (vectorizes them).

In [9]:
import cv2
import torch
from transformers import CLIPProcessor, CLIPModel
import numpy as np

In [2]:
# Load model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

print("Model loaded")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Model loaded


### Define vectorize()
- Takes in a video input (cap: VideoCapture)
- Sample every several frames of the video and process

In [5]:
def vectorize(cap):
    frame_vectors = []
    frame_rate = cap.get(cv2.CAP_PROP_FPS)
    frame_interval = int(frame_rate) if frame_rate > 0 else 1

    frame_idx = 0
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        if frame_idx % frame_interval == 0:
            # print(f"Processing frame {frame_idx}")

            rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            inputs = processor(images=rgb, return_tensors="pt")
            with torch.no_grad():
                embedding = model.get_image_features(**inputs)
            frame_vectors.append(embedding.squeeze().cpu().numpy())
        frame_idx += 1

    cap.release()

    return frame_vectors

### Initial testing data
*The Fellowship of the Ring* and *The Two Towers* are movies in the Lord of the Rings franchise, so they should be similar.

In [7]:
# Open video
videos = ["manchester_by_the_sea", "fellowship_of_the_ring", "the_two_towers"]
vectors = {}

for v in videos:
    cap = cv2.VideoCapture(f"ex_trailers/{v}.mp4")
    print(v)

    # Combine frame vectors (mean pooling)
    video_vector = np.mean(np.stack(vectorize(cap)), axis=0)
    print("Video vector shape:", video_vector.shape)

    vectors[v] = video_vector

manchester_by_the_sea
Video vector shape: (512,)
fellowship_of_the_ring
Video vector shape: (512,)
the_two_towers
Video vector shape: (512,)


In [8]:
from sklearn.metrics.pairwise import cosine_similarity

for i, v1 in enumerate(videos[:-1]):
    for v2 in videos[i+1:]:
        print(f"{v1} {v2} similarity: {cosine_similarity([vectors[v1]], [vectors[v2]])}")

manchester_by_the_sea fellowship_of_the_ring similarity: [[0.8736533]]
manchester_by_the_sea the_two_towers similarity: [[0.8699611]]
fellowship_of_the_ring the_two_towers similarity: [[0.9678031]]


## More fine-tuned applications
CLIPModel is a great general-purpose model, but we're trying to be more specific to movie trailers.