# Demo notebook
## kNN-Memory with ClipCap: Enhanced Long-Range Dependency Handling

Welcome to ClipMemClap's interactive notebook! Here, you can upload your own .mp4 clip and our system will generate a customized caption for your clip.

To ensure a smooth operation, please adhere to the following steps:

- Ensure that your current location is within the ./knn-memory-clipcap/demos directory. You can access this directory by following this link: SebastiaanJohn's knn-memory-clipcap Github repository.
- Make sure that your video file is in the same directory as this notebook.

Once you upload your video, the following processes will occur:

- A folder named 'frames' will be automatically generated within this directory, which will serve as the storage for the frames extracted from your video.
- Another folder, titled 'embeddings', will be created. This is where the CLIP embeddings of the frames will be stored.

If you prefer for the 'embeddings' and 'frames' folders to be deleted after the captioning process, you can set the variable rem_dirs to True.

In [1]:
import sys
import os

from IPython.display import Video

module_path = os.path.abspath(os.path.join(".."))
if module_path not in sys.path:
    sys.path.append(module_path)

module_path_2 = os.path.abspath(os.path.join("../src"))
if module_path_2 not in sys.path:
    sys.path.append(module_path_2)

from src.models.clipcap import ClipCaptionPrefix
from src.process_vid import *

This location is dedicated to storing the specific model you've chosen for the caption generation process.

In [None]:
MODEL_DIR = "../checkpoints/activitynet_with_mem-best.pt"

Please input the name of your video, ensuring it includes the .mp4 extension. The individual frames from your video will be saved in the 'frames' folder.

In [2]:
video_name = "tennis.mp4"
extract_frames(video_name)

Extracting frames for video: tennis.mp4
Succesfully dumped frames of tennis: 101 frames


This is the video for which the system will generate a corresponding caption.

In [3]:
Video(video_name)

#### Initialize the trained ClipMemCap model.

In [6]:
# arguments need to match those of the saved model
model = ClipCaptionPrefix(
    10,
    batch_size=1,
    clip_length=10,
    prefix_size=512,
    num_layers=8,
    num_heads=8,
    memorizing_layers=(4, 5),
    max_knn_memories=64000,
    num_retrieved_memories=32,
)

model.load_state_dict(torch.load(MODEL_DIR, map_location="cpu"))

<All keys matched successfully>

#### Generate the caption!

In [7]:
generate_caption(video_name, model, remove_dirs=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Creating CLIP embeddings of frames...


100%|██████████| 5/5 [00:00<00:00, 1238.87it/s]
100%|██████████| 1/1 [00:00<00:00, 21845.33it/s]
100%|██████████| 35/35 [00:00<00:00, 594334.57it/s]


Generating caption...


'A man is playing a tennis racket on a tennis court.'