## Demo notebook
### kNN-Memory with ClipCap: Enhanced Long-Range Dependency Handling

This is a notebook where you can submit your own mp4 clip, after which ClipMemClap will generate a caption for the clip! 

For this to work, please:
- make sure you currently are in the ./knn-memoroy-clipcap/demos directory (https://github.com/SebastiaanJohn/knn-memory-clipcap/tree/main/demos).
- have the video in the same directory as this notebook.

Parsing a video will cause a folder named 'frames' to be created in this directory, where the frames of the submitted videos will be stored. \
Also, a folder named 'embeddings' will be created, where the CLIP embeddings of the frames will be stored. \
If you would like the 'embeddings' and 'frames' folders to be removed after the caption has been made, set _rem_dirs_ to _True_.

#### Imports

In [1]:
import sys
import os

from IPython.display import Video

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

module_path_2 = os.path.abspath(os.path.join('../src'))
if module_path_2 not in sys.path:
    sys.path.append(module_path_2)

from src.models.clipcap import ClipCaptionPrefix
from src.process_vid import *

#### This is where the model that you want to use for the captioning is stored.

In [None]:
MODEL_DIR = '../checkpoints/activitynet_with_mem-best.pt'

#### Please provide the name of the video, including the .mp4 suffix.
The frames of the video will be stored in the 'frames' folder.

In [2]:
video_name = 'tennis.mp4'
extract_frames(video_name)

Extracting frames for video: tennis.mp4
Succesfully dumped frames of tennis: 101 frames


The video for which the caption will be generated.

In [3]:
Video(video_name)

#### Initialize the trained ClipMemCap model.

In [6]:
# arguments need to match those of the saved model
model = ClipCaptionPrefix(
            10,
            batch_size = 1,
            clip_length= 10,
            prefix_size= 512,
            num_layers= 8,
            num_heads = 8,
            memorizing_layers = (4,5),
            max_knn_memories = 64000,
            num_retrieved_memories = 32
        )

model.load_state_dict(torch.load(MODEL_DIR, map_location="cpu"))

<All keys matched successfully>

#### Generate the caption!

In [7]:
generate_caption(video_name, model, remove_dirs=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Creating CLIP embeddings of frames...


100%|██████████| 5/5 [00:00<00:00, 1238.87it/s]
100%|██████████| 1/1 [00:00<00:00, 21845.33it/s]
100%|██████████| 35/35 [00:00<00:00, 594334.57it/s]


Generating caption...


'A man is playing a tennis racket on a tennis court.'