# Finding landmarks in Videos


In this notebook, you can extract the names and frames of landmarks from a video. 
The video is first analyzed with the [OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit) model (an open-vocabulary object detection model) by Google Research. 
We use [OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit) as a building detector by splitting the video into frames and querying each frame  for hand-crafted text templates descriptions of buildings (e.g., "castle at daylight in full view"). The time code of each frame in which this process detects something is saved and compared to the output of the [LandmarkNER](https://huggingface.co/spaces/constantinSch/LandmarkNER_EL).
For the [LandmarkNER](https://huggingface.co/spaces/constantinSch/LandmarkNER_EL), German video subtitles are analyzed with a Named Entity Recognition for landmark trained on BR subtitle to recognize the names of buildings that occur in the subtitles. The output of the LandmarkNER is disambiguated for Wikipedia titles by [mGenre](https://github.com/facebookresearch/GENRE). 

The frames for which the OWL-VIT model detects a building and the [LandmarkNER](https://huggingface.co/spaces/constantinSch/LandmarkNER_EL) detects a landmark are saved and downloaded in a .zip file.

The video needs to be uploaded as a .mp4 and the associated subtitles as a .txt with the timecodes in milliseconds. The name of the subtitle file has to start with the video-ID and can be followed by more information after an `_` .

# Building detection with [OWL-ViT](https://huggingface.co/docs/transformers/model_doc/owlvit)

## Set-up environment for the building detector

Install huggingface transformers version that includes OWL-VIT

In [None]:
!pip install transformers==4.22

### Load pre-trained model and processor

Let's first apply the image preprocessing and tokenize the text queries using `OwlViTProcessor`. The processor will resize the image(s), scale it between [0-1] range and normalize it across the channels using the mean and standard deviation specified in the original codebase.


Text queries are tokenized using a CLIP tokenizer and stacked to output tensors of shape [batch_size * num_max_text_queries, sequence_length]. If you are inputting more than one set of (image, text prompt/s), num_max_text_queries is the maximum number of text queries per image across the batch. Input samples with fewer text queries are padded. 

In [None]:
from transformers import OwlViTProcessor, OwlViTForObjectDetection

model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")

## Preprocess input video
To analyse an .mp4 video in OWL-Vit you first have to upload it in the Colab. The following function extracts every N-th frame of the video and converts it to the necessary format. 
You can change how many frames you want to extract bei adjusting N. Currently, we extract 1 frame every 1.5 seconds, assuming a frame rate of 50 frames per second. 
This makes the frames sufficiently different. 

In [None]:
import cv2
from PIL import Image

import glob

# For videos directly uploaded to the Colab
path = '/content/*.mp4' 
video_paths=glob.glob(path) 

video_files=[]
video_files.append(video_paths[0])

# How many frames to skip
N = 75

# The frame images will be stored in video_frames
video_frames = []

# The time codes in milliseconds will be stored in time_codes
video_time_codes = []

# Open the video files
for file in video_files:
  capture = cv2.VideoCapture(file)
  fps = capture.get(cv2.CAP_PROP_FPS)

  current_frame = 0
  while capture.isOpened():
  # Read the current frame
    ret, frame = capture.read()

  # Convert it to a PIL image (required for CLIP) and store it
    if ret == True:
      video_frames.append(Image.fromarray(frame[:, :, ::-1]))
      video_time_codes.append(capture.get(cv2.CAP_PROP_POS_MSEC))
    else:
      break

  # Skip N frames
    current_frame += N
    capture.set(cv2.CAP_PROP_POS_FRAMES, current_frame)

# Print some statistics
print(f"Frames extracted: {len(video_frames)}")
print(f"Time codes extracted: {len(video_time_codes)}")

In [None]:
# Look at the first extracted frame
video_frames[0]

In [None]:
# Look at the first 19 time codes
video_time_codes[:19]

## Analyze each frame with OWL-ViT


In [None]:
import torch

# Use GPU if available
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

In [None]:
# Preprocessing all images 
import numpy as np

# Preprocessing
images = video_frames
images = [Image.fromarray(np.uint8(img)).convert("RGB") for img in images]

# list of building types to detect
text_queries = ["palace at daylight in full view", "castle at daylight in full view", "skyscraper at daylight in full view", "parliament building at daylight in full view", "administrative building at daylight in full view", "municipal building at daylight in full view", "Corporate headquarters building at daylight in full view", "railway station building at daylight in full view", "exterior view of church building in daylight", "exterior view of mosque  building in daylight", "exterior view of synagoge building in daylight"] 

In [None]:
# Set model in evaluation mode
model = model.to(device)
model.eval()

outputs = []

In [None]:
# Loop through images
with torch.no_grad():
  for image in images:
    input = processor(text=text_queries, images=image, return_tensors="pt").to(device)
    # Get predictions
    output = model(**input)
    outputs.append(output)
    # delete input and output to make room on the GPU
    del(input)
    del(output)

In [None]:
len(outputs)

## Display frames with bounding boxes

In [None]:
# Threshold to eliminate low probability predictions
score_threshold = 0.25

In [None]:
def plot_predictions(input_image, text_queries, scores, boxes, labels):
    fig, ax = plt.subplots(1, 1, figsize=(16, 9))
    ax.imshow(input_image, extent=(0, 1, 1, 0))
    ax.set_axis_off()

    for score, box, label in zip(scores, boxes, labels):
      if score < score_threshold:
        continue

      cx, cy, w, h = box
      ax.plot([cx-w/2, cx+w/2, cx+w/2, cx-w/2, cx-w/2],
              [cy-h/2, cy-h/2, cy+h/2, cy+h/2, cy-h/2], "r")
      ax.text(
          cx - w / 2,
          cy + h / 2 + 0.015,
          f"{text_queries[label]}: {score:1.2f}",
          ha="left",
          va="top",
          color="red",
          bbox={
              "facecolor": "white",
              "edgecolor": "red",
              "boxstyle": "square,pad=.3"
          })
  

In [None]:
import matplotlib.pyplot as plt

from transformers.image_utils import ImageFeatureExtractionMixin
mixin = ImageFeatureExtractionMixin()

# Let's plot the predictions of the first 100 frames
for i in range(100):
  image_idx = i
  image_size = model.config.vision_config.image_size
  image = mixin.resize(images[image_idx], image_size)
  input_image = np.asarray(image).astype(np.float32) / 255.0
  # Get prediction logits
  logits = torch.max(outputs[i]["logits"][0], dim=-1)
  scores = torch.sigmoid(logits.values).cpu().detach().numpy()
  # Get prediction labels and boundary boxes
  labels = logits.indices.cpu().detach().numpy()
  boxes = outputs[i]["pred_boxes"][0].cpu().detach().numpy()
  plot_predictions(image, text_queries, scores, boxes, labels)

## Detected Frames
Collect the frames for which the model detected a building, display them and extract their timecodes.

In [None]:
# Threshold to eliminate low probability predictions
score_threshold = 0.25

In [None]:
# The index of the image building_frames
building_frames_index = []

# Loop over predictions for each image in the batch
for i in range(len(video_frames)):
  # Get prediction logits
  logits = torch.max(outputs[i]["logits"][0], dim=-1)
  scores = torch.sigmoid(logits.values).cpu().detach().numpy()
  # score_threshold = 0.2
  for score in scores:
     if score >= score_threshold:
       building_frames_index.append(i)    
                

In [None]:
building_frames = []

for frame in building_frames_index:
  building_frames.append(video_frames[frame])

Display all frames in which the building detector detected anything

In [None]:
# Look at all the frames in which buildings are detected
for frame in building_frames:
  display(frame)

Create new lists with all frames time codes at which there were buildings detected

In [None]:
building_video_time_codes = []

for frame in building_frames_index:
  building_video_time_codes.append(video_time_codes[frame])

# Landmark detection in subtitles

## Set up environment to detect landmarks in subtitles



In [None]:
# Install the landmarkNER model from huggingface
!pip install https://huggingface.co/constantinSch/LandmarkNER/resolve/main/de_pipeline-any-py3-none-any.whl

import spacy
nlp = spacy.load("de_pipeline")

In [None]:
# Download large German model to use its sentencizer component
!python -m spacy download de_core_news_lg

Install all necessary libraries and download mGENRE

In [None]:
# Install transformers (unless you already installed it for OWL-ViT)
# !pip install transformers==4.21
# Install fairseq
!git clone --branch fixing_prefix_allowed_tokens_fn https://github.com/nicola-decao/fairseq
!cd fairseq
!pip install --editable ./fairseq/
# Install Genre
!rm -rf GENRE
!git clone https://github.com/facebookresearch/GENRE.git
! cd GENRE && pip install ./
# Load the model and Trie
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pickle
import re
! pip install sentencepiece marisa_trie
from genre.trie import MarisaTrie
from huggingface_hub import hf_hub_download
file_path_marisa_trie = hf_hub_download("facebook/mgenre-wiki", "titles_lang_all105_marisa_trie_with_redirect.pkl")
with open(file_path_marisa_trie, "rb") as f:
    trie = pickle.load(f)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("facebook/mgenre-wiki")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mgenre-wiki").eval()

## Analyse subtitles
Upload the associated subtitles as .txt. Timecodes need to be encoded in ms. Timecodes need to precede the associated subtitles. Timecode range needs to be separated by exactly one "-". 
In the example, the timecodes do not start at zero, so the initial starting timecode offset has to be subtracted.

In [None]:
# Read in all text files
import glob
path = '/content/*.txt' 
text_files=glob.glob(path) 

Convert each text file into a dataframe which has the start timecode in row 1, end timecode in row 2 and the subtitle text in row 3

In [None]:
import pandas as pd
import re

time_code_start = []
time_code_end = []
subtitle_text = []

# Loop over the text files in content folder
for file in text_files:
  with open(file, "r") as f:
    lines = f.readlines()
    # Seperate timecodes from subtitle text
    for line1,line2 in zip(lines[::2],lines[1::2]):
      #Take the first token from line1 as initial time code offset
      if len(time_code_start) == 0:
        time_code_start.append(line1.split()[0])
        time_code_end.append(line1.split()[2])
      else:
        # For all odd line indeces take the first token, convert string to integer and subtract the offset
        time_code_start.append(int(line1.split()[0])-int(time_code_start[0]))
        # Take the third token from line1 as end timecode and subtract subtract the offset
        time_code_end.append(int(line1.split()[2])-int(time_code_start[0]))
      # Add each even line to subtitle text 
      # remove music from subtitles ()
      regex_tc = r"\*.*\*"
      line_2_2 = re.sub(regex_tc, " ", line2, 0, re.MULTILINE)
      # remove all whitespace characters (tab, newline, return, formfeed)   
      subtitle_text.append(" ".join(line_2_2.split()))

In [None]:
# Combine time_code_start, end and the associated text in one list of dictionaries 
text_list_of_dicts = []
if len(time_code_start) == len(time_code_end) == len(subtitle_text):
  for i in range(len(time_code_start)):
    text_list_of_dicts.append({"1_Start_TimeCode": time_code_start[i], "2_End_Timecode": time_code_end[i], "3_Text": subtitle_text[i]})    
# Convert to DataFrame
text_df = pd.DataFrame(text_list_of_dicts)

In [None]:
# Remove the offset row
text_df  = text_df.drop(0)

In [None]:
# Look at the entire data frame
text_df

Create textblocks of 10 Sentences for analysis with the LandmarkNER. 

In [None]:
# Take all of the subtitle text
full_text = ' '.join(subtitle_text)

import spacy

# My model has no sentencizer component, so I have to use the pretrained spacy model
nlp_sent = spacy.load("de_core_news_lg")
doc_sent = nlp_sent(full_text)

# use the spacy tokenizer to get sentences
# Take 10 sentences and append them to subtitle_chunks
subtitle_chunks = []
chunk_nr = 0
chunk_text = "" 
for sent in doc_sent.sents:
  chunk_text += (" " + sent.text)
  chunk_nr += 1
  if chunk_nr % 10 == 0:
    subtitle_chunks.append(chunk_text)
    chunk_text = ""
    chunk_nr = 0

In [None]:
# Look at the first chunk
subtitle_chunks[0]

Loop through subtitle_chunks, analyze each chunk for landmarks, if there are landmarks, disambiguate them with mGenre and give out the text of the chunk and the detected landmark.

In [None]:
# Create an empty list to save the output of the landmarkNER, the landmark titles and the associated chunks to
landmark_titles = []

for chunk in subtitle_chunks:  
    # Create a spacy Doc object
    doc = nlp(chunk)
    reshaped_text = []
    # Reshape detected landmarks for mGenre
    for ent in doc.ents:
      # Add start and end marker for landmark mention
      reshaped_text.append(doc.text.replace(ent.text, '[START]' + ent.text + '[END]'))
    generated_text = []
    # take the reshaped chunk and create mGenre model output
    for sent in reshaped_text:
      outputs = model.generate(
          **tokenizer(sent, return_tensors="pt"),
          num_beams=5,
          num_return_sequences=5,
          # use constrained beam search to only return valid wikipedia titles
          prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
          )
      generated_text.append(tokenizer.batch_decode(outputs, skip_special_tokens=True))
      # Create regex
      re_lang = r" >> [a-z]+"
    # If a wikipedia title for an entity was detected  
    # Loop through the length of generated text / doc.ents / reshaped_text
    if generated_text:
      for i in range(len(generated_text)):
       # Take the prediction with the highest score
       gen_text = generated_text[i][0]
       # remove irrelevant tokens
       shortened_text = re.sub(re_lang, "", gen_text, 0, re.MULTILINE)
       # Add found entities, their wikipedia title and the current chunk where to landmark_titles
       if shortened_text:
         landmark_titles.append([doc.ents[i], shortened_text, chunk])

In [None]:
# Convert to dataframe for visual inspection
landmark_titles_df = pd.DataFrame(landmark_titles, columns = ['LandmarkNER', 'Disambiguierung', 'Chunk'])

In [None]:
landmark_titles_df

Create a list with any detected landmarks and the associated timecodes.


In [None]:
# This preliminary code checks whether a subtitle line appears in one of the chunks. 
# If the output of the LandmarkNER also appears in this subtitle line 
# the corresponding output of mGenre and the start time code and end time code of the subtitle line are 
# added to the landmark_titles_time_codes list

landmark_titles_time_codes = []

for i in range(len(landmark_titles)):
  for j in range(len(subtitle_text)):
    if subtitle_text[j] in landmark_titles[i][2] and landmark_titles[i][0].text in subtitle_text[j]:
      landmark_titles_time_codes.append([landmark_titles[i][1], int(time_code_start[j]), int(time_code_end[j])])

In [None]:
landmark_titles_time_codes[:10] 

In [None]:
len(landmark_titles_time_codes)

# Connect building detector and LandmarkNER
First, we need to find out whether there are overlapping timecodes at which a building was detect in the video and a landmark was detected in the subtitles.
I subtract/add 2000 from the landmark timecodes to account for the fact they may sometimes appear on the screen earlier or later than in the subtitles.

The Viam-ID is extracted from the subtitle file. This needs to be named 
All the images are saved in a folder named 
In this folder, there is a folder for each landmark title. 
The filename is the title of the landmark, the Viam-ID of the video and the timecode of the frame.

In [None]:
!mkdir dataset_lm

In [None]:
# Take the file path for the subtitle file and extract the viam-id (everything betweent content/ and a _ or a .)
regex = r"^/content/([^._]*).*"
viam_id = re.search(regex, text_files[0]).group(1)

## Connect subtitles with video frames
Include a 2000 ms overlap

In [None]:
# import os

# # loop through all the time codes at which buildings are detected
# for i in range(len(building_video_time_codes)):
#   # Loop through all time codes at which landmarks are detected
#   for j in range(len(landmark_titles_time_codes)):
#     # Check if the current building time code overlaps with a landmark timecode +/- 2000 ms and whether "/"" is in the landmark name
#     if  (int(landmark_titles_time_codes[j][1])-2000 <= int(building_video_time_codes[i]) <= int(landmark_titles_time_codes[j][2])+2000) and ("/" not in landmark_titles_time_codes[j][0]):
#       # check if the folder for the current landmark exists and if not create it
#       if not os.path.exists("/content/dataset_lm/" + landmark_titles_time_codes[j][0]):
#         os.makedirs("/content/dataset_lm/" + landmark_titles_time_codes[j][0])
#       # For overlapping time codes, save the associated frame with the name of the landmark as title.    
#       building_frames[i].save("/content/dataset_lm/" + landmark_titles_time_codes[j][0] + "/" + viam_id + "_ " + landmark_titles_time_codes[j][0]+ "_"  + str(building_video_time_codes[i]) + ".jpg")

## Connect 'Bildinhalt' with video frames
no overlap

In [None]:
import os

# loop through all the time codes at which buildings are detected
for i in range(len(building_video_time_codes)):
  # Loop through all time codes at which landmarks are detected
  for j in range(len(landmark_titles_time_codes)):
    # Check if the current building time code overlaps with a landmark timecode +/- 2000 ms and whether "/"" is in the landmark name
    if  (int(landmark_titles_time_codes[j][1]) <= int(building_video_time_codes[i]) <= int(landmark_titles_time_codes[j][2])) and ("/" not in landmark_titles_time_codes[j][0]):
      # check if the folder for the current landmark exists and if not create it
      if not os.path.exists("/content/dataset_lm/" + landmark_titles_time_codes[j][0]):
        os.makedirs("/content/dataset_lm/" + landmark_titles_time_codes[j][0])
      # For overlapping time codes, save the associated frame with the name of the landmark as title.    
      building_frames[i].save("/content/dataset_lm/" + landmark_titles_time_codes[j][0] + "/" + viam_id + "_ " + landmark_titles_time_codes[j][0]+ "_"  + str(building_video_time_codes[i]) + ".jpg")

## Zip the folder with the frames and download it

In [None]:
!zip -r /content/file.zip /content/dataset_lm/

from google.colab import files
files.download("/content/file.zip")

Currently, you can only analyse one video at a time with this notebook. So to analyse another video you first need to delete the old video and subtitle and then rerun the notebook from "Preprocess input video".