<a href="https://colab.research.google.com/github/HiddenBeginner/notebooks/blob/master/video-data-analysis/Tutorial_on_Video_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Code Tutorial on Video Analysis

**Video Credits & Usage**
- **Source**: The original video used in this tutorial can be found at [https://databrary.org/volume/235](https://databrary.org/volume/235), which is publicly available.
- **Note on Implementation:** The original video exceeds 14 minutes. Therefore, we will use the last 60 seconds of the clip for fast implementation.
- **Availability**: ~~The `wget` code below will download the video clip that I've uploaded on Google Drive and the url will expire after this tutorial~~. Therefore, if you want to reproduce this tutorial later, please download the original video on Databrary.

### Loading and exploring a video file using `moviepy`

There are two popular packages for handling video data, `moviepy` and `opencv-python` (`cv2`).
- `moviepy` is designed for high-level video editing. It is optimized for video-level composition and effects (e.g., fades, transitions, and overlays). However, `moviepy` is less efficient for frame-by-frame analysis and converting/writing videos.
- `opencv-python` is optimized for low-level, frame-by-frame processing and computer vision tasks. However, it does not support audio processing and high-level video editing.

In this tutorial, we will use `moviepy`.

In [None]:
raw_video_fpath = './0807.mp4'
short_video_fpath = './video.mp4'

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from moviepy.editor import VideoFileClip, CompositeVideoClip, TextClip

In [None]:
# The annotated code below processes the raw video to a short video clip

# raw_clip = VideoFileClip(raw_video_fpath)
# # Taking the video clip from the last 60 seconds (`t_start=-60.0`) to the end of the video (`t_end`=None)
# clip = raw_clip.subclip(t_start=-60.0, t_end=None)
# clip.write_videofile(short_video_fpath)

clip = VideoFileClip(short_video_fpath)

print("Frame size (width, height): ", clip.size)
print("Duration (sec): ", clip.duration)
print("Frames per second (FPS): ", clip.fps)
print("Total number of frames: ", int(clip.duration * clip.fps) + 1)

In [None]:
clip.ipython_display(maxduration=60.1, width=360, rd_kwargs={'logger': None})

- The duration and FPS in the previous code are parsed from the video file's **header**, which may occasionally deviate from the actual frames in the video stream.
- Thus, it is recommended to store the actual timestamps, called Presentation Time Stamp (PTS), rather than relying on values calculated based on duration and FPS.
- Frames are usually processed one by one. This is because a video file saves space by only stroing the changes between frames. Because of this 'inter-frame' compression, you can't just jump in anywhere.
- From this reason, the reconstructed frames can occupy up to 1.74 GB memory, even though our video file has a size of 5.8 MB.

In [None]:
frames = []
timestamps = []
for t, frame in clip.iter_frames(with_times=True):
    frames.append(frame)  # This is not recommended especially when a video is large
    timestamps.append(t)

In [None]:
print("The total size of recovered frames: {:.2f} GB".format(len(frames) * np.size(frames[0]) / 1024 / 1024 / 1024), '\n')
print("The first 10 timestamps: ", timestamps[:10], '\n')
print("Time interval between two frames (sec): ", timestamps[1] - timestamps[0])

In [None]:
# Draw four randomly selected frames
idxs = np.random.randint(0, len(frames), size=4)
idxs = sorted(idxs)

fig, axes = plt.subplots(figsize=(7, 5), nrows=2, ncols=2)
for i, ax in enumerate(axes.flatten()):
    idx = idxs[i]
    ax.imshow(frames[idx])
    ax.axis('off')
    ax.text(10, 30, f't={timestamps[idx]:.2f} sec', fontdict={'color': 'red'})
fig.tight_layout()
plt.show()

---

## Transcription: `Distil-Whisper`
- HuggingFace's `Distil-Whisper` is six time faster and 49% smaller than OpenAI `Whisper`.
- But it supports only English.
- Source code: [https://github.com/huggingface/distil-whisper](https://github.com/huggingface/distil-whisper)

### Extracting the audio fron the video

In [None]:
audio = clip.audio

print("Duration (sec): ", audio.duration)
print("Sampling rate(Hz) :  ", audio.fps)

In [None]:
raw_audio = list(audio.iter_frames())  # Extracting raw audio signal
raw_audio = np.array(raw_audio)

print("The shape of audio signal: ", raw_audio.shape)  # It is a stereo audio
mono_audio = raw_audio.mean(axis=1)

In [None]:
plt.figure(figsize=(4, 3))
plt.plot(mono_audio)
plt.grid()
plt.show()

### Load `Distil-Whisper` based on the instruction in the [Github repository](https://github.com/huggingface/distil-whisper?tab=readme-ov-file#chunked-long-form)
- The code below is to load `Distil-Whisper`. It looks complicated, but it's just the copy-and-pasted codes in the instruction.

In [None]:
!pip install transformers==4.49.0

In [None]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=10,
    torch_dtype=torch_dtype,
    device=device,
)

In [None]:
results = pipe({"raw": raw_audio.mean(axis=1), 'sampling_rate': audio.fps}, return_timestamps=True)
for chunk in results['chunks']:
    print(chunk)

---

## Visual Annotations using Ultralytics YOLO
The YOLO (You Only Look Once) family of models is the most popular choice for real-time computer vision tasks, such as object detection, segmentation, and pose estimation. Many practitioners consider YOLO their "go-to" models because it is remarkably fast, lightweight, and easy to use, while still delivering state-of-the-art performance.


In [None]:
!pip install -U ultralytics

In [None]:
from PIL import Image
from ultralytics import YOLO

frame = frames[0]
img = Image.fromarray(frame)

### Object detection

In [None]:
model = YOLO("yolo26n.pt")  # Load a pretrained model
results = model(img)
results[0].show()

- `results` variable contains the prediction results for a given list images. Since we fed only one image, `results` contains one result.
- `result` variable below the prediction result for our image, and it contains all bounding boxes detected in the image. In our case, there are two bounding boxes.
- YOLO models are trained on the COCO dataset, which supports the following classes:
  

In [None]:
# Prediction format
result = results[0].boxes.cpu().numpy()
print("Predicted bounding boxes (x, y, w, h) in pixels: \n", result.xywh)
print("\nPredicted classes: ", result.cls)
print("\nSupproted classes: ", results[0].names)

### Segmentation
Segmentation predicts a set of pixels that outline each object in each bounding box.

In [None]:
model = YOLO("yolo26n-seg.pt")
results = model(img)
results[0].show()

### Pose estimation
Pose estimation predicts the coordinates (in pixels) of the following 17 keypoints:
- **Face (5)**: Nose, Left Eye, Right Eye, Left Ear, Right Ear
- **Upper body (6)**: Left Shoulder, Right Shoulder, Left Elbow, Right Elbow, Left Wrist, Right Wrist
- **Lower body (6)**: Left Hip, Right Hip, Left Knee, Right Knee, Left Ankle, Right Ankle

In [None]:
model = YOLO("yolo26n-pose.pt")
results = model(img)
results[0].show()

**Example: Drawing pose estimation for all frames**

In [None]:
def func(arr):
    result = model([Image.fromarray(arr)], verbose=False)[0]
    img = result.plot()
    img = img[:, :, [2, 1, 0]]  # BGR 2 RGB

    return img

In [None]:
sub_clip = clip.subclip(t_start=0, t_end=10.0)
new_clip = sub_clip.fl_image(func)
new_clip.ipython_display(width=360, rd_kwargs={'logger': None})

### Gaze detection using L2-CS Net

In [None]:
!wget 'https://drive.usercontent.google.com/download?id=18S956r4jnHtSeT8z8t3z8AoJZjVnNqPJ&export=download&authuser=1&confirm=t' -O L2CSNet_gaze360.pkl
!pip install git+https://github.com/edavalosanaya/L2CS-Net.git@main

In [None]:
from l2cs import Pipeline, render

gaze_pipeline = Pipeline(
    './L2CSNet_gaze360.pkl',
    arch='ResNet50',
    device=torch.device('cuda') # or 'gpu'
)


In [None]:
results = gaze_pipeline.step(frame)
frame = render(np.copy(frame), results)

plt.figure(figsize=(4, 3))
plt.imshow(frame)
plt.axis('off')
plt.show()

In [None]:
def func(arr):
    result = gaze_pipeline.step(arr)
    img = render(np.copy(arr), result)

    return img

In [None]:
sub_clip = clip.subclip(t_start=0, t_end=10.0)
new_clip = sub_clip.fl_image(func)
new_clip.ipython_display(maxduration=60.1, width=360, rd_kwargs={'logger': None})

In [None]:
clip.close()
new_clip.close()
sub_clip.close()
audio.close()