<a href="https://colab.research.google.com/github/AmaruEscalante/VideoGPT/blob/master/Using_VideoGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using VideoGPT
This is a notebook demonstrating how to use VideoGPT and any pretrained models, Make sure that it is a GPU instance: **Change Runtime Type -> GPU**

## Installation
First, we install the necessary packages

In [1]:
! git clone https://github.com/amaruescalante/VideoGPT.git

Cloning into 'VideoGPT'...
remote: Enumerating objects: 380, done.[K
remote: Counting objects: 100% (115/115), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 380 (delta 54), reused 68 (delta 41), pack-reused 265[K
Receiving objects: 100% (380/380), 3.97 MiB | 17.31 MiB/s, done.
Resolving deltas: 100% (211/211), done.


In [2]:
%cd VideoGPT

/content/VideoGPT


In [6]:
! pip install git+https://github.com/amaruescalante/VideoGPT.git
! pip install scikit-video ava
! pip install --upgrade --no-cache-dir gdown

Collecting git+https://github.com/amaruescalante/VideoGPT.git
  Cloning https://github.com/amaruescalante/VideoGPT.git to /tmp/pip-req-build-v7i95wzj
  Running command git clone --filter=blob:none --quiet https://github.com/amaruescalante/VideoGPT.git /tmp/pip-req-build-v7i95wzj
  Resolved https://github.com/amaruescalante/VideoGPT.git to commit 93a16187cd96016c3fa34f7b3635f35a16efe1d0
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [None]:
!sh scripts/preprocess/msrvtt/create_msrvtt_dataset.sh datasets/msrvtt

In [None]:
# Train VQ-VAE
! python scripts/train_vqvae.py --data_path datasets/msrvtt --accelerator gpu --batch_size 16 --gpus 1 --auto_select_gpus true

In [None]:
! python scripts/train_videogpt.py --data_path datasets/msrvtt --accelerator gpu --batch_size 16 --gpus 1 --auto_select_gpus true

In [7]:
%matplotlib inline

from matplotlib import pyplot as plt
from matplotlib import animation
from IPython.display import HTML

import os
import torch
from torchvision.io import read_video, read_video_timestamps

from videogpt import download, load_vqvae, load_videogpt
from videogpt.data import preprocess

VIDEOS = {
    'breakdancing': '1OZBnG235-J9LgB_qHv-waHZ4tjofiDgj',
    'bear': '16nIaqq2vbPh-WMo_7hs9feVSe0jWVXLF',
    'jaywalking': '1UxKCVrbyXhvMz_H7dI4w5hjPpRGCAApy',
    'cartoon': '1ONcTMSEuGuLYIDbX-KeFqd390vbTIH9d'
}

ROOT = 'pretrained_models'

## Downloading a Pretrained VQ-VAE
There are four pretrained models available: `bair_stride4x2x2`, `ucf101_stride4x4x4`, `kinetics_stride4x4x4`, and `kinetics_stride2x4x4`. BAIR was trained on 64 x 64 video, and the rest on 128 x 128. The `stride` component represents the THW downsampling the VQ-VAE performs on the video tensor.

In [8]:
device = torch.device('cuda')
vqvae = load_vqvae('kinetics_stride2x4x4', device=device, root=ROOT).to(device)

Downloading...
From (uriginal): https://drive.google.com/uc?id=1jvtjjtrtE4cy6pl7DK_zWFEPY3RZt2pB
From (redirected): https://drive.google.com/uc?id=1jvtjjtrtE4cy6pl7DK_zWFEPY3RZt2pB&confirm=t&uuid=c1f7110a-9030-40b0-aa4c-123e0ddf33bd
To: /content/VideoGPT/pretrained_models/kinetics_stride2x4x4
100%|██████████| 258M/258M [00:04<00:00, 59.7MB/s]
  rank_zero_warn(
INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.1.6 to v1.9.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file pretrained_models/kinetics_stride2x4x4`


## Video Loading and Preprocessing
The code below downloads, loads, and preprocesses a given `mp4` file.

In [9]:
video_name = 'jaywalking'
# `resolution` must be divisible by the encoder image stride
# `sequence_length` must be divisible by the encoder temporal stride
resolution, sequence_length = vqvae.args.resolution, 16

video_filename = download(VIDEOS[video_name], f'{video_name}.mp4')
pts = read_video_timestamps(video_filename, pts_unit='sec')[0]
video = read_video(video_filename, pts_unit='sec', start_pts=pts[0], end_pts=pts[sequence_length - 1])[0]
video = preprocess(video, resolution, sequence_length).unsqueeze(0).to(device)

Downloading...
From: https://drive.google.com/uc?id=1UxKCVrbyXhvMz_H7dI4w5hjPpRGCAApy
To: /root/.cache/videogpt/jaywalking.mp4
100%|██████████| 3.29M/3.29M [00:00<00:00, 203MB/s]


## VQ-VAE Encoding and Decoding
Now, we can encode the video through the `encode` function. The `encode` function also has an optional input `including_embeddings` (default `False`) which will also return the embedding versions of the encodings.

In [10]:
with torch.no_grad():
    encodings = vqvae.encode(video)
    video_recon = vqvae.decode(encodings)
    video_recon = torch.clamp(video_recon, -0.5, 0.5)

## Visualizing Reconstructions

In [11]:
videos = torch.cat((video, video_recon), dim=-1)
videos = videos[0].permute(1, 2, 3, 0) # CTHW -> THWC
videos = ((videos + 0.5) * 255).cpu().numpy().astype('uint8')

fig = plt.figure()
plt.title('real (left), reconstruction (right)')
plt.axis('off')
im = plt.imshow(videos[0, :, :, :])
plt.close()

def init():
    im.set_data(videos[0, :, :, :])

def animate(i):
    im.set_data(videos[i, :, :, :])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=videos.shape[0], interval=200) # 200ms = 5 fps
HTML(anim.to_html5_video())

# Using Pretrained VideoGPT Models

The current available model to download is `ucf101`.

In [13]:
device = torch.device('cuda')
gpt = load_videogpt('ucf101_uncond_gpt', device=device).to(device)

Access denied with the following error:



 	Too many users have viewed or downloaded this file recently. Please
	try accessing the file again later. If the file you are trying to
	access is particularly large or is shared with many people, it may
	take up to 24 hours to be able to view or download the file. If you
	still can't access a file after 24 hours, contact your domain
	administrator. 

You may still be able to access the file from the browser:

	 https://drive.google.com/uc?id=1QkF_Sb2XVRgSbFT_SxQ6aZUeDFoliPQq 



FileNotFoundError: ignored

`VideoGPT.sample` method returns generated samples of shape BCTHW in the range [0, 1]

In [None]:
samples = gpt.sample(16) # unconditional model does not require batch input

100%|██████████| 4096/4096 [02:34<00:00, 26.50it/s]


In [None]:
import math
import numpy as np

b, c, t, h, w = samples.shape
samples = samples.permute(0, 2, 3, 4, 1)
samples = (samples.cpu().numpy() * 255).astype('uint8')

video = np.zeros((t, (1 + h) * 4 + 1, (1 + w) * 4 + 1, c), dtype='uint8')
for i in range(b):
  r, c = i // 4, i % 4
  start_r, start_c = (1 + h) * r, (1 + w) * c
  video[:, start_r:start_r + h, start_c:start_c + w] = samples[i]

fig = plt.figure()
plt.title('ucf101 unconditional samples')
plt.axis('off')
im = plt.imshow(video[0, :, :, :])
plt.close()

def init():
    im.set_data(video[0, :, :, :])

def animate(i):
    im.set_data(video[i, :, :, :])
    return im

anim = animation.FuncAnimation(fig, animate, init_func=init, frames=video.shape[0], interval=200) # 200ms = 5 fps
HTML(anim.to_html5_video())