<a href="https://colab.research.google.com/github/Mpogazi/athena_coder/blob/main/video_%26%26_sound.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Video and Sound GPT (Generation of Videos)
The main idea in this notebook is to use the gpt model architecture or another different architecture (JEPA for example) to produce plausible videos.

### Theory/Intuition:
I think it's possible to learn the pixel distributions of the in the time dimension and predict the next frame in videos and other modalities as it has been done for texts. (GPT 3 or GPT 4)

### Initial Approach
* download hundreds of videos from YouTube and put them on Google drive
* Write code to read the videos and transform them into Tensors
* Write code to transform or read an output tensors into video format
* Write a window code to play the generated video vis as vis the original video (A bit tricky, will add more details as time goes.)

In [14]:
# Installing the python libraries to handle reading sound and video
!pip install opencv-python pydub



### Import

Please put all the imports here. We would like to have a single source of truth.

In [15]:
import cv2
import tensorflow as tf

from tensorflow import keras
from keras import layers
from pydub import AudioSegment
import matplotlib.pyplot as plt

import numpy as np
import random
import io
import os

# Google drive imports
from google.colab import drive

### Globals
In this section we mount the memory (Drive) and set up some global variables.
List of globals:

`base_path`, `MAX_HEIGHT`, `MAX_WIDTH`

In [16]:
# Mounting the drive with the content
# Might need to give permission on this.
# Since someone needs to access the contents of the drive/videos
drive.mount('/content/drive')
BASE_PATH = '/content/drive/MyDrive/video_model/'
MAX_HEIGHT = 720
MAX_WIDTH = 1280
TRAIN_SPLIT = 0.9
# number of frames in a transformer block
BLOCK_SIZE = 3
BATCH_SIZE = 4
VOCAB_SIZE = 256
EMBEDDING_DIM = 16
LEARNING_RATE = 3e-4
EVAL_ITERS = 20
MAX_ITERS = 200

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Setting Up The Training Data

1. Create a list of all files
2. load the videos to set up some Global Variables

In [17]:
files = set()
for filename in os.listdir(BASE_PATH):
  files.add(filename)

def split_test(files):
  size = int(len(files) * TRAIN_SPLIT)
  input_list = list(files)
  random.shuffle(input_list)
  return input_list[:size], input_list[size:]

train, val = split_test(files)

### Loading Training Data

1. Load the videos and pad them
2. Collapse all the videos into a single giant data object
3. Split the vidoes into train and test data

In [18]:
# We will need to pad the tensors to handle MAX_WIDTH and MAX_HEIGHT
# Frames are gonna be the same
def pad_batch(batch):
  # (batch, block, height, width, chanel)
  _, _, height, width, _ = tf.shape(batch).numpy()
  paddings = tf.constant([
      [0, 0],
      [0, 0],
      [0, MAX_HEIGHT - height],
      [0, MAX_WIDTH - width],
      [0, 0]
  ])

  return tf.pad(batch, paddings, "CONSTANT", constant_values=0)

# take a video path and return an equivalent tensor
# Returns tensor of shape (Frames, MAX_HEIGHT, MAX_WIDTH, 3)
def capture_frames_randomly(video_path, blocks = 16, batch_size = 32):
  cap = cv2.VideoCapture(video_path)
  video_tensors = [[] for i in range(batch_size)]

  block_starts = tf.random.uniform(
                            (batch_size,),
                            maxval= int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) - blocks - 1,
                            dtype=tf.int32
                          )

  for index, block_start in enumerate(block_starts.numpy()):
    cap.set(cv2.CAP_PROP_POS_FRAMES, block_start)
    for i in range(block_start, block_start + blocks + 1):
      ret, frame = cap.read()
      if not ret:
        print("Error: Failed to grab frame.")
      video_tensors[index].append(tf.convert_to_tensor(frame, dtype=tf.int16))

  cap.release()
  video_tensors = tf.stack([tf.convert_to_tensor(video_tensor) for video_tensor in video_tensors])
  return video_tensors

### Batching
Since the video sizes are humongous, we need to implement a
batching strategy that is not conventional.

We will be picking a file by random and randomly pick a batch.

Some Math:
We 're working in milli seconds. Therefore, we will be picking
this much time in a video randomly (batch_size * (1 / 25) fps)

batch_size is counted in frames.


In [19]:
def pool_images(batch):
  pooling = keras.layers.AveragePooling2D(pool_size=(2, 2), padding='valid')
  B, T, H, W, C = batch.shape
  batch = tf.reshape(tf.cast(batch, dtype=tf.float32), [B * T, H, W, C])
  batch = pooling(batch)
  batch = pooling(batch)
  _, new_H, new_W, _ = batch.shape
  batch = tf.reshape(batch, [B, T, new_H, new_W, C])
  return batch

def get_batch(split: str):
  data = train if split == 'train' else val
  # a batch of files
  file_index = random.randint(0, len(data) - 1)

  # batch => (BATCH, BLOCK + 1, H, W, C)
  batch = capture_frames_randomly(BASE_PATH + data[file_index], BLOCK_SIZE, BATCH_SIZE)
  batch = pad_batch(batch)
  xb, yb = batch[:, :BLOCK_SIZE,:, :, :], batch[:, 1:(BLOCK_SIZE + 1), :, :, :]

  return pool_images(xb), pool_images(yb)

xb, yb = get_batch('train')

In [20]:
"""Always very important to check the size of the tensors to make sure you're doing the right thing!"""
print("shape xb: ", tf.shape(xb))
print("shape yb: ", tf.shape(yb))

shape xb:  tf.Tensor([  4   3 180 320   3], shape=(5,), dtype=int32)
shape yb:  tf.Tensor([  4   3 180 320   3], shape=(5,), dtype=int32)


### Modeling

This is the modeling part of the notebook. Watch as we create the model lmaooo!

Since me and my associates are GPU-poor, the model should have checkpoints from the beginning. So anytime the GPU dies, after Google and Associates kill the session, we will restart where we were in the training.

In [21]:
checkpoint_filepath = '/content/drive/MyDrive/model_checkpoints/video_foundation_model_checkpoint.h5' # path to save weights

In [22]:
# Specify the ModelCheckpoint callback
model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',  # You can choose the metric to monitor
    mode='max',
    save_best_only=True
)

## The Model

Now let's talk about the structure of the model.

In [23]:
class ImageDense(layers.Layer):
  """
  This is an attempt to define a dense layer for images.
  """
  def __init__(self, embedding_dim = EMBEDDING_DIM):
    super().__init__()

  def call(self, x):
    B, T, H, W, C, E = x.shape
    return x

class MultiHeadAttention(layers.Layer):
  def __init__(self):
    super().__init__()
    self.c_attention = layers.Dense(EMBEDDING_DIM * 3)


  def call(self, x):
    B, T, H, W, C, E = x.shape
    k, q, v = self.c_attention(x)
    return x


# class FeedForward(layers.Layer):
#   def __init__(self):
#     super().__init__()

#   def call(self, x):

# class Block(layers.Layer):
#   def __init__(self):
#     super().__init__()

#   def call(self, x):

In [26]:
class VideoModel(keras.Model):
  def __init__(self):
    super().__init__()
    self.pixel_embedding_table = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, dtype=tf.float32)
    self.image_positional_embedding_table = layers.Embedding(BLOCK_SIZE, EMBEDDING_DIM)

    self.vm_head = keras.layers.Dense(VOCAB_SIZE)
    self.loss_calc = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

  def call(self, idx, targets=None):
    B, T, H, W, C = idx.shape
    pixel_embedding = self.pixel_embedding_table(idx)
    pos_embedding = tf.reshape(self.image_positional_embedding_table(tf.range(T)), [1,T, 1, 1, 1, EMBEDDING_DIM])
    x = pixel_embedding + pos_embedding

    logits = self.vm_head(x)
    if targets is None:
      loss = None
    else:
      loss = self.loss_calc(targets, logits)

    return logits, loss

  def generate(self, idx, max_new_images):
    B, T, H, W, C = idx.shape
    idx = tf.cast(idx, dtype=tf.int32)

    for _ in range(max_new_images):
      idx_cond = idx[:, -BLOCK_SIZE:]

      #print("idx_cond: ", tf.shape(idx_cond))
      logits, _ = self(idx_cond)

      #print("idx_logits: ", tf.shape(logits))
      # (B, T, H, W, C, VOCAB_SIZE)
      logits = logits[:, -1, :, :, :, :]
      #print("idx_logits2: ", tf.shape(logits))

      probs = tf.reshape(tf.nn.softmax(logits), [H * W * C, VOCAB_SIZE])
      img_next = tf.reshape(tf.random.categorical(probs, 1), [1, -1, H, W, C])

      #print("image_next: ", tf.shape(img_next))
      img_next = tf.cast(img_next , dtype=tf.int32)
      idx = tf.concat([idx, img_next], 1)
    return idx

m = VideoModel()
out, loss = m(xb, yb)
optimizer = keras.optimizers.AdamW(learning_rate = LEARNING_RATE)

In [27]:
m.summary()

Model: "video_model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     multiple                  4096      
                                                                 
 embedding_5 (Embedding)     multiple                  48        
                                                                 
 dense_2 (Dense)             multiple                  4352      
                                                                 
Total params: 8496 (33.19 KB)
Trainable params: 8496 (33.19 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Training Step

Setting up the training step.

In [28]:
@tf.function
def train_step(x, y, model, optimizer):
  with tf.GradientTape() as tape:
    logits, loss = model(x, y)

  gradients = tape.gradient(loss, model.trainable_variables)
  #print(gradients)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))
  return loss

for iter in range(MAX_ITERS):

  if iter % EVAL_ITERS == 0:
    out = {}
    for split in ['train', 'val']:
      losses = [None] * EVAL_ITERS
      for k in range(EVAL_ITERS):
        x, y = get_batch(split)
        logits, loss = m(x, y)
        losses [k] = loss.numpy()
      mean_loss = tf.reduce_mean(losses)
      out[split] = mean_loss.numpy()

    print(f"step {iter}: train loss {out['train']:.4f}, val loss {out['val']:.4f}")

  x, y = get_batch('train')
  loss = train_step(x, y, m, optimizer)

print(loss)

step 0: train loss 5.5465, val loss 5.5457
step 20: train loss 5.5398, val loss 5.5391
step 40: train loss 5.5348, val loss 5.5365
step 60: train loss 5.5259, val loss 5.5285
step 80: train loss 5.5229, val loss 5.5244
step 100: train loss 5.5091, val loss 5.5202
step 120: train loss 5.5123, val loss 5.5147
step 140: train loss 5.4993, val loss 5.5048
step 160: train loss 5.4873, val loss 5.4975
step 180: train loss 5.4659, val loss 5.4827
tf.Tensor(5.503126, shape=(), dtype=float32)


### Visualization

We need to be able to turn the generated artifacts into videos that are viewable and critiquable. Therefore we need to have the facilities to change a tensor of shape:

`[FRAMES, HEIGHT, WIDTH, C]` to a video (normally a 25fps video).

In [29]:
sample = m.generate(tf.zeros((1, 1, MAX_HEIGHT // 4, MAX_WIDTH // 4, 3)), 1000)
tf.shape(sample)

<tf.Tensor: shape=(5,), dtype=int32, numpy=array([   1, 1001,  180,  320,    3], dtype=int32)>

In [30]:
tf.shape(sample[0][0]).numpy()[:2][::-1]

array([320, 180], dtype=int32)

In [31]:
output_file = "/content/drive/MyDrive/generated_videos/video_" + str(random.randint(0, 2000)) + ".mp4"
print("output file: ", output_file)
fps = 30
print("sample: ", tf.shape(sample))

fourcc = cv2.VideoWriter_fourcc(*'mp4v')
T, H, W, C = sample.shape[1:]

out = cv2.VideoWriter(output_file, fourcc, fps, (W, H))

for t in range(T):
  frame = sample[0, t]
  frame = np.uint8(frame)
  out.write(frame)

out.release()

output file:  /content/drive/MyDrive/generated_videos/video_1594.mp4
sample:  tf.Tensor([   1 1001  180  320    3], shape=(5,), dtype=int32)
