<a href="https://colab.research.google.com/github/Mpogazi/athena_coder/blob/main/notebooks/video_%26%26_sound.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Video and Sound GPT (Generation of Videos)
The main idea in this notebook is to use the gpt model architecture or another different architecture (JEPA for example) to produce plausible videos.

### Theory/Intuition:
I think it's possible to learn the pixel distributions of the in the time dimension and predict the next frame in videos and other modalities as it has been done for texts. (GPT 3 or GPT 4)

### Initial Approach
* download hundreds of videos from YouTube and put them on Google drive
* Write code to read the videos and transform them into Tensors
* Write code to transform or read an output tensors into video format
* Write a window code to play the generated video vis as vis the original video (A bit tricky, will add more details as time goes.)

In [None]:
# Installing the python libraries to handle reading sound and video
!pip install opencv-python pydub



### Import

Please put all the imports here. We would like to have a single source of truth.

In [None]:
import cv2
import tensorflow as tf

from tensorflow import keras
from keras import layers
from pydub import AudioSegment

import numpy as np
import random
import io
import os

# Google drive imports
from google.colab import drive

### Globals
In this section we mount the memory (Drive) and set up some global variables.
List of globals:

`base_path`, `MAX_HEIGHT`, `MAX_WIDTH`

In [None]:
# Mounting the drive with the content
# Might need to give permission on this.
# Since someone needs to access the contents of the drive/videos
drive.mount('/content/drive')
BASE_PATH = '/content/drive/MyDrive/video_model/'
MAX_HEIGHT = 720
MAX_WIDTH = 1280
TRAIN_SPLIT = 0.9
BATCH_SIZE = 32

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Setting Up The Training Data

1. Create a list of all files
2. load the videos to set up some Global Variables

In [None]:
files = set()
for filename in os.listdir(BASE_PATH):
  files.add(filename)

def split_test(files):
  size = int(len(files) * TRAIN_SPLIT)
  input_list = list(files)
  random.shuffle(input_list)
  return input_list[:size], input_list[size:]

train, val = split_test(files)

### Loading Training Data

1. Load the videos and pad them
2. Collapse all the videos into a single giant data object
3. Split the vidoes into train and test data

In [None]:
# We will need to pad the tensors to handle MAX_WIDTH and MAX_HEIGHT
# Frames are gonna be the same
def pad_video_tensor(video_tensor):
  # (frames, height, width, chanel)
  _, height, width, _ = tf.shape(video_tensor).numpy()
  paddings = tf.constant([
      [0, 0],
      [0, MAX_HEIGHT - height],
      [0, MAX_WIDTH - width],
      [0, 0]
  ])
  return tf.pad(video_tensor, paddings, "CONSTANT", constant_values=0)


# take a video path and return an equivalent tensor
# Returns tensor of shape (Frames, MAX_HEIGHT, MAX_WIDTH, 3)
def capture_frames_randomly(video_path, frame_limits = 200):
  cap = cv2.VideoCapture(video_path)
  video_tensors = []

  frame_index = 0
  limit = frame_limits
  start = random.randint(0, cap.get(cv2.CAP_PROP_FRAME_COUNT))

  cap.set(cv2.CAP_PROP_POS_FRAMES, start)
  for i in range(start, start + limit + 1):
    ret, frame = cap.read()
    if not ret:
      print("Error: Failed to grab frame.")
      break
    video_tensors.append(tf.convert_to_tensor(frame, dtype=tf.int16))
  cap.release()

  ## Padding the tensor to match the sizes for all videos
  return pad_video_tensor(tf.convert_to_tensor(video_tensors, dtype=tf.int16))

### Batching
Since the video sizes are humongous, we need to implement a
batching strategy that is not conventional.

We will be picking a file by random and randomly pick a batch.

Some Math:
We 're working in milli seconds. Therefore, we will be picking
this much time in a video randomly (batch_size * (1 / 25) fps)

batch_size is counted in frames.


In [None]:
def get_batch(split: str):
  data = train if split == 'train' else val
  rand_file = data[random.randint(0, len(data))]
  # Since in our predictions we're taking in the past and predicting the future
  sample = capture_frames_randomly(BASE_PATH + rand_file, BATCH_SIZE)
  x = sample[:BATCH_SIZE]
  y = sample[1:(BATCH_SIZE + 1)]
  return x, y

xb, yb = get_batch('train')

In [None]:
print("shape xb: ", tf.shape(xb))
print("shape yb: ", tf.shape(yb))

shape xb:  tf.Tensor([  32  720 1280    3], shape=(4,), dtype=int32)
shape yb:  tf.Tensor([  32  720 1280    3], shape=(4,), dtype=int32)


### Modeling

This is the modeling part of the notebook. Watch as we create the model lmaooo!

In [None]:
class GPTVideoModel(keras.Model):
  def __init__(self):
    super().__init__()

  def call(self, idx, targets=None):

  def generate(self, idx, max_new_frames):
    return idx

m = GPTVideoModel()
out, loss = m(xb, yb)
optimizer = keras.optimizers.AdamW(learning_rate=learning_rate)