Skip to content

Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder.

Notifications You must be signed in to change notification settings

Jaykef/min-patchnizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

min-patchnizer

Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder. The code here, first extracts still images (frames) from a video, splits the image frames into smaller fixed-size patches, linearly embeds each of them, adds position embeddings and then saves the resulting sequence of vectors for use in a Vision Transformer encoder. I tried training the resulting sequence vectors with Karpathy's minbpe and it took 2173.45 seconds per frame to tokenize. The whole "patchnization" took ~77.40a for a 20s video on my M2 Air.

IMG_5672

The files in this repo work as follows:

  • patchnizer.py: Holds code for simple implemenatation of the three stages involved (extract_image_frames from video, reduce image_frames_to_patches of fixed sizes 16x16 pixels, then linearly_embed_patches into a 1D vector sequence with additional position embeddings.
  • patchnize.py: performs the whole process with custom configs (patch_size, created dirs, video - I am using the "dogs playing in snow" video by sora).
  • train.py: Trains the resulting one-dimensional vector sequence (linear_patch_embeddings + position_embeddings) on Karpathy's minbpe (a minimal implementation of the byte-pair encoding algorithm).
  • check.py: Checks to see if the patch embeddings match the original image patches and then reconstructs the original image frames - this basically just do the reverse of linear embedding.

The whole process builds on the approach introduced in the Vision Transformer paper: "An image is worth 16x16 words: Transformers for image recognition at scale."

Youtube Video: Watch Demo

Usage

First patchnize:

python patchnize.py

Next check:

python check.py

Then train:

python train.py

References

License

MIT

About

Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages