Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Idefics 2 #309

Merged
merged 133 commits into from
Jun 20, 2024
Merged

Add support for Idefics 2 #309

merged 133 commits into from
Jun 20, 2024

Conversation

EricLBuehler
Copy link
Owner

@EricLBuehler EricLBuehler commented May 15, 2024

This PR adds support for our first multimodal model: Idefics 2 (https://huggingface.co/HuggingFaceM4/idefics2-8b)!

Implementation TODOs:

  • VisionTransformer
    • Encoder
      • Attention
      • MLP
    • VisionEmbedding
  • Connector
    • MLP
    • PerceiverLayer
  • Model
    • Forward pass
      • Remove padding images
      • Generate the patch/pixel attention mask
      • Run vision submodel and connector submodel
        • Allow Mistral to run trained embedding head on any input tokens
        • Inputs merger to inject embeddings correctly
      • Pass input to Mistral model
        • Allow Mistral to take an embeddings vector instead of using the trained embedding head.
  • Image processor analogous to Idefics2ImageProcessor
    • Resizing
    • Rescaling
    • Normalization
    • Padding
      • Generate pixel attention mask for padded images
        • Pass and use in input injection
    • Create pixel values tensors
  • Vision Model Pipeline
    • Add a VisionModel trait similar to NormalModel
    • Add a ModelCategory: vision, text, embedding etc
    • Handle sequence scheduling with image dimensions
    • Abstract input preparation logic
      • Handle padding to same, resized shape, across batch dimension
  • Add proper handling of chat templates
    • Load preprocessor/processor config JSON files
    • Support configuration of inputs processor via preprocessor
  • Messages API generalization
    • Support OpenAI compatible method of specifying images
    • Update messages to optionally encode type (akin to examples here).
    • Use processor config to abstract the chat template application process
  • HTTP API
    • Handle decoding from base64
    • Support loading from HTTP.
  • Rust API
  • Python API

Other TODOs:

  • Introduce model type enum to reject mixing of text/multimodal models in speculative decoding
    • Perhaps introduce VisionModel akin to NormalModel.
  • Ergonomic API support (OpenAI compatible on the HTTP side, but hopefully nicer on the Rust/Python side)
  • Support device mapping
  • Support ISQ

Pending issues:

@EricLBuehler EricLBuehler added new feature New feature or request models Additions to model or architectures labels May 15, 2024
Copy link

github-actions bot commented May 15, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 31         1217         1038           37          142
 TOML                   16          440          400            1           39
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1149            0          846          303
 |- BASH                 5          100           97            0            3
 |- Python               6          122          110            0           12
 |- Rust                 2           80           72            3            5
 (Total)                           1451          279          849          323
-------------------------------------------------------------------------------
 Rust                  115        34412        31161          585         2666
 |- Markdown            57          641           13          594           34
 (Total)                          35053        31174         1179         2700
===============================================================================
 Total                 191        37715        33014         1469         3232
===============================================================================
  

@EricLBuehler EricLBuehler merged commit 9fda084 into master Jun 20, 2024
10 of 11 checks passed
@EricLBuehler EricLBuehler deleted the idefics2 branch June 20, 2024 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant