# Phi-3-Vision
On May 21, 2024 Microsoft released phi-3-vision which is a 4.2B parameter multimodal model with language and vision capabilities. 
A [cookbook](https://github.com/microsoft/Phi-3CookBook) is available alongside the model being openly available on [Hugging Face](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct).

As of May, the model is not available on inference tools like llama.cpp or Ollama which would make it easier to use the model for inference. 
This notebook is an example for how to setup it up using Hugging Face's transformers library. This does mean that the setup will be quite involved and may not work on many systems.


## Installation
1. Prerequisites:
    - Linux
    - Python 3.11 or 3.12
    - Modern NVIDIA GPU (3000 Series or higher) with >=16GB
1. Install PyTorch - [Getting Started](https://pytorch.org/get-started/locally/).
    - Usually this is: `pip install torch torchvision torchaudio`
1. Install [flash-attn](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features)
    - `pip install packaging`
    - `pip install flash-attn --no-build-isolation` 
    - May need install these first:
        ```bash
        pip install setuptools.
        pip install wheel
        ```
1. `pip install accelerate`


In [1]:
from pathlib import Path

from not_again_ai.local_llm.huggingface.chat_completion import chat_completion_image
from not_again_ai.local_llm.huggingface.helpers import load_model, load_processor

model_id = "microsoft/Phi-3-vision-128k-instruct"

In [2]:
model = load_model(model_id=model_id)
processor = load_processor(model_id=model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
messages = [
    {"role": "system", "content": "Your goal is to understand and describe images."},
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
]

sk_diagram = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "SKDiagram.png"
images = [sk_diagram]

response = chat_completion_image(
    messages=messages,
    images=images,
    model_processor=(model, processor),
    max_tokens=1000,
)

response["message"]



'The image displays a flowchart representing the process flow of an AI service within the Semantic Kernel framework. It shows the steps from invoking a prompt to selecting an AI service, rendering the prompt, and finally, invoking the AI service. The process includes model selection, templateization, reliability checks, and event notifications.'

In [4]:
messages = [
    {"role": "system", "content": "Your goal is to understand and describe images in just a few words."},
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
    {"role": "assistant", "content": "A cat"},
    {"role": "user", "content": "<|image_2|>\nWhat is shown in this image?"},
]

cat = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "cat.jpg"
numbers = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "numbers.png"
images = [cat, numbers]

response = chat_completion_image(
    messages=messages,
    images=images,
    model_processor=(model, processor),
    max_tokens=500,
)

response["message"]

'A handwritten number set'

In [5]:
messages = [
    {"role": "system", "content": "Your goal is to understand and describe images in just a few words."},
    {
        "role": "user",
        "content": "<|image_1|><|image_2|>\nWhat is shown in these images? Describe the first image first. Then describe the second.",
    },
]

cat = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "cat.jpg"
numbers = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "numbers.png"
images = [cat, numbers]

response = chat_completion_image(
    messages=messages,
    images=images,
    model_processor=(model, processor),
    max_tokens=500,
)

print(f"Completion Tokens: {response['completion_tokens']}")
response["message"]

Completion Tokens: 46


"The first image shows a cat's face with green eyes, a pink nose, and white whiskers. The second image appears to be a handwritten set of numbers from 1 to 12."

In [6]:
# Try inference without images
messages = [
    {
        "role": "user",
        "content": "What is 2+2?",
    },
]

response = chat_completion_image(
    messages=messages,
    images=None,
    model_processor=(model, processor),
    max_tokens=500,
)

response["message"]

'2+2 is 4.'