# Phi-3-Vision
On May 21, 2024 Microsoft released phi-3-vision which is a 4.2B parameter multimodal model with language and vision capabilities. 
A [cookbook](https://github.com/microsoft/Phi-3CookBook) is available alongside the model being openly available on [Hugging Face](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct).

As of May, the model is not available on inference tools like llama.cpp or Ollama which would make it easier to use the model for inference. 
This notebook is an example for how to setup it up using Hugging Face's transformers library. This does mean that the setup will be quite involved and may not work on many systems.


## Installation
1. Prerequisites:
    - Linux
    - Python 3.11 or 3.12
    - Modern NVIDIA GPU (3000 Series or higher) with >=16GB
1. Install PyTorch - [Getting Started](https://pytorch.org/get-started/locally/).
    - Usually this is: `pip install torch torchvision torchaudio`
1. Install [flash-attn](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#installation-and-features)
    - `pip install packaging`
    - `pip install flash-attn --no-build-isolation` 
    - May need install these first:
        ```bash
        pip install setuptools.
        pip install wheel
        ```
1. `pip install accelerate`


In [1]:
from pathlib import Path

from not_again_ai.local_llm.huggingface.chat_completion import chat_completion_image
from not_again_ai.local_llm.huggingface.helpers import load_model, load_processor

model_id = "microsoft/Phi-3-vision-128k-instruct"

In [2]:
model = load_model(model_id=model_id)
processor = load_processor(model_id=model_id)



In [3]:
messages = [
    {"role": "system", "content": "Your goal is to understand and describe images."},
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
]

sk_diagram = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "SKDiagram.png"
images = [sk_diagram]

response = chat_completion_image(
    messages=messages,
    images=images,
    model_processor=(model, processor),
    max_tokens=1000,
)

response["message"]



"The image depicts a diagram illustrating the process flow within a Semantic Kernel. It includes an 'Application' block on the left, a central 'Kernel' block, and a 'Models' block on the right, with various steps and processes connected by arrows indicating the flow of operations."

In [4]:
messages = [
    {"role": "system", "content": "Your goal is to understand and describe images in just a few words."},
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
    {"role": "assistant", "content": "A cat"},
    {"role": "user", "content": "<|image_2|>\nWhat is shown in this image?"},
]

cat = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "cat.jpg"
numbers = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "numbers.png"
images = [cat, numbers]

response = chat_completion_image(
    messages=messages,
    images=images,
    model_processor=(model, processor),
    max_tokens=500,
)

response["message"]

'Numbers'

In [5]:
messages = [
    {"role": "system", "content": "Your goal is to understand and describe images in just a few words."},
    {
        "role": "user",
        "content": "<|image_1|><|image_2|>\nWhat is shown in these images? Describe the first image first. Then describe the second.",
    },
]

cat = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "cat.jpg"
numbers = Path.cwd().parent.parent / "tests" / "llm" / "sample_images" / "numbers.png"
images = [cat, numbers]

response = chat_completion_image(
    messages=messages,
    images=images,
    model_processor=(model, processor),
    max_tokens=500,
)

print(f"Completion Tokens: {response['completion_tokens']}")
response["message"]

Completion Tokens: 179


"The first image features a close-up of a cat's face. The cat appears to have a mix of grey and white fur, and its eyes are a striking green color. The cat's nose is prominent and pinkish, and its whiskers are long and white. The background is blurred, focusing attention on the cat's face.\n\nThe second image is a collection of handwritten numbers and symbols. They are scattered across the image and vary in size and orientation. The numbers are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12. The symbols include a circle, a square, a triangle, and a cross. The numbers and symbols are black and are placed on a white background."

In [6]:
# Try inference without images
messages = [
    {
        "role": "user",
        "content": "What is 2+2?",
    },
]

response = chat_completion_image(
    messages=messages,
    images=None,
    model_processor=(model, processor),
    max_tokens=500,
)

response["message"]

'2+2 is equal to 4.'