# Florence 2: Vision Language Model with downstream tasks

- Minor Applied AI, Hogeschool van Amsterdam
- Michiel Bontenbal & Maarten Post
- 24 oktober 2024 Computer Vision Lecture 4

Florence 2 is a relatively small VLM by Microsoft. There are two versions of the model you can use:
1. Florence 2 Base. About 0.5 Gb with ok quality. 
2. Florence 2 Large. About 1.5 Gb with much better quality. 

Florence 2 is an Foundation Model that can be applied for several CV tasks. In this notebook we'll do three tasks: Image Caption, Object Detection, Dense Region Caption.

In this notebook we'll use Florence 2 with the elephants dataset, you know from ResNet.

### Contents
1. First image
2. Describe images from Elephant Dataset
3. Change the task: do Object Detection

----
source: https://huggingface.co/microsoft/Florence-2-base

In [32]:
import requests
import torch

from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM 

#cuda settings
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

#SELECT THE MODEL. Florence-2-base is 0.5Gb, large is 1.5Gb. Large has better quality.
model_id = 'microsoft/Florence-2-base'
#model_id = 'microsoft/Florence-2-large'

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

In [None]:
#download, process and show the image
from PIL import Image
#url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
url= 'https://huggingface.co/datasets/MichielBontenbal/elephants/resolve/main/olifant_foto3.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image

In [34]:
prompt = "Describe this image"

In [40]:
#load the function
import torch

def generate_image_description(model, processor, prompt, image, device):
    """
    Generate a description for an image using a given model and processor.
    """
    # Prepare inputs
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)

    # Generate text
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3,
    )

    # Decode generated text
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    # Post-process the generated text
    parsed_answer = processor.post_process_generation(
        generated_text, 
        task='describe this image', 
        image_size=(image.width, image.height)
    )

    return parsed_answer



In [None]:
# Call the function:
result = generate_image_description(model, processor, "Describe this image:", image, device)
print(result)

## 2. Load the dataset and describe images

In [None]:
!git clone https://huggingface.co/datasets/MichielBontenbal/elephants

In [None]:
#load the images
import glob
image_list = glob.glob('./elephants/*.jpg')
image_list

In [None]:
#select the image by changing the index
image = Image.open(image_list[5])
print(type(image))
image

In [None]:
# Call the function:
result = generate_image_description(model, processor, "Describe this image:", image, device)
print(result)

## 3. Object detection

In [44]:
def run_example(task_prompt, text_input=None):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input
    inputs = processor(text=prompt, images=image, return_tensors="pt")#.to('cuda', torch.float16)
    generated_ids = model.generate(
      input_ids=inputs["input_ids"],#.cuda(),
      pixel_values=inputs["pixel_values"],#.cuda(),
      max_new_tokens=1024,
      early_stopping=False,
      do_sample=False,
      num_beams=3,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )

    return parsed_answer

In [None]:
task_prompt = '<OD>'
results = run_example(task_prompt)
print(results)

In [47]:
import matplotlib.pyplot as plt
import matplotlib.patches as patches
def plot_bbox(image, data):
   # Create a figure and axes
    fig, ax = plt.subplots()

    # Display the image
    ax.imshow(image)

    # Plot each bounding box
    for bbox, label in zip(data['bboxes'], data['labels']):
        # Unpack the bounding box coordinates
        x1, y1, x2, y2 = bbox
        # Create a Rectangle patch
        rect = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=1, edgecolor='r', facecolor='none')
        # Add the rectangle to the Axes
        ax.add_patch(rect)
        # Annotate the label
        plt.text(x1, y1, label, color='white', fontsize=8, bbox=dict(facecolor='red', alpha=0.5))

    # Remove the axis ticks and labels
    ax.axis('off')

    # Show the plot
    plt.show()

In [None]:
plot_bbox(image, results['<OD>'])

## Exercise 1: Change to another task: Dense region caption

1. Look-up https://huggingface.co/microsoft/Florence-2-base.
2. There you will find another Jupyter Notebook with example code. 
3. Look for the task 'Dense region caption'.


In [49]:
#YOUR CODE HERE

### Reflectievragen

Hier laat je zien dat je de code begrepen hebt.

1. Hoe groot is florence-2-base en hoe groot is florence-2-large?
2. Hoe groot is Llava-Next? (Zoek op Huggingface naar llava-hf/llava-v1.6-mistral-7b-hf. Onder 'files and versions' staan de .safetensor-files. Hoeveel zijn dat er en hoe groot zijn ze samen?)
3. Welke taken kan je uitvoeren met Florence-2? 
4. Wat betreft de olifanten dataset. Hoe veel beter is Florence 2 dan ResNet met de olifanten dataset?
