# Exploring Llama 3.2-Vision (locally) with Ollama

### imports

In [3]:
import ollama

### pull model

In [4]:
ollama.pull('llama3.2-vision')

ProgressResponse(status='success', completed=None, total=None, digest=None)

#### Basic Usage

In [19]:
import requests
import base64

# Step 1: Download the image from a URL
url = "https://media.istockphoto.com/id/1372362461/photo/handome-young-indian-man-chatting-with-girlfriend-using-smartphone.jpg?s=612x612&w=0&k=20&c=6JjzP-NldQGkePmmwEttJEDJtTTLwh_Xu05PzgssGFg="
response = requests.get(url)
image_bytes = response.content

# Step 2: Convert image to base64 string
image_base64 = base64.b64encode(image_bytes).decode('utf-8')


response = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': [image_base64]
    }]
)

print(response['message']['content'])

The image depicts a man sitting comfortably on a chair with his legs crossed, exuding a sense of relaxation. His attire consists of a light green button-up shirt over a white t-shirt and brown pants, complemented by white shoes that add a touch of elegance to the overall look. The man's curly hair adds a layer of texture to the image, while his right hand is positioned on his leg, suggesting a moment of contemplation or conversation.

The background of the image is a plain wall with no decorations, which creates a sense of simplicity and minimalism. However, the light-colored floor adds a touch of warmth and coziness to the setting, making it feel inviting and comfortable. Overall, the image conveys a feeling of calmness and serenity, suggesting that the man is enjoying some quiet time or perhaps engaging in a conversation with someone off-camera.


#### Image captioning - streaming

In [21]:
# Step 1: Download the image from a URL
url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTmrzE7JlZSe8wklhE2Qa-dFaA0rx9v-olBdQ&s"
response = requests.get(url)
image_bytes = response.content

# Step 2: Convert image to base64 string
image_base64 = base64.b64encode(image_bytes).decode('utf-8')


stream = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'Can you write a caption for this image?',
        'images': [image_base64]
    }],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

This image depicts a person standing in an empty outdoor stadium, surrounded by rows of white seats. The individual is attired in a red and black long-sleeved shirt and dark pants, facing away from the camera.

The stadium's seating area stretches out behind the person, with a lush green field on the left side of the image. In the background, a blue sky dominates the scene, punctuated by three light poles rising above the stadium.

The overall atmosphere suggests that the individual is likely a maintenance worker or groundskeeper, engaged in routine tasks within the stadium.

#### Explaining memes

In [22]:
# Step 1: Download the image from a URL
url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRgR9Jz6xUCXTN7ssNt3854QwmhbBJWKb4Fkw&s"
response = requests.get(url)
image_bytes = response.content

# Step 2: Convert image to base64 string
image_base64 = base64.b64encode(image_bytes).decode('utf-8')


stream = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'Can you explain this meme to me?',
        'images': [image_base64]
    }],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

This meme is a play on the game show "Who Wants To Be A Millionaire" and features a photo of Steve Harvey, the host of the show. The image shows Steve Harvey with a confused expression on his face.

The text at the top of the image reads "*Male black widow spiders". Below this are two options: "A) Offer yourself as food" and "B) Offer yourself as food".

The joke is that the question being asked is about how to attract a male black widow spider, which is not something you would typically do on a game show. The humor comes from the unexpected twist of having a serious-looking game show host like Steve Harvey asking a silly question about spiders.

Overall, the meme pokes fun at the idea of taking a serious situation (a game show) and turning it into something ridiculous (attracting black widow spiders). It's a lighthearted way to poke fun at the absurdity of life.

#### OCR

In [24]:
# Step 1: Download the image from a URL
url = "https://cdsassets.apple.com/live/7WUAS350/images/ios/ios-18-iphone-16-pro-notes-text-formatting-options.png"
response = requests.get(url)
image_bytes = response.content

# Step 2: Convert image to base64 string
image_base64 = base64.b64encode(image_bytes).decode('utf-8')

stream = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'Can you transcribe the text from this screenshot in a markdown format?',
        'images': [image_base64]
    }],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

The image shows two screenshots of an iPhone screen, one on top of the other. The top screenshot is slightly larger than the bottom one.

**Top Screenshot:**

*   A yellow highlighted line runs across the middle of the screen with "Bird Spotting" written inside it.
*   Below this are some details about a ruby-throated hummingbird:
    *   Date: June 1
    *   Time: 3:05 p.m.
    *   Location: Backyard hummingbird feeder
    *   Description: Silver throat and belly, emerald green back. Markings of a female as it lacked the distinctive red throat that males have.

**Bottom Screenshot:**

*   The bottom screenshot is slightly smaller and shows a formatting menu with options for title, heading, subheading, and body text.
*   There are several lines of black text below these options, but they are not fully visible due to the overlap with the top screenshot.