## Text input

https://platform.openai.com/docs/models

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

In [24]:
import os
from langchain_ollama import ChatOllama
from langchain.agents import create_agent

model_name="qwen3-vl:2b"          # Multimodal model, install using:   ollama run llava:7b
model_url=os.getenv('OLLAMA_HOST')

model = ChatOllama(
    model=model_name,
    api_base=model_url
)

agent = create_agent(
    model=model,
)

In [26]:
from langchain.messages import HumanMessage

question = HumanMessage(content=[
    {"type": "text", "text": "What is the capital of The Moon?"}
])

response = agent.invoke(
    {"messages": [question]}
)

print(response['messages'][-1].content)

The question "What is the capital of The Moon?" is likely referencing a **fictional or metaphorical context**, as there is no real-world capital for the Moon (Earth's natural satellite). However, it could be tied to a specific reference:

### üåï **In the context of The Beatles' song "The Moon" (1970s), the capital of "The Moon" is "Nephi"**.  
- **Explanation**: The song describes the Moon as a celestial body, and the city on it is named **Nephi** (a reference to a biblical city). This is part of the song's lyrics:  
  > *"The Moon's a place where I can go,  
  > And I'm in the moon of my own mind,  
  > And I want to see what I'm doing with you,  
  > I want to go to the Moon, it's the place I want to go."*  
  (While the song does not explicitly state "Nephi" as the capital, it is commonly associated with the fictional city on the Moon in the song's narrative.)

### ‚ùì **Why this is the answer**:  
- The phrase "The Moon" in this context refers to a **fictional location** (as in t

## Image input

In [None]:
from ipywidgets import FileUpload
from IPython.display import display

uploader = FileUpload(accept='.png', multiple=False)
display(uploader)

# Creative commons image. source: https://pxhere.com/es/photo/720406

FileUpload(value=(), accept='.png', description='Upload')

In [27]:
print(uploader.value)

({'name': 'image.png', 'type': 'image/png', 'size': 66595, 'content': <memory at 0x000001E52ED681C0>, 'last_modified': datetime.datetime(2025, 12, 25, 1, 39, 9, 340000, tzinfo=datetime.timezone.utc)},)


In [28]:
import base64

# Get the first (and only) uploaded file dict
uploaded_file = uploader.value[0]

# This is a memoryview
content_mv = uploaded_file["content"]

# Convert memoryview -> bytes
img_bytes = bytes(content_mv)  # or content_mv.tobytes()

# Now base64 encode
img_b64 = base64.b64encode(img_bytes).decode("utf-8")

In [29]:
multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "Tell me about this capital"},
    {"type": "image", "base64": img_b64, "mime_type": "image/png"}
])

response = agent.invoke(
    {"messages": [multimodal_question]}
)

print(response['messages'][-1].content)

The image you provided shows the **Eiffel Tower**, a world-famous landmark located in **Paris**, France. However, it's important to clarify a key point: **Paris is the capital of France**, not just any city. Here‚Äôs a concise explanation of why Paris holds this role and why it‚Äôs so iconic:  

### üåç **Paris: The Capital of France**  
- **Location**: Paris is a city in the *Seine Valley* of northern France, where the Seine River flows through the city. It serves as the political, cultural, economic, and administrative center of France.  
- **History**: Paris has a rich history spanning over 2,000 years, from its ancient roots (including the Roman *Colis√©e* and medieval fortifications) to its modern identity. It‚Äôs a city of global significance, with iconic landmarks like the *Notre-Dame Cathedral* (built in the 12th century), the *Louvre Museum* (founded in the 12th century), and the *Arc de Triomphe* (a symbol of French military history).  
- **Culture & Appeal**: Known for its 

## Audio input

In [32]:
import sounddevice as sd
from scipy.io.wavfile import write
import base64
import io
import time
from tqdm import tqdm

# Recording settings
duration = 5  # seconds
sample_rate = 44100

print("Recording...")
audio = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1)
# Progress bar for the duration
for _ in tqdm(range(duration * 10)):   # update 10√ó per second
    time.sleep(0.1)
sd.wait()
print("Done.")

# Write WAV to an in-memory buffer
buf = io.BytesIO()
write(buf, sample_rate, audio)
wav_bytes = buf.getvalue()

aud_b64 = base64.b64encode(wav_bytes).decode("utf-8")

Recording...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:05<00:00,  9.73it/s]

Done.





In [37]:
model_name="ollama run hf.co/Hack337/WavGPT-1.5-GGUF:latest"   # Audio to text model, install using:   ollama run ollama run hf.co/Hack337/WavGPT-1.5-GGUF

model = ChatOllama(
    model=model_name,
    api_base=model_url
)

agent = create_agent(
    model=model,
)

In [None]:
agent = create_agent(
    model=model,
)

multimodal_question = HumanMessage(content=[
    {"type": "text", "text": "Tell me about this audio file"},
    {"type": "audio", "base64": aud_b64, "mime_type": "audio/wav"}
])

response = agent.invoke(
    {"messages": [multimodal_question]}
)

print(response['messages'][-1].content)