# Step 5: Add Visual Input
Adding image processing to the AI Tutor using BLIP and Flickr30k.
- Date: July 21, 2025
- Describes images and integrates with Q&A system.

In [3]:
from transformers import pipeline
import pickle
from PIL import Image
from datasets import load_dataset

# Load Flickr30k dataset (for context or future use)
multimodal_dataset = load_dataset("lmms-lab/flickr30k")
print("Flickr30k loaded!")

# Load image captioning pipeline
image_to_text = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")

# Test with a sample image (replace with your image path)
image_path = "math.jpg"  # Replace with your file, e.g., "math_diagram.jpg"
image = Image.open(image_path)
caption = image_to_text(image)[0]['generated_text']
print(f"Image Caption: {caption}")

# Use caption as context for Q&A
text_qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
question = "What is shown in the image?"
result = text_qa(question=question, context=caption)
print(f"Question: {question}")
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['score']:.2f}")

Flickr30k loaded!


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Device set to use cpu


Image Caption: a yellow background with a black and white image of a number


Device set to use cpu


Question: What is shown in the image?
Answer: a number
Confidence: 0.60


## Observations
- BLIP generated an accurate caption for math.jpg: "a yellow background with a black and white image of a number".
- DistilBERT answered image-related questions, with confidence varying (e.g., 0.60 for vague questions, higher for specific ones).
- This completes the multimodal AI Tutor (text, voice, images), ready for knowledge retrieval in Step 6.