If we wanted to create a chatbot that is able to retrieve images from a textbook or a repository we give it based on that images relevance to the user query how would we do that? This jupyter notebook goes over the first part of that process. We extract images from a pdf, we then feed these images to openai to provide us with a description of the image, we then store the images in a folder and then the descriptions in a seperate folder as a .json file. We'll then use these descriptions in the Image Picker Jupyter Notebook to find the image of all of them which is most relevant to our user query.  We start by importing several modules, os gives us access to file paths, fitz allows for the opening of pdf documements, PIL's Image module allows for the opening of images, io allows for the conversion of input types, base64 allows us to upload images to openai, requests allows for requests from a url, json allows for the creation and interpretation of .json files.

In [None]:
#pip install PyMuPDF

In [None]:
import os
import fitz  # PyMuPDF
from PIL import Image
import io
import base64
import requests
import json

We start by start by stating our api key as per usual, we then define the path to the pdf we want to extract images from in this case I have used the quantum notes from Year 2 F2A and we can also define the document_name which we'll use later for creating names of extracted images and created files. We also define two directories one which holds our descriptions of the images and one which will hold the images. Make sure to switch the directories to the locations on your computer

In [None]:
api_key = ""

pdf_file_path = r""
document_name = ""

description_directory = r""
image_directory = r""

We then define two functions extract_images_from_pdf first checks the file path to ensure the file exists, it then opens the file and then goes through each page and retrieves the images from them, indexs them appropriately, closes the document and then returns a list of the images data. save_image takes image data and an output directory, opens the image and saves it to that directory. It creates a file name based on the page number and the image index defined in the image data and the document name we defined previously. We can then run both the functions feeding the image data obtained from extract_image_from_pdf into save_image which then saves the images in our image_directory.

In [None]:
def extract_images_from_pdf(pdf_file):
    if not os.path.isfile(pdf_file):
        print(f"Error: The file '{pdf_file}' does not exist.")
        return None
    
    pdf_document = fitz.open(pdf_file)
    image_list = []

    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        page_images = page.get_images(full=True)

        for img_idx, img_info in enumerate(page_images):
            xref = img_info[0]
            base_image = pdf_document.extract_image(xref)

            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            
            image_data = {
                "page_number": page_num + 1,
                "image_index": img_idx + 1,
                "image_extension": image_ext,
                "image_data": image_bytes
            }
            image_list.append(image_data)

    pdf_document.close()
    return image_list

def save_image(image_data, output_dir):
    image_bytes = image_data["image_data"]
    image_ext = image_data["image_extension"]
    
    # Load image from bytes
    img = Image.open(io.BytesIO(image_bytes))

    # Save image to output directory
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    image_filename = f"source{document_name}_page{image_data['page_number']}_image{image_data['image_index']}.{image_ext}"
    image_path = os.path.join(output_dir, image_filename)
    img.save(image_path)

    print(f"Image saved: {image_path}")

In [None]:
extracted_images = extract_images_from_pdf(pdf_file_path)
    
if extracted_images:
    for image_data in extracted_images:
        save_image(image_data, image_directory)

Next we define the encode_image function which takes an image file and and encodes it in the base64 format and returns the encoded string. We need to encode an image in base64 for openai to be able to read it. 

In [None]:
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

Now we'll send the image to gpt-4o to interpret and provide us with a description. We start by definining the headers, these just tell openai what to expect and it gives them our api key. We then loop through all the files in the image_directory folder and call the encode_image function on each one as we create a "payload" to send to openai. This payload contains everything we need to send for the message including a token limit. This token limit limits how many tokens are used by the chatbot in processing the image, this is important as image interpretation can be quite costly if left unchecked. We then send our payload by calling requests.post which then outputs the chatbots response which we store in a variable. We then extract just the text from the response and add it to a list called descriptions. We use a short amount of words to decrease the time taken for the model to produce an answer and to increase the accuracy of our retrieval system in the Image Picker document.

In [None]:
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {api_key}"
}

descriptions = []

for filename in os.listdir(image_directory):
    if filename.endswith(".jpeg") or filename.endswith(".jpg") or filename.endswith(".png"):
        image_path = os.path.join(image_directory, filename)
        
        # Getting the base64 string
        base64_image = encode_image(image_path)
        
        payload = {
            "model": "gpt-4o",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": "use a short but scientific description to describe what is in this image, it should be no longer than 10 words?"
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}"
                            }
                        }
                    ]
                }
            ],
            "max_tokens": 300
        }
        
        response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
        
        # Extracting just the text from the response
        if response.status_code == 200:
            response_json = response.json()
            text_content = response_json['choices'][0]['message']['content']
            descriptions.append(text_content)
        else:
            print(f"Request for {filename} failed with status code: {response.status_code}")
            print(response.text)

We then define the file name and where we want our list of descriptions to be stored and save them there.

In [None]:
file_name = f"{document_name}_descriptions.json"
file_path = os.path.join(description_directory, file_name)

with open(file_path, "w") as file:
    json.dump(descriptions, file)