In [1]:
!python -m pip install git+https://github.com/huggingface/transformers

Looking in indexes: https://nexus.iisys.de/repository/ki-awz-pypi-group/simple, https://pypi.org/simple
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-ox_0zyxz
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-ox_0zyxz
  Resolved https://github.com/huggingface/transformers to commit ca03842cdcf2823301171ab27aec4b6b1cafdbc1
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


In [2]:
pip install 'accelerate>=0.26.0'

Looking in indexes: https://nexus.iisys.de/repository/ki-awz-pypi-group/simple, https://pypi.org/simple
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install sentencepiece

Looking in indexes: https://nexus.iisys.de/repository/ki-awz-pypi-group/simple, https://pypi.org/simple
Note: you may need to restart the kernel to use updated packages.


In [6]:
import time  # Importing time module for execution time measurement
import matplotlib.pyplot as plt
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
import json
import os

# Record the overall start time
overall_start_time = time.time()

# Step 1: Load the model and processor from Hugging Face
print("Loading model and processor...")
model = AutoModelForVision2Seq.from_pretrained("microsoft/kosmos-2-patch14-224")
processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224")
print("Model and processor loaded.\n")

# Step 2: Load the dataset from JSON
json_path = "dataset_fixeddd.json"  # Replace with the path to your JSON dataset
with open(json_path, "r") as f:
    dataset = json.load(f)

results = []

# Step 3: Process each image and its questions
for item in dataset:
    image_path = item['image']
    questions = [q["question"] for q in item['questions']]
    groundtruth_answers = [q["answer"] for q in item['questions']]

    # Check if the image file exists
    if not os.path.exists(image_path):
        print(f"Image not found: {image_path}")
        continue

    print(f"Processing image {image_path}...")
    image = Image.open(image_path)
    

    answers = []

    for question in questions:
        print(f"Processing question: {question}")

        # Step 4: Process inputs
        inputs = processor(text=question, images=image, return_tensors="pt")
        
        # Step 5: Generate the output
        generated_ids = model.generate(
            pixel_values=inputs["pixel_values"],
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            image_embeds=None,
            image_embeds_position_mask=inputs.get("image_embeds_position_mask"),
            use_cache=True,
            max_new_tokens=600,
        )

        # Step 6: Decode the generated text output
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

        # Step 7: Post-process the generated text
        processed_text, entities = processor.post_process_generation(generated_text)

        # Append the processed text to answers
        answers.append(processed_text)

    # Append results for the current image
    results.append({
        "image": image_path,
        "questions": questions,
        "Kosmos_generated_answers": answers,
        "groundtruth_answers": groundtruth_answers
    })

# Step 8: Save results to a new JSON file
output_json_path = "kosmos_generated_answers.json"
with open(output_json_path, "w") as f:
    json.dump(results, f, indent=4)

print(f"Processing complete. Results saved to {output_json_path}.")

# Record the overall end time and print total execution time
overall_end_time = time.time()
print(f"\nOverall Execution Time: {overall_end_time - overall_start_time:.2f} seconds.")


Loading model and processor...
Model and processor loaded.

Processing image datasett/datasetUML1.png...
Processing question: Which classes are present in the diagram?
Processing question: How many entities are defined in the diagram?
Processing question: What attributes are present in the 'Reservation' entity?
Processing question: What is the cardinality relationship between the 'User' and 'Employee' entities?
Processing question: Which entity handles 'airline inquiries'?
Processing image datasett/datasetUML2.png...
Processing question: Which classes are shown in the diagram?
Processing question: What are the private methods of the 'ATM' class?
Processing question: What is the cardinality relationship between the 'Customer' and 'Account' entities?
Processing question: What attributes are defined in the 'ATM Transaction' entity?
Processing question: What parameters must be passed to the method that processes ATM transactions, and what are their data types?
Processing image datasett/dat