<a href="https://colab.research.google.com/github/Noelle-Pastor/UTA-Libraries-Data-Visualization-Contest/blob/main/1.%20Table_of_Contents_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Agent: Table of Contents (ToC) Processing
## **.JPG** ---> **.CSV**

---


### **Input:**
- Set of photos from ToC


### **Output:**
- One .CSV representing ToC structure
- One .CSV containing ToC metadata

In [None]:
from google import genai
from google.genai import types
from pathlib import Path
import json
import pandas as pd
import time

Create dynamic, detailed prompt. This prompt:
- Specifies rules for processing the images into structured format
- Outputs response in a .json format
- Works in batches; dynamically passes in last line from previous batch

In [None]:

def create_prompt():

    last_hierarchy = {"ID":"", "Heading 1":"", "Heading 2":"", "Heading 3":"", "Heading 4":""}
    last_metadata = {"Anthology Title": "", "Publish Date": "", "Author 1": "", "Author 2": "", "Author 3": "", "Pages": ""}


    context_block = f""" CONTEXT FROM PREVIOUS BATCH (Optional):
    •	Instructions: If this block is provided, use its values as the starting state for your analysis of the new images. Increment the last ID by 1. The first items on the new page may be children of these context entries. If this block's values are empty, start from a clean slate.

    •	Last Hierarchy Context:
    o	Last ID: "{last_hierarchy["ID"]}"
    o	Heading 1: "{last_hierarchy["Heading 1"]}"
    o	Heading 2: "{last_hierarchy["Heading 2"]}"
    o	Heading 3: "{last_hierarchy["Heading 3"]}"
    o	Heading 4: "{last_hierarchy["Heading 4"]}"

    •	Last Metadata Context:
    o	Anthology Title: "{last_metadata["Anthology Title"]}"
    o	Publish Date: "{last_metadata["Publish Date"]}"
    o	Author 1: "{last_metadata["Author 1"]}"
    o	Author 2: "{last_metadata["Author 2"]}"
    o	Author 3: "{last_metadata["Author 3"]}"
    o	Pates: "{last_metadata["Pages"]}"

    """



    json_specification = """
    {{
      "HIERARCHY": [
        {{"ID": 1, "Page Number": "...", "Heading 1": "...", "Heading 2": "...", "Heading 3": "...", "Heading 4": "...", "Heading 5": "..."}
    }
      ],

      "METADATA": [
        {{"Anthology Title": "...", "Publish Date": "...", "Author 1": "...", "Author 2": "...", "Author 3": "...", "Pages": "..."}
    }
      ]
    }}
    """



    json_example = """
    {
      "HIERARCHY": [
        {
          "ID": 1,
          "Page Number": "1",
          "Heading 1": "PART 1: THE EARLY YEARS",
          "Heading 2": "",
          "Heading 3": "",
          "Heading 4": "",
          "Heading 5": ""
        },
        {
          "ID": 2,
          "Page Number": "5",
          "Heading 1": "PART 1: THE EARLY YEARS",
          "Heading 2": "Colonial Voices",
         "Heading 3": "",
          "Heading 4": "",
          "Heading 5": ""
        },
        {
          "ID": 3,
          "Page Number": "7",
          "Heading 1": "PART 1: THE EARLY YEARS",
          "Heading 2": "Colonial Voices",
          "Heading 3": "Anne Bradstreet (1612-1672)",
          "Heading 4": "",
          "Heading 5": ""
        },
        {
          "ID": 4,
          "Page Number": "8",
          "Heading 1": "PART 1: THE EARLY YEARS",
          "Heading 2": "Colonial Voices",
          "Heading 3": "Anne Bradstreet (1612-1672)",
          "Heading 4": "The Prologue",
          "Heading 5": ""
        },
        {
          "ID": 5,
          "Page Number": "9",
          "Heading 1": "PART 1: THE EARLY YEARS",
          "Heading 2": "Colonial Voices",
          "Heading 3": "Anne Bradstreet (1612-1672)",
          "Heading 4": "To My Dear and Loving Husband",
          "Heading 5": ""
        }
      ],
      "METADATA": [
        {
          "Anthology Title": "American Literature: A Survey",
          "Publish Date": "2023",
          "Author 1": "Jane Doe",
          "Author 2": "John Smith",
          "Author 3": "",
          "Pages": "9"
        }
      ]
    }
    """



    prompt = f"""
    Objective:

    You are an expert data extraction and interpretation agent. You will be given a batch of images representing a book's title page and/or its table of contents (ToC).
    Your task is to analyze these images and generate a single, valid JSON object containing three distinct data structures derived from the content.
    Analyze the series of images containing a book's title page and its complete table of contents (ToC).
    From this single set of images, you will generate a single, complete JSON object that contains all the extracted data, adhering strictly to \the specified structure and logic.


    Input:
    A sequence of images. The first image will be the book's title page. The subsequent images will be the pages of the table of contents. Treat all ToC images as a single, continuous document.

    Output:
    You must generate a single, valid JSON object and nothing else. The JSON object must have two top-level keys: HIERARCHY and METADATA. The value for each key must be a list of JSON objects, where each object represents one row of data. It is critical that every key is present in every object, using an empty string "" for any field that has no value. Do not use null.


    JSON Structure Specification (generate nothing outside of the first and last curly braces. DO NOT include the word "json" before the file/text.):
    {json_specification}

    BATCHES:
    The complete JSON file will be created in multiple batches.


    Core Logic & Rules:
    {context_block}


    1.	General Processing Rules (Apply to all data extraction)
    •	UTF-8 Encoding: Ensure all text output uses UTF-8 encoding to correctly represent all characters.
    •	Ignore Irrelevant Information: Actively ignore all text and graphics not required for the output, including publisher data, ISBNs, dedications, and page footers.
    •	Multi-line Entries: If a single conceptual entry (like a long title) is visually broken across multiple lines, intelligently merge them into a single field.

    2.	Initial Analysis & Metadata (for the METADATA key):
    •	First, analyze the title page image to extract the "Anthology Title," "Publish Date," and up to three primary authors.
    •	Actively ignore all other text on the pages (publisher info, ISBN, etc.).
    •	Determine the total "Pages" by finding the last numerical page number in the main body of the ToC.
    •	Format this information as a list containing a single JSON object.
    •	Ensure all text output uses UTF-8 encoding to correctly represent all characters.

    3.	Hierarchical Extraction (for the HIERARCHY key):
    •	Process the ToC images to capture their literal structure.
    •	Handle multi-line entries: If a single entry is visually broken across multiple lines, intelligently merge them into a single field.
    •	For each line item, create a new JSON object. Assign a sequential integer ID.
    •	Extract the Page Number or the word "online".
    •	Determine the hierarchical level (1-5) based on visual cues (indentation, font styles).
    •	Populate the Heading 1 through Heading 5 fields. Parent headings must be populated.


    START OF EXAMPLE
    To be clear, here is a perfect example of how to perform this entire task.

    Example Input:
    (Imagine one image of a title page and one image of a ToC are provided)

    Example Output:
    {json_example}

    END OF EXAMPLE

    Now, apply this exact logic and JSON structure to the new images I will provide. Produce a single, valid JSON object as the complete output."
    """

    return prompt

Make sure response is clean json

In [None]:
def clean_response(s):
    return s.strip().removeprefix("```json").removesuffix("```").strip()

Create client

In [None]:
GEMINI_API_KEY = ""
client = genai.Client(api_key = GEMINI_API_KEY)

Create temporary lists; add to them each batch to create .csv's at end

In [None]:
HIERARCHY_alldata = []
METADATA_alldata = []

Set batch size

In [None]:
batch_size = 10

Path object; path to directory of directories, that each contain the images for a single ToC

In [None]:
path = Path("")

Get list of names of all folders in directory

In [None]:
toc_titles = os.listdir(path)

In [None]:
# for each ToC folder
for folder_path, title in zip(path.iterdir(), toc_titles):


    # skip hidden files like ".DS_Store"
    if not folder_path.is_dir():
        continue


    #reset batch memory for new ToC
    last_hierarchy = {"ID":"", "Heading 1":"", "Heading 2":"", "Heading 3":"", "Heading 4":""}
    last_metadata = {"Anthology Title": "", "Publish Date": "", "Author 1": "", "Author 2": "", "Author 3": "", "Pages": ""}



    # Get a list of the photo filepaths  (then remove ".DS_Store" hidden file)
    photo_paths = sorted(list(folder_path.iterdir()))
    photo_paths = photo_paths[1:]


    for i in range(0, len(photo_paths), batch_size):
        print(f"\nBatch {i+1}")


        #slice to get filepaths for photos in batch
        batch_paths = photo_paths[i : i + batch_size]



        # Upload the photos in batch to api; make uploaded photos list
        uploaded_photos = []
        for path in batch_paths:


            # From Google's documentation on passing images
            photo = client.files.upload(file=path)
            print(f"{len(uploaded_photos)+1} photos uploaded")


            #add to list of uploaded photo objects
            uploaded_photos.append(photo)

            #make sure not too many requests at once
            time.sleep(1)


        #pass prompt and batch of uploaded photos to model and get response
        response = client.models.generate_content(
        model="gemini-flash-latest",
        contents=[

            create_prompt(),
            *uploaded_photos

          ])


        #clear uploaded files
        for file in uploaded_photos:
            client.files.delete(name=file.name)




        cleaned_string = clean_response(response.text)




        # Creates dictionary from json keys
        data = json.loads(cleaned_string)


        # Add the new "rows" (currently in dictionary form) to lists
        HIERARCHY_alldata.extend(data['HIERARCHY'])
        METADATA_alldata.extend(data['METADATA'])
        print(HIERARCHY_alldata)
        print(METADATA_alldata)





        #update context block in prompt (allows for batching, remembering the last line of previous batch)
        if HIERARCHY_alldata:
            last_hierarchy["ID"] = HIERARCHY_alldata[-1]["ID"]
            last_hierarchy["Heading 1"] = HIERARCHY_alldata[-1]["Heading 1"]
            last_hierarchy["Heading 1"] = HIERARCHY_alldata[-1]["Heading 2"]
            last_hierarchy["Heading 1"] = HIERARCHY_alldata[-1]["Heading 3"]
            last_hierarchy["Heading 1"] = HIERARCHY_alldata[-1]["Heading 4"]
            print("Batch memory data updated")

        if METADATA_alldata:
            last_metadata["Anthology Title"] = METADATA_alldata[-1]["Anthology Title"]
            last_metadata["Publish Date"] = METADATA_alldata[-1]["Publish Date"]
            last_metadata["Author 1"] = METADATA_alldata[-1]["Author 1"]
            last_metadata["Author 2"] = METADATA_alldata[-1]["Author 2"]
            last_metadata["Author 3"] = METADATA_alldata[-1]["Author 3"]
            last_metadata["Pages"] = METADATA_alldata[-1]["Pages"]
            print("Batch memory data updated")




    # List to DataFrame
    hierarchy_df = pd.DataFrame(HIERARCHY_alldata)
    metadata_df = pd.DataFrame(METADATA_alldata)

    # DataFrame to .csv
    hierarchy_df.to_csv(f"{title}_HIERARCHY.csv", sep= "\t", index=False)
    metadata_df.to_csv(f"{title}_METADATA.csv", sep= "\t", index=False)
    print("CSV's saved.")
