# Initializations

In [1]:
from PIL import Image
from google import genai
import os
import json
import re
import pathlib
from google.genai import types
import fitz  # PyMuPDF
import io

In [2]:
models = [
    "gemini-2.0-flash",
    "gemini-2.0-pro-exp-02-05",
    "gemini-2.5-pro-exp-03-25"
]

In [None]:
# from load_dotenv import load_dotenv

# load_dotenv()



In [6]:
api_key = os.environ.get("GEMINI_API_KEY")
client = genai.Client(api_key=api_key)

In [9]:
pdf2 = "Capacity-Aware_Inference_Mitigating_the_Straggler_Effect_in_Mixture_of_Experts.pdf"
pdf1 = "Mixture_of_Modular_Experts_Distilling_Knowledge_from_a_Multilingual_Teacher_into_Specialized_Modular_Language_Models.pdf"

In [10]:
pdf1_filepath = pathlib.Path(pdf1)
pdf2_filepath = pathlib.Path(pdf2)

# Let's Call Gemini!
- Practice Calling the Gemini API:
    - Single Call
    - Chat
    - PDF

In [12]:
response = client.models.generate_content(
    model=models[1],
    contents=["How does AI work?"]
)
print(response.text)

Alright, let's break down how AI works.  It's a broad topic, so I'll give you a general overview, then we can dive deeper into specific areas if you're interested.

**The Core Idea:**

At its heart, AI is about creating machines that can perform tasks that typically require human intelligence.  These tasks include:

*   **Learning:** Acquiring information and rules for using the information.
*   **Reasoning:** Using rules to reach conclusions (both exact and approximate).
*   **Problem Solving:** Identifying and solving problems.
*   **Perception:** Recognizing objects, sounds, speech, and other inputs.
*   **Understanding Natural Language:** Processing and understanding human language.

**Key Components & Approaches:**

AI achieves these feats through a variety of techniques, but here are some of the most fundamental:

1.  **Algorithms:**

    *   These are the step-by-step instructions that tell the computer how to perform a task. They are the backbone of AI systems. Think of them as

In [13]:
chat = client.chats.create(model="gemini-2.0-flash")

response = chat.send_message("Tell me about OCR")
print(response.text)

response = chat.send_message("How is it useful")
print(response.text)

for message in chat.get_history():
    print(f'role - {message.role}',end=": ")
    print(message.parts[0].text)

Okay, let's dive into the world of OCR: Optical Character Recognition.

**What is OCR?**

At its core, Optical Character Recognition (OCR) is a technology that allows computers to "read" text from images, scanned documents, or even handwriting.  It essentially converts images of text into machine-readable text data. Think of it as giving computers the ability to understand written language in a visual format.

**How does it work?**

The OCR process typically involves these key steps:

1.  **Image Acquisition:** The process starts with an image containing text. This could be a scanned document, a photo taken with a camera, or even an image from a PDF file.

2.  **Preprocessing:** This stage enhances the image to improve OCR accuracy. Common preprocessing steps include:
    *   **Deskewing:** Correcting any slant or rotation in the image.
    *   **Noise Reduction:** Removing unwanted specks, shadows, or other imperfections.
    *   **Binarization:** Converting the image to black and whi

In [14]:
# Retrieve and encode the PDF byte
filepath = pdf1_filepath

prompt = "Analyze Figure 1"
response = client.models.generate_content(
  model=models[1],
  contents=[
      types.Part.from_bytes(
        data=filepath.read_bytes(),
        mime_type='application/pdf',
      ),
      prompt])
print(response.text)

Okay, let's analyze Figure 1 from the provided OCR text.

**Figure 1 Analysis:**

1.  **Title:** "Dataset Splits for Each Language"
2.  **Type:** Grouped Bar Chart.
3.  **X-axis:** "Language", showing the four languages used in the study: English, German, French, and Python.
4.  **Y-axis:** "Number of Tokens (million)", indicating the volume of data measured in millions of tokens. The scale ranges from 0 to 140 million.
5.  **Legend:** "Dataset Splits", identifying the three different colored bars for each language:
    *   Blue: Training Set
    *   Orange: Validation Set
    *   Green: Test Set
6.  **Content:** The chart visually represents the size (in millions of tokens) of the training, validation, and test datasets for each of the four languages.
7.  **Observations:**
    *   **Training Data Dominance:** For all languages, the training set (blue bar) is significantly larger than the validation (orange) and test (green) sets, which is standard practice in machine learning.
    *  

# Time to Convert the PDF to Markdown
- Write Prompt
- Call Gemini with Prompt
- Refine Prompts and Repeat Until Satisfactory

In [15]:
promptv1 = """Convert the following PDF to Markdown."""

In [17]:
response = client.models.generate_content(
  model=models[1],
  contents=[
      types.Part.from_bytes(
        data=filepath.read_bytes(),
        mime_type='application/pdf',
      ),
      promptv1])
print(response.text)

```markdown
arXiv:2407.19610v1 [cs.AI] 28 Jul 2024

# MIXTURE OF MODULAR EXPERTS: DISTILLING KNOWLEDGE FROM A MULTILINGUAL TEACHER INTO SPECIALIZED MODULAR LANGUAGE MODELS

**Mohammed Al-Maamari**
Chair of Data Science
University of Passau
Germany, Passau
Mohammed.Al-Maamari@uni-passau.de

**Mehdi Ben Amor**
Chair of Data Science
University of Passau
Germany, Passau
Mehdi.BenAmor@uni-passau.de

**Michael Granitzer**
Chair of Data Science
University of Passau
Germany, Passau
Michael.Granitzer@uni-passau.de

July 30, 2024

## ABSTRACT

This research explores the integration of Knowledge Distillation (KD) and Mixture of Experts (MoE) to create modular, efficient, and specialized multilingual language models. The primary objectives include evaluating adaptive versus fixed alpha methods in KD, developing and comparing modular MoE architectures in handling multi-domain inputs and preventing catastrophic forgetting.

We address the computational challenges of large language models (LLMs) and 

In [19]:
promptv2 = """You are given a scan of a PDF file. Convert the PDF to Markdown.
When you encounter an image, return its bounding box in [ymin, xmin, ymax, xmax] format. You should return the bounding box for EVERY image in the document. Surround the bounding box in <bounding_box></bounding_box> tags. Inside these tags, return the information in ([ymin, xmin, ymax, xmax], page_number) format"""


In [20]:
response = client.models.generate_content(
  model=models[1],
  contents=[
      types.Part.from_bytes(
        data=filepath.read_bytes(),
        mime_type='application/pdf',
      ),
      promptv2])
print(response.text)

```markdown
arXiv:2407.19610v1 [cs.AI] 28 Jul 2024

# MIXTURE OF MODULAR EXPERTS: DISTILLING KNOWLEDGE FROM A MULTILINGUAL TEACHER INTO SPECIALIZED MODULAR LANGUAGE MODELS

**Mohammed Al-Maamari**
Chair of Data Science
University of Passau
Germany, Passau
Mohammed.Al-Maamari@uni-passau.de

**Mehdi Ben Amor**
Chair of Data Science
University of Passau
Germany, Passau
Mehdi.BenAmor@uni-passau.de

**Michael Granitzer**
Chair of Data Science
University of Passau
Germany, Passau
Michael.Granitzer@uni-passau.de

July 30, 2024

## ABSTRACT

This research explores the integration of Knowledge Distillation (KD) and Mixture of Experts (MoE) to create modular, efficient, and specialized multilingual language models. The primary objectives include evaluating adaptive versus fixed alpha methods in KD, developing and comparing modular MoE architectures in handling multi-domain inputs and preventing catastrophic forgetting.

We address the computational challenges of large language models (LLMs) and 

# Satisfactory Results? Let's Polish Them with Images
- save_to_markdown
- parse_markdown
- parse_out_markdown_tildas
- pdf_page_to_image
- parse_bbs
- extract_images
- etract_all_images
- save_extracted_images
- insert_images_into_markdown


In [21]:
def save_to_markdown(content, filename="output.md"):
    """
    Saves a string to a markdown file.
    
    Args:
        content (str): The content to save to the markdown file.
        filename (str, optional): The name of the file to save to. Defaults to "output.md".
    
    Returns:
        pathlib.Path: The path to the saved file.
    """
    file_path = pathlib.Path(filename)
    
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(content)
    
    return file_path

In [22]:
def parse_markdown(text):
    """
    Extracts content inside <markdown></markdown> tags from text.
    If these tags don't exist, returns the entire text.
    
    Args:
        text (str): The text to parse
        
    Returns:
        str: The extracted markdown content or the original text
    """
    # Look for content between <markdown> and </markdown> tags
    
    markdown_pattern = re.compile(r'<markdown>(.*?)</markdown>', re.DOTALL)
    match = markdown_pattern.search(text)
    
    if match:
        # Return only the content inside the tags
        return match.group(1)
    else:
        # If tags don't exist, return the entire text
        return text

In [23]:
def parse_out_markdown_tildas(text):
    """
    Finds all occurrences of ```markdown ... ``` in the text and extracts
    only the content inside, removing the markdown code block delimiters.
    Also preserves any text outside of these blocks.
    
    Args:
        text (str): The text to process
        
    Returns:
        str: The text with all markdown code blocks replaced with just their content
    """
    # Pattern to match ```markdown ... ``` blocks
    # The (?s) makes the dot match newlines as well
    pattern = r"```markdown\s*((?:.|\n)*?)```"
    
    # Replace each match with just the content inside
    result = re.sub(pattern, r"\1", text)
    
    return result

In [24]:
def pdf_page_to_image(pdf_path, page_index, output_dir=None, dpi=300):
    """
    Convert a specific page of a PDF to an image and optionally save it.
    
    Args:
        pdf_path: Path to the PDF file
        page_index: Zero-based index of the page to convert
        output_dir: Directory to save the image (if None, image is not saved)
        dpi: Resolution of the output image (default: 300)
        
    Returns:
        PIL Image object of the specified PDF page and path to saved image (if saved)
    """
    try:
        # Open the PDF file
        pdf_document = fitz.open(pdf_path)
        
        # Check if page_index is valid
        if page_index < 0 or page_index >= len(pdf_document):
            raise ValueError(f"Page index {page_index} is out of range. PDF has {len(pdf_document)} pages.")
        
        # Get the specified page
        page = pdf_document.load_page(page_index)
        
        # Convert page to a pixmap (image)
        pix = page.get_pixmap(matrix=fitz.Matrix(dpi/72, dpi/72))
        
        # Convert pixmap to PIL Image
        img_data = pix.tobytes("png")
        img = Image.open(io.BytesIO(img_data))
        
        # Close the PDF document
        pdf_document.close()
        
        # Save the image if output_dir is provided
        saved_path = None
        if output_dir is not None:
            # Create output directory if it doesn't exist
            os.makedirs(output_dir, exist_ok=True)
            
            # Get base name of the PDF file without extension
            base_name = os.path.splitext(os.path.basename(pdf_path))[0]
            
            # Create output filename
            output_filename = f"{base_name}_{page_index}.png"
            output_path = os.path.join(output_dir, output_filename)
            
            # Save the image
            img.save(output_path)
            saved_path = output_path
            print(f"Saved page {page_index} to {output_path}")
        
        if saved_path is None:
            return img
        else:
            return img, saved_path
    
    except Exception as e:
        print(f"Error converting PDF page to image: {e}")
        return None, None

In [25]:
def parse_bbs(response_text, pdf_path):
    """
    Parse the bounding box information from the Gemini response text.

    Args:
        response_text (str): The text response containing bounding box tags.
        pdf_path (str): The path to the PDF file.

    Returns:
        list: A list of dictionaries, each containing bounding box, page number,
              and PDF path. Returns an empty list if no bounding boxes are found
              or if there's an error parsing.
    """
    bounding_boxes_info = []
    # Regex to find the bounding box and page number within the tags
    # It looks for <bounding_box>([list], page_number)</bounding_box>
    # and captures the list part and the page_number part.
    pattern = r"<bounding_box>\(\[([\d,\s]+)\],\s*(\d+)\)</bounding_box>"

    matches = re.findall(pattern, response_text)

    for match in matches:
        try:
            # Extract the coordinates string and the page number string
            coords_str, page_num_str = match

            # Convert coordinate string to a list of integers
            # Removes spaces and splits by comma
            coords = [int(c.strip()) for c in coords_str.split(',')]

            # Convert page number string to integer
            page_num = int(page_num_str)

            # Append the structured data to the list
            bounding_boxes_info.append({
                "bounding_box": coords,
                "page": page_num,
                "pdf": pdf_path
            })
        except ValueError as e:
            print(f"Error parsing bounding box data: {match}. Error: {e}")
            # Skip this entry if there's an error converting numbers
            continue
        except Exception as e:
            print(f"An unexpected error occurred while parsing: {match}. Error: {e}")
            continue

    return bounding_boxes_info

In [26]:
def extract_images(image_path_or_data, bounding_boxes):
    """
    Extract regions from an image based on bounding boxes and save them as PNG files.
    
    Args:
        original_image: PIL Image object of the original document
        bounding_boxes: List of bounding boxes in [ymin, xmin, ymax, xmax] format
        output_dir: Directory to save the extracted images
    
    Returns:
        List of paths to the saved images
    """
    if isinstance(image_path_or_data, str):
        try:
            original_image = Image.open(image_path_or_data)
        except Exception as e:
            print(e)
            return None
    elif isinstance(image_path_or_data, Image.Image):
        original_image = image_path_or_data
    else:
        raise ValueError("Invalid input type. Expected a path or PIL Image object.")
    
    cropped_images = []
    
    # Process each bounding box
    for i, box in enumerate(bounding_boxes):
        # Extract coordinates
        ymin, xmin, ymax, xmax = box['bounding_box']
        
        width, height = original_image.size
        xmin = int((xmin/1000) * width)
        xmax = int((xmax/1000) * width)
        ymin = int((ymin/1000) * height)
        ymax = int((ymax/1000) * height)
        
        # Crop the image
        cropped_img = original_image.crop((xmin, ymin, xmax, ymax))
        cropped_images.append(cropped_img)
    
    return cropped_images

In [27]:
def extract_all_images(file_path, bbs):
    pdf_docu_length = fitz.open(file_path).page_count
    pdf_images = [pdf_page_to_image(file_path, i) for i in range(pdf_docu_length)]
    all_extracted_images = []
    for bb in bbs:
        page_num = bb['page']
        corresponding_image = pdf_images[page_num - 1]
        extracted_images = extract_images(corresponding_image, [bb])
        all_extracted_images.append((extracted_images[0], bb['bounding_box'], bb['page']))
        
    return all_extracted_images

In [28]:
def save_extracted_images(all_extracted_images, pdf_path):
    """
    Saves extracted images to a folder named after the PDF file.
    
    Args:
        all_extracted_images: List of tuples (image, bounding_box, page_number)
        pdf_path: Path to the original PDF file
    
    Returns:
        List of dictionaries with the original bounding box info plus the path to the saved image
    """
    # Create folder name from PDF filename (replace spaces with underscores, remove extension)
    pdf_name = os.path.basename(pdf_path)
    folder_name = os.path.splitext(pdf_name)[0].replace(" ", "_")
    
    # Create the folder if it doesn't exist
    os.makedirs(folder_name, exist_ok=True)
    
    # Dictionary to keep track of image counts per page
    image_counts = {}
    
    # List to store updated info with image paths
    saved_images_info = []
    
    # Process each extracted image
    for i, (image, bbox, page_num) in enumerate(all_extracted_images):
        # Initialize counter for this page if not already done
        if page_num not in image_counts:
            image_counts[page_num] = 1
        else:
            image_counts[page_num] += 1
        
        # Create filename: PageX_ImageY.png
        image_filename = f"Page{page_num}_Image{image_counts[page_num]}.png"
        image_path = os.path.join(folder_name, image_filename)
        
        # Save the image
        image.save(image_path)
        print(f"Saved image to {image_path}")
        
        # Create info dictionary with original data plus the path
        image_info = {
            "bounding_box": bbox,
            "page": page_num,
            "pdf": pdf_path,
            "corresponding_image": image_path
        }
        
        saved_images_info.append(image_info)
    
    return saved_images_info

In [29]:
def insert_into_markdown(markdown_text, bounding_boxes_info):
    """
    Replaces bounding box placeholders in markdown text with Markdown image links.

    Args:
        markdown_text (str): The original markdown text with <bounding_box> tags.
        bounding_boxes_info (list): List of dictionaries, modified by
                                     extract_and_save_images to include
                                     'corresponding_image' paths.

    Returns:
        str: The markdown text with image links inserted.
    """
    # Use an iterator to fetch the corresponding info for each match sequentially
    info_iterator = iter(bounding_boxes_info)

    # Define the replacement function for re.sub
    def replacer(match):
        try:
            # Get the next bounding box info dictionary from the iterator
            info = next(info_iterator)
            image_path = info.get("corresponding_image") # Use .get for safety

            if image_path:
                # Ensure path uses forward slashes for Markdown/HTML compatibility
                image_path_md = pathlib.Path(image_path).as_posix()
                # Extract filename without extension for alt text (e.g., "Page0_Image1")
                alt_text = os.path.splitext(os.path.basename(image_path))[0]
                # Return the Markdown image syntax
                return f"![{alt_text}]({image_path_md})"
            else:
                # If corresponding_image is missing (e.g., extraction failed)
                print(f"Warning: Missing 'corresponding_image' for a matched bounding box tag. Original tag kept: {match.group(0)}")
                return match.group(0) # Return the full matched tag
        except StopIteration:
            # If there are more tags in text than info items
            print(f"Warning: More bounding box tags found in text than info items provided. Original tag kept: {match.group(0)}")
            return match.group(0)
        except Exception as e:
            print(f"Error during replacement for tag {match.group(0)}: {e}. Original tag kept.")
            return match.group(0)

    # The same pattern used in parse_bbs to find the tags
    pattern = r"<bounding_box>\(\[[\d,\s]+\]\,\s*\d+\)</bounding_box>"

    # Perform the substitution using the replacer function
    modified_markdown = re.sub(pattern, replacer, markdown_text)

    return modified_markdown

In [30]:
bbs = parse_bbs(response.text, pdf1)
images = extract_all_images(pdf1, bbs)
saved_images_info = save_extracted_images(images, pdf1)
md = insert_into_markdown(response.text, saved_images_info)
save_to_markdown(parse_out_markdown_tildas(md), "pdf1_promptv2_images.md")

Saved image to Mixture_of_Modular_Experts_Distilling_Knowledge_from_a_Multilingual_Teacher_into_Specialized_Modular_Language_Models\Page4_Image1.png
Saved image to Mixture_of_Modular_Experts_Distilling_Knowledge_from_a_Multilingual_Teacher_into_Specialized_Modular_Language_Models\Page5_Image1.png
Saved image to Mixture_of_Modular_Experts_Distilling_Knowledge_from_a_Multilingual_Teacher_into_Specialized_Modular_Language_Models\Page6_Image1.png
Saved image to Mixture_of_Modular_Experts_Distilling_Knowledge_from_a_Multilingual_Teacher_into_Specialized_Modular_Language_Models\Page6_Image2.png
Saved image to Mixture_of_Modular_Experts_Distilling_Knowledge_from_a_Multilingual_Teacher_into_Specialized_Modular_Language_Models\Page7_Image1.png
Saved image to Mixture_of_Modular_Experts_Distilling_Knowledge_from_a_Multilingual_Teacher_into_Specialized_Modular_Language_Models\Page7_Image2.png
Saved image to Mixture_of_Modular_Experts_Distilling_Knowledge_from_a_Multilingual_Teacher_into_Specialize

WindowsPath('pdf1_promptv2_images.md')