# CLIP-Based Image-Text Matching Pipeline

This notebook demonstrates a complete pipeline for processing, analyzing and matching images with relevant text descriptions using the CLIP neural network. The pipeline includes:

1. **Data Acquisition**: Downloading JSON, image, and XML files from a Label Studio project
2. **Data Analysis**: Counting and analyzing label distribution in the dataset
3. **Data Filtering**: Selecting images and text with specific labels
4. **Image Processing**: Cropping images based on annotations
5. **Text Processing**: Filtering text regions to remove captions
6. **CLIP Processing**: Using CLIP to find semantic matches between images and text

Each section below implements one step in the pipeline.

Loads necessary modules and configures authentication for Label Studio API access.

- Sets up a global debug flag to control logging verbosity
- Loads an authentication token from a configuration file
- Configures the logging system for a notebook environment

In [1]:
# Import the module
import download
import os
import sys
import json
from pathlib import Path
from IPython.display import display, Markdown
from tqdm import tqdm
import ipywidgets
from tqdm.notebook import tqdm as notebook_tqdm

DEBUG = False

# Load token from a config file (not in version control)
try:
    config_path = Path.home() / ".config" / "label_studio_config.json"
    with open(config_path) as f:
        config = json.load(f)
        label_studio_token = config.get("token")
except FileNotFoundError:
    display(Markdown("## ⚠️ Config file not found"))
    display(Markdown(f"Create a file at {config_path} with the content: `{{'token': 'your_token_here'}}`"))
    label_studio_token = None

# Setup logging for notebook environment
download.setup_logging(debug_mode=DEBUG, use_notebook=True)


INFO: Log level set to: INFO


Creates the directory structure needed for the pipeline:

- **downloaded_images**: Storage for images downloaded from Label Studio
- **texts**: Storage for XML files containing OCR text information
- **splitted_jsons**: Storage for individual JSON annotation files
- **output_context**: Where the final output images with context will be saved

Each directory is created if it doesn't exist already.

In [2]:

# Define custom directories (relative to current directory)
DIRS = {
    "images_dir": "downloaded_images",  # Where downloaded images will go
    "texts_dir": "texts",    # Where XML files are stored 
    "jsons_dir": "splitted_jsons",      # Where split JSON files will go
    "output_dir": "output_context"    # Where output files will go
}

# Create directories if they don't exist
for dir_name, dir_path in DIRS.items():
    os.makedirs(dir_path, exist_ok=True)
    display(Markdown(f"✓ Directory `{dir_path}` ready"))

✓ Directory `downloaded_images` ready

✓ Directory `texts` ready

✓ Directory `splitted_jsons` ready

✓ Directory `output_context` ready

Downloads the full JSON export from Label Studio containing all task annotations.

- Uses the API token for authentication
- Saves the export as "label_studio_export.json"
- Displays the number of tasks downloaded
- This export file contains annotation data for all images in the project

In [3]:
# Download the export.json file
result = download.download_export_json(
    token=label_studio_token,
    output_file="label_studio_export.json"
)

if "error" in result:
    display(Markdown(f"## ❌ Error\n{result['error']}"))
else:
    labels = result["labels"]
    display(Markdown(f"## Export Successful\n- Source: {result['source']}\n- Tasks: {len(labels)}"))

INFO: --- Downloading Export JSON ---
INFO: Using Label Studio URL: https://label-studio.semant.cz
INFO: Using Project ID: 16
INFO: Output file: label_studio_export.json
INFO: Export file 'label_studio_export.json' already exists, loading from disk.
INFO: Successfully loaded 5548 tasks from existing file.


## Export Successful
- Source: file
- Tasks: 5548

Downloads all annotated images from the Label Studio project:

- Only downloads images that have a corresponding XML file in the texts directory
- Skips images that already exist locally
- Uses the improved progress bar interface with tqdm.notebook
- Shows detailed statistics about the download process

This step ensures we have all the necessary image data for processing.

In [4]:
# Run download function with custom directories
if label_studio_token:
    result = download.run_download(
        show_progress=True,
        token=label_studio_token,
        texts_dir=DIRS["texts_dir"],
        images_dir=DIRS["images_dir"]
    )
    
    # Display results as markdown
    if "error" in result:
        display(Markdown(f"## ❌ Error\n{result['error']}"))
    else:
        display(Markdown(f"""
        ## Download Results
        - Total tasks: {result['total_tasks']}
        - Downloaded: {result['downloaded']}
        - Skipped (already exist): {result['skipped_exists']}
        - Skipped (no XML): {result['skipped_no_xml']}
        - Failed: {result['failed']}
        """))
else:
    display(Markdown("## ❌ Cannot proceed without token"))

INFO: --- Running Download Mode ---
INFO: Using Labels file: /home/adrian/school/KNN/CLIP/export.json
INFO: Using Label Studio Project ID: 16
INFO: Using Label Studio Base URL: https://label-studio.semant.cz
INFO: Using Image save directory: /home/adrian/school/KNN/CLIP/downloaded_images
INFO: Using XML check directory: /home/adrian/school/KNN/CLIP/texts
INFO: Ensured image save directory exists: '/home/adrian/school/KNN/CLIP/downloaded_images'
INFO: Label Studio token loaded successfully.
INFO: Loading labels from file: '/home/adrian/school/KNN/CLIP/export.json'
INFO: Successfully loaded 5548 labels from '/home/adrian/school/KNN/CLIP/export.json'.
INFO: --- Starting Image Download Process ---
INFO: Processing 5548 tasks from labels file...


Downloading images:   0%|          | 0/5548 [00:00<?, ?task/s]

INFO: 
--- Download Summary ---
INFO: Processing completed in: 0.10 seconds
INFO: Total tasks in labels file: 5548
INFO: Tasks processed: 5548
INFO: Successfully downloaded images: 0
INFO: Skipped (image already exists): 5548
INFO: Skipped (no corresponding XML found): 0
INFO: Skipped (no UUID extracted): 0
INFO: Skipped (no/invalid/non-local image path): 0
INFO: Failed downloads/saves: 0
INFO: ------------------------



        ## Download Results
        - Total tasks: 5548
        - Downloaded: 0
        - Skipped (already exist): 5548
        - Skipped (no XML): 0
        - Failed: 0
        

Divides the master export file into individual JSON files, one per task:

- Extracts each task from the export.json file
- Creates a separate JSON file named with the UUID of the task
- Verifies that the images and XML files match by comparing UUIDs
- Uses the improved progress tracking with tqdm.notebook

This step prepares the data for efficient parallel processing in later stages.

In [5]:
# Split the export.json into individual JSON files
if label_studio_token:
    split_result = download.run_split_json(
        show_progress=True,
        labels_file="label_studio_export.json",
        jsons_dir=DIRS["jsons_dir"]
    )
    
    # Display results as markdown
    if "error" in split_result:
        display(Markdown(f"## ❌ Error\n{split_result['error']}"))
    else:
        display(Markdown(f"""
        ## JSON Split Results
        - Total tasks: {split_result['total_tasks']}
        - JSON files created: {split_result['json_created']}
        - Skipped (no UUID): {split_result['skipped_no_uuid']}
        - Failed writes: {split_result['failed_writes']}
        - Processing time: {split_result['elapsed_time']:.2f} seconds
        """))
        
        # Also run directory comparison to verify we have matching files
        compare_result = download.run_compare(
            texts_dir=DIRS["texts_dir"],
            images_dir=DIRS["images_dir"]
        )
        if compare_result["match"]:
            display(Markdown("## ✅ Images and XMLs match!"))
        else:
            display(Markdown(f"""
            ## ⚠️ Mismatch between images and XMLs
            - Files in both directories: {compare_result['matching_count']}
            - Files only in texts directory: {compare_result['texts_only_count']}
            - Files only in images directory: {compare_result['images_only_count']}
            """))
else:
    display(Markdown("## ❌ Cannot proceed without token"))

INFO: --- Running Split JSON Mode ---
INFO: Input labels file: label_studio_export.json
INFO: Output directory for split JSONs: /home/adrian/school/KNN/CLIP/splitted_jsons
INFO: Ensured JSON save directory exists: '/home/adrian/school/KNN/CLIP/splitted_jsons'
INFO: Loading labels from file: 'label_studio_export.json'
INFO: Successfully loaded 5548 labels from 'label_studio_export.json'.
INFO: Processing 5548 tasks to split from labels file...


Splitting JSONs:   0%|          | 0/5548 [00:00<?, ?file/s]

INFO: 
--- Split JSON Summary ---
INFO: Processing completed in: 6.70 seconds
INFO: Total tasks in input file: 5548
INFO: Tasks processed: 5548
INFO: Successfully created JSON files: 5548
INFO: Skipped (no UUID extracted): 0
INFO: Failed JSON writes: 0
INFO: --------------------------



        ## JSON Split Results
        - Total tasks: 5548
        - JSON files created: 5548
        - Skipped (no UUID): 0
        - Failed writes: 0
        - Processing time: 6.70 seconds
        

INFO: --- Running Compare Mode ---
INFO: Starting comparison between directories:
INFO:   Texts directory: /home/adrian/school/KNN/CLIP/texts
INFO:   Images directory: /home/adrian/school/KNN/CLIP/downloaded_images
INFO: Found 5767 unique base filenames in /home/adrian/school/KNN/CLIP/texts
INFO: Found 5548 unique base filenames in /home/adrian/school/KNN/CLIP/downloaded_images
INFO: 
--- Comparison Results ---
INFO: Matching base filenames found in both: 5548
INFO: No base filenames found only in 'downloaded_images'.
INFO: --------------------------



            ## ⚠️ Mismatch between images and XMLs
            - Files in both directories: 5548
            - Files only in texts directory: 219
            - Files only in images directory: 0
            

Analyzes the distribution of rectangle labels in the dataset:

- Counts occurrences of each rectangle label across all JSON files
- Displays results as a sorted DataFrame for easy analysis
- Provides summary statistics about the dataset composition
- Uses the improved progress bar implementation with tqdm.notebook

This analysis helps identify which labels are most common and can inform filtering decisions.

In [6]:
# Import the module
import count_json
from IPython.display import display, Markdown
import pandas as pd

# Setup logging for notebook environment
count_json.setup_logging(debug_mode=DEBUG, use_notebook=True)

# Run the count_labels function with default directory
result = count_json.run_count_labels(print_table=False)  # Don't print tables in notebook

# Display results as a DataFrame
label_df = pd.DataFrame(
    list(result["label_counts"].items()), 
    columns=["Label", "Count"]
).sort_values("Count", ascending=False)

display(Markdown("## Rectangle Label Counts"))
display(label_df)

# Show summary stats
display(Markdown(f"""
## Summary Statistics
- Total files processed: {result["total_files"]}
- Files with rectangle labels: {result["processed_files"] - result["no_labels_files"]}
- Files without rectangle labels: {result["no_labels_files"]}
- Files with errors: {result["error_files"]}
- Total label instances: {result["total_labels"]}
- Unique labels found: {result["unique_labels"]}
"""))

INFO: Log level set to: INFO
INFO: Starting label count process for directory: /home/adrian/school/KNN/CLIP/jsons
INFO: Scanning for JSON files in: /home/adrian/school/KNN/CLIP/jsons
INFO: Found 5548 JSON files to process.


Analyzing JSONs:   0%|          | 0/5548 [00:00<?, ?file/s]

## Rectangle Label Counts

Unnamed: 0,Label,Count
2,Obrázek,7716
1,Popis u obrázku,4996
5,Reklama,2785
3,Popis v textu,2466
0,Fotografie,2090
9,Ostatní knižní dekor,1259
6,Tabulka,479
12,Graf,318
7,Erb/cejch/logo/symbol,287
14,Schéma,127



## Summary Statistics
- Total files processed: 5548
- Files with rectangle labels: 5489
- Files without rectangle labels: 59
- Files with errors: 0
- Total label instances: 23096
- Unique labels found: 24


Filters JSON, image, and XML files to keep only those with the target label:

- Identifies all JSONs containing the label "Obrázek" (Picture)
- Copies matching JSONs to the filtered_jsons directory
- Copies corresponding images to the filtered_images directory
- Copies corresponding XML files to the filtered_texts directory
- Features multiple progress bars with the improved tqdm implementation
- Updates the directory dictionary with new filtered paths

This filtering step ensures we focus only on tasks with images/illustrations.

In [7]:
# Import the enhanced filtering module
import filter_by_label
from IPython.display import display, Markdown
from tqdm.notebook import tqdm  # Import tqdm for notebooks

# Setup logging for notebook environment
filter_by_label.setup_logging(debug_mode=DEBUG, use_notebook=True)

# Run the filter process with our directory structure for all three file types
result = filter_by_label.run_filter_by_label(
    jsons_dir=DIRS["jsons_dir"],                # Source JSONs  
    images_dir=DIRS["images_dir"],              # Source images
    texts_dir=DIRS["texts_dir"],                # Source XML texts
    filtered_jsons_dir="filtered_jsons",        # Output filtered JSONs
    filtered_images_dir="filtered_images",      # Output filtered images
    filtered_texts_dir="filtered_texts",        # Output filtered XMLs
    label="Obrázek",                           # Filter by this label
    copy_files=True,                            # Copy instead of move
    case_sensitive=False                       # Case-insensitive matching
)
# Display results as markdown
display(Markdown(f"""
## Filter Results for Label: '{result["label_filtered"]}'

### JSON Files:
- With label: {result["json_matches"]}
- Without label: {result["json_non_matches"]}
- **Total processed: {result["json_matches"] + result["json_non_matches"]}**

### Image Files:
- Matching filtered JSONs: {result["image_matches"]}
- Not matching: {result["image_non_matches"]}
- **Total processed: {result["image_matches"] + result["image_non_matches"]}**

### Text Files (XML):
- Matching filtered JSONs: {result["text_matches"]}
- Not matching: {result["text_non_matches"]}
- **Total processed: {result["text_matches"] + result["text_non_matches"]}**

### Final Dataset:
- Total matching triplets (JSON+image+XML): {result["total_matching_pairs"]}
- Files were {result["file_operation"].lower()} to filtered directories
"""))

# Update our directories dictionary with the filtered directories
DIRS["filtered_jsons_dir"] = "filtered_jsons"
DIRS["filtered_images_dir"] = "filtered_images"
DIRS["filtered_texts_dir"] = "filtered_texts"

INFO: Log level set to: INFO
INFO: Starting filter and sync process for label: 'Obrázek'
INFO: JSON source directory: splitted_jsons
INFO: Images source directory: downloaded_images
INFO: Texts source directory: texts
INFO: Output directory: filtered_jsons
INFO: Scanning 5548 JSON files for label: 'Obrázek'


Filtering JSONs for 'Obrázek':   0%|          | 0/5548 [00:00<?, ?file/s]

INFO: Filtering complete: 3949 files with 'Obrázek', 1599 files without
INFO: Syncing files with 3949 UUIDs
INFO: Output directory: filtered_images
INFO: Processing 5548 jpg/jpeg/png/gif/bmp/tif/tiff files


Syncing jpg/jpeg/png/gif/bmp/tif/tiff:   0%|          | 0/5548 [00:00<?, ?file/s]

INFO: Sync complete: Copied 3949 files, Skipped 1599 files
INFO: Syncing files with 3949 UUIDs
INFO: Output directory: filtered_texts
INFO: Processing 5765 xml/txt files


Syncing xml/txt:   0%|          | 0/5765 [00:00<?, ?file/s]

INFO: Sync complete: Copied 3949 files, Skipped 1816 files



## Filter Results for Label: 'Obrázek'

### JSON Files:
- With label: 3949
- Without label: 1599
- **Total processed: 5548**

### Image Files:
- Matching filtered JSONs: 3949
- Not matching: 1599
- **Total processed: 5548**

### Text Files (XML):
- Matching filtered JSONs: 3949
- Not matching: 1816
- **Total processed: 5765**

### Final Dataset:
- Total matching triplets (JSON+image+XML): 3949
- Files were copied to filtered directories


Processes the filtered images to crop them to their annotated regions:

- For each JSON file, finds rectangles with the "Obrázek" label
- Extracts the coordinates of these rectangles
- Crops the corresponding image to these boundaries
- Saves the cropped images to a new directory
- Uses the enhanced progress bar with tqdm.notebook
- Updates the directory dictionary with the cropped images path

Cropping lets us focus only on the annotated image content and removes unnecessary background.

In [8]:
# Import the image cropping module
import trim_images
from IPython.display import display, Markdown

# Setup logging for notebook environment
trim_images.setup_logging(debug_mode=DEBUG, use_notebook=True)

# Run the cropping process with our directory structure
result = trim_images.run_crop_images(
    jsons_dir=DIRS["filtered_jsons_dir"],      # Use the filtered JSONs
    images_dir=DIRS["filtered_images_dir"],    # Use the filtered images 
    output_dir="cropped_images",               # Where cropped images will go
    target_label="Obrázek",                    # Label to look for
    show_progress=True                         # Show progress updates
)

# Display results as markdown
if "error" in result:
    display(Markdown(f"## ❌ Error\n{result['error']}"))
else:
    display(Markdown(f"""
    ## Image Cropping Results
    
    - **Processed Files:** {result['files_processed']} of {result['total_files_found']} JSON files
    - **Files with Errors:** {result['files_with_errors']}
    - **Crops Created:** {result['crops_created']} images
    - **Label Used:** "{result['target_label']}"
    - **Processing Time:** {result['elapsed_time']:.2f} seconds
    
    All cropped images were saved to `{result['output_dir']}`
    """))

# Update our directories dictionary with the cropped images directory
DIRS["cropped_images_dir"] = "cropped_images"

INFO: Log level set to: INFO
INFO: Starting image cropping process
INFO: Input JSON directory: /home/adrian/school/KNN/CLIP/filtered_jsons


INFO: Input Image directory: /home/adrian/school/KNN/CLIP/filtered_images
INFO: Output directory for crops: /home/adrian/school/KNN/CLIP/cropped_images
INFO: Target label for cropping: 'Obrázek'
INFO: Ensured output directory exists: /home/adrian/school/KNN/CLIP/cropped_images
INFO: Found 3949 JSON files to process


Cropping images:   0%|          | 0/3949 [00:00<?, ?file/s]

ERROR: Skipping JSON: Could not find matching image for UUID 'CB05D8B0-8E41-11DE-9080-0030487BE43A' in '/home/adrian/school/KNN/CLIP/filtered_images' mentioned in /home/adrian/school/KNN/CLIP/filtered_jsons/CB05D8B0-8E41-11DE-9080-0030487BE43A.json
ERROR: Skipping JSON: Could not find matching image for UUID 'D14F8090-1222-11DE-AF19-0030487BE43A' in '/home/adrian/school/KNN/CLIP/filtered_images' mentioned in /home/adrian/school/KNN/CLIP/filtered_jsons/D14F8090-1222-11DE-AF19-0030487BE43A.json
ERROR: Skipping JSON: Could not find matching image for UUID 'E50A16C0-F2BC-11DD-B7D3-0030487BE43A' in '/home/adrian/school/KNN/CLIP/filtered_images' mentioned in /home/adrian/school/KNN/CLIP/filtered_jsons/E50A16C0-F2BC-11DD-B7D3-0030487BE43A.json
ERROR: Skipping JSON: Could not find matching image for UUID 'C5025230-07F1-11DE-94C9-0030487BE43A' in '/home/adrian/school/KNN/CLIP/filtered_images' mentioned in /home/adrian/school/KNN/CLIP/filtered_jsons/C5025230-07F1-11DE-94C9-0030487BE43A.json
ERRO


    ## Image Cropping Results
    
    - **Processed Files:** 3949 of 3949 JSON files
    - **Files with Errors:** 0
    - **Crops Created:** 7709 images
    - **Label Used:** "Obrázek"
    - **Processing Time:** 459.67 seconds
    
    All cropped images were saved to `/home/adrian/school/KNN/CLIP/cropped_images`
    

Analyzes the filtered JSONs to identify images that have text descriptions:

- Looks for files with the "Popis v textu" (Text Description) label
- Also checks for co-occurrence of "Obrázek" and "Popis v textu" labels
- Provides detailed statistics about label distribution
- Shows a sample of files that contain text descriptions
- Uses the enhanced progress tracking with tqdm.notebook

This information helps us understand how many images have associated text descriptions.

In [9]:
# Import the description finding module
import find_label_description
from IPython.display import display, Markdown
import pandas as pd

# Setup logging for notebook environment
find_label_description.setup_logging(debug_mode=DEBUG, use_notebook=True)

# Run the analysis function with the filtered JSONs directory
result = find_label_description.run_find_descriptions(
    jsons_dir=DIRS["filtered_jsons_dir"],
    pair_to_check=["Obrázek", "Popis v textu"], 
    print_table=False  # Don't print tables in notebook
)

# Display label counts as a DataFrame
label_df = pd.DataFrame(
    list(result["label_counts"].items()), 
    columns=["Label", "Count"]
).sort_values("Count", ascending=False)

display(Markdown("## Rectangle Label Counts"))
display(label_df)

# Show summary stats
display(Markdown(f"""
## Description Analysis Results
- Total files processed: {result["total_files"]}
- Files with rectangle labels: {result["processed_files"] - result["no_labels_files"]}
- Files without rectangle labels: {result["no_labels_files"]}
- Files with errors: {result["error_files"]}
- **Files with "Popis v textu" label: {result["description_count"]}**
- Total label instances: {result["total_labels"]}
- Unique labels found: {result["unique_labels"]}
"""))

# Show pair co-occurrence if requested
if "pair_checked" in result:
    display(Markdown(f"""
    ## Label Co-occurrence
    Files containing BOTH "{result["pair_checked"][0]}" AND "{result["pair_checked"][1]}": **{result["pair_count"]}**
    """))

# Show sample of files with descriptions
if result["files_with_description"]:
    sample_files = result["files_with_description"][:10]  # Show first 10
    sample_list = "\n".join([f"- {file}" for file in sample_files])
    
    display(Markdown(f"""
    ## Files with Descriptions
    Sample of files containing "Popis v textu" label ({len(result["files_with_description"])} total):
    {sample_list}
    {'...(and more)' if len(result["files_with_description"]) > 10 else ''}
    """))
else:
    display(Markdown("## No files with descriptions were found"))

# Add the description files to our DIRS dictionary
DIRS["description_files"] = result["files_with_description"] 

INFO: Log level set to: INFO
INFO: Starting analysis for directory: /home/adrian/school/KNN/CLIP/filtered_jsons
INFO: Additionally checking for co-occurrence of labels: 'Obrázek' AND 'Popis v textu'
INFO: Scanning for JSON files in: /home/adrian/school/KNN/CLIP/filtered_jsons
INFO: Found 3949 JSON files to process.


Analyzing JSON files:   0%|          | 0/3949 [00:00<?, ?file/s]

INFO: List of files with 'Popis v textu' label written to: /home/adrian/school/KNN/CLIP/popis_v_textu_files.txt


## Rectangle Label Counts

Unnamed: 0,Label,Count
0,Obrázek,7716
1,Popis u obrázku,2926
3,Reklama,2631
2,Popis v textu,1646
6,Ostatní knižní dekor,1236
4,Tabulka,414
9,Fotografie,274
5,Erb/cejch/logo/symbol,273
8,Iniciála,91
7,Ozdobný nápis,75



## Description Analysis Results
- Total files processed: 3949
- Files with rectangle labels: 3949
- Files without rectangle labels: 0
- Files with errors: 0
- **Files with "Popis v textu" label: 858**
- Total label instances: 17482
- Unique labels found: 23



    ## Label Co-occurrence
    Files containing BOTH "Obrázek" AND "Popis v textu": **858**
    


    ## Files with Descriptions
    Sample of files containing "Popis v textu" label (858 total):
    - 9fc3f6dd-ca34-407c-9455-44aa098fa163.json
- f0fd55b4-226f-11ea-acbe-001999480be2.json
- 2c8ddb80-ab5b-11eb-81e5-005056825209.json
- 560c97c0-5625-11e1-6101-001143e3f55c.json
- 3af14969-5d26-11e5-830c-001999480be2.json
- 771e2287-d176-4d2c-a7ae-faf8855029c6.json
- b37803a3-a675-11e6-adc0-d485646517a0.json
- d80939e4-80a8-40c2-b71e-28d76031c0ad.json
- cb68d5e1-cfb1-40cb-b33a-3dd1c048e3c8.json
- 4fac6b3f-f27f-4272-9472-f1f72d906440.json
    ...(and more)
    

Removes text regions that are explicitly labeled as picture descriptions:

- Matches "Popis u obrázku" (Picture Description) regions in JSON files
- Identifies the corresponding text regions in XML files using IoU matching
- Removes those text regions from the XML files
- Creates new filtered XML files without description text
- Uses enhanced progress tracking with tqdm.notebook
- Updates the directory dictionary with the filtered text path

This step ensures CLIP doesn't match images with their existing captions in the text.

In [10]:
# Import the picture description filtering module
import filter_picture_descriptions
from IPython.display import display, Markdown

# Setup logging for notebook environment
filter_picture_descriptions.setup_logging(debug_mode=DEBUG, use_notebook=True)

# Run the filtering process with our directory structure
result = filter_picture_descriptions.run_filter_descriptions(
    json_dir=DIRS["filtered_jsons_dir"],             # JSON files from filtered directory 
    xml_dir=DIRS["filtered_texts_dir"],              # XML files from filtered directory
    output_dir="filtered_texts_no_desc",             # Output directory for filtered XMLs
    iou_threshold=0.00005,                           # IoU threshold for matching
    show_progress=True                               # Show progress updates (ensure this is True)
)

# Display results as markdown
if "error" in result:
    display(Markdown(f"## ❌ Error\n{result['error']}"))
else:
    display(Markdown(f"""
    ## Picture Description Filtering Results
    
    - **Processed Files:** {result['processed_files']} file pairs
    - **Files with Matches:** {result['files_with_matches']}
    - **Files Copied Without Filtering:** {result['copied_without_filtering']}
    
    ### Matching Statistics:
    - Total picture description regions: {result['total_json_regions']}
    - Matched regions: {result['total_matches']}
    {f"- Match percentage: {result['match_percentage']:.2f}%" if 'match_percentage' in result else ""}
    - Text regions removed from XML: {result['regions_removed']}
    
    ### Processing Details:
    - IoU threshold used: {result['iou_threshold']}
    - Processing time: {result['elapsed_time']:.2f} seconds
    
    Filtered XML files are available in: `{result['output_dir']}`
    """))

# Update our directories dictionary with the filtered texts directory
DIRS["filtered_texts_no_desc_dir"] = "filtered_texts_no_desc"

INFO: numba available, importing jit
INFO: Log level set to: INFO
INFO: Starting picture description filtering process
INFO: Found 3949 JSON files and 9497 XML files


Filtering descriptions:   0%|          | 0/3949 [00:00<?, ?file/s]

INFO: Removed 2 matching regions from /home/adrian/school/KNN/CLIP/filtered_texts/9fc3f6dd-ca34-407c-9455-44aa098fa163.xml
INFO: Removed 1 matching regions from /home/adrian/school/KNN/CLIP/filtered_texts/8f04042d-2d06-4e1f-bd4f-a8a5abfa95f6.xml
INFO: Removed 1 matching regions from /home/adrian/school/KNN/CLIP/filtered_texts/3d2a5068-d0db-4e3c-beab-0a64a523ac25.xml
INFO: Removed 1 matching regions from /home/adrian/school/KNN/CLIP/filtered_texts/f0fd55b4-226f-11ea-acbe-001999480be2.xml
INFO: Removed 2 matching regions from /home/adrian/school/KNN/CLIP/filtered_texts/2c8ddb80-ab5b-11eb-81e5-005056825209.xml
INFO: Removed 3 matching regions from /home/adrian/school/KNN/CLIP/filtered_texts/560c97c0-5625-11e1-6101-001143e3f55c.xml
INFO: Removed 2 matching regions from /home/adrian/school/KNN/CLIP/filtered_texts/3af14969-5d26-11e5-830c-001999480be2.xml
INFO: Removed 2 matching regions from /home/adrian/school/KNN/CLIP/filtered_texts/dd58291d-fbb7-11e9-99ae-001999480be2.xml
INFO: Removed 1 


    ## Picture Description Filtering Results
    
    - **Processed Files:** 3949 file pairs
    - **Files with Matches:** 1810
    - **Files Copied Without Filtering:** 2139
    
    ### Matching Statistics:
    - Total picture description regions: 2915
    - Matched regions: 2896
    - Match percentage: 99.35%
    - Text regions removed from XML: 2784
    
    ### Processing Details:
    - IoU threshold used: 5e-05
    - Processing time: 58.72 seconds
    
    Filtered XML files are available in: `/home/adrian/school/KNN/CLIP/filtered_texts_no_desc`
    

Uses OpenAI's CLIP model to find semantic matches between images and text:

- Loads the CLIP ViT-B/32 model (handles both images and text)
- For each cropped image, computes CLIP embeddings
- For each text region in the XML files, computes CLIP embeddings
- Calculates cosine similarity between image and text embeddings
- Identifies text that semantically matches each image
- Creates visual context images showing the image with matched text
- Uses nested progress bars with improved tqdm implementation
- Displays a sample result with the matched image and text

This is the core of the pipeline that creates the final image-text matches.

In [None]:
# Import the process descriptions module
import process_descriptions
from IPython.display import display, Markdown, Image
import os
from tqdm.notebook import tqdm  # Import tqdm for notebooks

# Set up logging for notebook environment
process_descriptions.setup_logging(debug_mode=DEBUG, use_notebook=True, log_to_file=False)

# Let's add a notification that this might take a while

display(Markdown("## ⚙️ Running CLIP model - this may take several minutes..."))

# Run the processing function with our directory structure
result = process_descriptions.run_process_descriptions(
    json_dir=DIRS["filtered_jsons_dir"],      # Use filtered JSONs
    images_dir=DIRS["cropped_images_dir"],    # Use cropped images 
    texts_dir=DIRS["filtered_texts_no_desc_dir"],  # Use filtered texts without descriptions
    output_dir=DIRS["output_dir"],            # Where output files will go
    similarity_threshold=0.25,                # Threshold for text matching
    max_lines_context=3,                      # Include 3 lines above/below matches
    max_ids=10,                               # Process 10 IDs
    model_name="ViT-B/32",                    # CLIP model to use
    best_only=True,                           # Only use best matching text
    show_progress=True,                       # Use tqdm.notebook for progress bars
    verbose=False                             # Disable verbose output for cleaner logs
)

# Display results as markdown
if "error" in result:
    display(Markdown(f"## ❌ Error\n{result['error']}"))
else:
    display(Markdown(f"""
    ## CLIP Text-Image Matching Results
    
    ### Processing Summary:
    - Total IDs processed: {result['summary']['total_ids']}
    - Successful matches: {result['summary']['successful_ids']}
    - Success rate: {result['summary']['success_rate']:.1f}%
    - Total images processed: {result['summary']['total_images_processed']}
    - Images below threshold: {result['summary']['images_below_threshold']}
    
    ### Configuration:
    - Model used: {result['config']['model']} on {result['config']['device']}
    - Similarity threshold: {result['config']['similarity_threshold']}
    - Max lines context: {result['config']['max_lines_context']}
    
    ### Performance:
    - Total processing time: {result['summary']['elapsed_time']:.2f} seconds
    - Average time per ID: {result['summary']['average_time_per_id']:.2f} seconds
    """))
    
    # Show a sample image if any were successful
    successful_results = [r for r in result['details'] if r.get('success', False)]
    if successful_results:
        sample = successful_results[0]
        display(Markdown(f"### Sample Result: ID {sample['id']}"))
        sample_path = os.path.join(DIRS["output_dir"], sample['output_file'])
        if os.path.exists(sample_path):
            display(Image(filename=sample_path, width=800))
            display(Markdown(f"- Context blocks: {sample['context_blocks']}"))
            display(Markdown(f"- Processing time: {sample['time']:.2f} seconds"))
        else:
            display(Markdown(f"Image file not found: {sample_path}"))
    else:
        display(Markdown("No successful results to display"))

INFO: Log level set to: INFO


INFO: Log level set to: INFO


## ⚙️ Running CLIP model - this may take several minutes...

INFO: Using device: cuda


INFO: Using device: cuda


INFO: Loading CLIP model: ViT-B/32


INFO: Loading CLIP model: ViT-B/32


INFO: Model loaded in 4.34 seconds


INFO: Model loaded in 4.34 seconds


INFO: Starting processing with ViT-B/32 model


INFO: Starting processing with ViT-B/32 model


INFO: Scanning for JSONs containing 'Obrázek' label...


INFO: Scanning for JSONs containing 'Obrázek' label...


Checking JSONs for Obrázek label:   0%|          | 0/3949 [00:00<?, ?it/s]

INFO: Limiting to 10 IDs out of 3949 total


INFO: Limiting to 10 IDs out of 3949 total


INFO: Found 10 IDs to process


INFO: Found 10 IDs to process






INFO: STARTING PROCESSING


INFO: STARTING PROCESSING






INFO: Configuration: threshold=0.25, model=ViT-B/32


INFO: Configuration: threshold=0.25, model=ViT-B/32


INFO: Directories: json=filtered_jsons, images=cropped_images, texts=filtered_texts_no_desc


INFO: Directories: json=filtered_jsons, images=cropped_images, texts=filtered_texts_no_desc


Processing IDs:   0%|          | 0/10 [00:00<?, ?ID/s]

INFO: Processing ID 1/10: 9fc3f6dd-ca34-407c-9455-44aa098fa163


INFO: Processing ID 1/10: 9fc3f6dd-ca34-407c-9455-44aa098fa163


ERROR: XML file not found: filtered_texts_no_desc/filtered_9fc3f6dd-ca34-407c-9455-44aa098fa163.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_9fc3f6dd-ca34-407c-9455-44aa098fa163.xml


INFO: Detailed info for ID 9fc3f6dd-ca34-407c-9455-44aa098fa163


INFO: Detailed info for ID 9fc3f6dd-ca34-407c-9455-44aa098fa163


INFO: Processed 1/10 IDs, success rate: 0.0%


INFO: Processed 1/10 IDs, success rate: 0.0%


INFO: Processing ID 2/10: 8f04042d-2d06-4e1f-bd4f-a8a5abfa95f6


INFO: Processing ID 2/10: 8f04042d-2d06-4e1f-bd4f-a8a5abfa95f6


ERROR: XML file not found: filtered_texts_no_desc/filtered_8f04042d-2d06-4e1f-bd4f-a8a5abfa95f6.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_8f04042d-2d06-4e1f-bd4f-a8a5abfa95f6.xml


INFO: Detailed info for ID 8f04042d-2d06-4e1f-bd4f-a8a5abfa95f6


INFO: Detailed info for ID 8f04042d-2d06-4e1f-bd4f-a8a5abfa95f6


INFO: Processed 2/10 IDs, success rate: 0.0%


INFO: Processed 2/10 IDs, success rate: 0.0%


INFO: Processing ID 3/10: 469772d1-435e-11dd-b505-00145e5790ea


INFO: Processing ID 3/10: 469772d1-435e-11dd-b505-00145e5790ea


ERROR: XML file not found: filtered_texts_no_desc/filtered_469772d1-435e-11dd-b505-00145e5790ea.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_469772d1-435e-11dd-b505-00145e5790ea.xml


INFO: Detailed info for ID 469772d1-435e-11dd-b505-00145e5790ea


INFO: Detailed info for ID 469772d1-435e-11dd-b505-00145e5790ea


INFO: Processed 3/10 IDs, success rate: 0.0%


INFO: Processed 3/10 IDs, success rate: 0.0%


INFO: Processing ID 4/10: 3d2a5068-d0db-4e3c-beab-0a64a523ac25


INFO: Processing ID 4/10: 3d2a5068-d0db-4e3c-beab-0a64a523ac25


ERROR: XML file not found: filtered_texts_no_desc/filtered_3d2a5068-d0db-4e3c-beab-0a64a523ac25.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_3d2a5068-d0db-4e3c-beab-0a64a523ac25.xml


INFO: Detailed info for ID 3d2a5068-d0db-4e3c-beab-0a64a523ac25


INFO: Detailed info for ID 3d2a5068-d0db-4e3c-beab-0a64a523ac25


INFO: Processed 4/10 IDs, success rate: 0.0%


INFO: Processed 4/10 IDs, success rate: 0.0%


INFO: Processing ID 5/10: c26fd9c9-5997-4718-9300-4a3aee9d3e09


INFO: Processing ID 5/10: c26fd9c9-5997-4718-9300-4a3aee9d3e09


ERROR: XML file not found: filtered_texts_no_desc/filtered_c26fd9c9-5997-4718-9300-4a3aee9d3e09.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_c26fd9c9-5997-4718-9300-4a3aee9d3e09.xml


INFO: Detailed info for ID c26fd9c9-5997-4718-9300-4a3aee9d3e09


INFO: Detailed info for ID c26fd9c9-5997-4718-9300-4a3aee9d3e09


INFO: Processed 5/10 IDs, success rate: 0.0%


INFO: Processed 5/10 IDs, success rate: 0.0%


INFO: Processing ID 6/10: d69341c9-4909-11eb-bff6-001999480be2


INFO: Processing ID 6/10: d69341c9-4909-11eb-bff6-001999480be2


ERROR: XML file not found: filtered_texts_no_desc/filtered_d69341c9-4909-11eb-bff6-001999480be2.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_d69341c9-4909-11eb-bff6-001999480be2.xml


INFO: Detailed info for ID d69341c9-4909-11eb-bff6-001999480be2


INFO: Detailed info for ID d69341c9-4909-11eb-bff6-001999480be2


INFO: Processed 6/10 IDs, success rate: 0.0%


INFO: Processed 6/10 IDs, success rate: 0.0%


INFO: Processing ID 7/10: cd20e381-9a2f-4539-a79a-122c5bfdfad3


INFO: Processing ID 7/10: cd20e381-9a2f-4539-a79a-122c5bfdfad3


ERROR: XML file not found: filtered_texts_no_desc/filtered_cd20e381-9a2f-4539-a79a-122c5bfdfad3.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_cd20e381-9a2f-4539-a79a-122c5bfdfad3.xml


INFO: Detailed info for ID cd20e381-9a2f-4539-a79a-122c5bfdfad3


INFO: Detailed info for ID cd20e381-9a2f-4539-a79a-122c5bfdfad3


INFO: Processed 7/10 IDs, success rate: 0.0%


INFO: Processed 7/10 IDs, success rate: 0.0%


INFO: Processing ID 8/10: 80f746a1-11ad-4de6-97ab-181d03beba67


INFO: Processing ID 8/10: 80f746a1-11ad-4de6-97ab-181d03beba67


ERROR: XML file not found: filtered_texts_no_desc/filtered_80f746a1-11ad-4de6-97ab-181d03beba67.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_80f746a1-11ad-4de6-97ab-181d03beba67.xml


INFO: Detailed info for ID 80f746a1-11ad-4de6-97ab-181d03beba67


INFO: Detailed info for ID 80f746a1-11ad-4de6-97ab-181d03beba67


INFO: Processed 8/10 IDs, success rate: 0.0%


INFO: Processed 8/10 IDs, success rate: 0.0%


INFO: Processing ID 9/10: f0fd55b4-226f-11ea-acbe-001999480be2


INFO: Processing ID 9/10: f0fd55b4-226f-11ea-acbe-001999480be2


ERROR: XML file not found: filtered_texts_no_desc/filtered_f0fd55b4-226f-11ea-acbe-001999480be2.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_f0fd55b4-226f-11ea-acbe-001999480be2.xml


INFO: Detailed info for ID f0fd55b4-226f-11ea-acbe-001999480be2


INFO: Detailed info for ID f0fd55b4-226f-11ea-acbe-001999480be2


INFO: Processed 9/10 IDs, success rate: 0.0%


INFO: Processed 9/10 IDs, success rate: 0.0%


INFO: Processing ID 10/10: 5fc7663d-7b57-407d-93e5-6465109d303a


INFO: Processing ID 10/10: 5fc7663d-7b57-407d-93e5-6465109d303a


ERROR: XML file not found: filtered_texts_no_desc/filtered_5fc7663d-7b57-407d-93e5-6465109d303a.xml


ERROR: XML file not found: filtered_texts_no_desc/filtered_5fc7663d-7b57-407d-93e5-6465109d303a.xml


INFO: Detailed info for ID 5fc7663d-7b57-407d-93e5-6465109d303a


INFO: Detailed info for ID 5fc7663d-7b57-407d-93e5-6465109d303a


INFO: Processed 10/10 IDs, success rate: 0.0%


INFO: Processed 10/10 IDs, success rate: 0.0%






INFO: PROCESSING COMPLETE


INFO: PROCESSING COMPLETE






INFO: Total IDs processed: 10


INFO: Total IDs processed: 10


INFO: Successful context images created: 0


INFO: Successful context images created: 0


INFO: Total images processed: 0


INFO: Total images processed: 0


INFO: Images below threshold: 0


INFO: Images below threshold: 0


INFO: Success rate: 0.00%


INFO: Success rate: 0.00%


INFO: Total processing time: 0.09 seconds


INFO: Total processing time: 0.09 seconds


INFO: Average time per ID: 0.01 seconds


INFO: Average time per ID: 0.01 seconds


INFO: Detailed results saved to output_context/processing_results.json


INFO: Detailed results saved to output_context/processing_results.json



    ## CLIP Text-Image Matching Results
    
    ### Processing Summary:
    - Total IDs processed: 10
    - Successful matches: 0
    - Success rate: 0.0%
    - Total images processed: 0
    - Images below threshold: 0
    
    ### Configuration:
    - Model used: ViT-B/32 on cuda
    - Similarity threshold: 0.25
    - Max lines context: 3
    
    ### Performance:
    - Total processing time: 0.09 seconds
    - Average time per ID: 0.01 seconds
    

No successful results to display