# Curate .yml file types

This notebook automatically identifies entries in `nfdi4bioimage.yml` that are missing a `type` tag and checks if they contain presentation slides. It focuses on Zenodo entries, inspecting them for PowerPoint files or landscape-oriented PDFs. Entries identified as slides will be updated with `type: Slides`.

### Setup

First, we import the required libraries. We will use `ruamel.yaml` to load and save the YAML file while preserving its formatting and comments. `requests` is needed for API calls to Zenodo, and `pypdf` for inspecting PDF files.

In [1]:
try:
    from ruamel.yaml import YAML
except ImportError:
    !pip install ruamel.yaml
    from ruamel.yaml import YAML

try:
    import pypdf
except ImportError:
    !pip install pypdf
    import pypdf

import re
import requests
import io
from pypdf import PdfReader

Collecting ruamel.yaml


  Downloading ruamel.yaml-0.18.14-py3-none-any.whl.metadata (24 kB)


Collecting ruamel.yaml.clib>=0.2.7 (from ruamel.yaml)
  Downloading ruamel.yaml.clib-0.2.12-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.7 kB)
Downloading ruamel.yaml-0.18.14-py3-none-any.whl (118 kB)


Downloading ruamel.yaml.clib-0.2.12-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (745 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/745.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m745.1/745.1 kB[0m [31m61.0 MB/s[0m  [33m0:00:00[0m
[?25h

Installing collected packages: ruamel.yaml.clib, ruamel.yaml
[?25l

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [ruamel.yaml]
[?25h[1A[2KSuccessfully installed ruamel.yaml-0.18.14 ruamel.yaml.clib-0.2.12


Collecting pypdf


  Downloading pypdf-5.9.0-py3-none-any.whl.metadata (7.1 kB)


Downloading pypdf-5.9.0-py3-none-any.whl (313 kB)


Installing collected packages: pypdf


Successfully installed pypdf-5.9.0


### Load Data

We load the `nfdi4bioimage.yml` file from the `resources` directory.

In [2]:
yml_path = '../resources/nfdi4bioimage.yml'
yaml_loader = YAML()
yaml_loader.preserve_quotes = True
yaml_loader.indent(mapping=2, sequence=4, offset=2)

with open(yml_path, 'r', encoding='utf-8') as f:
    data = yaml_loader.load(f)

DuplicateKeyError: while constructing a mapping
  in "../resources/nfdi4bioimage.yml", line 10607, column 3
found duplicate key "description" with value "Raw microscopy image from the NFDI4Bioimage calendar March 2025.
The image shows 125x magnified microscopic details of a biofilm formed by Pseudomonas fluorescence on the surface of a liquid culture medium. The culture was inoculated with a cellulose-overexpressing and surface-colonizing mScarlet-tagged wild type and a GFP-tagged mutant that is unable to colonize the surface. The biofilm can collapse over time due to its own mass, so that new strategies have to be developed and thus a life cycle emerges.
Image Metadata (using REMBI template):



Study
&nbsp;


Study description
Biofilm formation


Study Component
&nbsp;


Imaging method
Stereo microscopy


Biosample
&nbsp;


Biological entity
Bacteria


Organism
Pseudomonas fluorescence


Specimen
&nbsp;


Signal/contrast mechanism
Relief, fluorescence


Channel 1 - content
Relief, grey


Channel 1 - biological entity
Details of the biofilm in transmitted light


Channel 2 - content
mScarlet, red


Channel 2 - biological entity
WT over-expressing cellulose and colonizing the surface


Channel 3 - content
GFP, green


Channel 3 - biological entity
∆wss mutant unable to colonize the surface


Image Acquisition
&nbsp;


Microscope model
Zeiss Axio Zoom V16


Image Data
&nbsp;


Magnification
125x


Objective
PlanNeoFluar Z 1.0x


Dimension extents
x: 2752, y: 2208


Pixel size description
0.91 &micro;m x 0.91 &micro;m


Image area
2500&micro;m x 2500&micro;m



&nbsp;" (original value: "Today, research data is often stored in many different places, difficult to find and only available for a limited time. Base4NFDI creates the basis for better findability, accessibility, interoperability ans reusability of research data. For this purpose, common, technical services are developed together with experts for the data in the different research disciplines. Since many scientific fields have similar requirements for research data, Base4NFDI supports common solutions to avoid parallel developments. Already existing services are thereby adapted or extended to be usable for researchers from other disciplines.
This slides were presented from Sonja Schimmler at the first Conference on Research Data Infrastructure (CoRDI 2023) on September 14, 2023 in Karlsruhe, Germany. The presentation will give an insight into the idea and structure of Base4NFDI, the processes for developing basic services and the services currently funded.")
  in "../resources/nfdi4bioimage.yml", line 10644, column 3

To suppress this check see:
    https://yaml.dev/doc/ruamel.yaml/api/#Duplicate_keys


### Identify Candidate Entries

We find all entries that are missing a `type` (or have an empty `type`) and contain a link to Zenodo. These are the candidates we will inspect automatically.

In [None]:
entries_to_check = []
for entry in data['resources']:
    if 'type' not in entry or not entry['type']:
        urls = entry.get('url', [])
        if not isinstance(urls, list):
            urls = [urls]
        
        for url in urls:
            if re.search(r'zenodo\.org/(?:record|records)/', url):
                entries_to_check.append(entry)
                break

print(f"Found {len(entries_to_check)} Zenodo entries to check for slides.")

### Define Helper Functions

These functions will help us interact with the Zenodo API to check for slide-related files.

In [None]:
def get_record_id(url):
    """Extracts the Zenodo record ID from a URL."""
    match = re.search(r'zenodo\.org/(?:record|records)/(\d+)', url)
    return match.group(1) if match else None

def contains_powerpoint(record_id):
    """Checks a Zenodo record for .ppt or .pptx files via its API."""
    api_url = f'https://zenodo.org/api/records/{record_id}'
    try:
        response = requests.get(api_url)
        response.raise_for_status()
        record_data = response.json()
        
        for file_info in record_data.get('files', []):
            if file_info['key'].lower().endswith(('.pptx', '.ppt')):
                return True
    except requests.RequestException as e:
        print(f"Could not fetch Zenodo record {record_id}: {e}")
    return False

def contains_landscape_pdf(record_id):
    """Checks a Zenodo record for any landscape-oriented PDF files."""
    api_url = f'https://zenodo.org/api/records/{record_id}'
    try:
        response = requests.get(api_url)
        response.raise_for_status()
        record_data = response.json()

        for file_info in record_data.get('files', []):
            if file_info['key'].lower().endswith('.pdf'):
                pdf_url = file_info['links']['self']
                pdf_response = requests.get(pdf_url)
                if pdf_response.ok:
                    with io.BytesIO(pdf_response.content) as pdf_file:
                        reader = PdfReader(pdf_file)
                        if len(reader.pages) > 0:
                            page = reader.pages[0]
                            if page.mediabox.width > page.mediabox.height:
                                return True
    except Exception as e:
        print(f"Could not process PDFs for record {record_id}: {e}")
    return False

### Process Entries and Update Type

Now we iterate through the entries. For each candidate, we check for PowerPoint files or landscape PDFs. If either is found, we set the `type` to `Slides`.

**Note:** This step involves network requests and may take some time to complete.

In [None]:
modified_entries_count = 0
for entry in data['resources']:
    if 'type' not in entry or not entry['type']:
        urls = entry.get('url', [])
        if not isinstance(urls, list):
            urls = [urls]

        for url in urls:
            record_id = get_record_id(url)
            if record_id:
                print(f"Checking '{entry['name']}' (Record: {record_id})...")
                if contains_powerpoint(record_id) or contains_landscape_pdf(record_id):
                    print(f"  -> Flagging '{entry['name']}' as Slides.")
                    entry['type'] = 'Slides'
                    modified_entries_count += 1
                    break

### Save Changes

Finally, if any entries were modified, we write the updated data structure back to the `nfdi4bioimage.yml` file. This will overwrite the original file.

In [None]:
if modified_entries_count > 0:
    with open(yml_path, 'w', encoding='utf-8') as f:
        yaml_loader.dump(data, f)
    print(f"\nSuccessfully updated {modified_entries_count} entries in {yml_path}.")
else:
    print("\nNo entries were updated.")