### Introduction
This notebook analyzes a YAML file (`resources/nfdi4bioimage.yml`) to extract and identify duplicate URLs. It also counts the total number of URLs and the number of duplicates.

In [1]:
import yaml
from collections import Counter
from pathlib import Path

### Load the YAML file
Load the content of the YAML file into a Python data structure.

In [2]:
yml_file = Path("../resources/nfdi4bioimage.yml")
with open(yml_file, 'r') as file:
    data = yaml.safe_load(file)

### Define a function to extract URLs
This function recursively navigates through the nested YAML structure to find all URLs.

In [3]:
def extract_urls(data):
    urls = []
    if isinstance(data, dict):
        for key, value in data.items():
            if key == "url":
                if isinstance(value, list):
                    urls.extend(value)
                elif isinstance(value, str):
                    urls.append(value)
            else:
                urls.extend(extract_urls(value))
    elif isinstance(data, list):
        for item in data:
            urls.extend(extract_urls(item))
    return urls

### Extract URLs and identify duplicates
Extract all URLs using the defined function and count their occurrences. Identify duplicates if any exist.

In [4]:
urls = extract_urls(data)
url_counts = Counter(urls)
duplicates = [url for url, count in url_counts.items() if count > 1]

### Print the results
Display the total number of URLs, the number of duplicates, and the duplicate URLs themselves.

In [5]:
print(f"Total URLs found: {len(urls)}")
print(f"Number of duplicate URLs: {len(duplicates)}")
print("Duplicate URLs:")
for duplicate in duplicates:
    print(duplicate)

Total URLs found: 769
Number of duplicate URLs: 0
Duplicate URLs:
