<a href="https://colab.research.google.com/github/Kolo-Naukowe-Axion/Angiography/blob/main/dataset_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ARCADE

The ARCADE dataset (Automatic Region-based Coronary Artery Disease Diagnostics using X-ray Angiography) is a large-scale, expert-annotated resource designed to accelerate the development and benchmarking of AI models for coronary artery disease (CAD) diagnostics.

**link:**
https://www.kaggle.com/datasets/nikitamanaenkov/annotated-x-ray-angiography-dataset

**paper:**
https://www.nature.com/articles/s41597-023-02871-z


**Technical Specifications**

Volume: 3,000 anonymized X-ray coronary angiography (XCA) frames.

Resolution: 512 √ó 512 pixels.

Equipment: High-quality imaging obtained via Philips Azurion 3 and Siemens Artis Zee angiographs.

**Patient cohort**

The study cohort consists of patients with suspected CAD, whose clinical data is available at
the Research Institute of Cardiology and Internal Diseases, Almaty, Kazakhstan. The total number of patients is
1500 with a Mean age of 45.8, Median age of 60.0, 57% men (youngest 21, oldest 85), and 43% women (youngest
19, oldest 90).

**Annotation Process**

Tool: Computer Vision Annotation Tool (CVAT) - web-based, state-of-the-art for pixel-level annotations.

Verification: Multi-step process with experienced cardiologists; final consensus reached by two senior experts to ensure high-quality data.

##EDA for ARACADE

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("nikitamanaenkov/annotated-x-ray-angiography-dataset")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'annotated-x-ray-angiography-dataset' dataset.
Path to dataset files: /kaggle/input/annotated-x-ray-angiography-dataset


In [None]:
import os
import json
import pandas as pd
from pathlib import Path

# path = "/content/annotated-x-ray-angiography-dataset"

def run_split_aware_eda(base_path):
    base_path = Path(base_path)
    json_files = list(base_path.rglob('*.json'))

    stats = []
    total_imgs_all = 0
    total_anns_all = 0

    # Przetwarzamy ka≈ºdy plik JSON osobno, aby zachowaƒá podzia≈Ç train/val/test
    for ann_path in sorted(json_files):
        # Pomijamy pliki, kt√≥re nie sƒÖ g≈Ç√≥wnymi splitami (je≈õli takie sƒÖ)
        split_name = ann_path.stem

        with open(ann_path, 'r') as f:
            data = json.load(f)

        df_images = pd.DataFrame(data.get('images', []))
        df_anns = pd.DataFrame(data.get('annotations', []))
        df_cats = pd.DataFrame(data.get('categories', []))

        num_images = len(df_images)
        num_annots = len(df_anns)
        num_classes = len(df_cats)

        # Zliczamy fizyczne pliki PNG w podfolderze o tej samej nazwie co split
        # (Zak≈ÇadajƒÖc standardowƒÖ strukturƒô folder√≥w)
        img_folder = base_path / split_name
        files_on_disk = len(list(img_folder.glob('*.png'))) if img_folder.exists() else "N/A"

        stats.append({
            "Zbi√≥r (Split)": split_name,
            "Liczba obraz√≥w (JSON)": num_images,
            "Pliki na dysku (.png)": files_on_disk,
            "Liczba adnotacji (obiekt√≥w)": num_annots,
            "Liczba klas": num_classes
        })

        # Do podsumowania ko≈Ñcowego bierzemy tylko te splity
        total_imgs_all += num_images
        total_anns_all += num_annots

    # Tworzymy tabelƒô
    df_final = pd.DataFrame(stats)

    print("üîç SZCZEG√ì≈ÅOWY RAPORT PODZIA≈ÅU DANYCH")
    print("-" * 80)
    print(df_final.to_string(index=False))
    print("-" * 80)
    print(f"TOTAL: {total_imgs_all} images, {total_anns_all} annotations.")

    # Wy≈õwietlamy rozk≈Çad klas tylko dla g≈Ç√≥wnego zadania (np. stenozy)
    # Mo≈ºemy to zrobiƒá dla wszystkich JSON√≥w razem
    print("\n--- ROZK≈ÅAD KLAS (G≈Ç√≥wne kategorie) ---")
    all_cat_names = []
    for ann_path in json_files:
        with open(ann_path, 'r') as f:
            d = json.load(f)
            cats = {c['id']: c['name'] for c in d.get('categories', [])}
            for a in d.get('annotations', []):
                all_cat_names.append(cats.get(a['category_id'], "unknown"))

    class_dist = pd.Series(all_cat_names).value_counts()
    print(class_dist.to_string())

run_split_aware_eda(path)

üîç SZCZEG√ì≈ÅOWY RAPORT PODZIA≈ÅU DANYCH
--------------------------------------------------------------------------------
Zbi√≥r (Split)  Liczba obraz√≥w (JSON) Pliki na dysku (.png)  Liczba adnotacji (obiekt√≥w)  Liczba klas
         test                    300                   N/A                          386           26
        train                   1000                   N/A                         1625           26
          val                    200                   N/A                          406           26
         test                    300                   N/A                         1672           26
        train                   1000                   N/A                         4976           26
          val                    200                   N/A                         1168           26
--------------------------------------------------------------------------------
TOTAL: 3000 images, 10233 annotations.

--- ROZK≈ÅAD KLAS (G≈Ç√≥wne kategorie) ---
st

#CADICA

The CADICA (Coronary Artery Disease ICA) dataset is a specialized medical imaging collection designed to advance the development of AI-driven diagnostic tools for cardiovascular diseases. CADICA dataset images were acquired at Hospital Universitario Virgen de la Victoria, M√°laga, Spain.

**link:**
 https://data.mendeley.com/datasets/p9bpx9ctcv/5

**paper:**
https://onlinelibrary.wiley.com/doi/10.1111/exsy.13708

**Technical Specifications**

Volume: 18 154 annotated frames extracted from 141 coronary angiography videos.

Resolution: 512 x 512 pixels.

Equipment: High-quality imaging obtained via Siemens Artis Zee angiographic system.

Frame Rate: 10 frames per second.

Radiation Dose: 5‚Äì50 mGy per sequenc

**Patient Cohort**

The study cohort consists of patients undergoing invasive coronary angiography (ICA) for suspected coronary artery disease (CAD).

Total Patients: 42 anonymized individuals.

Data Organization: Organized into patient-specific folders (e.g., p1, p2) containing video-specific subdirectories (v1, v2).

**Annotation Process**

Classification: Each frame is labeled as either containing a lesion or being a non-lesion (normal) frame.

Object Detection: Precise bounding boxes identify the exact location of stenotic lesions within the coronary arteries.

Expert Verification: Annotations were performed and validated by professional cardiologists to ensure a high-quality "gold standard" for AI training.

Metadata: Includes clinical information regarding the X-ray projection angles (e.g., RAO, LAO, Cranial, Caudal).

##EDA for CADICA

In [None]:
import os
from pathlib import Path
from collections import Counter

# ≈öcie≈ºka do Twoich danych
path = "/content/sysu_ancad_data"

def run_bulletproof_cadica_eda(base_path):
    base_path = Path(base_path)

    # 1. Znajd≈∫ WSZYSTKIE pliki .png i .txt w ca≈Çym drzewie katalog√≥w
    print("Scanning directory... please wait.")
    all_png_paths = list(base_path.rglob('*.png'))
    # Filtrujemy pliki txt, ≈ºeby nie braƒá README ani ≈õmieci
    all_txt_paths = [f for f in base_path.rglob('*.txt') if 'README' not in f.name.upper() and f.stat().st_size > 0]

    # Tworzymy zbiory samych nazw (bez rozszerze≈Ñ) dla szybkiego por√≥wnania
    png_names = {f.stem: f for f in all_png_paths}
    txt_names = {f.stem: f for f in all_txt_paths}

    # 2. Parowanie obraz√≥w z adnotacjami
    annotated_images_count = 0
    total_stenosis_instances = 0

    for stem in png_names.keys():
        if stem in txt_names:
            annotated_images_count += 1
            # Liczymy linie w odpowiadajƒÖcym pliku .txt (instancje stenozy)
            try:
                with open(txt_names[stem], 'r') as f:
                    lines = [l.strip() for l in f.readlines() if l.strip()]
                    total_stenosis_instances += len(lines)
            except:
                continue

    # 3. GENEROWANIE RAPORTU
    print(f"\nüîç FINAL VERIFIED REPORT (CADICA)")
    print("-" * 50)

    print(f"--- 1. FILE FORMATS & QUANTITY ---")
    print(f"Total .png files found: {len(all_png_paths)}")
    print(f"Total .txt files found: {len(all_txt_paths)}")

    print(f"\n--- 2. SAMPLES & STENOSIS OVERVIEW ---")
    print(f"Total unique image samples: {len(all_png_paths)}")
    print(f"Images WITH identified stenosis: {annotated_images_count}")
    print(f"Images WITHOUT stenosis (background): {len(all_png_paths) - annotated_images_count}")
    print(f"Do all samples have stenosis? {'Yes' if len(all_png_paths) == annotated_images_count else 'No'}")

    print(f"\n--- 3. CLASS DISTRIBUTION (Total Instances) ---")
    print(f"{'category_name':<20} {'count'}")
    # W CADICA ka≈ºda linia w pasujƒÖcym pliku .txt to stenoza
    print(f"{'stenosis':<20} {total_stenosis_instances}")

run_bulletproof_cadica_eda(path)

Scanning directory... please wait.

üîç FINAL VERIFIED REPORT (CADICA)
--------------------------------------------------
--- 1. FILE FORMATS & QUANTITY ---
Total .png files found: 31500
Total .txt files found: 4452

--- 2. SAMPLES & STENOSIS OVERVIEW ---
Total unique image samples: 31500
Images WITH identified stenosis: 3996
Images WITHOUT stenosis (background): 27504
Do all samples have stenosis? No

--- 3. CLASS DISTRIBUTION (Total Instances) ---
category_name        count
stenosis             6161


#MENDELEY
This dataset presents a collection of angiographic imaging series from 100 patients with confirmed one-vessel coronary artery disease, acquired using Siemens Coroscop and GE Innova systems. The study follows established 2018 ESC/EACTS clinical guidelines and was conducted at the Research Institute for Complex Problems of Cardiovascular Diseases in Russia under full ethical approval.

**link:**
https://data.mendeley.com/datasets/ydrm75xywg/1

**paper:**
https://www.nature.com/articles/s41598-021-87174-2



**Technical Specifications**

Volume: 8,325 grayscale images extracted from coronary angiography cine series.

Resolution: Ranging from 512 x 512 to 1000 x 1000 pixels.

Equipment: Imaging obtained via Coroscop (Siemens) and Innova (GE Healthcare) surgery systems.

**Patient Cohort**

Total number of patients is 100 individuals. All patients had confirmed one-vessel coronary artery disease (‚â•70% diameter stenosis or 50‚Äì69% with FFR ‚â§ 0.80).


**Annotation Process**

Tool: Labeled using LabelBox (SaaS version) for object detection tasks.

Methodology: Stenotic regions are identified with bounding boxes, with additional categorization based on area (Small, Medium, Large).

Expert Verification: Presence or absence of stenosis was confirmed by a single professional operator according to 2018 ESC/EACTS Guidelines.

Dataset Split: Objects categorized by size: 30% small (area < 322), 69% medium (322 ‚â§ area ‚â§ 962), and 1% large (area > 962

In [None]:
import os
from pathlib import Path
from collections import Counter

# Path from your Colab environment
path = "/content/sysu_ancad_data"

def run_final_formatted_eda(base_path):
    base_path = Path(base_path)

    # 1. SCANNING ALL FILES
    all_files = list(base_path.rglob('*'))
    files_only = [f for f in all_files if f.is_file()]
    extensions = Counter([f.suffix.lower() for f in files_only])

    # 2. MATCHING IMAGES WITH ANNOTATIONS
    png_files = [f for f in all_files if f.suffix.lower() == '.png']
    txt_files = [f for f in all_files if f.suffix.lower() == '.txt' and 'README' not in f.name.upper()]

    # Create maps for pairing
    png_map = {f.stem: f for f in png_files}
    txt_map = {f.stem: f for f in txt_files}

    total_instances = 0
    images_with_stenosis = 0

    for stem in png_map.keys():
        if stem in txt_map:
            try:
                with open(txt_map[stem], 'r') as f:
                    lines = [l for l in f.readlines() if l.strip()]
                    if lines:
                        images_with_stenosis += 1
                        total_instances += len(lines)
            except:
                continue

    # --- FINAL FORMATTED OUTPUT ---

    print(f"--- 1. FILE FORMATS & QUANTITY ---")
    print(f"Total number of files: {len(files_only)}")
    # List formats in requested order or most common
    for ext in ['.txt', '.json', '.png']:
        if ext in extensions:
            print(f"Format {ext}: {extensions[ext]} files")

    print(f"\n--- 2. SAMPLES & STENOSIS OVERVIEW ---")
    print(f"Total unique image samples: {len(png_files)}")
    print(f"Images WITH identified stenosis: {images_with_stenosis}")
    print(f"Images WITHOUT stenosis (background/healthy): {len(png_files) - images_with_stenosis}")
    print(f"Do all samples have stenosis? {'Yes' if len(png_files) == images_with_stenosis else 'No'}")

    print(f"\n--- 3. CLASS DISTRIBUTION (Total Instances) ---")
    print(f"{'category_name':<20} {'count'}")
    print(f"{'stenosis':<20} {total_instances}")

run_final_formatted_eda(path)

--- 1. FILE FORMATS & QUANTITY ---
Total number of files: 36234
Format .txt: 4463 files
Format .json: 1 files
Format .png: 31500 files

--- 2. SAMPLES & STENOSIS OVERVIEW ---
Total unique image samples: 31500
Images WITH identified stenosis: 3996
Images WITHOUT stenosis (background/healthy): 27504
Do all samples have stenosis? No

--- 3. CLASS DISTRIBUTION (Total Instances) ---
category_name        count
stenosis             6161
