<a href="https://colab.research.google.com/github/Kolo-Naukowe-Axion/Angiography/blob/main/dataset_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ARCADE

The ARCADE dataset (Automatic Region-based Coronary Artery Disease Diagnostics using X-ray Angiography) is a large-scale, expert-annotated resource designed to accelerate the development and benchmarking of AI models for coronary artery disease (CAD) diagnostics.

**link:**
https://www.kaggle.com/datasets/nikitamanaenkov/annotated-x-ray-angiography-dataset

**paper:**
https://www.nature.com/articles/s41597-023-02871-z


**Technical Specifications**

Volume: 3,000 anonymized X-ray coronary angiography (XCA) frames.

Resolution: 512 Ã— 512 pixels.

Equipment: High-quality imaging obtained via Philips Azurion 3 and Siemens Artis Zee angiographs.

**Patient cohort**

The study cohort consists of patients with suspected CAD, whose clinical data is available at
the Research Institute of Cardiology and Internal Diseases, Almaty, Kazakhstan. The total number of patients is
1500 with a Mean age of 45.8, Median age of 60.0, 57% men (youngest 21, oldest 85), and 43% women (youngest
19, oldest 90).

**Annotation Process**

Tool: Computer Vision Annotation Tool (CVAT) - web-based, state-of-the-art for pixel-level annotations.

Verification: Multi-step process with experienced cardiologists; final consensus reached by two senior experts to ensure high-quality data.

##EDA for ARACADE

In [33]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("nikitamanaenkov/annotated-x-ray-angiography-dataset")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'annotated-x-ray-angiography-dataset' dataset.
Path to dataset files: /kaggle/input/annotated-x-ray-angiography-dataset


In [51]:
import os
import json
import pandas as pd
from pathlib import Path

# Path retrieved from kagglehub
# path = kagglehub.dataset_download("nikitamanaenkov/annotated-x-ray-angiography-dataset")

def run_split_aware_eda_v3(base_path):
    base_path = Path(base_path)

    json_files = list(base_path.rglob('*.json'))

    if not json_files:
        print(f" No JSON files found in path: {base_path}")
        print("Folder content:", os.listdir(base_path))
        return

    stats = []
    total_imgs_all = 0
    total_anns_all = 0
    all_cat_names = []

    for ann_path in sorted(json_files):
        split_name = ann_path.stem

        try:
            with open(ann_path, 'r') as f:
                data = json.load(f)
        except Exception as e:
            continue

        df_images = pd.DataFrame(data.get('images', []))
        df_anns = pd.DataFrame(data.get('annotations', []))
        df_cats = pd.DataFrame(data.get('categories', []))

        num_images = len(df_images)
        num_annots = len(df_anns)
        num_classes = len(df_cats)

        if not df_cats.empty:
            cats_dict = {c['id']: c['name'] for c in data.get('categories', [])}
            for a in data.get('annotations', []):
                all_cat_names.append(cats_dict.get(a['category_id'], "unknown"))

        img_folder = ann_path.parent / split_name
        if not img_folder.exists():
            img_folder = base_path / "images" / split_name

        files_on_disk = len(list(img_folder.glob('*.png'))) if img_folder.exists() else "N/A"

        stats.append({
            "Dataset (Split)": split_name,
            "Images (JSON)": num_images,
            "Files on disk (.png)": files_on_disk,
            "Annotations (Objects)": num_annots,
            "Classes": num_classes
        })

        total_imgs_all += num_images
        total_anns_all += num_annots

    df_final = pd.DataFrame(stats)

    print("\n DETAILED DATA SPLIT REPORT")
    print("-" * 85)
    if not df_final.empty:
        print(df_final.to_string(index=False))
    else:
        print("Error: Could not process data into the table.")
    print("-" * 85)
    print(f"TOTAL: {total_imgs_all} images, {total_anns_all} annotations.")

    print("\n--- CLASS DISTRIBUTION (Main categories) ---")
    if all_cat_names:
        class_dist = pd.Series(all_cat_names).value_counts()
        print(class_dist.to_string())
    else:
        print("No class data available.")

run_split_aware_eda_v3(path)


 DETAILED DATA SPLIT REPORT
-------------------------------------------------------------------------------------
Dataset (Split)  Images (JSON) Files on disk (.png)  Annotations (Objects)  Classes
           test            300                  N/A                    386       26
          train           1000                  N/A                   1625       26
            val            200                  N/A                    406       26
           test            300                  N/A                   1672       26
          train           1000                  N/A                   4976       26
            val            200                  N/A                   1168       26
-------------------------------------------------------------------------------------
TOTAL: 3000 images, 10233 annotations.

--- CLASS DISTRIBUTION (Main categories) ---
stenosis    2417
5            850
6            850
2            541
1            540
3            535
11           509
7      

#CADICA

The CADICA (Coronary Artery Disease ICA) dataset is a specialized medical imaging collection designed to advance the development of AI-driven diagnostic tools for cardiovascular diseases. CADICA dataset images were acquired at Hospital Universitario Virgen de la Victoria, MÃ¡laga, Spain.

**link:**
 https://data.mendeley.com/datasets/p9bpx9ctcv/5

**paper:**
https://onlinelibrary.wiley.com/doi/10.1111/exsy.13708

**Technical Specifications**

Volume: 18 154 annotated frames extracted from 141 coronary angiography videos.

Resolution: 512 x 512 pixels.

Equipment: High-quality imaging obtained via Siemens Artis Zee angiographic system.

Frame Rate: 10 frames per second.

Radiation Dose: 5â€“50 mGy per sequenc

**Patient Cohort**

The study cohort consists of patients undergoing invasive coronary angiography (ICA) for suspected coronary artery disease (CAD).

Total Patients: 42 anonymized individuals.

Data Organization: Organized into patient-specific folders (e.g., p1, p2) containing video-specific subdirectories (v1, v2).

**Annotation Process**

Classification: Each frame is labeled as either containing a lesion or being a non-lesion (normal) frame.

Object Detection: Precise bounding boxes identify the exact location of stenotic lesions within the coronary arteries.

Expert Verification: Annotations were performed and validated by professional cardiologists to ensure a high-quality "gold standard" for AI training.

Metadata: Includes clinical information regarding the X-ray projection angles (e.g., RAO, LAO, Cranial, Caudal).

##EDA for CADICA

In [21]:
import os
import glob
import random
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

download_link = "https://data.mendeley.com/public-api/zip/p9bpx9ctcv/download/5"


print("downloading file...")
!wget -O outer_archive.zip "$download_link"

print("unzipping...")
!unzip -q outer_archive.zip


!rm outer_archive.zip

# finding all the zip folders
inner_zips = glob.glob('**/*.zip', recursive=True)

if inner_zips:
    inner_zip_path = inner_zips[0]
    print(f"found inner file: {inner_zip_path}")
    print("more unzipping...")
    !unzip -q "{inner_zip_path}"

    !rm "{inner_zip_path}"
    print("all layers done")
else:
    print("zip not found")


downloading file...
--2026-02-19 20:14:10--  https://data.mendeley.com/public-api/zip/p9bpx9ctcv/download/5
Resolving data.mendeley.com (data.mendeley.com)... 162.159.130.86, 162.159.133.86
Connecting to data.mendeley.com (data.mendeley.com)|162.159.130.86|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/p9bpx9ctcv-5.zip?X-Amz-Security-Token=IQoJb3JpZ2luX2VjELz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCWV1LXdlc3QtMSJGMEQCIEac%2Fo083CdVH4vADgBQNMAAdi1cK0X8DZCpksBUHSt7AiAHdnqj8johV5hZeo%2BTv7N36cPKUEIfooXccP%2BjILxIqiqVBQiF%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAQaDDM2NzE0NzM4MzgyNSIMO3CrjcMHZWQuB%2BeFKukEUlYL8XGipb5tpLFoeqGoZMzXZLjOQeV%2FBPcA374iojHKjvTzuN030UJiyQTgy0iPSaI3L5HE0n8kTga1UxE4oh1N%2F3siHTQOIG%2BxJ1FeW7z9oNgiL2h1w%2F%2FKQO60vKpP08p4WxVE96xOaVVyYjeK0UB101qI6LfydUF5GUwMU9bTSD1Mie7f3bH3Ar1Q5r8WWhpayo%2B1%2BqJmUp%2F4n%2FzV65vC1GSAP%2FYpAtzGZGvJbHRHmy1kM8pWJEWbONGIjYIJGGDXyjlagn7eQwg%2BWYhtNJ8s6avy

In [50]:
import os
from pathlib import Path
from collections import defaultdict

def final_cadica_eda_fixed(root_path):
    root = Path(root_path)
    if (root / "CADICA").exists():
        root = root / "CADICA"

    stats = defaultdict(lambda: {'input': 0, 'gt': 0})


    for input_folder in root.rglob('input'):
        if not input_folder.is_dir(): continue

        p_name = input_folder.parent.parent.name
        v_name = input_folder.parent.name
        vid_id = f"{p_name}_{v_name}"

        imgs = list(input_folder.glob('*.png'))
        stats[vid_id]['input'] = len(imgs)


        parent = input_folder.parent
        gt_folder = None
        for potential_name in ['groundtruth', 'groundTruth', 'GT', 'gt']:
            if (parent / potential_name).exists():
                gt_folder = parent / potential_name
                break

        if gt_folder:
            annos = list(gt_folder.glob('*.*'))
            stats[vid_id]['gt'] = len(annos)

    v_with_steno = sum(1 for v in stats.values() if v['gt'] > 0)
    v_total = len(stats)
    total_imgs = sum(v['input'] for v in stats.values())
    total_annos = sum(v['gt'] for v in stats.values())

    print("\n" + "="*55)
    print("FINAL REPORT")
    print("="*55)
    print(f"TOTAL VIDEOS:                 {v_total}")
    print(f"VIDEOS WITH STENOSIS (w/ GT): {v_with_steno}")
    print(f"VIDEOS WITHOUT STENOSIS:      {v_total - v_with_steno}")
    print("-" * 55)
    print(f"TOTAL IMAGES (Input):         {total_imgs}")
    print(f"TOTAL ANNOTATIONS (GT Files): {total_annos}")
    print("="*55)

    if v_with_steno > 0:
        example_vid = [k for k, v in stats.items() if v['gt'] > 0][0]
        print(f"Example ID with stenosis: {example_vid}")
    else:
        print("error")
        !ls -R {root}/selectedVideos/p7/ | head -n 15

final_cadica_eda_fixed("/content")


FINAL REPORT
TOTAL VIDEOS:                 668
VIDEOS WITH STENOSIS (w/ GT): 269
VIDEOS WITHOUT STENOSIS:      399
-------------------------------------------------------
TOTAL IMAGES (Input):         31500
TOTAL ANNOTATIONS (GT Files): 4265
Example ID with stenosis: p7_v9


#MENDELEY
This dataset presents a collection of angiographic imaging series from 100 patients with confirmed one-vessel coronary artery disease, acquired using Siemens Coroscop and GE Innova systems. The study follows established 2018 ESC/EACTS clinical guidelines and was conducted at the Research Institute for Complex Problems of Cardiovascular Diseases in Russia under full ethical approval.

**link:**
https://data.mendeley.com/datasets/ydrm75xywg/1

**paper:**
https://www.nature.com/articles/s41598-021-87174-2



**Technical Specifications**

Volume: 8,325 grayscale images extracted from coronary angiography cine series.

Resolution: Ranging from 512 x 512 to 1000 x 1000 pixels.

Equipment: Imaging obtained via Coroscop (Siemens) and Innova (GE Healthcare) surgery systems.

**Patient Cohort**

Total number of patients is 100 individuals. All patients had confirmed one-vessel coronary artery disease (â‰¥70% diameter stenosis or 50â€“69% with FFR â‰¤ 0.80).


**Annotation Process**

Tool: Labeled using LabelBox (SaaS version) for object detection tasks.

Methodology: Stenotic regions are identified with bounding boxes, with additional categorization based on area (Small, Medium, Large).

Expert Verification: Presence or absence of stenosis was confirmed by a single professional operator according to 2018 ESC/EACTS Guidelines.

Dataset Split: Objects categorized by size: 30% small (area < 322), 69% medium (322 â‰¤ area â‰¤ 962), and 1% large (area > 962

In [47]:
import os
import pandas as pd
from pathlib import Path
from collections import Counter

# 1. POBIERANIE DANYCH (Poprawiony link bezpoÅ›redni)
# Ten link wymusza pobranie konkretnego pliku SYSU-ANCAD
download_url = "https://data.mendeley.com/public-files/datasets/ydrm75xywg/files/e05a41c5-6219-4d07-ba9d-d203ec718991/file_downloaded"

print("ðŸ“¥ Pobieranie danych SYSU-ANCAD...")
!wget -q --show-progress -O dataset.zip "{download_url}"

print("ðŸ“¦ Rozpakowywanie...")
!unzip -q -o dataset.zip -d /content/sysu_ancad_data
print("âœ… Gotowe.")



ðŸ“¥ Pobieranie danych SYSU-ANCAD...
dataset.zip           0%[                    ]   1.20M  2.36MB/s               ^C
ðŸ“¦ Rozpakowywanie...
[dataset.zip]
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of dataset.zip or
        dataset.zip.zip, and cannot find dataset.zip.ZIP, period.
âœ… Gotowe.


In [48]:
import pandas as pd
import os
from pathlib import Path
from collections import Counter

def run_final_sysu_v2_eda(base_path):
    base_path = Path(base_path)

    # 1. SKANOWANIE STRUKTURY PLIKÃ“W
    all_files = list(base_path.rglob('*'))
    files_only = [f for f in all_files if f.is_file()]
    extensions = Counter([f.suffix.lower() for f in files_only])

    # 2. ANALIZA ETYKIET (CSV)
    csv_files = list(base_path.glob('*.csv'))
    total_annotations = 0
    annotated_frames = set()
    videos_with_labels = set()

    for csv_f in csv_files:
        df = pd.read_csv(csv_f)
        total_annotations += len(df)
        if 'filename' in df.columns:
            annotated_frames.update(df['filename'].unique())
            # WyciÄ…ganie unikalnych ID filmÃ³w na podstawie prefiksÃ³w w CSV
            vids = df['filename'].apply(lambda x: "_".join(str(x).split('_')[:3]))
            videos_with_labels.update(vids.unique())

    # 3. ANALIZA WIDEO
    video_files = [f for f in files_only if f.suffix.lower() == '.avi']

    # --- FINAL FORMATTED OUTPUT ---
    print(f"FINAL VERIFIED REPORT (SYSU-ANCAD v2)")
    print("-" * 50)

    print(f"--- 1. FILE FORMATS & QUANTITY ---")
    print(f"Total number of files: {len(files_only)}")
    # Zgodnie z Twoim wzorem, uwzglÄ™dniamy kluczowe formaty
    print(f"Format .txt: {extensions.get('.txt', 0)} files")
    print(f"Format .csv: {extensions.get('.csv', 0)} files")
    print(f"Format .avi: {len(video_files)} files")
    print(f"Format .xlsx: {extensions.get('.xlsx', 0)} files")

    print(f"\n--- 2. VIDEO STATISTICS ---")
    print(f"Total unique videos (.avi): {len(video_files)}")
    print(f"Videos WITH stenosis labels: {len(videos_with_labels)}")
    print(f"Videos WITHOUT stenosis labels: {len(video_files) - len(videos_with_labels)}")

    print(f"\n--- 3. SAMPLES & STENOSIS OVERVIEW ---")
    print(f"Total annotated image frames: {len(annotated_frames)}")
    print(f"Images WITH identified stenosis: {len(annotated_frames)}")
    print(f"Do all labeled samples have stenosis? Yes")

    print(f"\n--- 4. CLASS DISTRIBUTION (Total Instances) ---")
    print(f"{'category_name':<20} {'count'}")
    print(f"{'stenosis':<20} {total_annotations}")
    print("-" * 50)

run_final_sysu_v2_eda("/content/sysu_ancad_data")

FINAL VERIFIED REPORT (SYSU-ANCAD v2)
--------------------------------------------------
--- 1. FILE FORMATS & QUANTITY ---
Total number of files: 16923
Format .txt: 0 files
Format .csv: 2 files
Format .avi: 269 files
Format .xlsx: 2 files

--- 2. VIDEO STATISTICS ---
Total unique videos (.avi): 269
Videos WITH stenosis labels: 214
Videos WITHOUT stenosis labels: 55

--- 3. SAMPLES & STENOSIS OVERVIEW ---
Total annotated image frames: 8325
Images WITH identified stenosis: 8325
Do all labeled samples have stenosis? Yes

--- 4. CLASS DISTRIBUTION (Total Instances) ---
category_name        count
stenosis             8326
--------------------------------------------------
