# Other Datasets
I looked for data from similar tasks to pretrain the AIs and found 5 that we're going to use for pretraining.

For the rough pretraining, we'll use the ```brain_tumor_dataset```, which contains t1c images of gliomas (1426), meningeomas (708) and pituatary tumors (930). I've already split them into a training and validation dataset that we can use.

For the fine pretraining, we'll use the ```BRATS_METS dataset```, the ```eramus glioma dataset```, ```UCSF-PDGM``` and ```UPENN-GBM```. Together they contain images of brain metastases, glioblastomas, astrocytomas (IDH wildtype), astrocytomas (IDH mutated) and oligodendroglioma.

The task now is to split the dataset for the fine pretraining into equal training and validation dataset.

In [50]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import shutil
import cv2
from tqdm import tqdm
from pathlib import Path
import random
import nibabel as nib
from sklearn.model_selection import StratifiedGroupKFold
import skimage.measure as measure

In [83]:
# getting the paths to all the tfrecord files
brats_gray_tfr_dir = Path("/Users/LennartPhilipp/Desktop/Uni/Prowiss/Datensätze/BRATS_2024/BraTS-MET/BraTS2024-MET-tfrecords/BRATS_gray")
erasmus_gray_tfr_dir = Path("/Users/LennartPhilipp/Desktop/Uni/Prowiss/Datensätze/Erasmus_Glioma_Dataset/EGD_tfrs/gray_tfrs")
ucsf_gray_tfr_dir = Path("/Users/LennartPhilipp/Desktop/Uni/Prowiss/Datensätze/UCSF-PDGM/UCSF-PDGM_tfrs/gray_tfrs")
upenn_gray_tfr_dir = Path("/Users/LennartPhilipp/Desktop/Uni/Prowiss/Datensätze/UPenn_GBM/upenn_gbm_tfrs/gray_tfrs")

path_to_tfr_folder = Path("/Users/LennartPhilipp/Desktop/Uni/Prowiss/Datensätze/tfrs")
path_to_train_txt = Path("/Users/LennartPhilipp/Desktop/Uni/Prowiss/Datensätze/tfrs/pretraining_fine_train.txt")
path_to_val_txt = Path("/Users/LennartPhilipp/Desktop/Uni/Prowiss/Datensätze/tfrs/pretraining_fine_val.txt")

Each .tfrecord file is named like this ```{Patient_ID}_label_{label}.tfrecord```. So by going through all the files in each directory, we can find out which class (label) they belong to and split the dataset accordingly.

In [None]:
# Let's first create a custom patinent class to store the path to the file, the patient ID and the label
class Patient:
    def __init__(self, path, patient_id, label):
        self.path = path
        self.patient_id = patient_id
        self.label = label

In [10]:
all_patients = []

# Let's now loop through all the tfrecord files and extract the patient ID and the label
for tfr_file in tqdm(erasmus_gray_tfr_dir.iterdir()):
    patient_id = tfr_file.stem.split("_label_")[0]
    label = tfr_file.stem.split("_label_")[1]
    patient = Patient(tfr_file, patient_id, label)
    all_patients.append(patient)

for tfr_file in tqdm(ucsf_gray_tfr_dir.iterdir()):
    patient_id = tfr_file.stem.split("_label_")[0]
    label = tfr_file.stem.split("_label_")[1]
    patient = Patient(tfr_file, patient_id, label)
    all_patients.append(patient)

for tfr_file in tqdm(upenn_gray_tfr_dir.iterdir()):
    patient_id = tfr_file.stem.split("_label_")[0]
    label = tfr_file.stem.split("_label_")[1]
    patient = Patient(tfr_file, patient_id, label)
    all_patients.append(patient)

# in the brats directory are more patient directories that I need to loop over
for patient_dir in tqdm(brats_gray_tfr_dir.iterdir()):
    if not patient_dir.is_dir():
        continue
    for tfr_file in patient_dir.iterdir():
        patient_id = tfr_file.stem.split("_")[0]
        label = tfr_file.stem.split("_label_")[1]
        patient = Patient(tfr_file, patient_id, label)
        all_patients.append(patient)

print(f"Total number of patients: {len(all_patients)}")

151it [00:00, 142837.15it/s]
495it [00:00, 229996.73it/s]
526it [00:00, 161994.56it/s]
646it [00:00, 11575.63it/s]

Total number of patients: 5235





In [11]:
# Print statistics about the labels
labels = [patient.label for patient in all_patients]
unique_labels = np.unique(labels)
print(f"Unique labels: {unique_labels}")
label_counts = {label: labels.count(label) for label in unique_labels}
print(f"Label counts: {label_counts}")

Unique labels: ['0' '1' '2' '3' '4']
Label counts: {'0': 949, '1': 35, '2': 133, '3': 55, '4': 4063}


Apply StratifiedGroupKFold

In [15]:
patient_ids = [patient.patient_id for patient in all_patients]
labels = [patient.label for patient in all_patients]
paths_to_tfr_files = [patient.path for patient in all_patients]

all_patient_dict = {
    "patient_id": patient_ids,
    "label": labels,
    "path_to_tfr_file": paths_to_tfr_files
}

# create pandas dataframe
df = pd.DataFrame(all_patient_dict)
df

Unnamed: 0,patient_id,label,path_to_tfr_file
0,EGD-0389,0,/Users/LennartPhilipp/Desktop/Uni/Prowiss/Date...
1,EGD-0187,3,/Users/LennartPhilipp/Desktop/Uni/Prowiss/Date...
2,EGD-0606,3,/Users/LennartPhilipp/Desktop/Uni/Prowiss/Date...
3,EGD-0531,0,/Users/LennartPhilipp/Desktop/Uni/Prowiss/Date...
4,EGD-0417,0,/Users/LennartPhilipp/Desktop/Uni/Prowiss/Date...
...,...,...,...
5230,BraTS-MET-00712-000,4,/Users/LennartPhilipp/Desktop/Uni/Prowiss/Date...
5231,BraTS-MET-00712-000,4,/Users/LennartPhilipp/Desktop/Uni/Prowiss/Date...
5232,BraTS-MET-00712-000,4,/Users/LennartPhilipp/Desktop/Uni/Prowiss/Date...
5233,BraTS-MET-00706-000,4,/Users/LennartPhilipp/Desktop/Uni/Prowiss/Date...


In [27]:
X = df.drop(columns=["label", "patient_id"]).values
y = df["label"].values
groups = df["patient_id"].values

In [74]:
sgkf = StratifiedGroupKFold(
    n_splits = 10,
    shuffle = True,
    random_state = 44
)

train_idx, val_idx = next(sgkf.split(X, y, groups=groups))

X_train, y_train = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]

In [75]:
X_train = [x[0].name for x in X_train]
X_val = [x[0].name for x in X_val]

In [52]:
# loop through all the patients paths and copy them into the tfrs folder
for patient in tqdm(all_patients):
    shutil.copy(patient.path, path_to_tfr_folder)

100%|██████████| 5235/5235 [00:02<00:00, 2226.00it/s]


In [84]:
with open(path_to_train_txt, "w") as f:
    for path in X_train:
        f.write(str(path) + "\n")

with open(path_to_val_txt, "w") as f:
    for path in X_val:
        f.write(str(path) + "\n")

In [85]:
print("Split overview:")
print(f"Train: {len(X_train)}")
print(f"Val: {len(X_val)}")
print(f"Total: {len(X_train) + len(X_val)}")

Split overview:
Train: 4658
Val: 577
Total: 5235


In [86]:
train_lines = []
val_lines = []

# check split ratios in the .txt files
with open(path_to_train_txt, "r") as f:
    train_lines = f.readlines()
with open(path_to_val_txt, "r") as f:
    val_lines = f.readlines()

train_labels = [line.split("_label_")[1].strip() for line in train_lines]
train_labels = [label.split(".")[0] for label in train_labels]

val_labels = [line.split("_label_")[1].strip() for line in val_lines]
val_labels = [label.split(".")[0] for label in val_labels]
train_label_counts = {label: train_labels.count(label) for label in unique_labels}
val_label_counts = {label: val_labels.count(label) for label in unique_labels}
print("Train label counts:")
print(train_label_counts) 
print("Val label counts:")
print(val_label_counts)

# label ratios
train_ratio = {label: count / len(train_labels) for label, count in train_label_counts.items()}
val_ratio = {label: count / len(val_labels) for label, count in val_label_counts.items()}
print("Train label ratios:")
print(train_ratio)
print("Val label ratios:")
print(val_ratio)

Train label counts:
{'0': 861, '1': 33, '2': 119, '3': 47, '4': 3598}
Val label counts:
{'0': 88, '1': 2, '2': 14, '3': 8, '4': 465}
Train label ratios:
{'0': 0.18484328037784456, '1': 0.007084585659081151, '2': 0.025547445255474453, '3': 0.010090167453842851, '4': 0.772434521253757}
Val label ratios:
{'0': 0.15251299826689774, '1': 0.0034662045060658577, '2': 0.024263431542461005, '3': 0.01386481802426343, '4': 0.8058925476603119}
