Skip to content

TheLion-ai/UMIE_datasets

Repository files navigation

UMIE_datasets

contributors last update license

🤩 About the Project

Warning: This project is currently in alpha stage and may be subject to major changes

This repository presents a suite of unified scripts to standardize, preprocess, and integrate 882,774 images from 20 open-source medical imaging datasets, spanning modalities such as X-ray, CT, and MR. The scripts allow for seamless and fast download of a diverse medical data set. We create a unified set of annotations allowing for merging the datasets together without mislabelling. Each dataset is preprocessed with a custom sklearn pipeline. The pipeline steps are reusable across the datasets. The code was designed so that preorocessing a new dataset is simple and requires only reusing the available pipeline steps with customization performed through setting the appropriate values of the pipeline params.

The labels and segmentation masks were unified to be compliant with RadLex ontology.

Preprocessing_modules

Datasets

uid Dataset Modality TASK
0 KITS-23 CT classification/segmentation
1 CoronaHack XRAY classification
2 Alzheimers Dataset MRI Classification
3 Brain Tumor Classification MRI classification
4 COVID-19 Detection X-Ray XRAY classification
5 Finding and Measuring Lungs in CT Data CT Segmentation
6 Brain CT Images with Intracranial Hemorrhage Masks CT Classification
7 Liver and Liver Tumor Segmentation CT Classification, Segmentation
8 Brain MRI Images for Brain Tumor Detection MRI Classification
9 Knee Osteoarthritis Dataset with Severity Grading X-Ray Classification
10 Brain Tumor Progression MRI segmentation
11 Chest X-ray 14 XRAY classification
12 COCA- Coronary Calcium and chest CTs CT Segmentation
13 BrainMetShare MRI Segmentation

Using the datasets

Installing requirements

poetry install

Creating the dataset

Due to the copyright restrictions of the source datasets, we can't share the files directly. To obtain the full dataset you have to download the source datasets yourself and run the preprocessing scripts.

0.KITS-23

KITS-23

  1. Clone the KITS-23 repository.
  2. Enter the KITS-23 directory and install the packages with pip.
    cd kits23
    pip3 install -e .
  3. Run the following command to download the data to the dataset/ folder.
    kits23_download_data
    
  4. Fill in the source_path and target_path KITS-23Pipeline() in config/runner_config.py. e.g.
     KITS23Pipeline(
          path_args={
              "source_path": "kits23/dataset",  # Path to the dataset directory in KITS23 repo
              "target_path": TARGET_PATH,
              "labels_path": "kits23/dataset/kits23.json",  # Path to kits23.json
          },
          dataset_args=dataset_config.KITS23
      ),
1. Xray CoronaHack -Chest X-Ray-Dataset

1. Xray CoronaHack -Chest X-Ray-Dataset

  1. Go to CoronaHack page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in CoronaHackPipeline() in config/runner_config.py.
2. Alzheimer's Dataset

2. Alzheimer's Dataset ( 4 class of Images)

  1. Go to Alzheimer's Dataset page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in AlzheimersPipeline() in config/runner_config.py.
3. Brain Tumor Classification (MRI

3. Brain Tumor Classification (MRI)

  1. Go to Brain Tumor Classification page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in BrainTumorClassificationPipeline() in config/runner_config.py.
4. COVID-19 Detection X-Ray

4. COVID-19 Detection X-Ray

  1. Go to COVID-19 Detection X-Ray page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. REMOVE TrainData folder. We do not want augmented data at this stage.
  6. Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py.
5. Finding and Measuring Lungs in CT Dat

5. Finding and Measuring Lungs in CT Data

  1. Go to Finding and Measuring Lungs in CT Data page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive/2d_images folder in FindingAndMeasuringLungsPipeline() in config/runner_config.py. Fill in masks_path with the location of the archive/2d_masks folder.
6. Brain CT Images with Intracranial Hemorrhage Masks

6. Brain CT Images with Intracranial Hemorrhage Masks

  1. Go to Brain With Intracranial Hemorrhage page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in BrainWithIntracranialHemorrhagePipeline() in config/runner_config.py. Fill in masks_path with the same path as the source_path.
7. Liver and Liver Tumor Segmentation (LITS)

7. Liver and Liver Tumor Segmentation (LITS)

  1. Go to Liver and Liver Tumor Segmentation.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py. Fill in masks_path too.
8. Brain MRI Images for Brain Tumor Detection

8. Brain MRI Images for Brain Tumor Detection

  1. Go to Brain MRI Images for Brain Tumor Detection page on Kaggle.
  2. Login to your Kaggle account.
  3. Download the data.
  4. Extract archive.zip.
  5. Fill in the source_path to the location of the archive folder in BrainTumorDetectionPipeline() in config/runner_config.py.
9. Knee Osteoarthrithis Dataset with Severity Grading

9. Knee Osteoarthrithis Dataset with Severity Grading 1. Go to Knee Osteoarthritis Dataset with Severity Grading. 2. Login to your Kaggle account. 3. Download the data. 4. Extract archive.zip. 5. Fill in the source_path to the location of the archive folder in COVID19DetectionPipeline() in config/runner_config.py.

10. Brain-Tumor-Progression

10. Brain-Tumor-Progression

  1. Go to Brain Tumor Progression dataset from the cancer imaging archive.
11. Chest X-ray 14

11. Chest X-ray 14

  1. Go to Chest X-ray 14.
  2. Create an account.
  3. Download the images folder and DataEntry2017_v2020.csv.
12. COCA- Coronary Calcium and chest CTs

12. COCA- Coronary Calcium and chest CTs

  1. Go to COCA- Coronary Calcium and chest CTs.
  2. Log in or sign up for a Stanford AIMI account.
  3. Fill in your contact details.
  4. Download the data with azcopy.
  5. Fill in the source_path with the location of the cocacoronarycalciumandchestcts-2/Gated_release_final/patient folder. Fill in masks_path with cocacoronarycalciumandchestcts-2/Gated_release_final/calcium_xml xml file.
13. BrainMetShare

13. BrainMetShare

  1. Go to BrainMetShare.
  2. Log in or sign up for a Stanford AIMI account.
  3. Fill in your contact details.
  4. Download the data with azcopy.

To preprocess the dataset that is not among the above, search the preprocessing folder. It contains the reusable steps for changing imaging formats, extracting masks, creating file trees, etc. Go to the config file to check which masks and label encodings are available. Append new labels and mask encodings if needed.

Overall the dataset should have ** 882,774** images in .png format

  • CT - 500k+
  • X-Ray - 250k+
  • MRI - 100k+

🎯 Roadmap

  • dcm
  • jpg
  • nii
  • tif
  • Shared radlex ontology
  • Huggingface datasets
  • Data dashboards

👋 Contributors

🤝 Contact

Barbara Klaudel

TheLion.AI

Development

Pre-commits

Install pre-commits https://pre-commit.com/#installation

If you are using VS-code install the extention https://marketplace.visualstudio.com/items?itemName=MarkLarah.pre-commit-vscode

To make a dry-run of the pre-commits to see if your code passes run

pre-commit run --all-files

Adding python packages

Dependencies are handeled by poetry framework, to add new dependency run

poetry add <package_name>

Debugging

To modify and debug the app, development in containers can be useful .

Testing

run_tests.sh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published