Skip to content

Sudonuma/glaucoma-detection

Repository files navigation

Glaucoma Detection

Welcome to the Glaucoma detection repository.

Ohne Titel

Table of Contents

Prerequisites

Having Conda installed is the recommended approach. If not, you can install the requirements directly, although it's not the preferred method.

Usage

  1. clone the directory.
git clone https://github.com/Sudonuma/MLCodingChallenge.git
  1. cd to MLCodingChallenge.
cd MLCodingChallenge
  1. Create a conda environment and activate it. (if you have conda installed otherwise you can directly install the requirements)
conda create --name glaucomaenv python=3.9
conda activate glaucomaenv
  1. Install the dependencies.
pip install -r requirements.txt
  1. run the main script for training, validation and inference (see file for available arguments, you can also run with default args).
python main.py

Note : If you would like to only validate and infer a model you can run this command python main.py --validate_only True.

Data

The glaucodma dataset contains images categorized into folders labeled from 0 to 5. These folders include photographs of patients, some of whom have glaucoma, while others do not. Notably, the number of images of individuals without glaucoma significantly surpasses those with the condition.

The dataset is accompanied by a train_labels.csv file, which maps each image's name to its corresponding label. To simplify the labels, we encoded them as follows: "rg" represents class 1, and "ngr" corresponds to class 0. The resulting dataset with these encoded labels is saved in a file called encoded_dataset.csv.

For our experimentation, we set aside a portion (10%) of the data as a test dataset, and the information for these test samples is stored in a CSV file called test_data.csv.

In a separate experiment, we aimed to mitigate class imbalance by reducing the dataset size. We generated two distinct files: reduced_encoded_train_data.csv for training and reduced_encoded_test_data.csv for testing.

Additionally, to facilitate code testing for other users, we included two CSV files, namely, dummy_train_data.csv and dummy_test_data.csv. These files enable users to test the code without the need to download the entire 50+GB dataset.

Should you wish to train on the full dataset, you can do so by copying the image folders (0 to 5) into the data/dataset directory. Make sure to adjust the --data_csv_path argument to point to ./data/dataset/encoded_train_dataset.csv and the --test_data_csv_path to ./data/dataset/encoded_test_dataset.csv.

If you want to train on the balanced data, you can utilize the reduced_encoded_train_data.csv and reduced_encoded_test_data.csv files.

The data is split 70% for training, 20% validation and 10% for model evaluation (testing).

CSV

  1. To train the model on all the dataset use: train_data.csv and test_data.csv
  2. To train the model on balanced data (exactly the sample number of samples for each class): reduced_encoded_train_data.csv and reduced_encoded_test_data.csv
  3. To train the model on Downsampled but not well balanced data use: 13ktrain_data.csv and 13ktest_data.csv
  4. dummy_train_data.csv and dummy_test_data.csv are just for the purpose to run the code.

TODO

  1. csv files should be tracked with DVC.
  2. Improve the EDA and the pre-processing step.
  3. Test the output of the model.
  4. Add more tests.
  5. Optimise the stratfied sampling.
  6. Use Siamese Neural Networks as it is robust against unbalanced.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published