This repository shares icrar team's machine learning solution to the SKA Science Data Challenge 1. The ML solution has earned the team a second place in this data challenge.
- Convert the raw catalogues to CSV files
- Split the entire image into a set I of small (205 by 205 pixel) cutouts
- Spatially index each image cutout, and manage all indexes in the PostgreSQL database D
- Go througth each "ground-truth" source S in the CSV catalogue
- Find the cutout C that contains S using its index in D
- Calculate the background noise level rms of S
- Check if the flux of S is greater than k (k = [0.5 to 3]) sigma above rms
- If So, keep S in the training catalogue T
- Else, discard S
- Go through each valid source V in T
- Calculate the pix coordinates of its bounding box B based on its sky coordinates encoded in the catalogue
- Obtain the class label CL for V
- Assemble B and CL, together with some other identifiers (e.g. source id)as a valid source record R
- Create the final JSON file J that contains
- names of all cutout images, each of which has at least one valid source
- a set of valid source records (many Rs)
- Pass on both I and J to the following machine learning pipeline (see the section below)
Given I and J for each dataset (e.g. 1000h and B1), we trained ClaRAN - Classifying Radio Galaxies Automatically with Neural Networks to detect sources in all cutout images. Particurly, we used ClaRAN V0.2, which requires I and J to be organised as in the following directories:
SKASDC1/DATA_DIR/
annotations/
instances_train_B1_1000h.json
instances_test_B1_1000h.json
...
train_B1_1000h/
SKAMid_B1_1000h_v3_train_image*.png
...
val_B1_1000h/
SKAMid_B1_1000h_v3_train_image*.png
...
All the above data is publicaly available. For detailed description of ClaRAN's detection algorithms, please refer to our paper.
We have also prepared a Python notebook that shows the basic steps to get started with training SDC1 datasets (B1, 1000 hours) with ClaRAN v0.2.