Skip to content

ICRAR/skasdc1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The ML Solution to SKA SDC1

This repository shares icrar team's machine learning solution to the SKA Science Data Challenge 1. The ML solution has earned the team a second place in this data challenge.

Pre-processing

  • Convert the raw catalogues to CSV files
  • Split the entire image into a set I of small (205 by 205 pixel) cutouts
  • Spatially index each image cutout, and manage all indexes in the PostgreSQL database D
  • Go througth each "ground-truth" source S in the CSV catalogue
    • Find the cutout C that contains S using its index in D
    • Calculate the background noise level rms of S
    • Check if the flux of S is greater than k (k = [0.5 to 3]) sigma above rms
      • If So, keep S in the training catalogue T
      • Else, discard S
  • Go through each valid source V in T
    • Calculate the pix coordinates of its bounding box B based on its sky coordinates encoded in the catalogue
    • Obtain the class label CL for V
    • Assemble B and CL, together with some other identifiers (e.g. source id)as a valid source record R
  • Create the final JSON file J that contains
    • names of all cutout images, each of which has at least one valid source
    • a set of valid source records (many Rs)
  • Pass on both I and J to the following machine learning pipeline (see the section below)

Machine learning

Given I and J for each dataset (e.g. 1000h and B1), we trained ClaRAN - Classifying Radio Galaxies Automatically with Neural Networks to detect sources in all cutout images. Particurly, we used ClaRAN V0.2, which requires I and J to be organised as in the following directories:

SKASDC1/DATA_DIR/
  annotations/
    instances_train_B1_1000h.json
    instances_test_B1_1000h.json
    ...
  train_B1_1000h/
    SKAMid_B1_1000h_v3_train_image*.png
    ...
  val_B1_1000h/
    SKAMid_B1_1000h_v3_train_image*.png
    ...
  

All the above data is publicaly available. For detailed description of ClaRAN's detection algorithms, please refer to our paper.

We have also prepared a Python notebook that shows the basic steps to get started with training SDC1 datasets (B1, 1000 hours) with ClaRAN v0.2.