Skip to content

Pattern detection over a large galaxy images collection obtained from the Galaxy Zoo project.

Notifications You must be signed in to change notification settings

PlugInRichi/minIA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minIA

Python 3.6 TensorFlow 2.2 Maintaner

minIA is an unsupervised learning methiodology for discovering patterns in astronomical images (although it can be applied to any other image collection). We use a customized DELF model to extract simple features from images through a bag of words model we represent each image in the collection to mine patterns using the Sampled-MinHashing technique.

Screenshot

Install minIA 🚀

The whole project can be run from a Docker container, instructions for running it correctly can be found here.

The use of Sampled-MinHashing requires a separate installation that is not found in the container, for this you have to follow the instructions in Sampled-MinHashing repository.

Execution 🕹️

Customization of the DELF model for the extraction of astronomical features

  1. Dataset creation
  2. Reformating dataset
  3. Train
  4. Export Model

Dataset creation

To configure the creation of the training dataset for the neural model, it is necessary to specify the parameters in a configuration file:

# data/config/dataset_config.yml
    galaxyZoo2_path:  /data/images/gz2_hart16.csv
    map_images_path:  /data/images/gz2_filename_mapping.csv
    train_dataset_path: /data/images/gz2_train_dataset_5000
    images_dir_path: /data/images/images_gz2
    images_out_dir_path: /data/images/images_gz2
    class_size: 5000

Then just run the script

python3 createDataSet.py

Reformating dataset

python3 custom_delf/build_galaxy_image_dataset.py \
  --train_clean_csv_path=/data/images/gz2_train_dataset_5000_with_filter.csv \
  --train_directory=/data/images/images_gz2/  \
  --output_directory=/data/tf_records/v5-full_merge \
  --num_shards=64 \
  --validation_split_size=0.2

*** Note: to train on any other dataset it is required to have a format equal to 'GZ_dataset.csv' (with the same number of spaces and line breaks, the name of the images is given without format and the extension must be JPG) like:

Categoria_Encabezado,Nombre_encabezado
Categoria_ID,nombre_imagen_1 nombre_imagen_2 ...
Categoria_ID,nombre_imagen_2 nombre_imagen_5 ...

*** Note: Each image belongs to only one category *** Note: The names of the headers must match those written in the script

Train

python3 custom_delf/train.py \
    --train_file_pattern=/data/tf_records/v5-full_merge/train* \
    --validation_file_pattern=/data/tf_records/v5-full_merge/validation* \
    --imagenet_checkpoint=/data/models/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5 \
    --logdir=/data/train/v5-full_merge \
    --max_iters=15000 \
    --initial_lr=0.045 \
    --batch_size=50 \
    --num_classes=9

Export Model

python3 custom_delf/export_local_model.py \
  --ckpt_path=/data/train/v5-full_merge/delf_weights \
  --export_path=/data/models/v5-full_merge

Discovery of visual patterns

  1. Feature extraction
  2. Bag of words model
  3. Pattern mining
  4. Explore patterns!

Feature extraction

The script for feature extraction is extractor.py for its execution it is mandatory to specify three parameters:

  1. Extractor type (DELF, SIFT, SURF)
  2. Absolute or Relative Path of the image folder
  3. Absolute or Relative Path and name of the file to generate

The following execution will create a cvs file with the POI information and a txt file with the descriptor values.

extractor.py SIFT /images_dataset /test/images_descriptors

Bag of words model

The mining process requires representing each image using the bag of words model. With the descriptors generated in the previous step we build this vocabulary. For the execution of cluster.py it is mandatory to specify three parameters:

  1. Absolute or Relative Path of the file generated by the previous step
  2. Absolute or Relative Path and name of the file to generate
  3. Number of cluster (final vocabulary size)

The following execution will create new csv file named images_clusters using 2000 clusters

cluster.py /test/images_descriptors /test/images_visual_vocabulary 2000

Pattern mining

Using the file generated in the previous step we create the input document for the mining step. For the execution of cluster.py it is mandatory to specify two parameters:

  1. Magnitude associated with the image indices (SIZE or FRECUENCY)
  2. Absolute or Relative Path of the file generated in the previous step
  3. Absolute or Relative Path of the document to be generated Optionaly we can use the drop_outliers in order to reduce the size of words

The following execution will create a document with the name clusters_per_images measuring the size with which they appear within the image

SMHdocument.py SIZE /test/images_visual_vocabulary data/SMH_files/images_BoW.words

If we have Sampled-MinHashing installed we can perform the mining like:

#This create an inverted file index
    smhcmd ifindex  data/SMH_files/images_BoW.words data/SMH_files/images_BoW.ifs
smhcmd discover -r 2 -l 750 data/SMH_files/images_BoW.ifs data/SMH_models/images_structures.model

Explore patterns!

To perform the exploration of visual structures (patterns) we can use this notebook to display the images that belong to the same structure

About

Pattern detection over a large galaxy images collection obtained from the Galaxy Zoo project.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published