minIA is an unsupervised learning methiodology for discovering patterns in astronomical images (although it can be applied to any other image collection). We use a customized DELF model to extract simple features from images through a bag of words model we represent each image in the collection to mine patterns using the technique.
The whole project can be run from a Docker container, instructions for running it correctly can be found here.
The use of Sampled-MinHashing requires a separate installation that is not found in the container, for this you have to follow the instructions in Sampled-MinHashing repository.
To configure the creation of the training dataset for the neural model, it is necessary to specify the parameters in a configuration file:
# data/config/dataset_config.yml
galaxyZoo2_path: /data/images/gz2_hart16.csv
map_images_path: /data/images/gz2_filename_mapping.csv
train_dataset_path: /data/images/gz2_train_dataset_5000
images_dir_path: /data/images/images_gz2
images_out_dir_path: /data/images/images_gz2
class_size: 5000
Then just run the script
python3 createDataSet.py
python3 custom_delf/build_galaxy_image_dataset.py \
--train_clean_csv_path=/data/images/gz2_train_dataset_5000_with_filter.csv \
--train_directory=/data/images/images_gz2/ \
--output_directory=/data/tf_records/v5-full_merge \
--num_shards=64 \
--validation_split_size=0.2
*** Note: to train on any other dataset it is required to have a format equal to 'GZ_dataset.csv' (with the same number of spaces and line breaks, the name of the images is given without format and the extension must be JPG) like:
Categoria_Encabezado,Nombre_encabezado
Categoria_ID,nombre_imagen_1 nombre_imagen_2 ...
Categoria_ID,nombre_imagen_2 nombre_imagen_5 ...
*** Note: Each image belongs to only one category *** Note: The names of the headers must match those written in the script
python3 custom_delf/train.py \
--train_file_pattern=/data/tf_records/v5-full_merge/train* \
--validation_file_pattern=/data/tf_records/v5-full_merge/validation* \
--imagenet_checkpoint=/data/models/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5 \
--logdir=/data/train/v5-full_merge \
--max_iters=15000 \
--initial_lr=0.045 \
--batch_size=50 \
--num_classes=9
python3 custom_delf/export_local_model.py \
--ckpt_path=/data/train/v5-full_merge/delf_weights \
--export_path=/data/models/v5-full_merge
The script for feature extraction is extractor.py for its execution it is mandatory to specify three parameters:
- Extractor type (DELF, SIFT, SURF)
- Absolute or Relative Path of the image folder
- Absolute or Relative Path and name of the file to generate
The following execution will create a cvs file with the POI information and a txt file with the descriptor values.
extractor.py SIFT /images_dataset /test/images_descriptors
The mining process requires representing each image using the bag of words model. With the descriptors generated in the previous step we build this vocabulary. For the execution of cluster.py it is mandatory to specify three parameters:
- Absolute or Relative Path of the file generated by the previous step
- Absolute or Relative Path and name of the file to generate
- Number of cluster (final vocabulary size)
The following execution will create new csv file named images_clusters using 2000 clusters
cluster.py /test/images_descriptors /test/images_visual_vocabulary 2000
Using the file generated in the previous step we create the input document for the mining step. For the execution of cluster.py it is mandatory to specify two parameters:
- Magnitude associated with the image indices (SIZE or FRECUENCY)
- Absolute or Relative Path of the file generated in the previous step
- Absolute or Relative Path of the document to be generated Optionaly we can use the drop_outliers in order to reduce the size of words
The following execution will create a document with the name clusters_per_images measuring the size with which they appear within the image
SMHdocument.py SIZE /test/images_visual_vocabulary data/SMH_files/images_BoW.words
If we have Sampled-MinHashing installed we can perform the mining like:
#This create an inverted file index
smhcmd ifindex data/SMH_files/images_BoW.words data/SMH_files/images_BoW.ifs
smhcmd discover -r 2 -l 750 data/SMH_files/images_BoW.ifs data/SMH_models/images_structures.model
To perform the exploration of visual structures (patterns) we can use this notebook to display the images that belong to the same structure