#**Read me**

El enfoque de este proyecto es emplear ML no supervisado, por lo que no se busco un dataset de galaxias etiquetado. Las imagenes (jpg) fuero obtenida de la plataforma [zoo galaxy](https://data.galaxyzoo.org/#section-21) correspondientes al Galaxy Zoo 2 ([images_gz2.zip](https://zenodo.org/records/3565489#.Y3vFKS-l0eY)).

Se tomaron aleatoriamente 10,000 imgenes (jpg) para entrenar y 5,000 (jpg) para hacer pruebas, el uso de 15,000 archivos diferentes no es practica, por lo que es preciso vertir las imagenes en un archivo que las contenga de una manera más portable.

Aun asi, se mantendra el [acceso directo](https://drive.google.com/drive/folders/19dFtIuN5AJwEbsUqUL_6RZT1T1TBEMnY?usp=drive_link) a las imagenes jpg empleadas en el proyecto.

En estec codigo se muestra como convertir las 15,000 imagenes ".jpg" en archivos ".h5" orgnizandolos en dos archivos, para entrenar y probar, además de como guardarlo en drive

# Set up

## Package

In [None]:
# Mount Google Drive to access and store files
# Comment out this cell if running the script locally

from google.colab import drive
drive.mount('/content/drive', force_remount=True)  # Force remount to ensure access


Mounted at /content/drive


In [None]:
import os

import h5py

import time

from PIL import Image

import numpy as np
import matplotlib.pyplot as plt

#Directories

In [None]:
# Define basic configuration for data usage
DATA_USAGE = ['Train', 'Test']  # Specifies the dataset categories (training and testing)

# Paths for storing data
ROOT_JPGS = '/content/drive/MyDrive/Practicas_Profesionales/Data/JPGs'  # Directory for storing JPG image files
ROOT_H5S = '/content/drive/MyDrive/Practicas_Profesionales/Data/H5s'  # Directory for storing HDF5 (H5) data files


#Conver JPG to H5

In [None]:
indx = 0  # Index to select the dataset type (0 = 'Train', 1 = 'Test')

# Construct the path to the selected dataset (e.g., training images)
root_data = os.path.join(ROOT_JPGS, DATA_USAGE[indx])

# List all JPG images in the selected directory
imgs_jpg = os.listdir(root_data)


In [None]:
cuenta = 0  # Counter to track the number of processed images

print(f'Data extracted: {DATA_USAGE[indx]}')  # Display the dataset being processed
start_time = time.time()  # Start time for performance measurement

# Iterate through all images in the selected dataset
for i in range(len(imgs_jpg)):

  # Extract the image name without the file extension
  name = imgs_jpg[i][:-4]

  # Define the corresponding HDF5 file path
  name_fileh5 = os.path.join(ROOT_H5S, DATA_USAGE[indx]) + '/' + name + '.h5'

  # Get the full path to the current image
  root_img = os.path.join(root_data, imgs_jpg[i])

  # Open and convert the image to a NumPy array
  with Image.open(root_img) as img_jpg:
      img_array = np.array(img_jpg)

  # Save the image data in an HDF5 file with compression
  with h5py.File(name_fileh5, 'w') as f:
      f.create_dataset("number", data=np.array(int(name), dtype=np.int32))  # Store image number
      f.create_dataset("galaxy", data=img_array, compression="gzip", compression_opts=9)  # Store image data with compression

  cuenta += 1  # Update counter

  # Display progress every 250 images
  if cuenta % 250 == 0:
      delta_t = time.time() - start_time  # Calculate elapsed time
      print(f'Progress: {np.round(cuenta / len(imgs_jpg) * 100, 4)}%\nExecution Time: {delta_t:.2f} s\n')


Datos extraidos: Train
Progreso: 2.5%
Tiempo de ejecución: 227.56 s

Progreso: 5.0%
Tiempo de ejecución: 255.09 s

Progreso: 7.5%
Tiempo de ejecución: 277.77 s

Progreso: 10.0%
Tiempo de ejecución: 297.84 s

Progreso: 12.5%
Tiempo de ejecución: 319.27 s

Progreso: 15.0%
Tiempo de ejecución: 343.18 s

Progreso: 17.5%
Tiempo de ejecución: 365.43 s

Progreso: 20.0%
Tiempo de ejecución: 387.95 s

Progreso: 22.5%
Tiempo de ejecución: 409.89 s

Progreso: 25.0%
Tiempo de ejecución: 433.12 s

Progreso: 27.5%
Tiempo de ejecución: 458.37 s

Progreso: 30.0%
Tiempo de ejecución: 482.59 s

Progreso: 32.5%
Tiempo de ejecución: 507.68 s

Progreso: 35.0%
Tiempo de ejecución: 532.54 s

Progreso: 37.5%
Tiempo de ejecución: 557.89 s

Progreso: 40.0%
Tiempo de ejecución: 580.58 s

Progreso: 42.5%
Tiempo de ejecución: 604.53 s

Progreso: 45.0%
Tiempo de ejecución: 627.77 s

Progreso: 47.5%
Tiempo de ejecución: 653.59 s

Progreso: 50.0%
Tiempo de ejecución: 679.05 s

Progreso: 52.5%
Tiempo de ejecución: 703

##Check Files saved

In [None]:
# Define the directory containing the HDF5 files for the selected dataset (Train or Test)
data_dir = os.path.join(ROOT_H5S, DATA_USAGE[indx])

# Select a specific HDF5 file (in this case, the 9th file in the list)
cheack_file = os.listdir(data_dir)[8]  # Possible typo in variable name: "cheack_file" -> "check_file"

# Construct the full path to the selected HDF5 file
name_file = os.path.join(data_dir, cheack_file)

# Start timing the file reading process
start_time = time.time()

# Open and read the selected HDF5 file
with h5py.File(name_file, 'r') as f:
    imagenes = f["galaxy"][:]  # Load image data
    imagenes_nombres = f["number"][()]  # Load image identifier

# Print the time taken to read the file
print((time.time() - start_time))


0.31943821907043457


#File compression

##H5 Files

In [None]:
# Compress all .h5 files into a single archive
# This process is done separately for both Train and Test datasets

indx = 1  # Select dataset to compress [0 = Train, 1 = Test]

# Define the directory containing the HDF5 files for the selected dataset
data_dir = os.path.join(ROOT_H5S, DATA_USAGE[indx])

# Define the output compressed file path (Train.tar.gz or Test.tar.gz)
tar_file = os.path.join(ROOT_H5S, f"{DATA_USAGE[indx]}.tar.gz")

# Create a compressed tar archive (.tar.gz) containing all files in the selected directory
!tar -czvf "{tar_file}" -C "{data_dir}" .


[1;30;43mSe truncaron las últimas líneas 5000 del resultado de transmisión.[0m
./140592.h5
tar: ./140592.h5: file changed as we read it
./140676.h5
tar: ./140676.h5: file changed as we read it
./140776.h5
tar: ./140776.h5: file changed as we read it
./140835.h5
tar: ./140835.h5: file changed as we read it
./140839.h5
tar: ./140839.h5: file changed as we read it
./140852.h5
./140855.h5
tar: ./140855.h5: file changed as we read it
./140877.h5
tar: ./140877.h5: file changed as we read it
./140971.h5
tar: ./140971.h5: file changed as we read it
./140992.h5
tar: ./140992.h5: file changed as we read it
./141004.h5
tar: ./141004.h5: file changed as we read it
./141066.h5
tar: ./141066.h5: file changed as we read it
./141078.h5
tar: ./141078.h5: file changed as we read it
./141080.h5
tar: ./141080.h5: file changed as we read it
./141136.h5
tar: ./141136.h5: file changed as we read it
./141161.h5
tar: ./141161.h5: file changed as we read it
./141240.h5
tar: ./141240.h5: file changed as we rea

#JPG Files

In [None]:
!apt-get install pigz


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  pigz
0 upgraded, 1 newly installed, 0 to remove and 18 not upgraded.
Need to get 63.6 kB of archives.
After this operation, 162 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 pigz amd64 2.6-1 [63.6 kB]
Fetched 63.6 kB in 0s (129 kB/s)
Selecting previously unselected package pigz.
(Reading database ... 124926 files and directories currently installed.)
Preparing to unpack .../archives/pigz_2.6-1_amd64.deb ...
Unpacking pigz (2.6-1) ...
Setting up pigz (2.6-1) ...
Processing triggers for man-db (2.10.2-1) ...


In [None]:
# Compress JPG images from both Train and Test datasets to save storage space in Google Drive

ROOT_JPGS = '/content/drive/MyDrive/Practicas_Profesionales/Data/JPGs'

# Define the path for the compressed archive containing both Train and Test image folders
tar_dir = os.path.join(ROOT_JPGS, 'Test&Train_JPGs.tar.gz')

# Create a highly compressed .tar.gz archive using pigz for better compression speed and efficiency
!tar -cf - -C "/content/drive/MyDrive/Practicas_Profesionales/Data/JPGs" Train Test | pigz -9 -p 4 > "/content/drive/MyDrive/Practicas_Profesionales/Data/JPGs/Test&Train_JPGs.tar.gz"


[1;30;43mSe truncaron las últimas líneas 5000 del resultado de transmisión.[0m
tar: Train/27964.jpg: file changed as we read it
tar: Train/27974.jpg: file changed as we read it
tar: Train/28029.jpg: file changed as we read it
tar: Train/28032.jpg: file changed as we read it
tar: Train/28062.jpg: file changed as we read it
tar: Train/28101.jpg: file changed as we read it
tar: Train/28106.jpg: file changed as we read it
tar: Train/28144.jpg: file changed as we read it
tar: Train/28170.jpg: file changed as we read it
tar: Train/28207.jpg: file changed as we read it
tar: Train/28266.jpg: file changed as we read it
tar: Train/28284.jpg: file changed as we read it
tar: Train/28289.jpg: file changed as we read it
tar: Train/28304.jpg: file changed as we read it
tar: Train/28387.jpg: file changed as we read it
tar: Train/28395.jpg: file changed as we read it
tar: Train/28415.jpg: file changed as we read it
tar: Train/28424.jpg: file changed as we read it
tar: Train/28435.jpg: file changed as