You can use this notebook to write the ML pipeline for the classification of the galaxies in the GALAXYZOO dataset or create a folder with different files associated to the different steps of the ML pipeline.

In [1]:
#Importing libraries

import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
#import Random forest classifiers
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns

# Downloading the Galaxy Zoo Dataset

You can find the dataset from the Github repository at the url:



https://www.kaggle.com/competitions/galaxy-zoo-the-galaxy-challenge/data

---

##### Create a data frame with columns for objid and the corresponding assest_id.  
- asset_id: an integer that corresponds to the filename of the image of a particular galaxy.
- objid is the designation of the galaxy, e.g. galaxy 587722981741363294

In [2]:
import pandas as pd 

# get the objid and corresponding asset_id from gz2_filename_mapping.csv
columns_to_keep = ['objid', 'asset_id']

# Read the selected columns from the file
name_map = pd.read_csv("data/gz2_filename_mapping.csv", usecols=columns_to_keep)

# display the first few rows
print(name_map.head(5))

name_map.info()

                objid  asset_id
0  587722981736120347         1
1  587722981736579107         2
2  587722981741363294         3
3  587722981741363323         4
4  587722981741559888         5
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 355990 entries, 0 to 355989
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   objid     355990 non-null  int64
 1   asset_id  355990 non-null  int64
dtypes: int64(2)
memory usage: 5.4 MB


---

#### Create a data frame with dr7objid and corresponding label. 
- dr7objid gives the galaxy designation same as objid from the previous data frame.
- label correspond to some classification of the galaxy based on its shape and morphology. 

In [3]:
# select columns dr7objid and gz2class from zoo2MainSpecz.csv
columns_to_keep = ['dr7objid', 'gz2class']

# Read the selected columns from the file
labels = pd.read_csv("data/zoo2MainSpecz.csv", usecols=columns_to_keep)

# change the name of column dr7objid to objid for merging later
labels.rename(columns={'dr7objid':'objid'}, inplace=True)

# display
print(labels.head(5))

labels.info()


                objid gz2class
0  588017703996096547    SBb?t
1  587738569780428805      Ser
2  587735695913320507     Sc+t
3  587742775634624545   SBc(r)
4  587732769983889439      Ser
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243500 entries, 0 to 243499
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   objid     243500 non-null  int64 
 1   gz2class  243500 non-null  object
dtypes: int64(1), object(1)
memory usage: 3.7+ MB


---

1. Convert array of pixels in rows of a tabular dataset,
   using single pixels as feature columns and the intensities as values measured 

In [4]:
# Sequential implementation. Loops through one image at a time. This is embarassingly 
# parallelizable. The task which here consists of 1. processing images, converting to grayscale 
# and flattening pixel values is CPU-bound, ie performance is determined promarily by how
# CPU can process it in contrast to I/O bound. 
# We can parallelize using Multiprocessing library or Dask. 

import os 
from PIL import Image, ImageOps
from numpy import asarray

# Directory containing the images
image_dir = "data/images"

# List to store image data 
image_data = []
image_names = []

# Iterate over all files in the directory 
for filename in os.listdir(image_dir):
    if filename.endswith(('.jpg', '.png')): #filter the image files
        image_path = os.path.join(image_dir, filename)

        # Open image and convert to grayscale
        img = Image.open(image_path)
        img_gray = ImageOps.grayscale(img)

        # Convert to a numpy array and flatter it to 1D
        img_array = np.asarray(img_gray).flatten()

        #store the image data and filename
        image_data.append(img_array)

        # Extract the base name without the extension
        image_name = os.path.splitext(filename)[0] # Get only the root, ie w/o extension
    
        image_names.append(image_name)

# convert to DataFrame
image_data = pd.DataFrame(image_data)
image_data.insert(0, "asset_id", image_names) # NOTE: asset_id values are object type. Need to convert to int64 before merging later. 
# print(image_data['asset_id'].dtype)

#display the data frame
print(image_data.head())
image_data.info()

# Save to CSV
#image_data("image_pixel_data.csv", index=False)

  asset_id   0  1  2  3  4   5   6   7   8  ...  179766  179767  179768  \
0    97848   6  5  3  2  3   4   5   6   5  ...       6       7       8   
1   161420   8  7  7  8  9  12  14  16  25  ...       2       4       2   
2    17129   3  4  5  6  6   6   6   5   5  ...       1       1       6   
3   246153  11  7  2  0  0   3   5   5   5  ...       3       4       5   
4   167167   1  1  1  1  1   1   1   1   0  ...       3       3       1   

   179769  179770  179771  179772  179773  179774  179775  
0       7       6       5       3       2       1       0  
1       1       0       0       0       0       0       0  
2       3       1       1       1       3       4       4  
3       4       2       1       1       2       4       5  
4       1       2       3       3       2       2       1  

[5 rows x 179777 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 179777 entries, asset_id to 179775
dtypes: object(1), uint8(179776)
memory usage: 

In [5]:
# Merge labels and name_map dataframes to map asset_id to gz2class
# merge based on objid. use an inner join (only matching rows) 
# since only a subset of points in labels are in name_map, ann inner join 
# will include the rows from name_map that have matching gz2class values
# this will avoid NaNs

labels_mapped = pd.merge(name_map, labels, on='objid', how='inner' ) 

print(labels_mapped.head(5))

labels_mapped.info() # should have the same number of rows as the dataframe labels


                objid  asset_id gz2class
0  587722981741363294         3       Ei
1  587722981741363323         4       Sc
2  587722981741559888         5       Er
3  587722981741625481         6       Er
4  587722981741625484         7       Ei
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243500 entries, 0 to 243499
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   objid     243500 non-null  int64 
 1   asset_id  243500 non-null  int64 
 2   gz2class  243500 non-null  object
dtypes: int64(2), object(1)
memory usage: 5.6+ MB


In [6]:
# Merge labels_mapped with image_data to insert gz2class columnt to the latter 
# Merge based on asset_id and use an inner join. image_data which is our 
# main data frame will only have, in general, a subset of data points (galaxies)
# in labels_mapped. 

# convert asset_id values in image_data from object to int64 before mergeing
image_data['asset_id'] = labels_mapped['asset_id'].astype(int)

#merge
galaxy_data = pd.merge(labels_mapped, image_data, on='asset_id', how='inner' ) 

# Move gz2class to the last position to serve as labels
galaxy_data['gz2class'] = galaxy_data.pop('gz2class')  

# print
print(galaxy_data.head(5))

galaxy_data.info()


                objid  asset_id   0  1  2  3  4   5   6   7  ...  179767  \
0  587722981741363294         3   6  5  3  2  3   4   5   6  ...       7   
1  587722981741363323         4   8  7  7  8  9  12  14  16  ...       4   
2  587722981741559888         5   3  4  5  6  6   6   6   5  ...       1   
3  587722981741625481         6  11  7  2  0  0   3   5   5  ...       4   
4  587722981741625484         7   1  1  1  1  1   1   1   1  ...       3   

   179768  179769  179770  179771  179772  179773  179774  179775  gz2class  
0       8       7       6       5       3       2       1       0        Ei  
1       2       1       0       0       0       0       0       0        Sc  
2       6       3       1       1       1       3       4       4        Er  
3       5       4       2       1       1       2       4       5        Er  
4       1       1       2       3       3       2       2       1        Ei  

[5 rows x 179779 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex

In [7]:
from dask import delayed, compute
import os 
from PIL import Image, ImageOps
from numpy import asarray

In [8]:
# parallel implementation of processing the images with DASK

# Directory containing images
image_dir = "data/images"

# Get list of image file paths
image_files = [
    os.path.join(image_dir, f) for f in os.listdir(image_dir)
    if f.endswith(('.png', '.jpg'))
]

# Function to process a single image (Dask version)
@delayed
def process_image(image_path):
    try:
        img = Image.open(image_path)
        img_gray = ImageOps.grayscale(img)
        img_array = np.asarray(img_gray).flatten()
        filename = os.path.basename(image_path)
        return os.path.splitext(filename)[0], img_array
    except Exception as e:
        print(f"Error processing {image_path}: {e}")
        return None

# Parallel execution using Dask
delayed_results = [process_image(img) for img in image_files]
results = compute(*delayed_results)

# Filter out failed reads
results = [res for res in results if res is not None]

# Convert to Dask DataFrame
image_names, data = zip(*results)
image_data = pd.DataFrame(data)
image_data.insert(0, "asset_id", image_names)

print(galaxy_data.head())

image_data.info()
# Save to CSV
#df.to_csv("image_pixel_data.csv", index=False)
#print("Processing complete. Data saved to 'image_pixel_data.csv'.")


                objid  asset_id   0  1  2  3  4   5   6   7  ...  179767  \
0  587722981741363294         3   6  5  3  2  3   4   5   6  ...       7   
1  587722981741363323         4   8  7  7  8  9  12  14  16  ...       4   
2  587722981741559888         5   3  4  5  6  6   6   6   5  ...       1   
3  587722981741625481         6  11  7  2  0  0   3   5   5  ...       4   
4  587722981741625484         7   1  1  1  1  1   1   1   1  ...       3   

   179768  179769  179770  179771  179772  179773  179774  179775  gz2class  
0       8       7       6       5       3       2       1       0        Ei  
1       2       1       0       0       0       0       0       0        Sc  
2       6       3       1       1       1       3       4       4        Er  
3       5       4       2       1       1       2       4       5        Er  
4       1       1       2       3       3       2       2       1        Ei  

[5 rows x 179779 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex

In [9]:
# Merge labels_mapped with image_data to insert gz2class columnt to the latter 
# Merge based on asset_id and use an inner join. image_data which is our 
# main data frame will only have, in general, a subset of data points (galaxies)
# in labels_mapped. 

# convert asset_id values in image_data from object to int64 before mergeing
image_data['asset_id'] = labels_mapped['asset_id'].astype(int)

#merge
galaxy_data = pd.merge(labels_mapped, image_data, on='asset_id', how='inner') 

# Move gz2class to the last position to serve as labels
galaxy_data['gz2class'] = galaxy_data.pop('gz2class')  

# print
print(galaxy_data.head(5))

galaxy_data.info()

                objid  asset_id   0  1  2  3  4   5   6   7  ...  179767  \
0  587722981741363294         3   6  5  3  2  3   4   5   6  ...       7   
1  587722981741363323         4   8  7  7  8  9  12  14  16  ...       4   
2  587722981741559888         5   3  4  5  6  6   6   6   5  ...       1   
3  587722981741625481         6  11  7  2  0  0   3   5   5  ...       4   
4  587722981741625484         7   1  1  1  1  1   1   1   1  ...       3   

   179768  179769  179770  179771  179772  179773  179774  179775  gz2class  
0       8       7       6       5       3       2       1       0        Ei  
1       2       1       0       0       0       0       0       0        Sc  
2       6       3       1       1       1       3       4       4        Er  
3       5       4       2       1       1       2       4       5        Er  
4       1       1       2       3       3       2       2       1        Ei  

[5 rows x 179779 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex

### 2. Perform EDA and feature preprocessing
#### 2.1 Exploratory Data Analysis (EDA)

In [10]:
# print
#print(galaxy_data.head(4))

print(galaxy_data.shape)  # Check dimensions

galaxy_data.info()# Check data types & missing values

print(galaxy_data.describe())  # Get summary stats

(100, 179779)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 179779 entries, objid to gz2class
dtypes: int64(2), object(1), uint8(179776)
memory usage: 17.1+ MB
              objid    asset_id           0           1           2  \
count  1.000000e+02  100.000000  100.000000  100.000000  100.000000   
mean   5.877230e+17   54.850000    4.860000    4.910000    4.900000   
std    2.318057e+06   30.694117    4.332214    4.022902    4.123106   
min    5.877230e+17    3.000000    0.000000    0.000000    0.000000   
25%    5.877230e+17   28.750000    2.000000    2.000000    2.000000   
50%    5.877230e+17   53.500000    4.000000    4.000000    4.000000   
75%    5.877230e+17   81.250000    7.000000    7.000000    7.000000   
max    5.877230e+17  108.000000   21.000000   19.000000   19.000000   

                3           4           5          6           7  ...  \
count  100.000000  100.000000  100.000000  100.00000  100.000000  ...   
mean     5.060000   

## Adding the symmetry data

For this kind of images we can get the symmetry information given some axis in order to add a new data that can be relevant to the data frame. In the python module named `get_symmetry` inside of the `src/` folder we define a pair of functions that given an array of images and the number of axis of symmetry we compute the differences between the intensity values of each pixel pair.

The process of getting the symmetries is the following one:

1. Get the coordinates of each pixel.
2. Rotate the coordinate system in an angle $ \theta = \dfrac{i * \pi }{n\_axis} $. Where $i$ is a number that goes from 0 to $ \dfrac{n\_axis}{2}$ and $n_axis$ is the number of axis of symmetry
3. Ignore the data that is outside of a circle of radius width of the image and centered in the middle of the image.
4. Split the image in the upper and the lower part, reflect the left part and compute the distance and append to the vector that stores the symmetry values.
5. Split the image in the left and right part, reflect the right part and compute the distance and append to the vector that stores the symmetry values.
6. Repeat until $i = \dfrac{n\_axis}{2}$.
7. Return the vector with the symmetry values.

For instance, an example of the step 5 is showing on the next image:

![images/symmetries.png](images/symmetries.png)

Lower the distance (darker the image) more symmetric the original image.

Then, we have the following code:

In [12]:
# Import the function from the module
from src.get_symmetry import get_all_symmetries

# Get the data related with the pixels
get_images_column = np.linspace(0, 424*424-1, 424*424)
# Reshape the image and assign it to an array
images_array = np.reshape(galaxy_data[get_images_column].to_numpy(), shape= (100, 424, 424))

In [13]:
%%time
# Chossing the number of axis
axis = 12

# Defining the column names
columns = [f"axis-{i}" for i in range(axis)]

# Getting the data
sym_data = get_all_symmetries(images_array, axis)

CPU times: user 5.32 s, sys: 2.94 ms, total: 5.32 s
Wall time: 5.34 s


In [36]:
# Appending to the data frame
for i in range(axis):
    galaxy_data[columns[i]] = sym_data[:,i]

galaxy_data[columns].describe()

Unnamed: 0,axis-0,axis-1,axis-2,axis-3,axis-4,axis-5,axis-6,axis-7,axis-8,axis-9,axis-10,axis-11
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,3.156401,3.009409,3.103811,2.892341,3.058632,2.953118,3.02772,2.967993,3.000322,2.92646,2.994521,3.047535
std,1.641796,1.303819,1.658116,1.319037,1.59413,1.42722,1.46165,1.397005,1.406018,1.325152,1.346774,1.572757
min,1.561248,1.522,1.570844,1.507126,1.570376,1.473278,1.54302,1.38647,1.565415,1.459394,1.54925,1.479385
25%,2.277803,2.326385,2.291064,2.218331,2.211463,2.20414,2.270179,2.2332,2.266536,2.253306,2.186474,2.22798
50%,2.707925,2.724949,2.583573,2.56198,2.578742,2.581371,2.603729,2.548752,2.582389,2.543538,2.658186,2.583715
75%,3.273444,3.206663,3.338863,3.147528,3.277448,3.186727,3.231517,3.108599,3.273524,3.139827,3.197606,3.170128
max,9.90269,9.828737,10.8172,10.732172,11.293827,10.550157,11.122019,8.957258,10.51548,8.643106,9.712247,9.090502


# Utility function to read Images

In [39]:
#Using PILLOW to convert images to array
from PIL import Image, ImageOps
from numpy import asarray
 
 
# load the image and convert into numpy array
img = Image.open('data/images/6994.jpg') 

img_gray = ImageOps.grayscale(img)

#img_gray.show() #to check it become gray
 
# asarray() class is used to convert
# PIL images into NumPy arraystotal_classifications
numpydata = asarray(img_gray)


# EDA, feature preprocessing and classification

1. Convert array of pixels in rows of a tabular dataset,
   using single pixels as feature columns and the intensities as values measured 
   
   
2. Perform EDA and feature preprocessing

3. Estimate the symmetry of the preprocessed images with respect to 12 axes and add this info to the original data

3. Test how much you can reduce the dimensions of the problem with one algorithm between (PCA, kPCA ..)
4. Check how many clusters can be associated to the data points joint distribution using tSNE or UMAP

5. Build the classifier using Random Forest (play with different  depth and number of trees) or SVC

6. Train the classifier

7. Predict the class labels


# Evaluate the accuracy of the Classifier
## Plot Confusion matrix