You can use this notebook to write the ML pipeline for the classification of the galaxies in the GALAXYZOO dataset or create a folder with different files associated to the different steps of the ML pipeline.

In [1]:
#Importing libraries

import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
#import Random forest classifiers
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns

# Downloading the Galaxy Zoo Dataset

You can find the dataset from the Github repository at the url:



https://www.kaggle.com/competitions/galaxy-zoo-the-galaxy-challenge/data

---

##### Create a data frame with columns for objid and the corresponding assest_id.  
- asset_id: an integer that corresponds to the filename of the image of a particular galaxy.
- objid is the designation of the galaxy, e.g. galaxy 587722981741363294

In [11]:
import pandas as pd 

# get the objid and corresponding asset_id from gz2_filename_mapping.csv
columns_to_keep = ['objid', 'asset_id']

# Read the selected columns from the file
name_map = pd.read_csv("data/gz2_filename_mapping.csv", usecols=columns_to_keep)

# display the first few rows
print(name_map.head(5))

name_map.info()

                objid  asset_id
0  587722981736120347         1
1  587722981736579107         2
2  587722981741363294         3
3  587722981741363323         4
4  587722981741559888         5
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 355990 entries, 0 to 355989
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   objid     355990 non-null  int64
 1   asset_id  355990 non-null  int64
dtypes: int64(2)
memory usage: 5.4 MB


---

#### Create a data frame with dr7objid and corresponding label. 
- dr7objid gives the galaxy designation same as objid from the previous data frame.
- label correspond to some classification of the galaxy based on its shape and morphology. 

In [10]:
# select columns dr7objid and gz2class from zoo2MainSpecz.csv
columns_to_keep = ['dr7objid', 'gz2class']


# Read the selected columns from the file
labels = pd.read_csv("data/zoo2MainSpecz.csv", usecols=columns_to_keep)

# display
print(labels.head(5))

labels.info()


             dr7objid gz2class
0  588017703996096547    SBb?t
1  587738569780428805      Ser
2  587735695913320507     Sc+t
3  587742775634624545   SBc(r)
4  587732769983889439      Ser
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243500 entries, 0 to 243499
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   dr7objid  243500 non-null  int64 
 1   gz2class  243500 non-null  object
dtypes: int64(1), object(1)
memory usage: 3.7+ MB


---

1. Convert array of pixels in rows of a tabular dataset,
   using single pixels as feature columns and the intensities as values measured 

In [18]:
# Sequential implementation. Loops through one image at a time. This is embarassingly 
# parallelizable. The task which here consists of 1. processing images, converting to grayscale 
# and flattening pixel values is CPU-bound, ie performance is determined promarily by how
# CPU can process it in contrast to I/O bound. 
# We can parallelize using Multiprocessing library or Dask. 

import os 
from PIL import Image, ImageOps
from numpy import asarray

# Directory containing the images
image_dir = "data/images"

# List to store image data 
image_data = []
image_names = []

# Iterate over all files in the directory 
for filename in os.listdir(image_dir):
    if filename.endswith(('.jpg', '.png')): #filter the image files
        image_path = os.path.join(image_dir, filename)

        # Open image and convert to grayscale
        img = Image.open(image_path)
        img_gray = ImageOps.grayscale(img)

        # Convert to a numpy array and flatter it to 1D
        img_array = np.asarray(img_gray).flatten()

        #store the image data and filename
        image_data.append(img_array)

        # Extract the base name without the extension
        image_name = os.path.splitext(filename)[0] # Get only the root, ie w/o extension
    
        image_names.append(image_name)

#convert to DataFrame
image_data_frame = pd.DataFrame(image_data)
image_data_frame.insert(0, "asset_id", image_names)

#display the data frame
print(image_data_frame.head())
image_data_frame.info()

# Save to CSV
image_data_frame("image_pixel_data.csv", index=False)

  asset_id  0  1  2  3  4  5  6  7  8  ...  179766  179767  179768  179769  \
0    87384  4  4  2  1  0  0  0  0  1  ...       7       9      13      10   
1   165078  0  0  0  1  3  5  7  8  2  ...       4       4       2       2   
2   155364  5  6  7  7  7  7  6  5  5  ...      12      13      20      21   
3   261278  0  0  1  1  1  1  1  1  5  ...       6       3       5       8   
4   227960  3  3  3  2  2  1  1  1  1  ...       9      10      11      12   

   179770  179771  179772  179773  179774  179775  
0       4       1       1       3       7      11  
1       2       3       3       4       4       5  
2      20      17      13       9       7       7  
3      10      10       7       4       2       2  
4      13      14      14      14      13      13  

[5 rows x 179777 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Columns: 179777 entries, asset_id to 179775
dtypes: object(1), uint8(179776)
memory usage: 17.1+ MB


# Utility function to read Images

In [13]:
#Using PILLOW to convert images to array
from PIL import Image, ImageOps
from numpy import asarray
 
 
# load the image and convert into numpy array
img = Image.open('data/images/6994.jpg') 

img_gray = ImageOps.grayscale(img)

img_gray.show() #to check it become gray
 
# asarray() class is used to convert
# PIL images into NumPy arraystotal_classifications
numpydata = asarray(img_gray)


# EDA, feature preprocessing and classification

1. Convert array of pixels in rows of a tabular dataset,
   using single pixels as feature columns and the intensities as values measured 
   
   
2. Perform EDA and feature preprocessing

3. Estimate the symmetry of the preprocessed images with respect to 12 axes and add this info to the original data

3. Test how much you can reduce the dimensions of the problem with one algorithm between (PCA, kPCA ..)
4. Check how many clusters can be associated to the data points joint distribution using tSNE or UMAP

5. Build the classifier using Random Forest (play with different  depth and number of trees) or SVC

6. Train the classifier

7. Predict the class labels


# Evaluate the accuracy of the Classifier
## Plot Confusion matrix