In [1]:
from pathlib import Path 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import IncrementalPCA
from skimage.measure import shannon_entropy,label,perimeter,regionprops
from skimage.filters import sobel, threshold_otsu
import pandas as pd
import numpy as np
from PIL import Image, UnidentifiedImageError
import sys, os
sys.path.append('/home/benr/ACT/CW2/py')
from functions import split_data_rf,get_data,galaxy_type 
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score

  from pkg_resources import get_distribution
2025-11-26 13:49:57.240534: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-26 13:49:57.510892: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-26 13:49:58.900443: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


# Machine Learning for Galaxy Classification  
This project aims to test how different machine learning methods will perform on a galaxy classifaction task. 

## Qestion 1, Traditional ML method
To test a traditional ML method I will use random forest to try and predict galaxy morphology based on features of the image and pca values.
Using the Galaxy10 Decals data set, we use a sample of ~17,000 image samples to classify galaxy morphology from images. 

Radom Forest classifiers can be extremely accurate. Random forest uses the 'votes' of many decision trees to make a prediction, which makes it a good technique for finding patterns through out a dataset. The contribution from many trees means that it is not so easily confused by noise and outliers. 

In [2]:

# get the data from functions.py and print galaxy type classes
x_images,y_labels = get_data()



/home/benr/.astroNN/datasets/Galaxy10_DECals.h5 was found!


The dataset uses the following galaxy morphology catagories

Galaxy10 dataset (17736 images)
- Class 0 (1081 images): Disturbed Galaxies
- Class 1 (1853 images): Merging Galaxies
- Class 2 (2645 images): Round Smooth Galaxies
- Class 3 (2027 images): In-between Round Smooth Galaxies
- Class 4 ( 334 images): Cigar Shaped Smooth Galaxies
- Class 5 (2043 images): Barred Spiral Galaxies
- Class 6 (1829 images): Unbarred Tight Spiral Galaxies
- Class 7 (2628 images): Unbarred Loose Spiral Galaxies
- Class 8 (1423 images): Edge-on Galaxies without Bulge
- Class 9 (1873 images): Edge-on Galaxies with Bulge

see dataset documentation here for more details 
https://astronn.readthedocs.io/en/latest/galaxy10.html


In [3]:
#define your random forest classifier 
rf = RandomForestClassifier(
        n_estimators=900, # number of trees
        max_depth=None, # how many times each tree can split (No limit)
        n_jobs= 1,
        random_state= 11
        
)

The next step is to create some features for to pass into the random forest model. 

* Eccentricity - measure of how elliptical the galaxy is
                 see https://scikit-image.org/docs/0.25.x/auto_examples/segmentation/plot_regionprops.html?utm_source=chatgpt.com
                 for details.
                 
* mean sobel edge magnitude - The sobel edges is a measure of the gradient
                              at each pixel, we apply this function and then take the mean. This helps distinguish between elepitcal (smooth) galaxies and disks or spirals that will have clearer edges. 

* perimeter - perimeter measures the perimiter of shapes in a binary
              images. This will be smaller in edge on galaxies and larger in the other types. 

* Area - similar logic to perimeter, to help distigues between edge on 
         and other types. 

* asymmetry - the asymmetry is measure by rotating the image by 90 
              degrees and then taking the average of the absolute value of the difference between the two images. 

As well as the above features we also include mean, stadard deviation,median and maximum of the mixal values. 

for details on skimage measures see
https://scikit-image.org/docs/0.25.x/api/skimage.measure.html#

In [4]:
#please note this is a lot of data and takes about 10 mins to run
# gather features of images 
x_features = []
for img in x_images:
    all_feats = []
    for col in range(3):
        i = img[:,:,col].astype(np.float64) / 255.0 
        #get binary version of image 
        th = threshold_otsu(i)
        bimg = i > th
        # eccentricity 
        lab = label(bimg)
        props = regionprops(lab)
        if len(props) > 0:
            e = props[0].eccentricity  
        else:
            e = 0.0

        edges = sobel(i)
        edg_mean = edges.mean()
     
        p = perimeter(bimg,neighborhood=6)
        A = bimg.sum() 

        rot180 = np.rot90(img, 2)
        asym = np.mean(np.abs(img - rot180))

        feats = np.array([np.mean(i), np.std(i),
                      np.max(i),np.median(i),
                      e,p,A,asym,
                      edg_mean,shannon_entropy(i)] 
                      )
        all_feats.append(feats)
    x_features.append(np.concatenate(all_feats))

x_features = np.array(x_features)
print(x_features.shape) 

(17736, 30)


Now that we have specific features of the images we will use Principle Componant Analysis (PCA) to reduce flattened image vectors to a smaller size whilst keeping the key information about the dataset. 

PCA works by calculating the covariance matrix and finding directions in which the dataset (images) vary the most. All values are then projected onto these new axes reducing the data to a chosen number of dimensions. 

see PCA guide here 
https://www.geeksforgeeks.org/maths/covariance-matrix/

In [5]:

#flatten dataset from [num images, 256,256] to [num images, 256**2] 

xflat = []
for i in range(len(x_images)):
    xflat.append(np.array(x_images[i].flatten()))

In [None]:
# split into training and test sets (see functions.py) 
xtrn_FT,xtst_FT,xtrn_FL,xtst_FL,ytrn,ytst = split_data_rf(x_features,xflat,y_labels,0.1)

: 

In [None]:
#Set up pca 
xtrn_FL = np.asarray(xtrn_FL, dtype=np.float32)
xtst_FL = np.asarray(xtst_FL, dtype=np.float32)

xtrn_FL = xtrn_FL.reshape(xtrn_FL.shape[0], -1)
xtst_FL = xtst_FL.reshape(xtst_FL.shape[0], -1)

print("Train flat shape:", xtrn_FL.shape)
print("Test  flat shape:", xtst_FL.shape)

# 2) Use IncrementalPCA with fewer components to reduce memory/time
pca = IncrementalPCA(n_components=50, batch_size=64)

# 3) Fit on training data
pca.fit(xtrn_FL)

# 4) Transform train and test sets
xpca_trn = pca.transform(xtrn_FL)
xpca_tst = pca.transform(xtst_FL)

print("PCA train shape:", xpca_trn.shape)
print("PCA test shape:", xpca_tst.shape)

Train flat shape: (15962, 196608)
Test  flat shape: (1774, 196608)


In [None]:
# join pca features with designed featers 
xtrn = np.hstack([xtrn_FT,xpca]) 

In [None]:
# train the random forest model 
rf.fit(xtrn,ytrn)


KeyboardInterrupt: 

In [None]:
# join the two feature test sets 
xpca_tst = pca.transform(xtst_FL)
xtst = np.hstack([xtst_FT,xpca_tst])

In [None]:
# make prediction based on test sets 
ypred = rf.predict(xtst)
print(f'accuracy score = {accuracy_score(ytst,ypred)}') 

accuracy score = 0.5207272727272727


As the results show, the random forest model only made correct predictions for half of the galaxies in the dataset. This is because the 256x256 pixel images mean that the model has a huge amount of input features; even after PCA. Each tree is trying to make decisions on all of these features which makes it difficult to find more general patterns in the image set. 

Another issue is that random forest does not have the ability to properly assess spacial features, it is not ablle to find links between certain edges and brightnesses in the same way a more sophisticated convolutional neural network might be able to. 