# COGS 118B - Final Project

# Understanding the Complexity of Galaxy Imaging Data

## Group members

- Kian Chou
- Jalen Li
- Hana Tse
- Arturo Sorensen
- Tianran Bao

# Abstract 
In this project, we investigate the effectiveness of unsupervised machine learning algorithms on clustering different galaxy shapes using images. To achieve this, we used a dataset containing images of galaxies and various algorithms such as K-means, GMM, Spectral, and DBSCAN to cluster galaxies that had similar shapes. We also ran evaluation metrics such as silhouette score and Adjusted Rand index to assess the performance of the resulting clusters and their corresponding algorithms. Overall, we saw low silhouette and Adjusted Rand scores from each algorithm’s cluster, leading us to believe that unsupervised algorithms are not suitable for this task. However, we believe that the conclusions of our project have been influenced by limitations such as computational load and data complexity, so this problem could be investigated further with more resources. 

# Background

Galaxies are fascinating astronomical objects that are extremely numerous and diverse - it’s estimated that there are billions or even possibly trillions of galaxies in the observable universe, each with unique properties/features. They vary in size from being ultra massive to relatively small, with a multitude of shapes such as elliptical (appears as a concentrated elliptic blob), spiral (has a circular twist and arms), and irregular (sparse and not as structured). Some are young and actively growing whereas others are ancient and dormant. As mentioned in our proposed solution, color and luminosity can vary as well. Luminosity can be affected by its age, metallicity, component stars, dust content, distance from the observer, etc. On the other hand, color not only can be affected by composition, but also by redshift; as the universe expands at an accelerated rate, the objects that are farther from us recede faster, which creates a phenomenon called redshift in which the light’s wavelength is stretched (since the universe itself is expanding) and appears redder.

The diversity galaxies offer makes them excellent subjects for image classification and exploratory study. Due to this, in recent years, the field of galaxy classification through machine learning not only has seen advancements, but has also greatly helped researchers study these astronomical bodies. Galaxy Zoo is a project that provided visual morphological classification, which refers to categorizing galaxies based on their shape and structure (e.g. spiral, elliptical, irregular), based on galaxy images from the Sloan Digital Sky Survey (SDSS)<a name="lintottnote"></a>[<sup>[1]</sup>](#lintott). The first Galaxy Zoo paper’s aim was to 
distinguish between the two morphological classes pertaining to massive systems, namely spirals and early-type systems<a name="lintottnote"></a>[<sup>[1]</sup>](#lintott). 

Galaxy Zoo 2<a name="zoo2"></a>[<sup>[2]</sup>](#zoo2), which is the data we are using, is a succeeding project that considers more detailed morphological features such as "galactic bars, spiral arm and pitch angle, bulges, edge-on galaxies, relative ellipticities, and many others"<a name="zoosite"></a>[<sup>[3]</sup>](#zoosite). Moreover, galactic bars refer to the elongated structures composed of stars and interstellar matter that extend from the center of spiral galaxies; pitch angle measures the tightness of spiral arms around a galaxy’s center; bulges are the tightly packed collections of stars at the centers of galaxies; edge-on means observed from the side, which would make the bulge more visible; and relative ellipticity is the galaxy’s deviation from a circular shape.

"There has been much research dedicated to classifying galaxy image data with convolutional neural networks (CNNs), including Huertas et al. using the Cosmic Assembly Near-Infrared Deep Extragalactic Legacy Survey (CANDELS) data to classify over a range of redshifts<a name="huertas"></a>[<sup>[4]</sup>](#huertas). In a similar vein, we will be classifying galaxy image data, but it’s important to note that this isn’t the only way. For example, Krakowski et al. applied a support vector machine (SVM) to tabular data from the WISE × SuperCOSMOS catalog<a name="krakowski"></a>[<sup>[5]</sup>](#krakowski).

# Problem Statement

Galaxies have a couple of different visual features, including shape, color, luminosity, and concentration. There are a few different ways that galaxies are classified, but the system we will be sticking to will use their morphology. There are 3 main categories of galaxies: elliptical, spiral, and irregular. 

The problem that we aim to solve is seeing if galaxies can be clustered based on their visual features alone. For this, we will be analyzing the galaxy images in our dataset and use unsupervised learning algorithms to attempt to cluster galaxies by their morphology classification.

We will be using PCA and UMAP for dimentionality reduction to transform the data high dimentional image data into lower dimentional space for easier analysis. Then we will use multiple different unsupervised algorithms including K-means, DBSCAN, GMM, Spectral, and PCA reconstruction to a performing clustering on the image data.

This problem can be measured in the following two ways. We will first use the crowd-sourced labels in our dataset and seperate a part of our data as test/training data to evaluate the performance of our models. We will also evalutation metrics without using the labels to measure the performance of our clustering algorithms.

# Data

## Data Sources
The raw galaxy image data can be found at this link: https://zenodo.org/records/3565489#.Y3vFKS-l0eY

This dataset has a total of 355,990 images of centered images and the unique ID associated with them.

The galaxy image labels can be found at this link: https://data.galaxyzoo.org/#section-7

This dataset contains 239,695 image-label pairs, which is less than the amount of images that was present in the zenodo link. 

## Data Preprocessing
To combine the image dataset and label dataset, we paired up the image file name and label using a unique identifier assigned to each galaxy. Through this, we ended up with a combined dataset made up of 239,695 galaxy images and labels. One observation in this dataset was made up of a grainy color image of a target galaxy centered in a 424x424 pixel frame and a corresponding label with galaxy shape and classification. This image can also contain other, smaller galaxies which could prove to be a difficult issue to overcome.  

In the following cell, a few samples of the raw images and their filenames are displayed.

![unprocessed_images_sample.png](report_images/unprocessed_images_sample.png)

Our overall goal is to be able to classify the shape of a provided 50x50 image of a galaxy. The galaxies can be one of four shapes: elliptical (E), spiral (S), spiral barred (SB), and amorphous (A). The first three types of galaxy occur in about equal frequency, but the amorphous type galaxy happens very rarely, with about 500 observations. 

A notebook with the code used to preprocess our images can be found at this Github link: https://github.com/TianranBaoUCSD/COGS118_Final_Project_Group_XXX/blob/main/Data_Cleaning.ipynb

To preprocess the image data, we first needed to crop the images so they mainly contain the galaxy we want to analyze, grayscale the images, and then scale down the images to reduce the computation necessary to process the data. To crop the data, we took the center 200x200 region of each image. Looking over multiple images, we found that this center dimension usually contained the entire galaxy that was being analyzed. We then grayscaled the images since color data is not as important to analyzing the overall shape of the galaxy. Finally, we scaled the 200x200 image down to a 50x50 image so that we had the computing power to work on the data and the data would not take too long. 

The images were then flattened into an array of size 2,500 and put into a matrix with all other observations. Each value in this matrix was normalized from a scale of 0 to 255 (representing pixel darkness) to a scale of 0 to 1 by dividing each value in the matrix by 255.

This means that our final, processed data has a total of 239,695 observations, with each observation containing a 50x50 image and a label for what the overall shape of the galaxy is. This means that each observation has a total of 2,500 critical variables.

In the following cell, a few samples of the processed images and their filenames are displayed.![processed_images_sample.png](report_images/processed_images_sample.png)

# Proposed Solution

The principal problem we aim to tackle for this project is whether labeling galaxy shapes through the differences in their images is feasible.

The different galaxy shapes in images involve variance in several properties, such as luminosity and distribution of pixels. We plan to cast image data into a uniform format by grayscaling the colors, as well as reducing resolution by cropping to the center, removing uninformative black space in the process. From there, a normalized set of vectors representing the grayscale, centered image (1 x 2500) can be run through dimensionality reduction methods, such as UMAP and PCA. Once a large amount of variance can be explained in a lower amount of dimensions, our projected data will enter various clustering algorithms such as K-means, DBSCAN, Spectral, and GMMs. We will then evaluate the performance of each clustering algorithm’s results to answer our problem.

# Evaluation Metrics

The Galaxy Zoo project, from which our data is obtained, has crowd-sourced labels for each galaxy. Each label is derived from a contribution of around 100,000 volunteers, reducing biases and ultimately providing a robust, confident classification (https://ui.adsabs.harvard.edu/abs/2008MNRAS.389.1179L/abstract). 

There are a few metrics for the goodness-of-fit of clustering algorithms. For one, we can score our clusters using their Silhouette score, ranging from -1 to 1. The silhouette score can be interpreted from having poorly assigned clusters, to overlapping clusters at 0, to well-assigned and separated clusters. Since our dataset also contains correct labels, we can use the Adjusted Rand Index (ARI) to score our clusters. The ARI computes how well the cluster assignments are and ranges from -0.5 to 1, with -0.5 being poor and 1 being exact. 

# Results

## Galaxy Imaging: A Very Complex Problem

Throughout the course of this project, we began to uncover how complex of a task classifying images in the GalaxyZoo 2 dataset was. From using UMAP, PCA, and KPCA, each attempt uncovered more and more facets of the dataset that each technique struggled with. Although we were not able to come up with a conclusive process for clustering or classifying these images, with each technique we tried, were able to get a deeper understanding of what makes this dataset difficult and what future techniques that can be employed to classify the images in this dataset.

### UMAP
*The results discussed in this section can be found in the following notebook:*
- UMAP and Reconstructions (https://github.com/TianranBaoUCSD/COGS118_Final_Project_Group_XXX/blob/main/UMAP_and_Reconstruction.ipynb)

A dimensionality reduction technique we tried to use was UMAP. From the flattened 50x50 images, we tried a couple of different iterations of UMAP with different parameters but generally ended up with the same results each time. Below are the plotted results from using `n_neighbors=25` and `min_dist=0.05`.

![UMAP_plot.png](report_images/UMAP_plot.png)

From this plot, we see that there is little to get from these results. There aren’t any well-defined clusters where certain galaxy types group up. Therefore, we decided to stick with the results from PCA to use as our dimensionally reduced data.

### PCA: Linearly Unseperable Data
*The results discussed in this section can be found in the following notebook:*
- Data Cleaning and PCA notebook: (https://github.com/TianranBaoUCSD/COGS118_Final_Project_Group_XXX/blob/main/Data_Cleaning_and_PCA.ipynb)   

To reduce the computational load of the data, we performed a PCA on all of the flattened 50x50 grayscale images to evaluate which components contributed to the most variance in the dataset. We plotted the variance explained by each component and decided to cut at the eblow to find the amount of principal components that explained the most amount of variance between images. 

![PCA_elbow.png](report_images/PCA_elbow.png)

<p style="text-align: center;"><i>A graph outlining how much variance is explained by the first 200 principal components. The red dashed line is drawn at principal component 35.</i></p>

Further clustering was then done on these 35 principal components, including UMAP, K-means, DBSCAN, GMM, and spectral clustering. 

#### K-Means:
*The results discussed in this section can be found in the following notebook:*
- K-Means Notebook (https://github.com/TianranBaoUCSD/COGS118_Final_Project_Group_XXX/blob/main/K-means.ipynb).

Although K-means can work well when we know how many clusters we want, K-means is driven by the assumption that the clusters are spherical shape and of equal size. After running our K-means on the PCA and UMAP data, we can see that the data does not fulfill either these conditions, which could explain the low resulting ARI scores of -2.98e-05 and 0.0003 respectively.

![K-means_plot.png](report_images/K-Means_plot.png)
<p style="text-align: center;"><i>This plot shows the resulting clusters from our K-means on PCA data. We can see that the data does not suit the aforementioned assumptions.</i></p>

#### DBSCAN and HDBSCAN:
*The results discussed in this section can be found in the following notebooks:*
- K-means, GMM, HDBSCAN, Hierarchical Clustering notebook (https://github.com/TianranBaoUCSD/COGS118_Final_Project_Group_XXX/blob/main/PCA_KMeans_GMM_HDBSCAN_HierarchicalFail_UMAP.ipynb)
- UMAP DBSCAN notebook: (https://github.com/TianranBaoUCSD/COGS118_Final_Project_Group_XXX/blob/main/PCA_UMAP_DBSCAN.ipynb)

We also attempted to use DBSCAN and HDBSCAN to try and see if this resulted in a good 2-D clustering for the PCA data, but this resulted in unclear clusterings. 

![HDBSCAN_results.png](report_images/HDBSCAN_results.png)
<p style="text-align: center;"><i>The result from our HDBSCAN clustering.</i></p>

#### GMM:
*The results discussed in this section can be found in the following notebook:*
- GMM and Spectral Clustering notebook: (https://github.com/TianranBaoUCSD/COGS118_Final_Project_Group_XXX/blob/main/GMMSpectral.ipynb)

Gaussian Mixture Models operate under the assumption that data can be multivariate and normally distributed (Gaussian) with different types of covariances. An attempt was made to run GMM clustering on the PCA data, scored using the Adjusted Rand Index which was mentioned in the Evaluation Metric section.

![GMM_results_table.png](report_images/GMM_results_table.PNG)

<p style="text-align: center;"><i>The mean ARI score for each covariance type allowed in GMMs is close to 0. Changing the covariance does not affect results meaningfully.</i></p>

This result tells us that the data may not Gaussian distributed, or that the data may be too dense to properly separate out.

#### Spectral Clustering:
*The results discussed in this section can be found in the following notebook:*
- GMM and Spectral Clustering notebook: (https://github.com/TianranBaoUCSD/COGS118_Final_Project_Group_XXX/blob/main/GMMSpectral.ipynb)

Spectral clustering works by creating similarity matrices of the data and using an approximate optimal cut on a Laplacian matrix to estimate similar clusters. In order to examine this clustering technique without massive time and memory overhead, we took random samples of our data, fitted a spectral clustering algorithm, and scored each clustering assignment using the Adjusted Rand Index.

![Spectral_results_table.png](report_images/Spectral_results_table.PNG)

<p style="text-align: center;"><i>The ARI score table for spectral clustering. Each row is a random subset of our data and the columns indicate how many nearest neighbors were passed to the algorithm.</i></p>

Since spectral clustering relies heavily on whether points can be deemed similar or not, one interpretation of the poor results is that the image vectors used are too similar. If data is too close in features, then local clusters are created with no seperation of labels.


### KPCA: Non-Linearly Unseperable Data
*The results discussed in this section can be found in the following notebook:*
- Kernel PCA and various clusterings notebook: (https://github.com/TianranBaoUCSD/COGS118_Final_Project_Group_XXX/blob/main/KernelPCA.ipynb)

All of the previous clustering algorithms were fitted on dimension reduced via PCA data. However, PCA is purely a linear operator. If the data contains some sort of nonlinear structure, PCA would not be able to capture the variance of the data in a meaningful way. Instead, Kernel-based PCA projects data to a higher dimensional structure, then doing PCA to try and capture variance.

For this KPCA, we took 1/12th of our total dataset and picked the first n principal components that would explain 90% of the variance. This resulted in us using the first 66 principal components of the KPCA for the `radial` basis, and the first 62 principal components for the `polynomial` basis. 

We attempted both KernelPCA using a `radial` basis and a `polynomial` basis, then clustered using GMM and DBSCAN to see our results.
![Kernel_GMM_results_table.png](report_images/Kernel_GMM_results_table.PNG)

<p style="text-align: center;"><i>One example of our results - first using a radial basis, then applying GMM clustering to the new data.</i></p>

Despite KernelPCA offering a more powerful dimensionality reduction that can fit nonlinear data, we get similarly poor results over all of our trials and parameters. This lead us to further hypothesize more about the structure of our data, and whether it was possible to cleanly separate the various galaxy shapes, given the data and algorithms at our disposal.

# Discussion

### Interpreting the Result 

While progressing through this project, we began to uncover how complex classifying images in the GalaxyZoo 2 dataset was. After PCAs and KPCAs were run on the image data, each clustering technique we tried uncovered more and more facets of the dataset that each strategy struggled with. Although we were not able to come up with a conclusive process for clustering or classifying these images, with each technique we tried, we were able to get a deeper understanding of what makes this dataset difficult and what future techniques that can be employed to classify the images.

We began by looking at the processed images, and realized how subtle the difference between the images were for galaxies of different categories for humans, let alone for an algorithm.

![similar_galaxy_examples.png](report_images/similar_galaxy_examples.png)

<p style="text-align: center;"><i>From left to right: an ellipitcal (E) galaxy, a spiral (S) galaxy, and a spiral barred (SB) galaxy.</i></p>

This led to some concerns as to how well these small defining characteristics would be represented in a dimensionality reduction, as we would not have the time nor computing power to process the entire 2,500 variables for each image. We also found that most of the images in this dataset had non-target galaxies present in the same frame as our target galaxy. This combined with the subtlety of the features that differentiate the galaxies from each other lead to a difficult situation where both the noise and signal comes from variance between different samples. Despite this, we hoped that with a lot of images, the variance from the irrelevant galaxies would not have too much impact on the classification. We attempted to use UMAP to project the high-dimensional data down to a 2-D visualization, which resulted in a very cluttered graph with no obvious clusterings. We took this result as evidence to continue the rest of our analysis using high dimensional techniques.  

To run our analysis in higher dimensions with the time and computing power we had at our disposal, we ran a PCA on our flattened images and isolated the principal components that explained the most amount of variance between images. This resulted in the first 35 principal components being identified as the components with the most variance. To see if we could separate the data in a way that could be clustered and visualized in a 2-D, we ran K-means, DBSCAN, and HDBSCAN using the first 2 principal components identified by our PCA to see if any of them would work. Similarly to the UMAP graph, this also resulted in a series of cluttered graphs without any obvious clusterings. 

Seeing as the data was still too cluttered in a 2-D space even after identifying the components with the most variance, we used GMM and spectral clustering and calculated the adjusted Rand index for each to see if clustering in higher dimensions would be effective. The adjusted Rand index for the GMM for all covariance types were close to zero, suggesting that the data were still too similar to be separated, even with 35 principal components. Spectral clustering also yielded a similar outcome, with near zero adjusted Rand indexes for all parameters passed through. This also suggests the data is too dense to be separated. The failure of both clusterings suggests that even in higher dimensions, the data provided was very similar, and thus were very dense in high dimensional space. 

The overall failure of PCA suggested that the data was not linearly separable, so we moved onto KPCA to see if the data was non-linearly separable. The first 66 components were chosen from a variance threshold of 90%, and then GMM and DBSCAN were then run on these components. Both GMM and DBSCAN had poor adjusted Rand indexes across both a radial KPCA and a polynomial KPCA. There is the possibility that a more complex kernel could result in better separation of the data, but we neither had the computing power nor the time to explore this further. This suggests that the data we were trying to analyze was also unable to be separated non-linearly, which means that the images may just be very similar even in the original 2,500 dimensional space. 


### Limitations

One limitation regarding the data was the amount of images per galaxy shape. The “E” and “S” shapes had around 95,000 corresponding images each, but the “SB” shape had around 45,000 images and the “A” shape only had 544 images. The uneven distribution of images affected the algorithms we used, as it was more difficult to cluster and represent the shapes with less data and similarly, could have introduced bias towards the shapes with more representation. 

On a larger scale, one of the major limitations for our project was computational load. The dataset that we used consisted of more than 230,000 424x424 images and their labels, so if we did not reduce the features in our data, it would have taken excessive time and computational load to run our algorithms. As a result we performed dimensionality reduction in order to reduce the computational costs. However, by reducing the dimensions of our data with PCA, we were working with images that had potentially lost relevant information regarding their shapes. Ideally, we would be able to run our algorithms on the data with all of their original features for more accurate clustering performance, but this was not possible given the computational constraints. 

Another larger limitation for our project was time. We noticed that a fair amount of our data contained noise due to extraneous planets and stars in the background of the images, but we did not have time to investigate how and to what extent the noise affected our PCA results. The noise in our data could have added variance that was not related to the structure of the galaxy shapes, which in turn could have affected the calculation of the principal components and the models using the resulting PCA data.
   

### Ethics & Privacy

The dataset that we used contains images and labels that are provided publicly by Galaxy Zoo. Since this data is public and does not pertain to any individuals, it does not present any privacy concerns. Similarly, the aim of our project does not pose any privacy concerns, as our focus was to classify galaxy types using this data. 

However, there is an ethical concern regarding misinformation. The labels for the images in our dataset are crowdsourced, which has potential for bias. The volunteers that contribute to the labels come from various backgrounds with different levels of experience, so different people might label certain images differently. As a result, there is a risk that the data might not be entirely accurate. In turn, the results of our model may not be accurate either. Therefore, people who view our project should keep in mind that our model might not classify each image correctly and should consider using other sources to verify information that is presented. 

### Conclusion 

From the various unsupervised learning techniques we attempted to use throughout the course of our project, we learned more and more about the complexity of the data that we were working with. The unique combination of subtle defining characteristics for galaxy shape and the inconsistent appearance of irrelevant galaxies in the target frame led to this dataset being particularly difficult to classify with the methods we learned. Because of how similar the data is, even in high dimensions, it is difficult to separate the data using both linear or non-linear processes.

Given these results from this report and seeing that researchers are exploring using neural networks to automatically classify galaxies, a promising avenue to investigate would be using supervised machine learning algorithms to try and classify the galaxy images. Given that all of the images are labeled in this dataset and that algorithms like neural networks are able to pick up on subtle patterns that differentiate between categories, applying a neural network could potentially yield better results than the ones found in this report. 

# Footnotes
<a name="lintottnote"></a>1.[^](#lintott): Lintott, Chris J., et al. (Sept 2008) Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey. *Monthly Notices of the Royal Astronomical Society*. https://ui.adsabs.harvard.edu/abs/2008MNRAS.389.1179L<br>
<a name="zoo2"></a>2.[^](#zoo2): Willett, Kyle W., et al. (Nov 2013) Galaxy Zoo 2: Images from Original Sample. *Monthly Notices of the Royal Astronomical Society*. https://zenodo.org/records/3565489<br>
<a name="zoosite"></a>3.[^](#zoosite): Galaxy Zoo 2, https://data.galaxyzoo.org/#section-7<br>
<a name="huertas"></a>4.[^](#huertas): Huertas, M., et al. (Nov 2015) A Catalog of Visual-Like Morphologies in the 5 CANDELS Fields Using Deep Learning. *The Astrophysical Journal*. https://iopscience.iop.org/article/10.1088/0067-0049/221/1/8/pdf<br>
<a name="krakowski"></a>5.[^](#krakowski): Krakowski, T., et al. (Aug 2016) Machine-learning identification of galaxies in the WISE x SuperCOSMOS all-sky catalogue. *Astronomy & Astrophysics*. https://www.aanda.org/articles/aa/pdf/2016/12/aa29165-16.pdf<br>
