# COGS118A (WI23) - Group045 - Project Proposal

# Team Members

- [Tej Nair](mailto:tnair@ucsd.edu)
- [Angel Olivas](mailto:aaolivas@ucsd.edu)
- [Stanley Sisson](mailto:sfsisson@ucsd.edu)
- [Sang Tran](mailto:stt008@ucsd.edu)

# Abstract 
We will attempt to classify galaxies' morphological shapes using spectral 
data from the Mapping Nearby Galaxies at APO (MaNGA) and classification data from Galaxy Zoo 2
by developing and evaluating 
a supervised machine learning model. 
Instead of using images of resolved galaxies as direct input with training techniques such as 
Convolution Neural Networks (CNN), we will be using
one-dimensional spectral data (which is very information-dense) to train a classifier 
on various hyperparameters and metrics and evaluate the strengths and weakness of each.

# Background

### Galaxies?
When you look up into the sky, there are thousands of stars that are 
pinpricks of light (assuming you are in a dark place). The vasy majority 
of these stars that are visible to us are only in our nearest neighborhood 
of the galaxy: within a couple hundred light-years of us 
[\[A\]](https://physics.stackexchange.com/a/164021), and make up less than 
1% of a percent of the stars in our galaxy (Actually its less than a 
thousanth of that, if you include all stars visible throughout the year). 
Beyond these stars, and even beyond our galaxy are billions of other 
galaxies that are similarly vast to our own milkyway. The Hubble Space 
Telescope famously took a long exposure of one of the darkest areas of the 
sky, and produced 
[a beautiful image](https://web.archive.org/web/20230222000000/https://www.nasa.gov/mission_pages/hubble/servicing/SM4/multimedia/wfpc/deep-field.html) 
showing many many speckles, all of which were very distant galaxies. 

These galaxies can be divided into a few categories based on their visual 
shape, allowing for categorization and study of similar galaxies and how they 
formed and evolve over the vast lifespan of the universe. There are two main 
categories: Spiral and Elliptical galaxies. These two categories can be 
further divided into how elliptical and how well defined the spiral arms are, 
resulting in a total of [11 subcategories](https://esahubble.org/images/heic9902o/).

### Prior Research:

Galaxies have previously been categorized en-mas using low resolution spectral 
information (ie: broad color filters), with crowdsourced shape identification 
through the project \[Galaxy Zoo\]. Previous attempts have been made to train 
a classifier on the 3 color images and shapes with moderate to low success 
[\[B\]](https://youtu.be/H6UBjbio5-A). In this project we plan to apply these 
techniques to classify galaxies based on high resolution spectral data from the 
MaNGA Survey (Mapping Nearby Galaxies at APO) from the Sloan Digital Sky Survey.

### Spectra?
Objects that emit light emit light in a range of wavelengths that are precieved 
by humans as different colors. Different objects produce, absorb, and reflect 
light at different wavelengths differently in characteristic ways that can tell 
information about a source.

For the most part there are two main production methods: One, when any object 
is heated up, it has general emissions over a broad range of wavelengths, known 
as black-body radiation. And the other is when various gasses are ionized and 
changing energy states, when electrons fall back into lower energy states, they 
release photons of energy at very specific wavelengths that corrospond exactly 
with the energy released by the electron. These create spectral lines: a large 
concentration of light intensity in a small band of wavelengths.

However this process also has its opposite: when that wavelength passes through 
a material that would emit at that wavelength, that material can absorb that 
light in the opposite process as the creation of spectral lines. These are 
called {spectral }, and they, in addition to physically blocking light (like 
dust or a solid rock), are how most light is attenuated. Most spectra are 
primarily made up of these features, and can be used to determine the 
tempurature and chemical makeup of the source that emitted that light. 

Humans only have 4 kinds of light sensing cells (three for color and one for 
luminance) that are sensetive to a very small portion of the spectrum, and 
thus can't view anything outside of that small range. But even within that 
range, only having three sensors means there is low spectral information. 
Technology has provided us with a way to measure precise spectra: the 
spectrograph. A spectrograph can measure the exact intensitites at very precice 
wavelengths of light, allowing for a lot of information to be gleaned about the 
source. 

The survey that we are using (MaNGA) has collected spectra for {hundreds of 
thousands} of galaxies, which are corrolated with that galaxy's history and makeup, 
which is also corrolated with the galaxy's shape. Thus we can use these spectral 
readings of these galaxies to hopefully make predictions about their shapes.

# Problem Statement

Galaxy classification is a difficult task due to the large number of galaxies 
which all have varying sizes, shapes, and structures. There are more than 100 
billion galaxies in the observable universe, each possessing their own 
distinct form. Additionally, galaxies progress over time, making it difficult 
to break them down into distinct groups. The images of galaxies we observe from 
Earth are often blurred due to light being refracted through dust and gas in 
space, making the classification process even more complex. However, we will 
classify the shapes of the galaxies into three universally agreed shapes as 
elliptical, spiral, or irregular. A well-known original Galaxy Zoo Project 
attempted to classify galaxies shapes based on the confidence level defined 
as number of votes. This method raises a compromise between the number of 
galaxies that remain unclassified and those that are incorrectly classified. 

# Data

We will be using two primary datasets as input for training our model: galaxy spectras from the MaNGA survey, and galaxy classifications from Galaxy Zoo, with the MaNGA Sloan Catalog to link the identifiers of each galaxy between the surveys.

## MaNGA survey
The MaNGA survey contains a lot of data ranging from "CUBE"s of spectral data (2D images with a series of spectral data for each pixel), measured in linear or logarithmic scales. There are supporting series describing the wavelength and error-bars for each measurement.

We plan to reduce this data by taking a single "averaged" spectral series for each galaxy (some form of average, be it mean, median, percentile, or something else), and use these for our features. (Or averaged groups of these, if that turns out to be too many features.)

Source: [MaNGA Page on SDSS](https://www.sdss4.org/dr17/manga/)

## Galaxy Zoo
Galaxy Zoo is a citizen-science (croud-sourced) astronomy project that is now part of [Zooniverse](https://www.zooniverse.org/) where a large quantity of images of distant galaxies from hubble were classified in bulk by non-astronomy citizens. Galaxy Zoo 1 contains classifications of a large set of galaxies into "spiral", "elliptical", and "irregular" with percentages of votes for each. Galaxy Zoo 2 is successor of Galaxy Zoo 1 and categorizes a more rigorous set of galaxies with more in depth labels.

The entries in Galaxy Zoo for the galaxies will be paired up with the entries in the MaNGA survey to create labels for the galaxy spectras for the training.

Source: [Galaxy Zoo Data Website](https://data.galaxyzoo.org/)

## MaNGA Sloan Catalog

Lists of `MaNGA-ID`s and their designations in other catalogs, used to link MaNGA and Galaxy Zoo entries.

Source: [Sloan Catalog for MaNGA on SDSS](https://www.sdss4.org/dr17/manga/manga-data/catalogs/)

# Proposed Solution

Our proposed solution is to use one dimensional data, which are transformed 
from two dimensional spectra data from MaNGA to train with labels matched 
from Galaxy Zoo Data Release 1.

We will first need search for overlapping data between the two sources so that 
we can match the data from MaNGA with labels from Galaxy Zoo. After that, a 
appropriate method will be chosen to best approximate the 2D data into 1D 
without losing much information.

# Evaluation Metrics

The results of the classifier models are labels of data points, which we will compare against true labels using a metric.

One metric is a classification accuracy score to compare models. The accuracy score is expressed as $\frac{\text{number of correct classes}}{\text{total number of classes}}$. The accuracy is a good metric to getting a single number from all data.

However, if the dataset has more of one class than another, we want a classifier that is able to recognize smaller classes. In that case, we can combine $F_1$ scores for each class (i.e. using the multiclass results as a binary "one vs. rest") in different ways. As written in [`sklearn`'s documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html), we can use an averaging strategy of `macro` to weight each class the same, or we can use `weighted` to weight by the number in each true class. The `weighted` strategy is more useful in this case, since classes do not vanish and thus contribute sizeable amounts to the weighting.

The `macro` strategy is $C^{-1}\sum_{i=1}^C 2/(\text{recall}_i^{-1} + \text{precision}_i^{-1})$ where the index $i$ refers to metrics calculated from the $i$th class, treating the multiclass classifier as a "one vs. rest".  
  The `weighted` strategy is $n^{-1}\sum_{i=1}^C 2n_i/(\text{recall}_i^{-1} + \text{precision}_i^{-1})$ where $n$ is the number of data points and $n_i$ is the number of data points in the $i$th class.  
  Precision and recall are defined as $TP / (TP + FP)$ and $TP/(TP+FN)$ respectively, where $TP, FP, FN$ are true positives, false positives, and false negatives respectively.

The averaged $F_1$ score (either strategy) is better than the accuracy score for avoiding bias, namely where larger classes are classified better. However, it is less interpretable than the accuracy score.

# Ethics & Privacy

Galaxy classification is a large task due to the number of galaxies, and as such 
is assisted by computers and/or crowds. However, using our algorithm may lead to 
undue confidence in its results. For example, since the classifier should be 
relatively cheap to run in terms of compute and it only relies on spectral data, 
this classifier may be preferred to other better classifiers. Taking this kind 
of shortcut may cause researchers to be led astray, or in more malicious cases, 
to perform sloppy analyses or even fabricate research.

Inexpensive galaxy classification is crucial to studying large amounts of 
galaxies at once. Since our classifier only looks at a subset of the available 
data, it may ignore atypical examples of galaxies, such as red (old) spiral 
galaxies and blue (new) elliptical galaxies. Thus, if researchers used our 
classification to conduct large-scale causative analyses, their ground dataset 
would be biased, which could likely cause effects that would otherwise be 
observed to disappear, or ostensible effects that are actually correlated with 
confounding variables. In general, drawing conclusions from a biased 
classification of galaxies could easily lead to inaccurate results. Inaccuracy 
in science lowers the quality of research, concretely effects the allocation of 
resources, and could prematurely close off otherwise-promising areas of research 
while developing/keeping lower quality areas.

This effect can be mitigated by clearly labeling the responsible use and limits 
of the classifier. However, there still exists potential of harm when 
recklessly or maliciously used.

# Team Expectations 

- Communicate in the Discord group chat, and read messages from the team in a timely manner.
- Team members should finish the work they agree to take on.
- Show up to the meetings, in-person if possible.


# Project Timeline Proposal



| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|:------------:|:------------:|:-------------------------|:-------------------|
2/21 | 5:00 PM | Brainstorm topics/questions (all) | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research. |
2/22 | 7:00 PM | Do background research on the topic (all).  | Finalize and organize project proposal form. Work toward figuring out how to improve accuracy. |
3/1 | 5:00 PM | Re-read bias training tradeoffs section, consider applications to project. (Led by Sang) | Discuss wrangling and possible analytical approaches. |
3/8 | 6:00 PM | EDA plan and contents (Led by Angel) | Finalize EDA group read through, analyze discussion plans |
3/13 | 5:00 PM | Finalize wrangling/EDA; Begin programming for project (Led by Tej) | Discuss/edit project code; Complete project |
3/17 | 6:00 PM | Complete analysis; Draft results/conclusion/discussion (Led by Angel) | Discuss/edit full project |
3/22 | Before 11:59 PM | Final Project (ideally) | Turn in Final Project/ Finalize project |

# Footnotes
> <a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
> <a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
> <a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
