# **Pandemonium Project Proposal**
## **Feb 27, 2020**
---
### Proposal 

#### Background. 
This project builds upon on-going research to model and simulate galaxy collisions within the Computational Science Ph.D. Program and Physics and Astronomy Departments here at Middle Tennessee State University. 

Galaxy Zoo: Mergers is a citizen scientist effort to find models for 62 target galaxy collisions with the use of SPAM.  SPAM is an O(N) simulation that models galaxy collisions and recreates the tidal distortions or shape that the galaxies undergo during collision.  Citizen scientists were presented with target images of two colliding galaxies and several simulations that attempt to model the target collision.  Using the innate human ability to visually recognize patterns, citizen scientists identified sixty-six thousand models that have resemblance to the target image.  Once selected, these galactic models also underwent a tournament-like scoring system to identify the best models for a particular target system, and give them corresponding human fitness scores for how well each individual model matched the target image. 

#### Project. 
The human ability to process visual patterns and identify similarities is incredibly complex.  Trying to identify and measure the “similarity” between two images is often arbitrary and could depend on hundreds of different factors depending on the objects or images being compared.  Even with the increase of computer image processing and computational power, previous efforts to manually program and predict the similarity between two galaxy images has had poor results with the ongoing galaxy collisions research.  We hope to utilize neural networks to capture the human ability to predict similarities, at least in the narrowed application of galaxies. 

The focus of our project is to create a variety of methods for predicting how well a model “fits” a target image using neural networks.  Neural network regression models will be trained on pre-existing model images and associated human fitness scores from Galaxy Zoo: Mergers.  The regression model will take galaxy images and produce a machine fitness score that attempts to predict human machine scores.  The performance of our networks will be based on how well the final machine scores statistically correlate with human scores for a set galactic model. 

The primary goal is to develop a neural network regression model for each individual target system.  Each target system can be trained on hundreds to thousands of model images and human scores.  In addition, a variety of neural network architectures can be implemented and analyzed.  Perhaps a particular architecture performs the best across the majority of target systems.  To evaluate the performance of our neural networks, the primary metric will be the statistical correlation between neural network predictions and human fitness scores.  In addition to the correlation, graphs can be constructed to identify outliers, or possible overfitting to training data.  Once the regression model is built and operational, new model images could be generated and humans can agree and disagree with the predicted score and improve the process.  

Another possible application is to build a general-purpose neural network model that can predict the similarity between any two pictures of galaxies in collision.  This could be trained on a sampling of models and target images from all of the target systems above.  This may be difficult to build, as each system undoubtedly has variance in how their similarity scores scale.  For example, a 0.75 similarity for one system may resemble a much higher “similarity” between images than a 0.75 for another system which contains poorer quality models and data.  However, the flexibility may be well worth the effort.  A general-purpose comparison could be applied to future galaxy collision systems.  

Another application could be to create single-layer networks for all of the target systems.  With these single-layer networks, the weights could be visualized and scientists could identify what sections of the image are most crucial for predicting a similarity score.  For example, do all the systems have a greater importance along tidal features or the edges of the galaxy shape?  Is pixel importance related to the distance between the axis between galaxy centers?  Which sections of the images can be mostly ignored? 

One ambitious idea is to incorporate existing open-source image classification models.  WNDCHRM is an open-source image classification software, that goes beyond the raw pixel intensities.  Instead WNDCHRM extracts hundreds of image features, identifies which image features best predicts an images classification, and uses those metrics to train and predict a classification model.  Perhaps WNDCHRM could be utilized for its automated feature extraction and identification, then use those image feature to build an even more accurate and robust neural network regression model. 

---

## 1. Does your project have a specific or focused aim (or set of aims)? (3 points)

## 2. Does your proposal exposit the specific role that neural networks will play in the project? (2 points)

## 3. Does your project have a specific set of data (or data sets) identified? (4 points)


The data sets identified have been amassed for Galaxy Zoo: Mergers (Link:https://mergers.galaxyzoo.org/). The target information is framed as two galaxies converge.  The first data set is called Target Info which is a text file containing the coordinates (in Right Ascension (RA - east and west longitudinal coordinates) and Declination (Dec - north and south latitudinal coordinates) for the center of each galaxy paring.  There is also a set of alternative names derived Atlas of Peculiar Galaxies (Arp), General Catalogue of Nebulae and Clusters of Stars (NGC/IC), Uppsala General Catalogue (UGC), Sloan Digital Sky Survey (SDSS), and Two Micron ALl Sky Survey (2MASS).  These sets are collected into a Target SDSS file with contains the id of the objects as well as the de-reddened magnitudes of SDSS u, g, r, i and z bands.  Those bands are filtered observations based on ultraviolet (u), green (g), red (r), near-infrared (i), and infrared (z) filtration measurements.  Whenever possible, DR7 and DR8 (data release) values are included.  

Below is the column information provided by Target_info.txt:

>ID	Target id, one-up number roughly same as Merger Zoo presentation order\\
>SDSSID	SDSS DR7 ID for target<br>
>PRI_RA_DEG	RA in degrees for primary galaxy<br>
>PRI_DEC_DEG	DEC in degrees for primary galaxy<br>
>PRI_NAMES	Alternate names for primary galaxy: SDSS, Arp, NGC/IC, UGC, 2MASS<br>
>SEC_RA_DEG	RA in degrees for secondary galaxy<br>
>SEC_DEC_DEG	DEC in degrees for secondary galaxy<br>
>SEC_NAMES	Alternate names for secondary galaxy: SDSS, Arp, NGC/IC, UGC, 2MASS<br>"

Below is the column information provided by Target_sdss.txt:

>ID	Target id, one-up number roughly same as Merger Zoo presentation order
SDSSID	SDSS DR7 ID for target<br>
PRI_DR7_ID	SDSS DR7 id for primary galaxy<br>
PRI_DR8_ID	SDSS DR8 id for primary galaxy<br>
SEC_DR7_ID	SDSS DR7 id for secondary galaxy<br>
SEC_DR8_ID	SDSS DR8 id for secondary galaxy<br>
PRI_DR7_U	Dered U magnitude (model - extinction) from DR7 for primary<br>
PRI_DR7_G	Dered G magnitude (model - extinction) from DR7 for primary<br>
PRI_DR7_R	Dered R magnitude (model - extinction) from DR7 for primary<br>
PRI_DR7_I	Dered I magnitude (model - extinction) from DR7 for primary<br>
PRI_DR7_Z	Dered Z magnitude (model - extinction) from DR7 for primary<br>
PRI_DR7_SPECZ	Spectral redshift from DR7 for primary<br>
PRI_DR7_PHOTOZ	Photometric redshift from DR7 for primary<br>
PRI_DR7_PHOTOZ2	Alternate photometric redshift from DR7 for primary<br>
PRI_DR8_U	Dered U magnitude (model - extinction) from DR8 for primary<br>
PRI_DR8_G	Dered G magnitude (model - extinction) from DR8 for primary<br>
PRI_DR8_R	Dered R magnitude (model - extinction) from DR8 for primary<br>
PRI_DR8_I	Dered I magnitude (model - extinction) from DR8 for primary<br>
PRI_DR8_Z	Dered Z magnitude (model - extinction) from DR8 for primary<br>
PRI_DR8_SPECZ	Spectral redshift from DR8 for primary<br>
PRI_DR8_PHOTOZ	Photometric redshift from DR8 for primary<br>
PRI_DR8_PHOTOZ2	Alternate photometric redshift from DR8 for primary<br>
SEC_DR7_U	Dered U magnitude (model - extinction) from DR7 for secondary<br>
SEC_DR7_G	Dered G magnitude (model - extinction) from DR7 for secondary<br>
SEC_DR7_R	Dered R magnitude (model - extinction) from DR7 for secondary<br>
SEC_DR7_I	Dered I magnitude (model - extinction) from DR7 for secondary<br>
SEC_DR7_Z	Dered Z magnitude (model - extinction) from DR7 for secondary<br>
SEC_DR7_SPECZ	Spectral redshift from DR7 for secondary<br>
SEC_DR7_PHOTOZ	Photometric redshift from DR7 for secondary<br>
SEC_DR7_PHOTOZ2	Alternate photometric redshift from DR7 for secondary<br>
SEC_DR8_U	Dered U magnitude (model - extinction) from DR8 for secondary<br>
SEC_DR8_G	Dered G magnitude (model - extinction) from DR8 for secondary<br>
SEC_DR8_R	Dered R magnitude (model - extinction) from DR8 for secondary<br>
SEC_DR8_I	Dered I magnitude (model - extinction) from DR8 for secondary<br>
SEC_DR8_Z	Dered Z magnitude (model - extinction) from DR8 for secondary<br>
SEC_DR8_SPECZ	Spectral redshift from DR8 for secondary<br>
SEC_DR8_PHOTOZ	Photometric redshift from DR8 for secondary<br>
SEC_DR8_PHOTOZ2	Alternate photometric redshift from DR8 for secondary<br>


## 4. Does your proposal describe a verifiable testing protocol? In order words, did you describe a set of specific tests that you could run for collecting data and/or statistics which can discriminate between success or failure at meeting the specific aim of the project in both a quantitative and qualitative manner? (4 points)
