# [Supernovae Data by Astrocatalogs](https://github.com/astrocatalogs/sne-2015-2019)

## Data Description and Exploratory Data Analysis

The data with which we are going to populate our Data Lake is with the **Open Supernovae Catalog (OSC)**. We will use a collection of all discovered supernovae from 2015 to 2019. This repository contains a variety of information about its discovery and physical attributes. For a deeper understanding, see its **[Schema's information](https://github.com/astrocatalogs/schema)**. The data is stored in .JSON files and each file refers to a different supernovae.

First thing first, let us clone the repository inside our data folder. From the main directory, go to data and git clone the following repository there: https://github.com/astrocatalogs/sne-2015-2019.git.

This should create a folder filled with .JSON files called **sne-2015-2019**. Be carefull when managing these files inside Jupyter Lab, this folder is a **CHONK BOY**! **ʕ•ᴥ•ʔ**

Now, we will import a single supernovae data to see what we are working with.

In [2]:
import pandas as pd
import json

In [8]:
f = r'../data/sne-2015-2019/ASASSN-15ab.json'

with open(f, 'r') as j:
    contents = json.loads(j.read())

The first key is only the supernovae's name. This could be handfull in case we want to merge every .JSON file into only one gigantic .JSON.

In [31]:
print(contents.keys())

dict_keys(['ASASSN-15ab'])


Once we open the first key, we begin to see every possible information collected from these supernovae. This is where we start to imagine what data structure will correctly fit our needs. Here we stumble on a couple of problems. Firstly, these measurement can have multiple values from differents observations. For instance, the **dec** key has two possible values from distinct sources. Secondly, by looking at our [Schema](https://github.com/astrocatalogs/schema) we see multiple attributes that are not present in the supernovae we are analysing. This could be a barrier when recursively creating a **DataFrame**.

In [9]:
print(contents['ASASSN-15ab'].keys())

dict_keys(['schema', 'name', 'sources', 'alias', 'claimedtype', 'comovingdist', 'dec', 'discoverdate', 'discoverer', 'ebv', 'host', 'hostdec', 'hostoffsetang', 'hostoffsetdist', 'hostra', 'lumdist', 'maxabsmag', 'maxappmag', 'maxdate', 'maxvisualabsmag', 'maxvisualappmag', 'maxvisualdate', 'ra', 'redshift', 'velocity', 'photometry'])


In [33]:
for key in contents['ASASSN-15ab']:
    value = contents['ASASSN-15ab'][key]
    # if isinstance(value, list):
    #      value = value[0]
    # else:
    #      value = value
    print(f'{key} : {value}\n\n')

schema : https://github.com/astrocatalogs/supernovae/blob/d3ef5fc/SCHEMA.md


name : ASASSN-15ab


sources : [{'name': '2016A&A...594A..13P', 'bibcode': '2016A&A...594A..13P', 'reference': 'Planck Collaboration et al. (2016)', 'alias': '1'}, {'name': '2015ATel.6864....1D', 'bibcode': '2015ATel.6864....1D', 'reference': 'Dong et al. (2015)', 'alias': '2'}, {'name': '2015ATel.6882....1S', 'bibcode': '2015ATel.6882....1S', 'reference': 'Shappee et al. (2015)', 'alias': '3'}, {'name': '2011ApJ...737..103S', 'bibcode': '2011ApJ...737..103S', 'reference': 'Schlafly & Finkbeiner (2011)', 'alias': '4'}, {'name': 'ASAS-SN Supernovae', 'url': 'http://www.astronomy.ohio-state.edu/~assassin/sn_list.html', 'alias': '5'}, {'name': 'Latest Supernovae', 'secondary': True, 'url': 'http://www.rochesterastronomy.org/snimages/snredshiftall.html', 'alias': '6'}, {'name': 'The Open Supernova Catalog', 'bibcode': '2017ApJ...835...64G', 'reference': 'Guillochon et al. (2017)', 'secondary': True, 'url': 'https

By looking at the data from multiple supernovae, we've figured that the best way to build a database is to slip our data into a relational and a non-relational set. For instance, the following fields are well behaved and could be structured as a relational database.

```
alias, claimedtype, comovingdist, dec, discoverdate, discoverer, ebv, host, hostdec, hostoffsetang, hostoffsetdist, hostra, lumdist, maxabsmag, maxappmag, maxdate, maxvisualabsmag, maxvisualappmag, maxvisualdate, ra, redshift, velocity
```

Even if some of them have two distinct values - e.g., ```dec``` and ```ra``` - we will pick the first one for simplicity.

On the contrary, fields such as ```photometry, spectra, X-ray,``` and  ```radio``` are multidimensional with a single key (supernovae's name). Therefore, they are candidates to their own non-relational sub database, where they are indexed with the supernovae's name. We are going to use **AWS Redshift** to create the relational database, and **MongoDB** for the .JSON based database.