# Joys & sorrows of the astro-person: datasets & catalogues

You will eventually encounter in your research these guys.

Either containing stellar tracks, dark matter haloes, spectra, images..

Either being real or simulated

Both if you are a theoretician or an observational astronomer (or cosmologist), you will have to deal with datasets.
And in general these datasets are organised in catalogues (or similar).

Fear not! Python is a great language for the inspection and analysis of these entities!

* **ASCII**
     - **TXT**
     - **CSV**
* **BINARY**
     - **FITS**
     - **HDF5**
     - **Pickle**


To analyse the datasets:

In [None]:
import numpy

To move around the filesystem:

In [None]:
import os

Some coordinates:

In [None]:
basedir = '../datasets'

## Load a dataset from ASCII

In [None]:
os.path.join(basedir,'haloes_64Mpc_512p_planck18_z3.txt')

In [None]:
Masses = numpy.genfromtxt(os.path.join(basedir,'haloes_64Mpc_512p_planck18_z3.txt'), 
                          usecols=(3,), unpack=True)
Lbox = 64 # Mpc/h
Volume = Lbox**3 # (Mpc/h)^3
Npart = 512**3

But first let's inspect what we have loaded, e.g.

* limits:

In [None]:
type(Masses), Masses.shape, Masses.size

In [None]:
Masses

In [None]:
Masses.argmin()

In [None]:
(Masses.min(), Masses.max())

We do not like those zeros, so we are going to mask them!

In [None]:
wMn0 = Masses > 0.0

In [None]:
wMn0.size

In [None]:
Masses[wMn0].size

In [None]:
Masses = Masses[Masses > 0.0]

In [None]:
numpy.log10((Masses.min(), Masses.max()))

Great, how many values do we have?

In [None]:
Masses.size

## Let's do some visual inspection with ``matplotlib``

Where to start: [the examples page](https://matplotlib.org/stable/gallery/index.html) 

In [None]:
import matplotlib.pyplot as plt

We want to compute the halo mass function

$$n(M_\text{halo}) = \dfrac{d n(M_\text{halo} \in [M,M+dM))}{d\ln M_\text{halo}}$$

is given by the **numerical density of haloes per logarithmic mass bin**.

In [None]:
NMDM, MDM_bins = numpy.histogram(Masses, bins=numpy.logspace(8.5, 14, 21))

In [None]:
MDM = 0.5 * (MDM_bins[1:]+MDM_bins[:-1])

In [None]:
dlnMDM = numpy.diff(numpy.log(MDM_bins))

In [None]:
nMDM = NMDM / dlnMDM / Volume

In [None]:
plt.loglog(MDM, nMDM, marker='o', linestyle='none')

Remember that we are scientists, what do we miss here? **THE ERROR!!!!**

In [None]:
nMDM_err = numpy.sqrt(NMDM) / dlnMDM / Volume

And maybe some physical limit to assess the validity of the data-set

In [None]:
from astropy.cosmology import Planck18
import astropy.units as u

In [None]:
Mpart = (Planck18.critical_density0*Planck18.Odm0).to(u.Msun/(u.Mpc)**3)*Volume/Npart

In [None]:
Mpart*100

So a better way of plotting this bad guy is as follows

In [None]:
Mpart.value

In [None]:
fig, ax = plt.subplots(1,1)

ax.set(
    xscale='log', yscale='log', 
    xlim=(1.e+9, 1.e+14),
    xlabel='$M_\\mathrm{halo}\\ [M_\\odot]$',
    ylabel='$n(M_\\mathrm{halo})\\ [(Mpc/h)^{-3}]$'
)
ax.errorbar(MDM, nMDM, yerr=nMDM_err, ls='none', marker = 'o', color='k', label='mass function')
ax.axvline(Mpart.value*100, color='gray', ls='--', label='resolution limit')
ax.legend()

and you could store it, also in a second moment:

In [None]:
#fig.savefig('mass_function.png', bbox_inches='tight')

## FITS files 

Stands for **Flexible Image Transport System** and by the name you can infer why it is so used in astronomy.
We will not work with images today, but consider that an image, most of the times, is just a 2D (one single channel) or a 3D (multiple channels) **MATRIX**.

Therefore fits is a file format that is optimised to work with multi-dimensional arrays.

It stores **data AND metadata** in binary format.

Generates Header-Data Units (HDUs), which both contain the dataset and descriptive meta-data.

> **metadata** are all of the informations on the dataset at hand that are useful to understand the actual dataset

**Last but not least** since the nature of the metadata associated with a dataset encompasses also
- the shape of a dataset
- the data-type
- where in the file the target dataset is stored
FITS allows to not directly load in memory the dataset but only a **MEMORY MAP** of the dataset.
This means it is possible to work with files larger than the available volatile memory of your system (at the cost of performances).

The file we will use as example:

In [None]:
filename = 'TRECS_HI+Continuum_z0.01.fits'

[AstroPy](https://www.astropy.org/) is a useful library that contains a lot of stuff useful to the astronomer.

**BUT REMEMBER THAT IT ALSO MISSES A LOT OF STUFF: DON'T RELY ON IT TOO MUCH**

In [None]:
from astropy.io import fits

By opening a ``fits`` file you are given a **LIST OF HDUs**

In [None]:
hdul = fits.open(os.path.join(basedir, filename.format(0.01)))

You can get some metadata on this:

In [None]:
hdul.info()

Each HDU comes with an header

In [None]:
hdul[0].header

In [None]:
hdul[1].header

Since it is conceptually a table, you can access the different columns by calling the sub-object ``columns``

In [None]:
hdul[1].columns.names

We can assign it to a variable for convenience:

In [None]:
columns = hdul[1].columns.names

## Once again on visual inspection with ``matplotlib``

### Redshift distribution

In [None]:
zz = hdul[1].data['redshift']

In [None]:
# zz = numpy.array(hdul[1].data['redshift'])

In [None]:
Nzbins, zbins, _ = plt.hist(zz, histtype='step')

In [None]:
rng = numpy.random.default_rng( seed=123 )

In [None]:
rind = rng.integers(1, endpoint=True, size=zz.size, dtype=bool)

Ok, it is quite uniform but these are a lot of objects (or they might be).

#### EXERCISE: SUB-SAMPLE A DATASET

Why not sub-sample it?

In [None]:
# insert here solution

In [None]:
zz[rind]

In [None]:
nz, *_ = plt.hist(zz, histtype='step', density=True)
nz_ss, *_ = plt.hist(zz[rind], histtype='step', density=True)

In [None]:
zcen = 0.5*(zbins[1:]+zbins[:-1])

In [None]:
plt.step(zcen, nz.cumsum())
plt.step(zcen, nz_ss.cumsum())

In [None]:
from scipy.stats import kstest

In [None]:
kstest(nz.cumsum(), nz_ss.cumsum()).pvalue > 0.05

### Stellar-to-halo mass relation (SHMR)

A.K.A. what stellar mass corresponds to heach DM halo mass?

In [None]:
Mh = hdul[1].data['Mh']
Ms = hdul[1].data['Mstar']

In [None]:
Ms.size

In [None]:
plt.scatter(Mh, Ms, marker='.')

In [None]:
wMs = Ms > 0.0

In [None]:
wMs.dtype, wMs.size, wMs.sum()

In [None]:
plt.scatter(Mh[wMs], Ms[wMs], marker='.')

Is this plot helpful? Not very much.. you just see a cloud of points, but you can understand some stuff from it, like what?

### EXERCISE: ANSWER THE QUESTIONS ABOVE AND DESIGN A BETTER PLOT

In [None]:
# insert here solution

## To conclude, we will use this notebook also to extract some data and use them in a second moment:

I want to see how the SFR is distributed across my catalogue. i.e. I want to compute the
**SFR density function**

I am extracting here the value array

In [None]:
lSFR = hdul[1].data['logSFR']

And once again I can see how this dataset is distributed

In [None]:
_ = plt.hist(lSFR, bins=100)

Since I notice there is an excess of objects with the same SFR value (i.e. $\log\text{SFR} = -100$) I can conclude those are *flagged values* (i.e. no measurement is provided)

In [None]:
wSFR = lSFR>-100.0

To get the density I need the volume, what is the volume of a lightcone?

$$V_\text{FoV} = \dfrac{\Omega}{3}d_C^3(z)$$

with $\Omega$ solid angle and $d_C(z)$ the comoving distance.

Some information I am giving you here "for free"

In [None]:
FoV = 5 # deg
degrees_to_radians = numpy.pi / 180
SolidAngle = (FoV*degrees_to_radians)**2
Volume = 0.333333 * SolidAngle * (Planck18.comoving_distance(zz.max()))**3

In [None]:
Volume

I can check once again the limits of my sample to get some indication of the bin limits. 

In [None]:
lSFR[wSFR].min(), lSFR[wSFR].max()

Compute the histogram

In [None]:
NlSFR, lSFR_bins = numpy.histogram( lSFR[wSFR], bins=numpy.linspace(-3, 1.5, 30) )
lSFR_cens = 0.5*(lSFR_bins[1:]+lSFR_bins[:-1])

and get my **SUMMARY STATISTICS**

In [None]:
nlSFR = NlSFR / Volume.value
nlSFR_err = numpy.sqrt( NlSFR ) / Volume.value

In [None]:
fig, ax = plt.subplots(1,1)
_ = ax.set(
    xscale='log', yscale='log',
    xlabel='$\\log [\\mathrm{SFR}/(M_\\star\\cdot\\mathrm{yr}^{-1})]$',
    ylabel='$\\log n[\\mathrm{SFR}/(M_\\star\\cdot\\mathrm{yr}^{-1}\\cdot Mpc^{-3})]$'
)
_ = ax.errorbar(10**lSFR_cens, nlSFR, yerr=nlSFR_err, 
                marker='o', linestyle='none', color='k', label='data')

In [None]:
lSFR[wSFR].min(), lSFR[wSFR].max()

In [None]:
lSFR_cens

In [None]:
nlSFR

In [None]:
outfile = os.path.join(basedir, 'SFR_density_function')
outfile

> Let's save the data (notice that I am ignoring the first and two-before-last data-points, why did I do that?) 

In [None]:
numpy.savez(outfile, lsfr = lSFR_cens[1:-2], nsfr = nlSFR[1:-2], nsfr_e = nlSFR_err[1:-2])