## Snippet from the documentation

- Catalogs behave much like a numpy structured array, where the fields of the array are referred to as “columns”. These columns store the information about the objects in the catalog; common columns are “Position”, “Velocity”, “Mass”, etc. 
- nbodykit.base.catalog.CatalogSource is an abstract base class and cannot be initialized directly. This is done via special catalog subclasses, falling into either of these categories:

    1) Reading data from disk
    
    2) Generating mock data
    
- CatalogSource.size: # objects in catalog
- CatalogSource.csize: collective # attributes in catalog

Meaning of some columns:


| Name | Description | Default Value |
| --- | --- | --- |
| Value | When interpolating a CatalogSource on to a mesh, the value of this array is used as the field value that each particle contributes to a given mesh cell. The mesh field is a weighted average of Value, with the weights given by Weight. For example, the Value column could represent Velocity, in which case the field painted to the mesh will be momentum when Weight is set to be the mass(mass-weighted velocity). | 1.0 |
| Weight | The weight to use for each particle when interpolating a CatalogSource on to a mesh. The mesh field is a weighted average of Value, with the weights given by Weight. | 1.0 |
| Selection | A boolean column that selects a subset slice of the CatalogSource. When converting a CatalogSource to a mesh object, only the objects where the Selection column is True will be painted to the mesh. | True | 

### Reading catalogs from disk

In [22]:
# Binary
import numpy as np
from nbodykit.source.catalog import BinaryCatalog

# generate some fake data and save to a binary file
with open('binary-example.dat', 'wb') as ff:
    pos = np.random.random(size=(1024, 3)) # fake Position column
    vel = np.random.random(size=(1024, 3)) # fake Velocity column
    pos.tofile(ff); vel.tofile(ff); ff.seek(0)

# create the binary catalog
f = BinaryCatalog(ff.name, [('Position', ('f8', 3)), ('Velocity', ('f8', 3))], size=1024)

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)

BinaryCatalog(size=1024, FileStack(BinaryFile(path=/home/jwack/library tests/binary-example.dat, dataset=*, ncolumns=2, shape=(1024,)>, ... 1 files))
columns =  ['Position', 'Selection', 'Value', 'Velocity', 'Weight']
total size =  1024


In [5]:
# FITS 
import fitsio
from nbodykit.source.catalog import FITSCatalog

# generate some fake data
dset = np.empty(1024, dtype=[('Position', ('f8', 3)), ('Mass', 'f8')])
dset['Position'] = np.random.random(size=(1024, 3))
dset['Mass'] = np.random.random(size=1024)

# write to a FITS file using fitsio
fitsio.write('fits-example.fits', dset, extname='Data')

# initialize the catalog
f = FITSCatalog('fits-example.fits', ext='Data')

print(f)
print("columns = ", f.columns) # default Weight,Selection also present
print("total size = ", f.csize)

FITSCatalog(size=1024, FileStack(FITSFile(path=/home/jwack/library tests/fits-example.fits, dataset=Data, ncolumns=2, shape=(1024,)>, ... 1 files))
columns =  ['Mass', 'Position', 'Selection', 'Value', 'Weight']
total size =  1024


### dask arrays
Columns of catalogs are dask array: similar to numpy structured arrays with only difference that np array perform operations immediately while dask stores operations in a task graph which is evaluated when it is most efficient. Further advantage: operations are performed on chucks of the array (specified by ```chucksize```) such that the maximal size of the data in operation can be the size of the disk storage rather than the size of the memory.

In [5]:
from nbodykit.lab import UniformCatalog

# Uniformly random posistion, vel. Trivial weight and selection columns
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42) 
print(cat)
print(cat['Position'])

UniformCatalog(size=96, seed=42)
dask.array<array, shape=(96, 3), dtype=float64, chunksize=(96, 3), chunktype=numpy.ndarray> first: [0.45470105 0.83263203 0.06905134] last: [0.62474599 0.15388738 0.84302209]


In [9]:
# evaluating a dask array
import dask.array as da

pos = cat['Position']
min_da_array = da.min(pos, axis=0) # makes task graph for finding smallest triplet of cartesian position
result = min_da_array.compute() # evaluated task graph, giving np array
result

array([0.00402579, 0.00015685, 0.00271747])

In [17]:
# alternative, also giving np array
vel = cat.compute(cat['Velocity'])
print(type(vel))

<class 'numpy.ndarray'>


### data operations
See getting started manual -> discrete data catalogs -> common data operations for how to concatenate and stack columns

In [75]:
from nbodykit.lab import *
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)

In [76]:
# check if specific column present
print("Does 'Mass' column exist? ", 'Mass' in cat)
# add columns either with array of correclt length or with scalar value
cat['Mass'] = np.random.random(size=len(cat))
cat['Type'] = b"central"

# overwrite columns
print("Original: ", cat['Mass'].compute()[:5]) # show first few entries of mass column
cat['Mass'] = cat['Mass'] / cat['Mass'].compute()[0]
print("normalized: ", cat['Mass'].compute()[:5]) # equivaluent: cat.compute(cat['Mass'])[:5]

Does 'Mass' column exist?  False
Original:  [0.26135736 0.80682607 0.90297312 0.74087606 0.83415904]
normalized:  [1.         3.08706084 3.45493663 2.83472428 3.19164165]


In [45]:
# selecting subset: boolean array or slice notation
sel = cat['Mass'] >= 1
sub = cat[sel]

# select y position of first two galaxies
print(sub['Position'].compute()[:3])
sub2 = cat['Position'][0:2,1]
sub2.compute()

[[0.45470105 0.83263203 0.06905134]
 [0.31944725 0.48518719 0.29826163]
 [0.31854524 0.34906766 0.99925086]]


array([0.83263203, 0.48518719])

In [46]:
# selecting columns: 'Selection', 'Value', 'Weight' are included by default
print("Columns in cat: ", cat.columns)
subcat = cat[['Position', 'Mass']]
print("Columns in subcat: ", subcat.columns)

Columns in cat:  ['Mass', 'Position', 'Selection', 'Type', 'Value', 'Velocity', 'Weight']
Columns in subcat:  ['Mass', 'Position', 'Selection', 'Value', 'Weight']


### adding redshift space distortions
mapping between real and redshift space position referred to as RSD. Here, apply along z axis

In [72]:
# copied nbodykit.transform.VectorProjection() since Jupyter claims no such
# function exists
def proj(vector, direction):
    direction = np.asarray(direction, dtype='f8')
    direction = direction / (direction ** 2).sum() ** 0.5
    projection = (vector * direction).sum(axis=-1)
    projection = projection[:, None] * direction[None, :]

    return projection

In [74]:
cat = UniformCatalog(nbar=5, BoxSize=1.0, seed=42)
print("Original pos:\n", cat['Position'].compute())

line_of_sight = [0,0,1]
cosmo = cosmology.Planck15
redshift = 0.55
rsd_factor = (1+redshift) / (100*cosmo.efunc(redshift)) # already included in log-normal catalogs as column 'VelocityOffset'
cat['Position'] = cat['Position'] + rsd_factor*proj(cat['Velocity'], line_of_sight)
print("RSD pos:\n", cat['Position'].compute())

Original pos:
 [[0.45470105 0.83263203 0.06905134]
 [0.31944725 0.48518719 0.29826163]
 [0.21242627 0.16674684 0.17622131]
 [0.31854524 0.34906766 0.99925086]
 [0.50668461 0.23705949 0.38925321]]
RSD pos:
 [[0.45470105 0.83263203 0.06913185]
 [0.31944725 0.48518719 0.29834118]
 [0.21242627 0.16674684 0.17625287]
 [0.31854524 0.34906766 0.99925115]
 [0.50668461 0.23705949 0.38936703]]


### converting sky to cartesian coords
Two angles (right ascension, declination) and redshift as radial coord -> cartesian
To converte from redshift to comoving distance need to specify a cosmology

In [93]:
src = RandomCatalog(csize=100, seed=42)
# add random sky coords. random numer generator (rng) automatically uses correct number of objects
src['z'] = src.rng.normal(loc=0.5, scale=0.1)
src['ra'] = src.rng.uniform(low=0, high=360)
src['dec'] = src.rng.uniform(low=-180, high=180)

cosmo = cosmology.Planck15

src['Position'] = transform.SkyToCartesian(src['ra'], src['z'], src['dec'], 
                                          degrees=True, cosmo=cosmo)
# some entries are nan. Supsect that this is caused by the choosen cosmology

Note that ```UniformCatalog``` is a subclass of ```RandomCatalog``` that includes uniformly distributed columns for Position (between 0 and BoxSize) and Velocity (between 0 and 0.01 x BoxSize) with a particle number density nbar.

### Log-normal catalog
More realistic approximation of cosmological large-scale structure: generates a set of objects by Poisson sampling a log-normal density field: get discrete positions of galaxies, by sampling the density field in each cell of the mesh. The desired number of galaxies in the bos of space considered is determined by nbar. Then use Zel'dovich approx to simulate dynamics of sampled galaxies. Final position, velocities, and velocity offset are stored in columns.

```LogNormalCatalog``` requires a linear power spectrum function, redshift, and linear bias.

Log-Normal definition: If X is log-normally distributed, then ln(X) is normally distributed.

### Halo Occupation Distribution catalog
Takes a set of DM halos and populates it with galaxies according to the conditional probability $P(N|M)$ that a halo of mass $M$ hosts $N$ objects. Assumes that the galaxy-halo connection only depends on the halo mass. Galaxies in halo are grouped into centrals and satellites.

See https://nbodykit.readthedocs.io/en/latest/catalogs/mock-data.html for more details and references to further reading. 

In [3]:
# example of populating halos accoriding to a specific form of P(N|M)
from nbodykit.lab import UniformCatalog, HaloCatalog, cosmology
from nbodykit.hod import Zheng07Model

# first make uniform particles in a box and interpret them as halos
cat = UniformCatalog(nbar=100, BoxSize=1.0, seed=42)
cat['Mass'] = 10**(cat.rng.uniform(12,15))
halos = HaloCatalog(cat, cosmo=cosmology.Planck15, redshift=0., mdef='vir', position='Position', velocity='Velocity', mass='Mass')
print("# of generated halos: ", len(halos))

In [31]:
# now populate according to Zheng's model from 2007 paper
# Can specify upto to 5 parameters for the population process; see above linked docs 
hod = halos.populate(Zheng07Model, alpha=0.5, sigma_logM=0.4, seed=42)

print("Total # of generated galaxies: ", hod.size)
print("Available columns: ", hod.columns) # see above link for meaning
print("# of centrals: ", np.sum(hod['gal_type'].compute()==0))
print("# of satellites = ", hod.compute((hod['gal_type']==1).sum()))

Total # of generated galaxies:  282
Available columns:  ['Position', 'Selection', 'Value', 'Velocity', 'VelocityOffset', 'Weight', 'conc_NFWmodel', 'gal_type', 'halo_hostid', 'halo_id', 'halo_mvir', 'halo_num_centrals', 'halo_num_satellites', 'halo_rvir', 'halo_upid', 'halo_vx', 'halo_vy', 'halo_vz', 'halo_x', 'halo_y', 'halo_z', 'host_centric_distance', 'vx', 'vy', 'vz', 'x', 'y', 'z']
# of centrals:  91
# of satellites =  191




In [44]:
# can also repopulate halos. Can change parameters and/or seed
hod.repopulate(seed=84)
print("New # of galaxies: ", hod.size)
hod.repopulate(logM0=13.2)
print("New # of galaxies: ", hod.size)

New # of galaxies:  247
New # of galaxies:  244


