# Python implementation of L-Galaxies

This is a playground to test out the possibility of using `python` as an interface into L-Galaxies.

In [None]:
# Imports
import astropy.constants as c
import astropy.units as u
import gc
import h5py
h5py.enable_ipython_completer()
import numpy as np
import yaml
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context('notebook')
sns.set_style('whitegrid')

In [None]:
# Development limiter
nHaloMax=10000

# Debug/testing switch
debugFlag=True

# Verbosity
verbosity=1 # 0 - Major program steps only; 1/2 - Major/minor Counters; 3 - Debugging diags.

# Script parameters
file_parameters='input/input.yml'
displayParameters=True

In [None]:
# Read in parameters, graph file, etc.

parameters=yaml.load(open(file_parameters),Loader=yaml.FullLoader)
if displayParameters:
    for item in parameters:
        print("{:20s}: {}".format(item,parameters[item]))

graphFile=parameters['inputFiles']['graphFile']
graphData=h5py.File(graphFile,'r')

## Data structure for halos

This is an interesting problem.  We have many requirements:
- Must be fast – does this remove the possibility of using objects?
- Must be flexible enough to respond to parameter choices (ideally at run time).
- Must allow for variable-length arrays – I think that each halo will individually need to track what fraction of material it inherits from each progenitor.

I think that the object-oriented way of doing it, as below, can easily adapt to run-time choices because it does not use long arrays.

I have initially coded it using:
- a list of graphs;
- each graph is a dictionary of snapshots;
- each snapshot is a dictionary of halos;
- each halo is an instance of the `haloProperties` class.  

There is a numpy.object dtype that would allow one to make that numpy arrays but I don't know if it offers any performance advantages or disadvantages.  The numpy objects can have arbitrary data added to them, but again I don't know if this flexibility means that they will be very slow (due to having to continually shift things around in memory).

This is all very far from the current method that we have in L-Galaxies of defining a galaxy structure at compile time.

## Data structure in graph HDF5 files

Each graph is a group in the root `['/']` with labels `graphID` that appear to be non-consecutive: `group size=918`; `max(graphID)=1000`.  However, this could be because this is a subset of all graphs – in general may be working with subsets, so don't assume consecutive.

Each snapshot is a group in `['/<graphID>']` with labels `snapID` running over all snapshots.

Each halo is a group in `['/<graphID>/<snapID>']` with labels `haloID` that appear to run consecutively from 0, with `haloID` increasing with `snapID`.
<b>Note: they are in increasing string order, not numerical order.  This is a feature that needs fixing or it wil break the code at some point.</b><br>
Halos have attributes:
* `catalogID` – ID in the original halo catalogue
* `concentration` – ?
* `halo_mass` – mass in Msun
* `halo_nPart` – number of particles

and datasets:
* `centre_of_mass` – `np.float64[3]` array of centroid (?) in cMpc (?)
* `desc_haloIDs` – `np.int64[<variable>]` array of `haloID`s of decendants
* `desc_mass_contribution` – `np.int64[<variable>]` array of number of particles in common with each descendant (? – may contain non-associated particles)
* `halo_velocity` – `np.float64[3]` array of mean velocity in km/s (?)
* `prog_haloIDs` – `np.int64[<variable>]` array of `progID`s of decendants
* `prog_mass_contribution` – `np.int64[<variable>]` array of number of particles in  
common with each progenitor (? – may contain non-associated particles)

Presumably we can add more properties, as desired.

In [None]:
# This haloProperties class is just a container for all the halo properties.
# It is not expected that it should have any sophisticated methods.
# The constructer merely provides the unique labels for each halo;
# other properties may then be added.
# If this dynamic variable allocation is too slow, we could presumably find a way
# to declare all the variables that we will need at the time of construction.
class haloProperties:
    # Constructor.
    def __init__(self,graphID,snapID,haloID):
        # These are (HDF5) strings
        self.graphID=graphID
        self.snapID=snapID
        self.haloID=haloID
        # Other halo properties.  Is it best to initialise to particular type or to None?
        self.done=False
        self.catalogID=-1
        self.mass=0.
        self.massBaryon=0.
        self.mass_fromProgenitors=0.
        self.massBaryon_fromProgenitors=0.
        if parameters['modelSwitches']['HOD']: 
            self.massStars=0.
            self.massStars_fromProgenitors=0.
        # Not clear whether it is better to store information here about
        # progenitors and descendants, which might result in variable-length
        # data structure, or to look up from the HDF5 data file as required,
        # which might require multiple access.
        # If we want to specify the data size in advance, then we need to
        # decide upon a maximum number of descendants to keep.  (I suspect
        # that keeping as few as 3 descendants may be enough.)

## Functions

Most of the work to be done in external routines, probably to be coded in C for efficiency.

Here we just include the high-level driver routines.

In [None]:
# Processing halos 
def processHalo(halo):
    if verbosity>=2: print('Processing halo ',halo.haloID)
    if halo.done==True: 
        print('Warning: processHalo: halo ',str(halo),' already processed.')
        return
    readProperties(halo)
    gatherProgenitors(halo)
#     fixBaryonFraction(halo)
#     fixStellarFraction(halo) # Dummy routine.
#     outputHalo(halo)         # Is this the right place to do this?
    halo.done=True
    
def readProperties(halo):
    # Reads halo properties from the input graph file
    halo.catalogID=graphData[halo.graphID][halo.snapID][halo.haloID].attrs.get('catalogID')
    halo.mass=graphData[halo.graphID][halo.snapID][halo.haloID].attrs.get('halo_mass')
    return

def gatherProgenitors(halo):
    # Collects information about material inherited from progenitors
    for prog_haloID in graphData[halo.graphID][halo.snapID][halo.haloID]['prog_haloIDs']:
        # Position in progenitor (lastSnap) halo list
        if debugFlag and verbosity>=3 : print('prog_haloID =',prog_haloID)
        prog_index_lastSnap=int(prog_haloID)-int(haloProperties_lastSnap[0].haloID) # Move out of loop?
        if debugFlag and verbosity>=3: print('prog_index_lastSnap =',prog_index_lastSnap)
        # Check halo association
        # This is needed because HDF5 may store halos in a different order - 
        # if this happens, will need to add code to do a search over all halos
        # in lastSnap.
        if debugFlag:
            assert int(haloProperties_lastSnap[prog_index_lastSnap].haloID) == int(prog_haloID)
        # I am going to get Will to put in the fraction of each halo that goes to each
        # descendant.  For now work it out.  Will need the loop over descendants anyway.
        # Actually, that's not true if we had a dictionary, but I don't think that HDF5
        # can handle dictionaries.
        descMassSum=0.
        descMass=0.
        for desc_haloID in haloProperties_lastSnap[prog_index_lastSnap].hal['prog_haloIDs']:
            # Do something. Progenitor may ony give part of mass to me.
    return

## Main routine

In [None]:
# Iteratively loop over halos, doing whatever processing is required.
# This assumes that halos properties depend only upon those halos in 
# their immediate past in the merger graph.

# Note: no attempt here to include sub-halos.  Let's get halos right
# first!

# Loop over MergerGraphs.
nHalo=0
for graphID in graphData['/']:
    if verbosity>=2: print('Processing graph',graphID)
    graph=graphData[graphID]
    # Loop over snapshots from first to last.
    haloProperties_lastSnap = None
#    for snap in snaps:
    for snapID in graph:
        if verbosity>=2: print('        snapshot',snapID)
        snap=graph[snapID]
        # Initialise halo properties
        haloProperties_thisSnap=[haloProperties(graphID,snapID,haloID) for haloID in snap]
        # Loop over halos in snapshot.
        for halo in haloProperties_thisSnap: processHalo(halo)
        nHalo +=1
        if verbosity>=1: 
            if nHalo%1000==0: print('Processed {:d} halos'.format(nHalo))
        # Once all halos have been done, update reference to lastSnap (and hence free memory)
        del haloProperties_lastSnap
        #gc.collect() # garbage collection -- safe but very slow.
        haloProperties_lastSnap=haloProperties_thisSnap
        # Delete reference to this memory to free variable for new use.
        del haloProperties_thisSnap
        # Temporary halt to limit to finite time
        if nHalo==nHaloMax: assert False
#del haloProperties_lastSnap

In [None]:
prog_haloID_lastSnap

In [None]:
prog_haloID