# Introduction


The most important number in the COMPAS data is the seed. The seed represents the unique identifier to a specific system in a simulation. Therefore the properties of a single system can be recovered by looking at seeds in different types of files. 

Here we introduce the basics of manipulating the data using the seeds. For example how we get the initial parameters of systems that ended up forming double compact objects

Most often we start using python with 'for loops' and add the systems of interest to a list. However, such loops might take long.
Here we present how we can more efficiently 'slice' the data using boolean masks. These are slightly more demanding but are quick and use intuitive logic.


We assume you already have a h5file with data, if not see in section 1 how you can create the h5file using the csv data of your simulation, or download some data from compas.science.

# careful these cells show examples which  take long if you test them on large data.



# Path to be set by user


In [1]:
pathToData = '/home/cneijssel/Desktop/Test/COMPAS_output.h5'

# Imports

In [2]:
#python libraries
import numpy as np               #for handling arrays
import h5py as h5                #for reading the COMPAS data
import time                      #for timeing computation time

In [3]:
Data  = h5.File(pathToData)
print(Data.keys())
nrSystems = len(Data['SystemParameters']['SEED'][()])
Data.close()

<KeysViewHDF5 ['CommonEnvelopes', 'DoubleCompactObjects', 'Supernovae', 'SystemParameters']>


The print statement shows the different types of files that are combined in your h5file.
Now the seed is the number that links the information of say the supernovae to the information in the SystemParameters

### Question: What were the initial total masses of the double compact objects

### The classic way when starting with python

In [4]:
def returnTotalMasses(pathData=None):
    Data  = h5.File(pathToData)
    
    totalMasses = []
    
    #for syntax see section 1 with basic syntax
    seedsDCOs     = Data['DoubleCompactObjects']['SEED'][()]
    
    #get info from ZAMS
    seedsSystems  = Data['SystemParameters']['SEED'][()]
    M1ZAMSs       = Data['SystemParameters']['Mass@ZAMS_1'][()]
    M2ZAMSs       = Data['SystemParameters']['Mass@ZAMS_2'][()]

    
    
    for seedDCO in seedsDCOs:
        for nrseed in range(len(seedsSystems)):
            seedSystem = seedsSystems[nrseed]
            if seedSystem == seedDCO:
                M1 = M1ZAMSs[nrseed]
                M2 = M2ZAMSs[nrseed]
                Mtot = M1+M2
                totalMasses.append(Mtot)
    Data.close()
    return totalMasses

In [5]:
start   = time.time()
MtotOld = returnTotalMasses(pathData=pathToData)
end     = time.time()
print(end - start, 'seconds for %s systems' %(nrSystems)) 

118.17710280418396 seconds for 300000 systems


# Steps of optimising the above loop

## 1 - using boolean masks in one file

an array and a list are both series of input. 
However, when you work with arrays you can use numpy to do some optimsed tricks
for example. Adding the entries of two lists

In [6]:
Data  = h5.File(pathToData)

M1ZAMS  = Data['SystemParameters']['Mass@ZAMS_1'][()]
M2ZAMS  = Data['SystemParameters']['Mass@ZAMS_2'][()]
Mtotal  = np.add(M1ZAMS, M2ZAMS)



A useful trick is when you want elements based on a condition.
Where in the classic way we put the condition in a for loop and if statement,
now we will work with an array of booleans or so called masks

In [7]:
#mask which gives total masses below or equal to 40
maskMtot = Mtotal <=40
#apply mask to get the masses
MtotalBelow40 = Mtotal[maskMtot]

The crucial trick is that you can apply this mask to other columns in the same file 
as long as you keep the length of the mask the same to the column that you apply it to.


In [8]:
# seeds of systems with total masses below 40
seeds  = Data['SystemParameters']['SEED'][()]
seedsMtotBelow40 = seeds[maskMtot]

Data.close()

Note that this works because the order of the two columns (seeds and total masses) are
the same. Rephrased, the total mass at the third entry corresponds to the seed at the third entry.

## 2 - using seeds as mask between files

## example 1
Before we continue it is useful to realise how the COMPAS-popsynth printing works.
Every time you simulate a system the output is printed in different files. 
Hence if you have four systems with seeds 1,2,3,4 then COMPAS will evolve seed 1, print the output, and then continue to evolve system 2 etc.

Therefore in the systemparameters file, which prints a single line for all the systems, you will find seeds 1,2,3,4 printed in order. However this is generally not the case for the other files.
For example the double compact objects file. This file also prints a single line per system, but not all the systems form double compact objects. However the systems are still printed in the same order hence a simple
mask will link the parameters without you having to order them too.


Imagine only seeds 2 and 4 formed a double compact objec (DCO) and you want to recover the initial primary mass of the systems. Lets create a simple example outside the COMPAS data



In [9]:
#small example mock data
SystemSeeds = np.array([1 ,2 ,3 ,4 ])
SystemMass1 = np.array([10,20,15,45])
DCOSeeds    = np.array([   2 ,   4 ])

#compare which element of 1-d array are in other
mask = np.in1d(SystemSeeds, DCOSeeds)
print(mask)
print(SystemSeeds[mask])
print(SystemMass1[mask])


[False  True False  True]
[2 4]
[20 45]


# Optimised loop

In [10]:
def returnTotalMasses2(pathData=None):
    Data  = h5.File(pathToData)
    
    totalMasses = []
    
    #for syntax see section 1 with basic syntax
    seedsDCOs     = Data['DoubleCompactObjects']['SEED'][()]
    #get info from ZAMS
    seedsSystems  = Data['SystemParameters']['SEED'][()]
    M1ZAMSs       = Data['SystemParameters']['Mass@ZAMS_1'][()]
    M2ZAMSs       = Data['SystemParameters']['Mass@ZAMS_2'][()]
    
    MZAMStotal    = np.add(M1ZAMS, M2ZAMS)
    
    maskSeedsBecameDCO  = np.in1d(seedsSystems, seedsDCOs)
    totalMassZAMSDCO    = MZAMStotal[maskSeedsBecameDCO]
    
    Data.close()
    return totalMassZAMSDCO

In [11]:
start   = time.time()
MtotNew = returnTotalMasses2(pathData=pathToData)
end     = time.time()
print(end - start, 'seconds for %s systems' %(nrSystems)) 

0.04909825325012207 seconds for 300000 systems


In [12]:
# test if I was lying (need to turn list into array)
print(np.array_equal(np.array(MtotOld), MtotNew))

True


Note that the above loop can easily be expanded with more conditions.
If you do not want all the DCO initial total masses but only of the double neutron stars, then you just need to reduce the seedsDCOs to those only becoming a double neutron star.

In [13]:
def returnTotalMassesDNS(pathData=None):
    Data  = h5.File(pathToData)
    
    totalMasses = []
    
    #for syntax see section 1 with basic syntax
    seedsDCOs     = Data['DoubleCompactObjects']['SEED'][()]
    type1         = Data['DoubleCompactObjects']['Stellar_Type_1'][()]
    type2         = Data['DoubleCompactObjects']['Stellar_Type_2'][()]
    maskDNS       = (type1 == 13) & (type2 == 13)
    seedsDNS      = seedsDCOs[maskDNS]
    
    #get info from ZAMS
    seedsSystems  = Data['SystemParameters']['SEED'][()]
    M1ZAMSs       = Data['SystemParameters']['Mass@ZAMS_1'][()]
    M2ZAMSs       = Data['SystemParameters']['Mass@ZAMS_2'][()]
    
    MZAMStotal    = np.add(M1ZAMS, M2ZAMS)
    
    
    maskSeedsBecameDNS  = np.in1d(seedsSystems, seedsDNS)
    totalMassZAMSDNS    = MZAMStotal[maskSeedsBecameDNS]
    
    Data.close()
    return totalMassZAMSDNS
#returnTotalMassesDNS(pathData=pathToData)

## example 2

The previous example is relies on the fact that each file only prints at most one line per system.
However there are events, such as supernovae (SNe), where the system could be printed 0,1,2 times depending on the masses of the stars.

Now the aforementioned method will require some additional steps. Imagine again the 4 seeds of the previous example. Both 2 and 4 formed a DCO and hence both stars in both binaries went SN. Seeds 1 and 3 are low mass stars hence they did not go SN. The SN file prints one line per SN and therefore seeds 2 and 4 appear twice, but still in order.

Imagine you want the primary masses of systems that experienced at any point a core collapse supernova (ccSN), (PPISN is pulsational pair instability SN). Lets reuse our mock data


In [14]:
#small example
SystemSeeds = np.array([1 ,2 ,3 ,4 ])
SystemMass1 = np.array([10,20,15,45])
DCOSeeds    = np.array([2 ,4 ])
SNSeeds     = np.array([2 ,2 ,4, 4 ])
SNTypes     = np.array(['ccSN' ,'ccSN' ,'PPISN', 'ccSN'])

#get seeds which had a ccSN
maskCCSN  = SNTypes == 'ccSN'
seedsCCSN = SNSeeds[maskCCSN]
print('ccSN seeds =%s' %(seedsCCSN))
#in principle you could directly compare to seedsCCSN
#but I prefer to always be explicit and compare unique seeds
seedsCCSN = np.unique(seedsCCSN)
#compare which element of 1-d array are in other
#this because in mroe complicated scenarios duplicate seeds might sometimes mess up your slicing
mask = np.in1d(SystemSeeds, seedsCCSN)
print(SystemMass1[mask])

ccSN seeds =[2 2 4]
[20 45]
