# Slicing the data
## 1. Introduction

The most important number in the COMPAS data is the seed. The seed represents the unique identifier to a specific system in a simulation. Therefore the properties of a single system can be recovered by looking at seeds in  different files.

Here we introduce the basics of manipulating the data using the seeds. We provide an example on how we get the initial parameters of systems that ended up forming double compact objects (DCOs).

Naively, we might try to use loops with conditions to extract systems of interest to a list. However, this can potentially be computationally expensive.

Here we present a method to more efficiently `slice` the data using boolean masks. These are slightly more involved but are computationally quick and use intuitive logic.

If you do not already have a ``COMPAS_Output.h5`` file ready, you can download some data from [compas.science](https://compas.science/)

***Note:* These cells may take a long time if you test them on large datasets.**

### 1.1 Paths


In [241]:
pathToData = '../COMPAS_Output.h5'

### 1.2 Imports

In [242]:
#python libraries
import numpy as np               # for handling arrays
import h5py as h5                # for reading the COMPAS data
import time                      # for finding computation time

In [243]:
Data  = h5.File(pathToData)
print(list(Data.keys()))


['BSE_Common_Envelopes', 'BSE_Double_Compact_Objects', 'BSE_Supernovae', 'BSE_System_Parameters']


The print statement shows the different files that are combined in your HDF5 file.

The system seed links information contained within (e.g.) the BSE_Supernovae file to information in the BSE_System_Parameters file.

## 2. Finding the initial total masses of the DCOs


In [244]:
def calculateTotalMassesNaive(pathData=None):
    Data  = h5.File(pathToData)
    
    totalMasses = []
    
    #for syntax see section 1 
    seedsDCOs     = Data['BSE_Double_Compact_Objects']['SEED'][()]
    
    #get info from ZAMS
    seedsSystems  = Data['BSE_System_Parameters']['SEED'][()]
    M1ZAMSs       = Data['BSE_System_Parameters']['Mass@ZAMS(1)'][()]
    M2ZAMSs       = Data['BSE_System_Parameters']['Mass@ZAMS(2)'][()]

    for seedDCO in seedsDCOs:
        for nrseed in range(len(seedsSystems)):
            seedSystem = seedsSystems[nrseed]
            if seedSystem == seedDCO:
                M1 = M1ZAMSs[nrseed]
                M2 = M2ZAMSs[nrseed]
                Mtot = M1 + M2
                totalMasses.append(Mtot)

    Data.close()
    return totalMasses

In [245]:
# calculate function run time
start   = time.time()
MtotOld = calculateTotalMassesNaive(pathData=pathToData)
end     = time.time()
timeDiffNaive = end-start

print('%s seconds, using for loops.' %(timeDiffNaive)) 


0.4106295108795166 seconds, using for loops.


## 3. Optimizing the above loop using built-in NumPy routines

`NumPy` is a comprehensive library of mathematical and array computing tools, underpinned by highly optimised C-code that provides fast vectorization, indexing, and broadcasting functions. Visit [numpy.org](https://numpy.org/) for detailed information regarding ``NumPy``.

The `NumPy` library provides tools that allow the user to bypass computationally heavy loops. For example, we can speed up the calculation of the element-wise sum of two arrays with:

In [246]:
M1ZAMS  = Data['BSE_System_Parameters']['Mass@ZAMS(1)'][()]
M2ZAMS  = Data['BSE_System_Parameters']['Mass@ZAMS(2)'][()]
    
mTotalAllSystems  = np.add(M1ZAMS, M2ZAMS)

## 4. Using boolean masks in a single file

There is a useful trick for when you want only those elements which satisfy a specific condition. Where previously we put the condition in an if statement nested within a for loop, now we will use an array of booleans to mask out the undesired elements.

The boolean array will have the same length as the input array, with:

In [247]:
# Create a boolean array from the total mass array which is True
# if the total mass of the corrresponding system is less than 40. 

maskMtot = (mTotalAllSystems <= 40)

Crucially, you can apply this mask to all other columns in the same file because, by construction, they all have the same length.

In [248]:
# seeds of systems with total mass below 40
seeds  = Data['SystemParameters']['SEED'][()]
seedsMtotBelow40 = seeds[maskMtot]

Note that this works because the order of the two columns (seeds and total masses) are the same. For example, the total mass of the third system entry corresponds to the seed at the third system entry.

## 5. Using seeds as masks between files

### 5.1 Example 1

Before we continue it is useful to understand how the COMPAS printing works.

Each simulated system will be initialized only once and so will have only one line in the `BSE_System_Parameters` file. However, lines in the `BSE_Common_Envelopes` file are created whenever a system goes through a common envelope (CE) event, which might happen multiple times for a single system, or potentially not at all. Similarly, in the `BSE_Supernovae` file, you will find at most two lines per system, but possibly none. `BSE_Double_Compact_Object` file lines are printed only when both remnants are either neutron stars or black holes (but disrupted systems are also included), which happens at most once per system.

For this reason, it is in general not the case that the system on line $n$ of one file corresponds to the system on line $n$ of another file.

In order to match systems across files, we need to extract the seeds of desired systems from one file, and apply them as a mask in the other file.

In [249]:
# example mock data from two files
SystemSeeds = np.array([1,  2,  3,  4 ])
SystemMass1 = np.array([1, 20,  5, 45 ])
DCOSeeds    = np.array([    2,      4 ])

# Calculate mask for which elements of SystemSeeds are found in DCOSeeds - see numpy.in1d documentation for details
mask = np.in1d(SystemSeeds, DCOSeeds)

print(mask)
print(SystemSeeds[mask])
print(SystemMass1[mask])


[False  True False  True]
[2 4]
[20 45]


### 5.2 Optimized loop

In [250]:
def calculateTotalMassesOptimized(pathData=None):
    Data  = h5.File(pathToData)
    
    totalMasses = []
    
    #for syntax see section 1 with basic syntax
    seedsDCOs     = Data['DoubleCompactObjects']['SEED'][()]
    #get info from ZAMS
    seedsSystems  = Data['BSE_System_Parameters']['SEED'][()]
    M1ZAMSs       = Data['BSE_System_Parameters']['Mass@ZAMS(1)'][()]
    M2ZAMSs       = Data['BSE_System_Parameters']['Mass@ZAMS(2)'][()]
    
    MZAMStotal    = np.add(M1ZAMS, M2ZAMS)
    
    maskSeedsBecameDCO  = np.in1d(seedsSystems, seedsDCOs)
    totalMassZAMSDCO    = MZAMStotal[maskSeedsBecameDCO]
    
    Data.close()
    return totalMassZAMSDCO

In [251]:
# calculate function run time
start   = time.time()
MtotNew = calculateTotalMassesNaive(pathData=pathToData)
end     = time.time()
timeDiffOptimized = end-start

# calculate number of Double Compact Objects
nrDCOs = len(Data['DoubleCompactObjects']['SEED'][()])

print('Compare')
print('%s seconds, using Optimizations.' %(timeDiffOptimized)) 
print('%s seconds, using For Loops.'     %(timeDiffNaive)) 
print('Using %s DCO systems'             %(nrDCOs))

Compare
0.3999161720275879 seconds, using Optimizations.
0.4106295108795166 seconds, using For Loops.
Using 697 DCO systems


*Note:* The time difference will depend on the number of systems under investigation, as well as the number of bypassed loops.

In [252]:
# test that the two arrays are in fact identical
print(np.array_equal(MtotOld, MtotNew))

True


Note that the above loop can easily be expanded with more conditions.

If you do not want all the DCO initial total masses but only of the double neutron stars, then you just need to apply another mask to the `seedsDCOs`.

In [253]:
def calculateTotalMassesDNS(pathToData=None):
    Data  = h5.File(pathToData)
    
    totalMasses = []
    
    #for syntax see section 1 with basic syntax
    seedsDCOs     = Data['BSE_Double_Compact_Objects']['SEED'][()]
    type1         = Data['BSE_Double_Compact_Objects']['Stellar_Type(1)'][()]
    type2         = Data['BSE_Double_Compact_Objects']['Stellar_Type(2)'][()]
    maskDNS       = (type1 == 13) & (type2 == 13)
    seedsDNS      = seedsDCOs[maskDNS]
    
    #get info from ZAMS
    seedsSystems  = Data['BSE_System_Parameters']['SEED'][()]
    M1ZAMSs       = Data['BSE_System_Parameters']['Mass@ZAMS(1)'][()]
    M2ZAMSs       = Data['BSE_System_Parameters']['Mass@ZAMS(2)'][()]
    
    MZAMStotal    = np.add(M1ZAMS, M2ZAMS)
    
    
    maskSeedsBecameDNS  = np.in1d(seedsSystems, seedsDNS)
    totalMassZAMSDNS    = MZAMStotal[maskSeedsBecameDNS]
    
    Data.close()
    return totalMassZAMSDNS


In [254]:
# calculate function run time
start   = time.time()
MtotDNS = calculateTotalMassesDNS(pathToData=pathToData)
end     = time.time()
timeDiffDNS = end-start

# calculate number of DNS systems
nrDNSs = len(MtotDNS)
    
print('%s seconds for all %s DNS systems.' %(timeDiffDNS, nrDNSs)) 


0.0014612674713134766 seconds for all 310 DNS systems.


### 5.3 Example 2

The previous example uses the fact that both ``BSE_System_Parameters`` and ``BSE_Double_Compact_Objects`` only contain at most one line per system. However, as mentioned above, events such as supernovae or common envelopes might happen multiple times to a given system, and as a result there would be multiple occurences of a given seed in the relevant file. 

To account for this, we will need to modify the previous method. Consider again the 4 seeds of the previous example. Both 2 and 4 formed a DCO, and hence both stars in these binaries went SN. Seeds 1 and 3 are low mass stars hence they did not go SN. (Note that we do not specify the companion masses for any of these systems, but for simplicity we assume that the companions to 1 and 3 are also sufficiently low mass to not produce a supernova). The ``BSE_Supernovae`` file contains one line per SN and therefore seeds 2 and 4 appear twice each.

Imagine you want the primary masses of systems that experienced at any point a core collapse supernova (CCSN). We'll reuse our mock data, with additional information about the types of SN which occured in each star. Here, PPISN refers to Pulsational Pair Instability Supernovae.

In [255]:
# example mock data from above
SystemSeeds = np.array([1,  2,  3,  4 ])
SystemMass1 = np.array([1, 20,  5, 45 ])
DCOSeeds    = np.array([    2,      4 ])

SNSeeds     = np.array([     2,      2,      4,       4 ])  
SNTypes     = np.array(['CCSN', 'CCSN', 'CCSN', 'PPISN' ])

# get seeds which had a CCSN
maskCCSN  = SNTypes == 'CCSN'
seedsCCSN = SNSeeds[maskCCSN]
print('CCSN seeds =%s' %(seedsCCSN))

#compare which element of 1-d array are in other
#this because in 

seedsCCSN = np.unique(seedsCCSN)
# in this particular case, it is not necessary to reduce seedsCCSN to it's unique entries.
# the numpy.in1d function will work with duplicate seeds, but we include it explicitly here
# as other more complicated scenarios might rely on unique sets of seeds

mask = np.in1d(SystemSeeds, seedsCCSN)
print(SystemMass1[mask])


CCSN seeds =[2 2 4]
[20 45]


In [256]:
# Always remember to close your data file
Data.close()