# Introduction

The COMPAS simulations might be very large in data size
while the actual data you need to reproduce your results could
be small. Hence it might make sense to reduced the number
of files and columns based on some criteria.

Here we show how you can reduce your data.
The main things you need are:
    
1 - The seeds you want to have in your data

2 - The files you want in your data

3 - The columns (parameters) you want for each file

The plain python script to do this is '$COMPAS_ROOT_DIR/postProcessing/Folders/H5/rewrite_H5.py'. Here we just show an example of how to call the script in order to reduce the data.

# Paths needed

In [None]:
# Set the appropriate paths to the input and output data files

pathToDataInput  = '/home/cneijssel/Desktop/Test/COMPAS_output.h5'         
pathToDataOutput = '/home/cneijssel/Desktop/Test/COMPAS_output_reduced.h5' 

# Imports

In [None]:
import h5py  as h5  # For handling data format
import sys

# Import the rewrite_H5.py script
sys.path.append('PythonScripts/')
import rewrite_H5

# 1  Load the Data

In [None]:
Data  = h5.File(pathToDataInput)
print("The main files I have at my disposal are:\n",list(Data.keys()))

In [None]:
# To see the parameter choices in each file, use, e.g:
#print(list(Data['SystemParameters']))

# 2 Specify which files and columns you want

We use dictionaries to specifically link all the entries.

The filesOfInterest dictionary should contain all files which hold any relevant data. The columnsOfInterest dictionary specifies the parameters in each file that you want to be included in the new output h5. Any filters or masks should be used to determine the seedsOfInterest (on a per file basis), and so do not need to be included in the columnsOfInterest.

### Hypothetical Example

Suppose you are studying Double Neutron Star systems, and you want to know the initial parameters of both components. Suppose you are separately curious about the eccentricity of systems following a Supernova that leaves the binary intact, and you want to use the same COMPAS run to save on CPU*hours. 

To be safe, you should probably keep the entire SystemParameters file, which contains all of the initial system settings. 

To get information about only Double Neutron Stars, you will need to create a mask for them from the DoubleCompactObjects file.

Information on post-SN eccentricity and whether or not the system disrupted is found in the Supernovae file. 

You will not need any other files. You will also want to grab the system 'SEED's column from any file, since that is the unique identifier of the binaries. 

In [None]:
# Which files do you want?
filesOfInterest   = {1:'SystemParameters',\
                     2:'DoubleCompactObjects',\
                     3:'Supernovae'}

# Give a list of columns you want, if you want all, say ['All']
columnsOfInterest = {1:['All'],\
                     2:['All'],\
                     3:['SEED', 'Eccentricity']}

# The seedsOfInterest are a little more involved

# 3 Which seeds do I want per file?

In [None]:
### Do not filter out any systems/seeds from SystemParameters

seedsSystems = Data['SystemParameters']['SEED'][()]



### Of all the double compact objects, keep only the DNSs

DCOs = Data['DoubleCompactObjects']
seedsDCOs       =  DCOs['SEED'][()]

typePrimary     =  DCOs['Stellar_Type_1'][()]
typeSecondary   =  DCOs['Stellar_Type_2'][()]
DNSs            =  (typePrimary == 13) & (typeSecondary == 13)

seedsDNS        =  seedsDCOs[DNSs]



### Filter out disrupted systems

SNe  = Data['Supernovae']
seedsSNe     = SNe['SEED']


isUnbound    = SNe['Unbound'][()] 
intact       = (isUnbound == False)

seedsIntact  = seedsSNe[intact]



### Create seedsOfInterest dictionary -- DOUBLE CHECK ORDER :) 

seedsOfInterest   = {1:seedsSystems,\
                     2:seedsDCOs,\
                     3:seedsIntact}


# Don't forget to close the original h5 data file
Data.close()

# 4 Call the function which creates the h5 file

In [None]:
rewrite_H5.reduceH5(pathToOld = pathToDataInput, pathToNew = pathToDataOutput,\
                     dictFiles=filesOfInterest, dictColumns=columnsOfInterest, dictSeeds=seedsOfInterest)

In [None]:
rewrite_H5.printAllColumnsInH5(pathToDataOutput)