# notebook for converting Clementine EDR files

**IMPORTANT:**
This notebook expects to be working with *decompressed* files. It is
impractical to decompress files one-by-one from CLEM-JPEG during the
PDS3 -> PDS4 conversion step. See clemdcmp_processor.ipynb for a bulk
decompression method.

## performance tips

IOPS is the most likely bottleneck, but CPU is also possible. We recommend
both using a fast disk and parallelizing this, either using ```pathos```
(vanilla Python ```multiprocessing``` will fail during a pickling step)
or simply by running multiple copies of this notebook.

Also note that our manifest file is pretty large file (~1.4 GB), as it includes
a lot of metadata (here used primarily to read compression stats).
it might be worth reading it in chunks, dropping chunks of it
(as this notebook does), or paring it down in some way if you're running
this many times in parallel or on a memory-constrained computer.
This is really the only constraint on working memory, because the individual
EDR files are so small.

## corrupted files

A handful of files in the PDS3 archive appear to be corrupt. CLEMDCMP
produces visually-ok output that is nevertheless unparseable by any
method we attempted. The ```CANNOT_PARSE_FILES``` variable below marks
these files for skipping so that the pipeline doesn't choke on them.

In [None]:
import datetime as dt

from fs.osfs import OSFS
import pandas as pd

from clem_bulk import crude_time_log
from clem_conversion import ClemEDRConverter

In [None]:
CANNOT_PARSE_FILES = [
    'lla0807f.222', 'lla3440m.230', 'lla2912l.202', 'lla0977f.210', 'lla1744i.174',
    'lna5725y.034', 'lla3357m.226', 'lhb0557k.323', 'lla2996l.254', 'lla1355g.279',
    'lla2236j.260', 'lla1914s.341', 'lla2320j.268', 'lla1999i.280', 'lla2167j.274' 
]


In [None]:
# set paths to your input and output filesystems
input_fs = OSFS('/home/ubuntu/buckets/clemdcmp_holding/')
output_fs = OSFS('/home/ubuntu/buckets/clem_output/')

In [None]:
# index of file metadata + our new output paths.
# the metadata is really used just to populate onboard compression
# values, because CLEMDCMP strips them from its output labels and
# parsing both the compressed and decompressed files at this stage
# is a pain.
imindex = pd.read_csv('./directories/clementine/clementine_edr_index.csv')
# uppercasing this for parity with labels
# (the PDS3 archive is not very consistent about capitalization)
imindex['product_id'] = imindex['product_id'].str.upper() 

In [None]:
# this is a place to split your index up if you're
# parallelizing across multiple notebooks.
from more_itertools import chunked
chunk_ix_of_this_notebook = 0
chunk_size = 100000
chunks = chunked(imindex.index, chunk_size)
chunk_ix = 0
while chunk_ix <= chunk_ix_of_this_notebook:
    chunk = next(chunks)
    chunk_ix += 1
working_index = imindex.loc[chunks].copy()
del imindex
log_string = "_" + str(chunk_ix_of_this_notebook)

In [None]:
# this is just to quickly break and restart if odd things happen.
try: 
    old_ix = ix
except NameError:
    old_ix = 0

for ix, edr_image in enumerate(working_index.itertuples()):
    if ix < old_ix:
        continue
    if ix % 250 == 0:
        print("Converting " + edr_image.product_id.lower())
        print(str(ix) + " of " + str(len(working_index)) + " in this volume.")    
    if edr_image.product_id.lower() in CANNOT_PARSE_FILES:
        print('skipping known bad file ' + edr_image.product_id.lower())
        continue
    edr_start_time = dt.datetime.now() # just for logging
    # note that clem_conversion.clemdcmp() adds an .img extension
    # to differentiate decompressed files from their compressed sources,
    # and the decompression notebook provided simply splits files into
    # directories by archive volume number (0001 - 0088). If you have
    # arranged them some other way, change this.
    source_path = str(
        edr_image.volume_id + '/' + edr_image.product_id + '.img'
    ).lower()
    destination_path = edr_image.newpath
    output_fs.makedirs(destination_path, recreate=True)

    # initialize writer & convert product
    try: 
        writer = ClemEDRConverter(
            source_path,
            image_index = working_index,
        )
    except ValueError:
        print(
            "OH NO! ACK!  SOMETHING IS WRONG WITH THIS \n \n \n FILE!!!!!!!",
            "\n",
            "\n",
            "ACK!!!!\n",
        )
        print(source_path + " is bad!!!!!!!"),
        print("\n\n\n"),
        print("*************************)")
        with open('failed_edr_files' + log_string + '.txt') as file:
            file.write(source_path)
        continue
    writer.write_pds4(output_fs.getsyspath('') + destination_path, verbose = False)

    # very simple logger
    # probably better to distinguish log filenames if you're running
    # these concurrently on the same machine
    elapsed = str((dt.datetime.now() - edr_start_time).total_seconds())
    crude_time_log(
        'edr_conversion_log_' + log_string + '.csv',
        writer,
        elapsed
    )
