# notebook for converting clementine mosaic files in bulk

## notes and performance tips:
1. other than selecting a source index, everything necessary to distinguish
mosaic product types and handle their individual idiosyncracies is
handled by ```clem_conversion.ClemMosaicConverter```. There isn't a bunch of
preprocessing in this notebook, and no need for anything like the EDR decompression
notebook. The Clementine mosaics are all pretty similar to one another, each product
only has one data object, and they don't need to be decompressed for the
'planetary data stack' (```pvl, pdr, gdal```, etc.) to work on them.

2. CPU, throughput, and IOPS (the last especially for HIRES) are all plausible
bottlenecks. We strongly recommend parallelizing this, perhaps just by running
duplicates of this notebook n parallel.

3. ```ClemMosaicConverter``` makes no special effort to be careful about
memory, because even the largest Clementine tiles are well under 100 MB. However,
if you are running this on a memory-constrained machine, you might want
to hack it a bit.

4. We're never thinking about distinct PDS3 label files: all of the tiles have
attached PDS3 labels. Note that there are also detached ISIS qube files in the
same directories; while this isn't important for the current writer configuration,
if you later want to add ```rasterio``` into this stack, it might be.

5. This does not handle the 56 HIRES mosaic products with duplicated product IDs.
See [dupe_hires_tile_fix.ipynb](dupe_hires_tile_fix.ipynb).

In [None]:
import datetime as dt

import fs.path
from fs.osfs import OSFS
from more_itertools import distribute
import pandas as pd
import sh

from clem_bulk import crude_time_log, swap_lat_and_scale
from clem_conversion import ClemMosaicConverter

In [None]:
# set root directories for your input and output data sets
input_fs = OSFS('/home/ubuntu/buckets/clem_input/')
output_fs = OSFS('/home/ubuntu/buckets/clem_output/')
# we wrote these directly to an s3fs-fuse filesystem.
# gdal does not deal well with writing directly to s3 filesystems,
# for unclear reasons. this roundabout temp directory step
# would serve no purpose in another configuration.
temp_output_directory = '/home/ubuntu/data_temp/'

# our manifest of 'standard' mosaic products (specifically not counting some
# of the oddball / mangled basemap files on cl_3015), along with mappings to
# their new paths
mosaic_manifest = pd.read_csv(
    './directories/clementine/standard_mosaic_products.csv'
)

# associate hires source index with an index of edr products that includes time
# in order to determine start and stop times for each hires mosaic tile
hires_source_groupby = pd.read_csv(
    './directories/clementine/hires_source_index.csv',
).groupby('tilename')

# do the same for basemap 
# (note that indices of this type are not available for uvvis and nir mosaics)
basemap_source_groupby = pd.read_csv(
    './directories/clementine/basemap_source_index.csv',
).groupby('tilename')

In [None]:
# this is a simple method of splitting your index up if
# you're parallelizing across multiple notebooks / machines /
# whatever.

# chunk tiles into separate lists
tile_chunks = distribute(25, mosaic_manifest.itertuples())

# increment this next variable (serially or in parallel) from 0 to 49
# across a series of notebooks / scripts / whatever in order to convert
# all of the tiles in an organized fashion
tile_chunk_index = 0
tile_chunk = list(tile_chunks[tile_chunk_index]) # greedily evaluate to get the length
chunk_length = len(tile_chunk)

In [None]:
# convert all the files in this chunk 
for ix, tile in enumerate(tile_chunk):
    print("Converting " + fs.path.split(tile.file)[1])
    print(str(ix) + " of " + str(chunk_length) + " in this chunk.")
    
    tile_start_time = dt.datetime.now() # just for logging
    
    source_path = tile.file
    # select source product index (this is basically used only to write 
    # detailed start / stop time tags in basemap and hires)
    if fs.path.split(source_path)[1].startswith('b'):
        source_groups = basemap_source_groupby
    elif fs.path.split(source_path)[1].startswith(('h', 'g')):
        source_groups = hires_source_groupby
    else:
        source_groups = None
    
    destination_path = tile.newpath
    output_fs.makedirs(destination_path, recreate=True)
    
    # initialize writer & convert product 
    writer = ClemMosaicConverter(
        input_fs.getsyspath(source_path),
        source_groups = source_groups
    )
    writer.write_pds4(temp_output_directory)
    
    # this is a hacky fix for an incompatibility between
    # gdal and the current version of the cart LDD
    if "hires_polar" in writer.pds4_root:
        swap_lat_and_scale(temp_output_directory + writer.pds4_root + ".xml")

    for extension in ['.xml', '.tif']:
        sh.mv(
            temp_output_directory + writer.pds4_root + extension, 
            output_fs.getsyspath(destination_path),
            _bg=True
        )

    # very simple logger
    # probably better to distinguish log filenames if you're running
    # these concurrently on the same machine
    elapsed = str((dt.datetime.now() - tile_start_time).total_seconds())
    crude_time_log(
        'mosaic_conversion_log_' + str(tile_chunk_index) + '.csv',
        writer,
        elapsed
    )
    print("total seconds: " + elapsed)