# notebook for decompressing clementine EDR files

## general notes and performance tips:
1. decompressing Clementine EDR files is fairly fast on a modern machine (much of which is initialization
/ access time for ```CLEMDCMP``` and not improved very much by more single-thread speed). However, to
provide an operating environment for ```CLEMDCMP```, we must spin up DOSBOX (a PCDOS emulator), which
is fairly slow (~ 2-5 seconds depending on operating environment, again not improved very much by more
single-thread speed). There are around 1.9 million EDR files. This means that running them in bulk is
necessary.
2. Feeding very long file paths to DOSBOX is not practical. This means that files need to be decompressed
locally, preferably within a single directory. Fortunately, all filenames (not counting path) in the EDR
are distinct. (As a side note, this was a secondary function of including orbit number as a three-digit
sequential extension -- so that they could encode all the metadata they wanted in the filename while
retaining adequate entropy to distinguish each file in the EDR from all its siblings.)
3. more CPU cores are very helpful if you want this to go faster. CLEMDCMP itself is quite efficient, but
the x86 emulation layer of DOSBOX is not. Assume it will eat a whole core for no reason. There is probably
a way to fix this, but I don't know what it is. Appropriate value for dosbox_process_count below is probably
about 1 per core.
4. Make very sure you have DOSBOX config set to _not_ limit processor cycles. Otherwise, it will faithfully
run it at authentic 1994 speeds, which is great for DOSBOX's primary target application (playing DOS games
at the correct speed) but terrible for this one
5. After CPU cores, IOPS are another very likely bottleneck. Throughput and working memory are not likely
bottlenecks.  The upshot here is that if you do parallelize this beyond the multithreading that is already
built into this notebook, it's probably better to do it across machines. On a single machine, just increase
the multiprocessing limits until you run out of CPU or iowait spikes; everything here multithreads very nicely.
6. ```CLEMDCMP``` has very few error messages. Presented with most forms of bad input, it simply hangs forever.
This, among many other potential sources of weirdness with any way you change the process, is a good reason
other than performance as such to consider parallelizing this across multiple sessions.
7. Average compression ratio varies quite a bit between the AVs, but across the whole EDR, it is about 1:3.75.
Total uncompressed size of the EDR is about 170 GB.

In [38]:
import re
from multiprocessing import Pool

import fs.path
from fs.osfs import OSFS
from more_itertools import distribute

from clem_conversion import *

# set root directory for input data set
input_root = '/home/ubuntu/clem_input/'

# clemdmp does not deal well with writing directly to other filesystems;
# this is a working directory that contains it and its
# immediate outputs
clemdcmp_directory = 'clemdcmp'

# where are we storing these decompressed images?
base_dir = '/home/ubuntu/clemdcmp_holding/'

# multiprocess limits
fs_process_slots = 30
dosbox_process_count = 16

In [None]:
# split up iteration however you want. there are 88 volumes.
# this notebook just runs them all.

av_range = (1,89)

In [None]:
for av_index in av_range:
# pick volume and make fs abstraction
    av_string = 'cl_' + '{:0>4}'.format(av_index)
    volume_root = input_root + av_string
    volume = OSFS(volume_root)
    # find data directories on volume
    data_dirs = [
        file.name for file in volume.scandir('')
        if file.is_dir and file.name not in [
            'document', 'index', 'software', 'timeline'
        ]
    ]
    av_start_time = dt.datetime.now()

    print('starting av ' + av_string)
    for data_dir in data_dirs:
        start_time = dt.datetime.now()
        print(start_time.isoformat())
        data_files = list(volume.walk.files(data_dir))
        print('copying files from:')
        print(data_dir, str(len(data_files)) + ' total files')
        if __name__ == '__main__':
            pool = Pool(fs_process_slots)
            for ix, path in enumerate(data_files):
                command = pool.apply_async(
                    shutil.copy, 
                    (
                        fs.path.combine(volume_root, path), 
                        clemdcmp_directory + fs.path.split(path)[1]
                    )
                )
            pool.close()
            pool.join()

        print ('starting clemdcmp conversion run')
        edr_files = [
            file for file in os.listdir(clemdcmp_directory)
            if re.match(r'\w{8}\.\d{3}', file)
        ]
        commands = []
        sh.mkdir('-p', base_dir + av_string + '/')
        if __name__ == '__main__':
            pool = Pool(dosbox_process_count)
            chunks = distribute(dosbox_process_count, edr_files)
            for ix, chunk in enumerate(chunks):
                command = pool.apply_async(
                    # do NOT look at the output of dosbox directly in a REPL environment
                    # on a remote server. ncurses viewed through sh's interpreter will
                    # cause TROUBLE.
                    clemdcmp, 
                    (list(chunk), clemdcmp_directory, base_dir + av_string + '/', ix)
                )
                commands.append(command)
            pool.close()
            pool.join()

        print('cleaning up')
        for file in edr_files:
            os.remove(clemdcmp_directory + file)
        print('done, elapsed time ' + str((dt.datetime.now() - start_time).total_seconds()) + ' seconds')
    print('**************************************************')
    print('\n')
    print('av done, elapsed time ' + str((dt.datetime.now() - av_start_time).total_seconds()) + ' seconds')
    print('\n')
    print('**************************************************')