# Tutorial on Dataset Minification

## Purpose

The goal of this chapter is to demonstrate the process of creating a minified dataset capable of running in online containers like Binder. Since Binder images are created from Github repositories, this necesitates that we adhere to Github's size [limits](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-large-files-on-github), which do not allow files over 100 MB.

## Style

This chapter does not contain any technically complex code, so the commentary may be a bit sparse. It exists for reproducibility and tutorial purposes. The link to the full dataset can be found [here](https://www.dropbox.com/s/qkr9712m8jt3zft/AirborneData.mat?dl=0).

## Process

In [64]:
import scipy.io
import os.path

The data used in this book comes in a matlab file, so we will be manipulating it with scipy.

In [65]:
airborne_data_path = "../data/Elwha2012.mat"
assert os.path.exists(airborne_data_path)
airborne_data = scipy.io.loadmat(airborne_data_path)
original_size = os.path.getsize(airborne_data_path) 

In [66]:
print(list(airborne_data.keys()))

['__header__', '__version__', '__globals__', 'imageRGB', 'imageIR', 'maskRiver', 'tempRiver', 'northings', 'eastings', 'Xt', 'Yt', 'Zt', 'altitude', 'datePDT']


In [67]:
print(airborne_data['__header__'])
print(airborne_data['__version__'])
print(airborne_data['__globals__'])

b'MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Mon Apr  8 16:29:13 2013'
1.0
[]


It is unclear if any of these will prove important to our data processing, but they don't contibute almost anything to the file size and don't get in our way so no reason to bother ourselves with removing them.

In [68]:
print(type(airborne_data['imageRGB']))
print(airborne_data['imageRGB'].shape)

print(type(airborne_data['imageIR']))
print(airborne_data['imageIR'].shape)

<class 'numpy.ndarray'>
(640, 480, 406, 3)
<class 'numpy.ndarray'>
(480, 640, 406)


The type and shape of objects in the dictionary (scipy loads matlab files to dictionary) tells us a lot about what they are.

In [69]:
print('\nmaskRiver:')
print(type(airborne_data['maskRiver']))
print(airborne_data['maskRiver'].shape)

print('\ntempRiver:')
print(type(airborne_data['tempRiver']))
print(airborne_data['tempRiver'].shape)

print('\ndatePDT:')
print(type(airborne_data['datePDT']))
print(airborne_data['datePDT'].shape)


maskRiver:
<class 'numpy.ndarray'>
(480, 640, 406)

tempRiver:
<class 'numpy.ndarray'>
(406, 5)

datePDT:
<class 'numpy.ndarray'>
(406,)


It seems `maskRiver` and `tempRiver` are abandoned old work from a few years ago when another researched tried to do some processing on this dataset. Trimming them will greatly reduce the dataset size.

In [70]:
airborne_data.pop('maskRiver')
airborne_data.pop('tempRiver')
airborne_data.pop('datePDT')
;

''

For those who don't know, the semicolon at the end of a cell tells jupyter to not show output.

In [71]:
print('\nnorthings:')
print(type(airborne_data['northings']))
print(airborne_data['northings'].shape)

print('\neastings:')
print(type(airborne_data['eastings']))
print(airborne_data['eastings'].shape)

print('\nXt:')
print(type(airborne_data['Xt']))
print(airborne_data['Xt'].shape)

print('\nYt:')
print(type(airborne_data['Yt']))
print(airborne_data['Yt'].shape)

print('\nZt:')
print(type(airborne_data['Zt']))
print(airborne_data['Zt'].shape)


northings:
<class 'numpy.ndarray'>
(406, 1)

eastings:
<class 'numpy.ndarray'>
(406, 1)

Xt:
<class 'numpy.ndarray'>
(480, 640, 406)

Yt:
<class 'numpy.ndarray'>
(480, 640, 406)

Zt:
<class 'numpy.ndarray'>
(1, 406)


In [72]:
# airborne_data.pop('northings')
# airborne_data.pop('eastings')
airborne_data.pop('Xt')
airborne_data.pop('Yt')
airborne_data.pop('Zt')
;

''

In [73]:
print(list(airborne_data.keys()))

['__header__', '__version__', '__globals__', 'imageRGB', 'imageIR', 'northings', 'eastings', 'altitude']


We've significantly reduced the file size with these steps. However, we still have 812 images, which at about 1 MB a piece leaves us with a still gargantuan ~800 MB file, far too large for Github. We are going to need to trim this down a bit.

Once we have chosen what size subset of the data to use, in this case 25 images, we have to decide which images. For this dataset, since the sequence of images matter (we want images next to eachother since we are dealing with misalignment), we will just choose the first 25 images. For other datasets, this may not be the optimal choice.

In [74]:
subset_size = 25

trimmed_rgb = airborne_data['imageRGB'][:,:,0:subset_size]
trimmed_ir = airborne_data['imageIR'][:,:,0:subset_size]

airborne_data['imageRGB'] = trimmed_rgb
airborne_data['imageIR'] = trimmed_ir

In [75]:
print(type(airborne_data['imageRGB']))
print(airborne_data['imageRGB'].shape)

print(type(airborne_data['imageIR']))
print(airborne_data['imageIR'].shape)

<class 'numpy.ndarray'>
(640, 480, 25, 3)
<class 'numpy.ndarray'>
(480, 640, 25)


After verifying we have succesfully extracted our subset of images, we can tell scipy to save our file.

In [76]:
airborne_minidata_path = "../data/Elwha2012Mini.mat"
scipy.io.savemat(airborne_minidata_path, airborne_data)

In [81]:
minified_size = os.path.getsize(airborne_minidata_path)
print("Size reduced by:", original_size - minified_size, "bytes")

Size reduced by: 1382450025 bytes
