# Fixed array size, variable number of versions

For this test, we have generated `.h5` data files using the `generate_data.py` script from the repository, using the following options:

- `test_large_fraction_changes_sparse`: 
    - `num_rows_initial = 5000`
    - `num_rows_per_append = 10`
    - `pct_inserts = 1`
    - `num_inserts = 10`
    - `pct_deletes = 1`
    - `num_deletes = 10`
    - `pct_changes = 90`
    - `num_changes = 1000`
- `test_small_fraction_changes_sparse`
    - `num_rows_initial = 5000`
    - `num_rows_per_append = 10`
    - `pct_inserts = 1`
    - `num_inserts = 10`
    - `pct_deletes = 1`
    - `num_deletes = 10`
    - `pct_changes = 90`
    - `num_changes = 10`
- `test_mostly_appends_sparse`:
    - `num_rows_initial = 1000`
    - `num_rows_per_append = 1000`
    - `pct_inserts = 5`
    - `num_inserts = 10`
    - `pct_deletes = 1`
    - `num_deletes = 10`
    - `pct_changes = 5`
    - `num_changes = 10`  
- `test_mostly_appends_dense`
    - `num_rows_initial_0 = 30`
    - `num_rows_initial_1 = 30`
    - `num_rows_per_append_0 = 1`
    - `pct_inserts = 5`
    - `num_inserts_0 = 1`
    - `num_inserts_1 = 10`
    - `pct_deletes = 1`
    - `num_deletes_0 = 1`
    - `num_deletes_1 = 1`
    - `pct_changes = 5`
    - `num_changes = 10`

## Setup

The path to the generated test files is

In [2]:
#path = "/home/melissa/projects/versioned-hdf5/" # change this as necessary
path = "/home/melissa/Dropbox/trabalho/Quansight/DE Shaw/tests/test_fixed_array_size/"

In [3]:
%matplotlib widget
import os
import sys
sys.path.append('..')
import pickle
import numpy as np
import matplotlib.pyplot as plt
import h5py
from versioned_hdf5 import VersionedHDF5File

In [4]:
# auxiliary code to format file sizes 
def format_size(size):
    suffixes = ['B', 'KB', 'MB', 'GB']
    i = 0
    while size >= 1024 and i < len(suffixes)-1:
        size = size/1024
        i += 1
    return f"{size:.2f} {suffixes[i]}"

# Test 1: Large fraction changes (sparse)

In [5]:
testname = "test_large_fraction_changes"

We have tested the following numbers of versions (or transactions):

In [6]:
num_transactions = [50, 100, 500, 1000, 2000, 5000, 10000, 20000]

In [7]:
tests = []
for t in num_transactions:
    filename = os.path.join(path, testname+"_"+str(t)+"_sparse.h5")
    h5pyfile = h5py.File(os.path.join(path, testname+"_"+str(t)+"_sparse.h5"), 'r')
    data = VersionedHDF5File(h5pyfile)
    tests.append(dict(num_transactions=t, filename=filename, h5pyfile=h5pyfile, data=data))

## Number of versions v. File size

We'll start by analyzing how the `.h5` file sizes grow as the number of versions grows. 

In [8]:
for test in tests:
    test['size'] = os.path.getsize(test['filename'])
    test['size_label'] = format_size(test['size'])
    print(f"File with {test['num_transactions']} versions has size {test['size_label']}")

File with 50 versions has size 6.74 MB
File with 100 versions has size 11.93 MB
File with 500 versions has size 65.26 MB
File with 1000 versions has size 151.09 MB
File with 2000 versions has size 376.74 MB
File with 5000 versions has size 1.53 GB
File with 10000 versions has size 5.00 GB
File with 20000 versions has size 17.73 GB


Note that the array size also grows as the number of versions grows:

In [None]:
print("Array sizes:")
for test in tests:
    lengths = []
    for vname in test['data']._versions:
        if vname != '__first_version__':
            version = test['data'][vname]
            group_key = list(version.keys())[0]
            lengths.append(len(version[group_key]['val']))
    print(f"File with {test['num_transactions']}: min = {min(lengths)}, max = {max(lengths)}")

```
Array sizes:
File with 50: min = 5000, max = 5567
File with 100: min = 5000, max = 5952
File with 500: min = 5000, max = 10263
File with 1000: min = 5000, max = 15601
File with 2000: min = 5000, max = 26057
File with 5000: min = 5000, max = 57937
File with 10000: min = 5000, max = 110305
File with 20000: min = 5000, max = 215739
```

In [9]:
test_large_fraction_changes_sparse = []
for test in tests:
    test_large_fraction_changes_sparse.append(dict((k, test[k]) for k in ['num_transactions', 'filename', 'size', 'size_label']))

Just for the sake of reproducibility, we'll pickle the filesizes for these tests so we can recover them later:

In [9]:
with open("test_large_fraction_changes_sparse_versions.pickle","wb") as pickle_out:
    pickle.dump(test_large_fraction_changes_sparse, pickle_out)

In [10]:
with open("test_large_fraction_changes_sparse_versions.pickle", "rb") as pickle_in:
    test_large_fraction_changes_sparse = pickle.load(pickle_in)

Let's show the size information in a graph:

In [10]:
filesizes = np.array([test['size'] for test in test_large_fraction_changes_sparse])
sizelabels = np.array([test['size_label'] for test in test_large_fraction_changes_sparse])

In [11]:
fig_large_fraction_changes = plt.figure()
plt.plot(num_transactions, filesizes, 'b')
plt.plot(num_transactions, filesizes, 'b*', ms=12)
plt.xticks([50, 5000, 10000, 20000])
plt.xlabel("Transactions")
plt.title("Number of transactions vs. File Size")
num_transactions = np.array(num_transactions)
plt.yticks(filesizes[[0, 5, 6, 7]], sizelabels[[0, 5, 6, 7]])
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [None]:
img = plt.imread('test_large_fraction_changes_sparse_versions.png')
plt.imshow(img)
plt.show()

This shows that the file size grows **quadratically** with respect to the number of versions added for this array size.

### Finishing up

In [12]:
for test in tests:
    test['h5pyfile'].close()

# Test 2: Mostly appends (sparse)

In [13]:
testname = "test_mostly_appends"

For this case, we are using the following number of transactions:

In [14]:
num_transactions = [50, 100, 500, 1000, 2000, 5000, 10000]

In [15]:
# Setting up dictionary with test info
tests = []
for t in num_transactions:
    filename = os.path.join(path, testname+"_"+str(t)+"_sparse.h5")
    h5pyfile = h5py.File(os.path.join(path, testname+"_"+str(t)+"_sparse.h5"), 'r')
    data = VersionedHDF5File(h5pyfile)
    tests.append(dict(num_transactions=t, filename=filename, h5pyfile=h5pyfile, data=data))
    
# Computing file sizes
for test in tests:
    test['size'] = os.path.getsize(test['filename'])
    test['size_label'] = format_size(test['size'])
    print(f"File with {test['num_transactions']} versions has size {test['size_label']}")

File with 50 versions has size 9.44 MB
File with 100 versions has size 23.40 MB
File with 500 versions has size 305.71 MB
File with 1000 versions has size 944.95 MB
File with 2000 versions has size 3.49 GB
File with 5000 versions has size 19.64 GB
File with 10000 versions has size 75.06 GB


In [16]:
print("Array sizes:")
for test in tests:
    lengths = []
    for vname in test['data']._versions:
        if vname != '__first_version__':
            version = test['data'][vname]
            group_key = list(version.keys())[0]
            lengths.append(len(version[group_key]['val']))
    print(f"File with {test['num_transactions']}: min = {min(lengths)}, max = {max(lengths)}")

Array sizes:
File with 50: min = 1000, max = 50933
File with 100: min = 1000, max = 100857
File with 500: min = 1000, max = 500951
File with 1000: min = 1000, max = 1000660
File with 2000: min = 1000, max = 2000840
File with 5000: min = 1000, max = 5001604
File with 10000: min = 1000, max = 10000126


```
Array sizes:
File with 50: min = 1000, max = 50933
File with 100: min = 1000, max = 100857
File with 500: min = 1000, max = 500951
File with 1000: min = 1000, max = 1000660
File with 2000: min = 1000, max = 2000840
File with 5000: min = 1000, max = 5001604
File with 10000: min = 1000, max = 10000126
```

In [17]:
test_mostly_appends_sparse = []
for test in tests:
    test_mostly_appends_sparse.append(dict((k, test[k]) for k in ['num_transactions', 'filename', 'size', 'size_label']))

In [18]:
with open("test_mostly_appends_sparse_versions.pickle","wb") as pickle_out:
    pickle.dump(test_mostly_appends_sparse, pickle_out)

In [19]:
with open("test_mostly_appends_sparse_versions.pickle", "rb") as pickle_in:
    test_mostly_appends_sparse = pickle.load(pickle_in)

In [20]:
filesizes = np.array([test['size'] for test in test_mostly_appends_sparse])
sizelabels = np.array([test['size_label'] for test in test_mostly_appends_sparse])

In [24]:
fig_mostly_appends = plt.figure()
plt.plot(num_transactions, filesizes, 'b')
plt.plot(num_transactions, filesizes, 'b*', ms=12)
plt.xticks([50, 5000, 10000])
plt.xlabel("Transactions")
plt.title("Number of transactions vs. File Size")
num_transactions = np.array(num_transactions)
plt.yticks(filesizes[[0, 4, 5, 6]], sizelabels[[0, 4, 5, 6]])
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [25]:
img = plt.imread('test_mostly_appends_sparse_versions.png')
plt.imshow(img)
plt.show()

In [28]:
for test in tests:
    test['h5pyfile'].close()