# Fixed array size, variable number of versions

For this test, we have generated `.h5` data files using the `generate_data.py` script from the repository, using the following options:

- `test_large_fraction_changes_sparse`: 
    - `num_rows_initial = 5000`
    - `num_rows_per_append = 10`
    - `pct_inserts = 1`
    - `num_inserts = 10`
    - `pct_deletes = 1`
    - `num_deletes = 10`
    - `pct_changes = 90`
    - `num_changes = 1000`

We have tested the following numbers of versions (or transactions):

In [1]:
num_transactions = [30, 60, 120, 250, 360, 730, 1825, 2500, 3650, 7300, 9125, 14600, 18250, 21900, 25000, 27375, 29200, 36500]

The path to the generated test files is

In [2]:
path = "/home/melissa/projects/versioned-hdf5/" # change this as necessary

## Setup

In [3]:
%matplotlib widget
import os
import numpy as np
import matplotlib.pyplot as plt
import h5py
from versioned_hdf5 import VersionedHDF5File

In [4]:
testname = "test_large_fraction_changes"
tests = []
for t in num_transactions:
    filename = os.path.join(path, testname+"_"+str(t)+"_sparse.h5")
    h5pyfile = h5py.File(os.path.join(path, "test_large_fraction_changes_"+str(t)+"_sparse.h5"), 'r')
    data = VersionedHDF5File(h5pyfile)
    tests.append(dict(time=t, filename=filename, h5pyfile=h5pyfile, data=data))

## Number of versions v. File size

We'll start by analyzing how the `.h5` file sizes grow as the number of versions grows. 

In [5]:
filesizes = []
sizelabels = []
suffixes = ['B', 'KB', 'MB', 'GB']
for test in tests:
    size = os.path.getsize(test['filename'])
    filesizes.append(size)
    i = 0
    while size >= 1024 and i < len(suffixes)-1:
        size = size/1024
        i += 1
    sizelabels.append(f"{size:.2f} {suffixes[i]}")
    print(f"File with {test['time']} versions has size {sizelabels[-1]}")
filesizes = np.array(filesizes)
sizelabels = np.array(sizelabels)

File with 30 versions has size 4.31 MB
File with 60 versions has size 7.75 MB
File with 120 versions has size 14.54 MB
File with 250 versions has size 29.76 MB
File with 360 versions has size 45.06 MB
File with 730 versions has size 102.37 MB
File with 1825 versions has size 333.88 MB
File with 2500 versions has size 529.18 MB
File with 3650 versions has size 939.11 MB
File with 7300 versions has size 2.91 GB
File with 9125 versions has size 4.29 GB
File with 14600 versions has size 9.88 GB
File with 18250 versions has size 14.94 GB
File with 21900 versions has size 20.97 GB
File with 25000 versions has size 26.87 GB
File with 27375 versions has size 31.94 GB
File with 29200 versions has size 36.00 GB
File with 36500 versions has size 55.70 GB


In [6]:
#fig = plt.figure(figsize=(12, 8), dpi= 80)
plt.plot(num_transactions, filesizes, 'b')
plt.plot(num_transactions, filesizes, 'b*', ms=12)
plt.xticks([30, 9125, 18250, 27375, 36500])
plt.xlabel("Transactions")
plt.title("Number of transactions vs. File Size")
indices = [0, 10, 12, 15, 17]
num_transactions = np.array(num_transactions)
plt.yticks(filesizes[indices], sizelabels[indices])
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

This shows that the file size grows **quadratically** with respect to the number of versions added for this array size.

### Finishing up

In [7]:
for test in tests:
    test['h5pyfile'].close()