# VirtualDatasets

A VirtualDataset in ConX allows you to load just part of a dataset at a time so that you don't require all of the data to be in memory at once.

You can construct a VirtualDataset as follows:

```python
cx.VirtualDataset(FUNCTION, 
                  LENGTH, 
                  INPUTS_SHAPES, 
                  TARGET_SHAPES, 
                  INPUT_RANGES, 
                  TARGET_RANGES,
                  generator_ordered=True|False,
                  load_batch_direct=True|False,
                  batch_size=SIZE)
```

Where:

* FUNCTION is a:
  * a generator that yields a input/target pair
  * a generator that yields a low-level set of batch data (use load_batch_direct=True)
  * a function that returns an input/target pair
  * a function that returns a low-level set of batch data (use load_batch_direct=True)
* LENGTH is total count of data
* INPUTS_SHAPES is shape of input banks
* TARGET_SHAPES is shape of target banks
* INPUT_RANGES is (min, max) for each input banks
* TARGET_RANGES is (min, max) for each target banks
* SIZE is size of batch

The SIZE determines how many input/target pairs are generated at a time. Usually this should match the batch_size used in the training of the network.

In [1]:
import conx as cx
import numpy as np
import random

Using TensorFlow backend.
ConX, version 3.7.3


In [2]:
BATCH_SIZE = 8

##   load_batch_direct = False, with function(position)

In [3]:
def f(pos):
    print("Generating position:", pos)
    return ([pos/100, pos/100], [pos/100])

In [4]:
f(50)

Generating position: 50


([0.5, 0.5], [0.5])

In [5]:
dataset = cx.VirtualDataset(f, 100, [(2,)], [(1,)], [(0,1)], [(0,1)],
                            load_batch_direct=False,
                            batch_size=BATCH_SIZE)

Generating position: 0
Generating position: 1
Generating position: 2
Generating position: 3
Generating position: 4
Generating position: 5
Generating position: 6
Generating position: 7


In [6]:
dataset.inputs[0]

[0.0, 0.0]

In [7]:
dataset.inputs[0]

[0.0, 0.0]

As you can see from the above, retrieving input/target patterns from the current batch does not regenerate the batch.

However, moving beyond the range does generate a new batch:

In [8]:
dataset.inputs[50]

Generating position: 48
Generating position: 49
Generating position: 50
Generating position: 51
Generating position: 52
Generating position: 53
Generating position: 54
Generating position: 55


[0.5, 0.5]

In [9]:
dataset.inputs[0]

Generating position: 0
Generating position: 1
Generating position: 2
Generating position: 3
Generating position: 4
Generating position: 5
Generating position: 6
Generating position: 7


[0.0, 0.0]

## load_batch_direct = True, with function(batch)

In [10]:
def f(batch):
    print("Generating batch:", batch)
    i = batch * BATCH_SIZE
    while True:
        all_inputs = [[]]
        all_targets = [[]]
        while i < (batch + 1) * BATCH_SIZE:
            all_inputs[0].append([i/100, i/100])
            all_targets[0].append([i/100])
            i += 1
        return ([np.array(inputs) for inputs in all_inputs],
                [np.array(targets) for targets in all_targets])

In [11]:
dataset = cx.VirtualDataset(f, 100, [(2,)], [(1,)], [(0,1)], [(0,1)],
                            load_batch_direct=True,
                            batch_size=BATCH_SIZE)

Generating batch: 0


In [12]:
dataset.inputs[0]

[0.0, 0.0]

In [13]:
dataset.inputs[50]

Generating batch: 6


[0.5, 0.5]

In [14]:
dataset.inputs[0]

Generating batch: 0


[0.0, 0.0]

In [15]:
dataset.inputs[0]

[0.0, 0.0]

## generator_ordered = True, load_batch_direct = False, with generator function()

In [16]:
def f():
    i = 0
    while True:
        print("Generating position:", i)
        yield ([i/100, i/100], [i/100])
        i += 1

In [17]:
dataset = cx.VirtualDataset(f, 100, [(2,)], [(1,)], [(0,1)], [(0,1)],
                            generator_ordered=True,
                            load_batch_direct=False,
                            batch_size=BATCH_SIZE)

Generating position: 0
Generating position: 1
Generating position: 2
Generating position: 3
Generating position: 4
Generating position: 5
Generating position: 6
Generating position: 7


In [18]:
dataset.inputs[0]

[0.0, 0.0]

In [19]:
dataset.inputs[20]

Generating position: 0
Generating position: 1
Generating position: 2
Generating position: 3
Generating position: 4
Generating position: 5
Generating position: 6
Generating position: 7
Generating position: 8
Generating position: 9
Generating position: 10
Generating position: 11
Generating position: 12
Generating position: 13
Generating position: 14
Generating position: 15
Generating position: 16
Generating position: 17
Generating position: 18
Generating position: 19
Generating position: 20
Generating position: 21
Generating position: 22
Generating position: 23


[0.20000000298023224, 0.20000000298023224]

In [20]:
dataset.inputs[24]

Generating position: 24
Generating position: 25
Generating position: 26
Generating position: 27
Generating position: 28
Generating position: 29
Generating position: 30
Generating position: 31


[0.23999999463558197, 0.23999999463558197]

In [21]:
dataset.inputs[0]

Generating position: 0
Generating position: 1
Generating position: 2
Generating position: 3
Generating position: 4
Generating position: 5
Generating position: 6
Generating position: 7


[0.0, 0.0]

## generator_ordered = True, load_batch_direct = True, with generator function()

In [22]:
def f():
    i = 0
    while True:
        print("Generating positions:", i, "-", i + BATCH_SIZE)
        i_end = i + BATCH_SIZE
        all_inputs = [[]]
        all_targets = [[]]
        while i < i_end:
            all_inputs[0].append([i/100, i/100])
            all_targets[0].append([i/100])
            i += 1
        yield ([np.array(inputs) for inputs in all_inputs],
               [np.array(targets) for targets in all_targets])

In [23]:
dataset = cx.VirtualDataset(f, 100, [(2,)], [(1,)], [(0,1)], [(0,1)],
                            generator_ordered=True,
                            load_batch_direct=True,
                            batch_size=BATCH_SIZE)

Generating positions: 0 - 8


In [24]:
dataset.inputs[0]

[0.0, 0.0]

In [25]:
dataset.inputs[25]

Generating positions: 0 - 8
Generating positions: 8 - 16
Generating positions: 16 - 24
Generating positions: 24 - 32


[0.25, 0.25]

In [26]:
dataset.inputs[8]

Generating positions: 0 - 8
Generating positions: 8 - 16


[0.08, 0.08]

## generator_ordered = True, load_batch_direct = False, with generator function() (showing another function style)

In [27]:
def f():
    for i in range(100):
        print("generating position:", i)
        yield ([i/100, i/100], [i/100])

In [28]:
dataset = cx.VirtualDataset(f, 100, [(2,)], [(1,)], [(0,1)], [(0,1)],
                            generator_ordered=True,
                            load_batch_direct=False,
                            batch_size=BATCH_SIZE)

generating position: 0
generating position: 1
generating position: 2
generating position: 3
generating position: 4
generating position: 5
generating position: 6
generating position: 7


In [29]:
dataset.inputs[0]

[0.0, 0.0]

In [30]:
dataset.inputs[25]

generating position: 0
generating position: 1
generating position: 2
generating position: 3
generating position: 4
generating position: 5
generating position: 6
generating position: 7
generating position: 8
generating position: 9
generating position: 10
generating position: 11
generating position: 12
generating position: 13
generating position: 14
generating position: 15
generating position: 16
generating position: 17
generating position: 18
generating position: 19
generating position: 20
generating position: 21
generating position: 22
generating position: 23
generating position: 24
generating position: 25
generating position: 26
generating position: 27
generating position: 28
generating position: 29
generating position: 30
generating position: 31


[0.25, 0.25]

In [31]:
dataset.inputs[0]

generating position: 0
generating position: 1
generating position: 2
generating position: 3
generating position: 4
generating position: 5
generating position: 6
generating position: 7


[0.0, 0.0]

## generator_ordered = False, load_batch_direct = False, with generator function()

In [32]:
def f():
    while True:
        print("Generating a position!")
        r = random.random()
        yield ([r, r], [r])

In [33]:
dataset = cx.VirtualDataset(f, 100, [(2,)], [(1,)], [(0,1)], [(0,1)],
                            generator_ordered=False,
                            load_batch_direct=False,
                            batch_size=BATCH_SIZE)

Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!


In [34]:
dataset.inputs[0]

[0.43144214153289795, 0.43144214153289795]

In [35]:
dataset.inputs[0]

[0.43144214153289795, 0.43144214153289795]

In [36]:
dataset.inputs[10]

Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!


[0.5141733288764954, 0.5141733288764954]

In [37]:
dataset.inputs[0]

Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!
Generating a position!


[0.510324239730835, 0.510324239730835]

## generator_ordered = False, load_batch_direct = True, with generator function()

In [38]:
def f():
    while True:
        print("Generating a batch!")
        all_inputs = [[]]
        all_targets = [[]]
        for i in range(BATCH_SIZE):
            r = random.random()
            all_inputs[0].append([r, r])
            all_targets[0].append([r])
        yield ([np.array(inputs) for inputs in all_inputs],
               [np.array(targets) for targets in all_targets])

In [39]:
dataset = cx.VirtualDataset(f, 100, [(2,)], [(1,)], [(0,1)], [(0,1)],
                            generator_ordered=False,
                            load_batch_direct=True,
                            batch_size=BATCH_SIZE)

Generating a batch!


In [40]:
dataset.inputs[0]

[0.5988568705426426, 0.5988568705426426]

In [41]:
dataset.inputs[50]

Generating a batch!


[0.3660460366554896, 0.3660460366554896]

In [42]:
dataset.inputs[0]

Generating a batch!


[0.5512341103950951, 0.5512341103950951]

In [43]:
dataset.inputs[99]

Generating a batch!


[0.6636074827171492, 0.6636074827171492]

## Generating a VirtualDataset from a directory of files

In [44]:
%%file test0.dat
[[0/3], [0/3]], [0/3]

Overwriting test0.dat


In [45]:
%%file test1.dat
[[1/3], [1/3]], [1/3]

Overwriting test1.dat


In [46]:
%%file test2.dat
[[2/3], [2/3]], [2/3]

Overwriting test2.dat


In [47]:
%%file test3.dat
[[3/3], [3/3]], [3/3]

Overwriting test3.dat


In [54]:
import glob
filenames = sorted(glob.glob("./*.dat"))

In [55]:
def f(pos):
    # To get a specific order, always number from beginning:
    print("Generating position:", pos)
    return eval(open(filenames[pos]).read())

In [56]:
filenames

['./test0.dat', './test1.dat', './test2.dat', './test3.dat']

In [57]:
f(0)

Generating position: 0


([[0.0], [0.0]], [0.0])

In [58]:
dataset = cx.VirtualDataset(f, len(filenames), [(2,)], [(1,)], [(0,1)], [(0,1)],
                            load_batch_direct=False,
                            batch_size=3)

Generating position: 0
Generating position: 1
Generating position: 2


In [62]:
dataset.inputs[0]

[[0.0], [0.0]]

In [63]:
dataset.inputs[0]

[[0.0], [0.0]]

In [64]:
dataset.inputs[3]

Generating position: 3


[[1.0], [1.0]]