# Dataset and samples

This notebook demonstrates the functionality of a class called `FieldDataset`.
It can be used to solve two possible tasks:
- Simplify complexly defined `Field` to a set of `numpy` arrays or `torch` tensors
- Iteratively convert instances of the `Field` to their simplified version for iterative model training

In [1]:
import sys
sys.path.append('..')
import numpy as np
import torch

from deepfield.datasets import FieldDataset, FieldSample
from deepfield.datasets.transforms import ToTensor, Normalize, Denormalize, AddBatchDimension

from deepfield import Field
from deepfield.field.base_component import BaseComponent

PATH_TO_DATASET = '../open_data/norne_simplified/'
PATH_TO_FIELD = '../open_data/norne_simplified/norne_simplified.data'

The `FieldDataset` can be created in several ways:
1. From a path to a folder with either `.data` or `.hdf5` files

In [2]:
dataset = FieldDataset(src=PATH_TO_DATASET)

2. From a preloaded `Field` instance. In this case, some preprocessing is required for the proper treatment of the control variables. Namely, we should ensure, the well trajectories are computed, and the events are transformed from the ECLIPSE format.

**NOTE:** in order to load the solutions of the *standard commercial simulator* available for the `norne_simplified`, unzip the file with results `../open_data/norne_simplified/RESULTS.zip`.

In [3]:
field = Field(PATH_TO_FIELD).load()
dataset = FieldDataset(src=field, allow_change_preloaded=True)

INFO:Field:Using default config.
INFO:Field:Start reading X files.
...
INFO:Field:===== Field summary =====
INFO:Field:GRID attributes: MAPAXES, ZCORN, COORD, ACTNUM, DIMENS
INFO:Field:ROCK attributes: PORO, PERMX, PERMY, PERMZ
INFO:Field:STATES attributes: PRESSURE, RS, SGAS, SOIL, SWAT
INFO:Field:TABLES attributes: PVTO, ROCK, PVTW, DENSITY, SWOF, SGOF, PVDG
INFO:Field:WELLS attributes: WELLTRACK, RESULTS, WCONINJE, COMPDAT, WELSPECS, WCONPROD
INFO:Field:AQUIFERS attributes: 
INFO:Field:Grid pillars (`COORD`) are mapped to new axis with respect to `MAPAXES`.


Roughly speaking, `FieldDataset` is a generator: at each iteration it loads the `Field` and simplifies it.
The simplified `Field` has its own class - `FieldSample`:

In [4]:
for sample in dataset:
    print(sample.__class__)

<class 'deepfield.datasets.datasets.FieldSample'>


`FieldSample` is a child class of the `BaseComponent`. It has the same interface and has attributes which are either arrays/tensors or `BaseComponents`

In [5]:
sample.attributes

('MASKS', 'GRID', 'ROCK', 'STATES', 'CONTROL')

In [6]:
print('STATES is a numpy array: %s' % isinstance(sample.states, np.ndarray))

STATES is a numpy array: True


In [7]:
print('MASKS is a BaseComponent: %s' % isinstance(sample.masks, BaseComponent))
print('MASKS.ACTNUM is a numpy array: %s' % isinstance(sample.masks.actnum, np.ndarray))

MASKS is a BaseComponent: True
MASKS.ACTNUM is a numpy array: True


The characteristics that are 'stackable' together are stacked: STATES, ROCK, CONTROL, etc..

The characteristics that have more complex shapes are represented as `BaseComponents`: MASKS, GRID, etc..

All the characteristic's names presented in the `FieldSample` are stored in the `sample_attrs` attribute:

In [8]:
print('SAMPLE_ATTRS is a BaseComponent: %s' % isinstance(sample.sample_attrs, BaseComponent))
dict(**sample.sample_attrs)

SAMPLE_ATTRS is a BaseComponent: True


{'MASKS': ['ACTNUM', 'TIME'],
 'GRID': [],
 'ROCK': ['PORO', 'PERMX', 'PERMY', 'PERMZ'],
 'STATES': ['PRESSURE', 'RS', 'SGAS', 'SOIL', 'SWAT'],
 'CONTROL': ['BHPT']}

You can specify `sample_attrs` by passing an appropriate `dict` to the `FieldDataset`:

In [9]:
sample_attrs = {
    'masks': ['actnum', 'time', 'named_well_mask', 'well_mask', 'cf_mask', 'perf_mask'],
    'states': ['pressure', 'soil', 'swat', 'sgas', 'rs'],
    'rock': ['poro', 'permx', 'permy', 'permz'],
    'control': ['bhpt'],
    'tables': ['pvto', 'pvtw', 'pvdg', 'swof', 'sgof', 'density'],
    'grid': ['xyz']
}
dataset = FieldDataset(src=field, sample_attrs=sample_attrs)

sample = next(iter(dataset))
dict(**sample.sample_attrs)

{'MASKS': ['ACTNUM',
  'TIME',
  'NAMED_WELL_MASK',
  'WELL_MASK',
  'CF_MASK',
  'PERF_MASK'],
 'STATES': ['PRESSURE', 'SOIL', 'SWAT', 'SGAS', 'RS'],
 'ROCK': ['PORO', 'PERMX', 'PERMY', 'PERMZ'],
 'CONTROL': ['BHPT'],
 'TABLES': ['PVTO', 'PVTW', 'PVDG', 'SWOF', 'SGOF', 'DENSITY'],
 'GRID': ['XYZ']}

Unlike the `BaseComponent`, `FieldSample` has several methods for specific transformations:

- You can apply any transform from `deepfield.dataset.transforms` to it
- You can change the spatial representation of the sample: `ravel` and `crop_at_mask`

In [10]:
print('STATES shape before "as_ravel": %s' % list(sample.states.shape))
sample_ravel = sample.as_ravel(inplace=False, crop_at_mask='ACTNUM')
print('STATES shape after "as_ravel": %s' % list(sample_ravel.states.shape))

STATES shape before "as_ravel": [246, 5, 46, 112, 22]
STATES shape after "as_ravel": [246, 5, 44431]


`sample.at_wells()` is a shortcut for `sample.ravel(crop_at_mask='WELL_MASK')`:

In [11]:
sample_at_wells = sample.at_wells(inplace=False) 
print('STATES shape after "at_wells": %s' % list(sample_at_wells.states.shape))

STATES shape after "at_wells": [246, 5, 504]


The transforms from `deepfield.dataset.transforms` can be used for e.g. convertion to `torch`:

In [12]:
sample.transformed(ToTensor, inplace=True)
print('STATES is a torch tensor: %s' % isinstance(sample.states, torch.Tensor))

STATES is a torch tensor: True


Some of the useful information about sample's representation can be found in its `state`:

In [13]:
sample.state.as_dict()

{'sample_attributes': <deepfield.field.base_component.BaseComponent at 0x7f0d09b61990>,
 'spatial': True,
 'cropped_at_mask': None,
 'numpy': False,
 'tensor': True}

You can ask the `FieldDataset` to apply transforms to all the generated samples:

In [14]:
dataset = FieldDataset(src=field, sample_attrs=sample_attrs)
dataset.set_transform([ToTensor, AddBatchDimension])

sample = next(iter(dataset))
print('STATES shape with batch dimension: %s' % list(sample.states.shape))

STATES shape with batch dimension: [1, 246, 5, 46, 112, 22]


For training, it is useful to normalize values before passing them into an ML model.

There is a pair of `Normalize` and `Denormalize` transforms for that. 
However, they will not work before the statistics (mean, std, min, max) across the dataset are calculated.
Statistics calculation is a prerogative of the `FieldDataset`.

In [15]:
dataset = FieldDataset(src=field, sample_attrs=sample_attrs, allow_change_preloaded=True)
dataset.calculate_statistics()
sample = next(iter(dataset))

You can dump and load precalculated statistics (pickle):

In [16]:
dataset.dump_statistics('statistics.pkl')
dataset.load_statistics('statistics.pkl')

In [17]:
print('STATES max value before normalization: %.2f' % sample.states.max())
sample.transformed([ToTensor, Normalize], inplace=True)
print('STATES max value after normalization: %.2f' % sample.states.max())
sample.transformed([Denormalize], inplace=True)
print('STATES max value after denormalization: %.2f' % sample.states.max())

STATES max value before normalization: 312.58
STATES max value after normalization: 4.56
STATES max value after denormalization: 312.58


You can also dump and load the sample itself (`state=True` dumps the `sample.state` too):

In [18]:
sample.dump('sample.hdf5', state=True)
sample = FieldSample('sample.hdf5').load()

In [19]:
sample.attributes

('CONTROL', 'GRID', 'MASKS', 'ROCK', 'STATES', 'TABLES')

Done!