In [1]:
import openpnm as op
import numpy as np
np.random.seed(0)

# Data Storage Format

## Spreadsheet Analogy

The best analogy for explaining data storage in OpenPNM is the humble spreadsheet.  According to this analogy, each pore (or throat) corresponds to a row and each property corresponds to a column.

Consider the following network with 4 pores, 3 throats:

In [2]:
pn = op.network.Cubic(shape=[4, 1, 1])
geo = op.geometry.StickAndBall(network=pn, pores=pn.Ps, throats=pn.Ts)

Let's use ``pandas`` to express the geometric properties as a 'spreadsheet':

In [3]:
import pandas as pd
pore_data_sheet = pd.DataFrame({i: geo[i] for i in geo.props(element='pore')})

We can now view this 'spreadsheet':

In [4]:
print(pore_data_sheet)

   pore.diameter  pore.seed  pore.area  pore.max_size  pore.volume
0       0.474407   0.474407   0.176763            1.0     0.055905
1       0.557595   0.557595   0.244190            1.0     0.090773
2       0.501382   0.501382   0.197436            1.0     0.065994
3       0.472442   0.472442   0.175302            1.0     0.055213


The properties are the 'column' names, such as 'pore.area', and the rows correspond to the pore index, so 'pore 0' has an area of 0.176763.  

One could also extract an entire column using:

In [5]:
pore_area_column = pore_data_sheet['pore.area']
print(pore_area_column)

0    0.176763
1    0.244190
2    0.197436
3    0.175302
Name: pore.area, dtype: float64


And then access individual elements:

In [6]:
print(pore_area_column[0])

0.17676309790984798


## Dictionaries and Numpy Arrays

Although the spreadsheet analogy described above is very close to reality, OpenPNM does not actually use ``pandas`` DataFrames, or any other spreadsheet data structure.  Instead, it uses the basic Python *dictionary* and Numpy arrays to accomplish a nearly identical behavior, but with a bit more flexibility.

Each OpenPNM object (e.g. networks, algorithms, etc) is actually a customized (a.k.a. [subclassed](https://realpython.com/python-data-classes/#inheritance)) Python [*dictionary*](https://realpython.com/python-dicts/) which allows data to be stored and accessed by name, with a syntax like ``network['pore.diameter']``.  This is analogous (actually indistinguishable) to extracting a column from a spreadsheet as outlined above.  One the data array is retrieved from the dictionary, it is then a simple matter of working with a Numpy array.

### Quick Overview of Dictionaries
The internet contains many tutorials and explanation of Python "dicts".  To summarize, they are general purpose data contains where *items* can be stored by name, as follows:

In [7]:
a = {}  # Create an empty dict
a['item 1'] = 'a string'
a['item 2'] = 4  # a number
a['another item'] = {}  # Even other dicts!
print(a)

{'item 1': 'a string', 'item 2': 4, 'another item': {}}


Data can be accessed by name, which is called a "key":

In [8]:
print(a['item 2'])

4


### Quick Overview of Numpy Arrays

OpenPNM uses *dictionaries* or `dict`'s to store an assortment of [Numpy arrays](https://docs.scipy.org/doc/numpy/user/) that each contain a specific type of pore or throat data.  

There are many [tutorials](https://realpython.com/numpy-array-programming/) on the internet explaining the various features and benefits of Numpy arrays.  To summarize, they are familiar to numerical arrays in any other language.  

Let's extract a Numpy array from the ``geo`` object and play with it:

In [9]:
a = geo['pore.diameter']
print(a)

[0.47440675 0.55759468 0.50138169 0.47244159]


It's possible to extract several elements at once:

In [10]:
a[1:3]

array([0.55759468, 0.50138169])

And easy to multiply all values in the array by a scalar or by another array of the same size, which defaults to element-wise multiplication:

In [11]:
print(a*2)

[0.9488135  1.11518937 1.00276338 0.94488318]


In [12]:
print(a*a)

[0.22506177 0.31091183 0.2513836  0.22320106]


### Rules to Maintain Data Integrity

Several rules have been implemented to control the integrity of the data:
* Only Numpy arrays can be stored in an OpenPNM object, and any data that is written into one of the OpenPNM object dicionaries will be converted to a Numpy array.  This is done to ensure that all mathematically operations throughout the code can be consistently done using vectorization.  Note that any subclasses of Numpy arrays, such as Dask arrays or Unyt arrays are also acceptable.
* All array names must begin with either *'pore.'* or *'throat.'* which serves to identify the type of information they contain.
* For the sake of consistency only arrays of length *Np* or *Nt* are allowed in the dictionary. Assigning a scalar value to a dictionary results in the creation of a full length vector, either *Np* or *Nt* long, depending on the name of the array..  This effectively applies the scalar value to all locations in the network.
* Any Boolean data will be treated as a *label* while all other numerical data is treated as a *property*. 

## Representing Topology

### Storage of Topological Connections

Pore network modeling is actually just a form of graph theory.  

> During the development of OpenPNM, it was debated whether existing Python graph theory packages (such as [graph-tool](http://graph-tool.skewed.de/) or [NetworkX](http://networkx.github.io/) should be used to store the network topology.  It was decided that network property data should be simply stored as [Numpy ND-arrays](http://www.numpy.org/) as discussed above.  This format makes the data storage very transparent and familiar since all engineers are used to working with arrays (i.e. vectors), and also very efficiently since this allows code vectorization.  Fortuitously, around the same time as this discussion, Scipy introduced the [compressed sparse graph library](http://docs.scipy.org/doc/scipy/reference/sparse.csgraph.html), which contains numerous graph theory algorithms that take Numpy arrays as arguments.  Therefore, OpenPNM's topology model is implemented using Numpy arrays, which is described in detail below:

The only topology definitions required by OpenPNM are:

1. A throat connects exactly two pores, no more and no less

2. Throats are non-directional, meaning that flow in either direction is equal

Other general, but non-essential rules are:

3. Pores can have an arbitrary number of throats, including zero; however, pores with zero throats lead to singular matrices and other problems so should be avoided.

4. Two pores are generally connected by no more than one throat.  It is technically possible in OpenPNM to have multiple throats between a pair of pores, but it is not rigorosly supported so unintended results may arise.

### Sparse Adjacency Matrices

In OpenPNM network topology (or connectivity) is stored as an [adjacency matrix](http://en.wikipedia.org/wiki/Adjacency_matrix).  An adjacency matrix is a *Np*-by-*Np* 2D matrix.  A non-zero value at location (*i*, *j*) indicates that pores *i* and *j* are connected.  Describing the network in this general fashion allows OpenPNM to be agnostic to the type of network it describes.  Another important feature of the adjacency matrix is that it is highly sparse and can be stored with a variety of sparse storage schemes.  OpenPNM stores the adjacency matrix in the 'COO' or 'IJV' format, which essentially stores the coordinates (I,J) and values (V) of the nonzero elements in three separate lists.  This approach results in a property called ``'throat.conns'``; it is an *Nt*-by-2 array that gives the index of the two pores on either end of a given throat.  The representation of an arbitrary network is shown in the following figure. It has 5 pores and 7 throats, and the ``'throat.conns'`` array contains the (I,J,V) information to describes the adjacency matrix.


![](http://i.imgur.com/rMpezCc.png)


### Additional Thoughts on Sparse Storage

* In pore networks there is (usually) no difference between traversing from pore *i* to pore *j* or from pore *j* to pore *i*, so a 1 is also found at location (*j*, *i*) and the matrix is symmetrical.

* Since the adjacency matrix is symmetric, it is redundant to store the entire matrix when only the upper triangular part is necessary.  The ``'throat.conns'`` array only stores the upper triangular information, and *i* is always less than *j*.

* Although this storage scheme is widely known as *IJV*, the ``scipy.sparse`` module calls this the Coordinate or *COO* storage scheme.

* Some tasks are best performed on other types of storages scheme, such as *CSR* or *LIL*.  OpenPNM converts between these internally as necessary, but users can generate a desired format using the ``create_adjacency_matrix`` method which accepts the storage type as an argument (i.e. ``'csr'``, ``'lil'``, etc).  For a discussion of sparse storage schemes and the respective merits, see this [Wikipedia article](http://en.wikipedia.org/wiki/Sparse_matrix).

### Performing Network Queries

[Finding neighboring pores and throats](Finding Neighbor Pores and Throats.ipynb)

Querying and inspecting the pores and throats in the **Network** is an important tool for working with networks. The various functions that are included on the **GenericNetwork** class will be demonstrated below on the following cubic network:

In [13]:
pn = op.network.Cubic(shape=[10, 10, 10])

### Finding Neighboring Pores and Throats

Given a pore *i*, it possible to find which pores (or throats) are directly connected to it:

In [14]:
Ps = pn.find_neighbor_pores(pores=1)
print(Ps)

[  0   2  11 101]


In [15]:
Ps = pn.find_neighbor_throats(pores=1)
print(Ps)

[   0    1  901 1801]


The above queries can be more complex if a list of pores is sent, and the ```mode``` argument is specified.  This is useful for finding neighbors surrounding a set of pores such as the fringes around an invading fluid cluster, or all throats within a cluster:

In [16]:
Ps = pn.find_neighbor_pores(pores=[2, 3, 4], mode='or')  # 'union' is default
print(Ps)

[  1   5  12  13  14 102 103 104]


In [17]:
Ts = pn.find_neighbor_throats(pores=[2, 3, 4], mode='xnor')
print(Ts)

[2 3]


In [18]:
Ts = pn.find_neighbor_throats(pores=[2, 3, 4], mode='exclusive_or')
print(Ts)

[   1    4  902  903  904 1802 1803 1804]


The ```mode``` argument limits the returned results using *set-theory* type logic.  Consider the following two queries:

In [19]:
Ts = pn.find_neighbor_throats(pores=2)
print(Ts)

[   1    2  902 1802]


In [20]:
Ts = pn.find_neighbor_throats(pores=3)
print(Ts)

[   2    3  903 1803]


The *or* is a single set of unique values obtained by combining the two sets, while the *intersection* of these two sets includes only the values present in both (i.e. *2*)  The *difference* of these sets is all the values except those found common to both initial sets.  It's possible to specify as many pores as desired, and the *set-logic* is bit less obvious.  More generally:

* ``'or'`` returns a list of unique locations neighboring any input pores
* ``'xor'`` returns a list of locations that are only neighbors to one of the input pores
* ``'xnor'`` returns a list of locations that are neighbors to at least two inputs pores

In addition to these neighbor lookups, the GenericNetwork class also offers several other methods that complete the suite of lookup tools:  ``find_connected_pores``, ``find_connecting_throats`` and ``find_nearby_pores``.  There are also many more tools related to Network queries and manipulations in the :ref:`topotools_index` module.