# Overview

Once images (and labels) are loaded into the database, access is mediated primarily through two classes. For MongoDB interactions, the majority of needed functionality is provided with the `cnn.db.Manager()` class. For access to underlying imaging data, including loading images as NumPy arrays, the majority of needed functionality is provided with the `cnn.utils.Client()` class.

The cnn package for implementing all functionality described here can be imported with a single module:

In [None]:
import cnn

# MongoDB Interactions

MongoDB interactions are mediated primarily through the `cnn.db.Manager()` class. For most the most part, these interactions are abstracted away by higher-level classes that simply use `cnn.db.Manager()` for low-level manipulation. Usually, use of these other more high-level classes (such as `cnn.db.Importer()`) will provide all the database interaction that is necessary. However for debugging or other purposes, access to the `Manager()` class may be useful and thus described here.

## cnn.db.Manager() class

The `cnn.db.Manager()` class provides an object for database interaction. It is initialized with just three arguments:

* ip: MongoDB IP address (by default `127.0.0.1` for local build)
* port: MongoDB port (by default `27017`)
* verbose: output help (by default True)

After initializing the object itself, the next step is to select the Mongo database and the respective Mongo collection. Both are set using the `self.use(db, collection=None)` method. In this particular call, the only required argument is the database name (`db`). If no `collection` is provided, by default the first available collection is used. Most often times this is sufficient since all images are automatically imported into the `images` collection which typically will be the only collection available in a database (by default). Both the current `db` and `collection` used by the the object will be saved in the `self.loc` dictiontary.

In [None]:
def init_manager(verbose=True):
    """
    Example showing how to:
    
      (1) Create Manager() class
      (2) Set to the 'mnist' database (and `images` collection)
      
    """
    man = cnn.db.Manager(verbose=verbose)
    man.use('mnist', verbose=verbose)
    
    return man

%time init_manager()

## Basic functionality

Basic `Manager()` functionality to display general status and overview of database contents includes:

* `man.pwd()`: list the current db/collection names
* `man.l()`: list the documents inside the current db/collection
* `man.summary()`: list summary of documents inside current db/collection

In [None]:
def demo_pwd():
    """
    Example showing how to:
    
      (1) Create Manager() class
      (2) List current db/collection names
      
    """
    man = init_manager(verbose=False)
    man.pwd()
    
%time demo_pwd()

In [None]:
def demo_l():
    """
    Example showing how to:
    
      (1) Create Manager() class
      (2) List documents in current db/collection
      
    """
    man = init_manager(verbose=False)
    man.l()
    
%time demo_l()

In [None]:
def demo_summary():
    """
    Example showing how to:
    
      (1) Create Manager() class
      (2) List summary of current db/collection
      
    """
    man = init_manager(verbose=False)
    man.summary()
    
%time demo_summary()

## Database  manipulation

Commands for database manipulation are based on names and syntax identical to (or similar to) the MongoDB API. Common useful functions include:

* `man.find_one()`: find a (single) document matching the specified search criteria
* `man.random()`: find random document(s) matching specified criteria
* `man.replace_one()`: replace a single document
* `man.insert_one()`: insert a new document
* `man.update_many()`: update many documents mathcing specified criteria

# Data access

To access underlying pixel data of imported files, use the `cnn.utils.Client()` class initialized with the proper `app_context` variable.

## Application context

As in the data import pipeline, the behavior of database interactions is defined by the `app_context` dictionary. Please refer to the the prior documentation for options regarding manually defining, intializing, saving, loading and deleting application contexts.

In these examples, we will assume that data has already been imported into the 'mnist' database using the `app_context` named 'mnist_test', the default values of which have already been saved to the `app` database.

In [None]:
def load_app_context():
    """
    Example showing how to:
    
      (1) Load a saved app_context
    
    """
    app_context = cnn.db.init_app_context({'name': 'mnist_test'})
    
    return app_context

%time load_app_context()

## cnn.utils.Client() class

The `cnn.utils.Client()` class provides an object for easy interaction with imported data. To initialize simply provide the `Client()` with an appropriate `app_context` that defines the `db` (and `collection`) that is to be accessed. In addition specific tags for either series and/or labels data must be provided in the form of a list to identify the correct subset of data to load. In our example test import from prior, only a single tag `mnist-all` was added however throughout the course of many experiments, a single dataset may in fact be used for a number of different related projects. In these situations, the multiple tags associated with any particular study can be used to identify the correct cohorot of patients for any given project.

In [None]:
def init_client():
    """
    Example showing how to:
    
      (1) Load app_context
      (2) Specific app_context['tags-series']
      (3) Initialize client
      
    """
    app_context = load_app_context()
    app_context['tags-series'] = ['mnist-all']
    client = cnn.utils.Client(app_context)

    return client

## Loading data

There are several ways to load data with `client` object. Regardless of method, the return is a dictionary `vol` where `vol['dat']` contains a 4D Numpy array with the selected series (Z x Y x X x C where a total of C number of series are stacked along the fourth dimension if more than one series is requested), and `vol['lbl']` contains the corresponding 4D Numpy array with the selected label masks (if requested). The simplest method is to use the `client.load_next(random)` function. This approach allows loading of data either sequentially or randomly:

1. `client.load_next(random=False)`: loads single example of data sequentially from database (matching `app_context`)
2. `client.load_next(random=True)`: loads random example of data from database (matching `app_context`)

In [None]:
def load_next():
    """
    Example showing how to:
    
      (1) Initialize client
      (2) Load first study 
      (3) Load random study
      
    """
    client = init_client()
    
    vol = client.load_next(random=False)
    assert vol['dat'].shape == (1, 28, 28, 1)
    
    vol = client.load_next(random=True)
    assert vol['dat'].shape == (1, 28, 28, 1)
    
%time load_next()

While simple to use, the `client.load_next()` method offers relatively little control over opening a specific desired exam. If this is instead desired, use the `client.load()` method instead. This (slightly more complex) approach is divided into two stages:

1. Determine the MongoDB document of the study to load.
2. Feed document into `client.load(doc, infos={})`

The two key parameters for the `client.load()` method are `doc` and `infos` dictionaries. See below for further discussion.

### Document

Depending on approach, the document can be obtained using the `cnn.db.Manager()` class (if you wish to search the database for a specific`studyid` or some other criteria), or alternatively be identified using the `client.next_doc()` method. The latter approach allows for fine-tuned control over which data to load, including:

* `client.next_doc(mode)`: specifying either 'train', 'valid', or 'mixed' (default) study
* `client.next_doc(random)`: sequential (random=False; default) or random (random=True) study
* `client.next_doc(studyid)`: load specific studyid

### Infos dictionary

The `infos` dictionary defines how underlying data will be loaded into memory. Generally speaking, there are two main modes of loading image data:

1. Loading the entire image (along a specific dimension)
2. Loading a portion of the image (along a specific dimension)

Generally, the three values in the `infos` dictionary that define the specific mode are the `infos['shape']`, `infos['tiles']`, and `infos['rsize']` entries.

#### Loading an entire image

By default if none of the values in the `infos` dictionary are set (e.g. pass an empty dictionary `{}`) then the full resolution entire image is loaded. This can be explicitly set by defining `infos['full_res'] = True` but is not required by default. If a specific image `shape` is required (as is often the case for neural network training) then the `infos['shape']` parameter should be set; doing so will ensure that provided shape for all images regardless of original matrix size. All image data is assumed to be 3D (2D images are defined with a shape of 1 in the z-direction), and so this value should be defined as a 3-element Python list `[z, y, x]`. Note that due to the progressive layout of `*.mvk` files, only the required amount of data per defined image shape is ever loaded into memory (e.g. smaller image shapes are loaded faster than larger image shapes).

In [None]:
def load_shape():
    """
    Example showing how to:
    
      (1) Initialize client
      (2) Load an entire image at full resolution
      (3) Load an entire at partial resolution
      
    """
    client = init_client()
    
    # Load at full resolution (default)
    vol = client.load_next(infos={})
    assert vol['dat'].shape == (1, 28, 28, 1)
    
    # Load at full resolution (explicit)
    vol = client.load_next(infos={'full_res': True})
    assert vol['dat'].shape == (1, 28, 28, 1)
    
    # Load at partial resolution (1 x 14 x 14 shape)
    vol = client.load_next(infos={'shape': [1, 14, 14]})
    assert vol['dat'].shape == (1, 14, 14, 1)
    
%time load_shape()

#### Loading a portion of an image

Oftentimes for more complex training strategies, only a portion of an image may be required. Examples here include loading arbitrary *n*-slice "slabs" of a 3D cross-sectional volume (CT, MR) or image patches for a high-resolution file (mammogram, pathology digital slides). 

To load a portion of the image along any given dimension, one must specify either the `infos['tiles']` or `infos['rsize']` entries in the `infos` dictionary in addition to the `infos['shape']` value. Setting either `tiles` or `rsize` will change the meaning of `infos['shapes']` to instead indicate that a "slab" or "patch" of size N is requested (instead of the entire image resized to the provided shape). 

**`infos['tiles']`**: Setting any *positive* value as the *i*-th entry in `tiles` indicates that along that particular dimension, a total of `infos['shape'][i]` pixels (or voxels) will be loaded spaced at `infos['tiles'][i]` mm. For example if `infos['shape'] = [3, 256, 256]` and `infos['tiles'] = [5, 0, 0]`, the final loaded volume will contain 3-slice slabs at 5-mm spacing with the entire in-plane image resized to 256 x 256. By contrast, changing `infos['tiles'] = [5, 1, 1]`, the final loaded volume will contain 256 x 256 image *patches* with voxels spaced at 1-mm (e.g. 25.6 x 25.6 mm image patches). 

In [None]:
def load_tiles():
    """
    Example showing how to:
    
      (1) Initialize client
      (2) Load 14 x 14 patches 
      
    """
    client = init_client()
    
    infos = {
        'shape': [1, 14, 14],
        'tiles': [0, 1, 1]
    }
    
    vol = client.load_next(infos=infos)
    assert vol['dat'].shape == (1, 14, 14, 1)
    
%time load_tiles()

**`infos['rsize']**`**: Setting any *positive* value as the *i*-th entry in `rsize` indicates that along that particular dimension, the image will first be resized to a shape of `infos['rsize'][i]` before a total of `infos['shape'][i]` pixels (or voxels) will be loaded. For example, if `infos['shape'] = [3, 256, 256]` and `infos['rsize'] = [32, 0, 0]`, the original volume will first be resized to a total of 32 slices before loading 3-slice slabs with the entire in-plane image resized to 256 x 256. By contrast, changing `infos['tiles'] = [32, 512, 512]`, the original volume will first be resized to a constant size of 32 x 512 x 512 before taking small 3 x 256 x 256 slab-patches (256 x 256 patches spanning 3 slices thick). 

In [None]:
def load_rsize():
    """
    Example showing how to: 
    
      (1) Initialize client
      (2) Load 14 x 14 patches
    
    """
    client = init_client()
    
    # First resize to 28 x 28, then load 14 x 14 patches
    infos = {
        'shape': [1, 14, 14],
        'rsize': [0, 28, 28]
    }
    
    vol = client.load_next(infos)
    assert vol['dat'].shape == (1, 14, 14, 1)
    
%time load_rsize()

**`infos['point']`**: By default, *random* slabs or patches within the volume (or image) will be loaded. If a specific point within the volume (or image) is requested, the normalized coordinate (scaled between 0 and 1) can be provided. If stratified sampling distributions are provided (see below), then random patches will be selected such that the distribution of labels matches the specified sampling requirements.  

#### Data Augmentation

Data augmentation may be defined setting the `infos['affine']` flag to True. If set, a random affine matrix will be applied to the original image. If a mask label is defined as part of `app_context` then the same affine matrix will be applied to the label(s) as well. The following affine parameters are set to default values shown here, and may alternatively redefined as needed:
```
infos = {
    'affine': True,
    'translation': [-10, 10],      # pixels or voxels
    'scale': [0.5, 1.5],           # 50% to 150%
    'rotation': [-1.57, 1.57],     # radians (-180 degrees or 180 degrees)
    'shear': [-0.1, 0.1]           
}
```

### Stratified sampling

In addition, if the `client.dist` dictionary is defined, then the documents will be drawn from the specified stratified sampling distribution. This strategy forces data to be loaded at specific ratios regardless of underlying population distribution. For example:
```
client.dist = {
    1: 0.25,
    2: 0.25,
    3: 0.40,
    4: 0.10
 }
 ```
In this example 4-class example, such a definition would force data to sampled such that approximately 25% of data has label==1, 25% has label==2, 40% has label==3, and 10% has label==4.