# Overview

The import pipeline consists of a series of steps needed to convert and organize raw heterogenous data sources (DICOM objects, NIIGZ, MHA, PNG, JPG, etc) into a uniform, easy-to-access, NoSQL database-driven system. Once imported via this pipeline, any underlying data can be accessed using a standard modular interface. This facilitates ease of deep learning algorithm development as any imported dataset (or data subsets) can be easily used as input to a number of arbitrary algorithm architectures with minimal change in code.

This tutorial and series of unit tests covers an overview of the data import pipeline including: 

1. Application context
2. MongoDB document (JSON) schema 
3. Using `Importer()` class
4. Using `Manager()` class 

The cnn package for implementing all functionality described here can be imported with a single module:

In [None]:
import cnn

# Application context

The `app_context` is a python dictionary used to define the behavior of database interactions. When used during data import, it defines the target database name (and collection) within the MongoDB that data will be placed into (in addition to MongoDB network settings such as IP address/port if necessary). When used during data access, it defines the source database name (and collection) within the MongoDB from which data will be loaded.

## Defining the application context

There are several ways to define an application context. The simplest is to manually define the app_context dictionary. When manually defining an application context in this way, a minimum of two fields are required:

* `name`: the name of the application context
* `db`: the name of the Mongo database

Note that the application context itself is a composite of configuration settings including the underlying Mongo `db` in addition to unique settings such as how the underlying data will be loaded, preprocessed or filtered. Thus any given Mongo `db` may be part of a number of application contexts.

In [1]:
def define_app_context():
    """
    Example showing how to:
    
      (1) Define an app_context manually
      
    """
    app_context = {
        'name': 'mnist_test',    # name of application context
        'db': 'mnist'            # name of Mongo database
    }
    
    return app_context

As needed, many classes and methods in this package will fill in the app_context with other default values automatically. If for some reason you need to manually fill in the default values yourself, you can use the `cnn.db.init_app_context()` method as shown here:

In [None]:
def init_app_context():
    """
    Example showing how to:
    
      (1) Define an app_context manually
      (2) Initialize the app_context with additional default values
      
    """
    app_context = define_app_context()
    app_context = cnn.db.init_app_context(app_context)
    
    return app_context

%time init_app_context()

## Saving, loading and deleting application contexts

As needed, `app_context` dictionaries can also be saved into a special Mongo database named `app`. The pertinent methods include:

* `cnn.db.save_app_context(app_context)`
* `cnn.db.remove_app_context(app_context`

Note also that once an `app_context` has been saved, the `cnn.db.init_app_context()` method will first cross-reference any matching app_contexts in the database (based on provided `app_context['name']`) before using the generic default values.

In [None]:
def save_app_context():
    """
    Example showing how to:
    
      (1) Initialize an app_context 
      (2) Changing a single field
      (3) Saving the modified app_context into the database
      
    """
    app_context = init_app_context()
    app_context['dst_root'] = '/some/random/path'
    cnn.db.save_app_context(app_context)
    
    return app_context

%time save_app_context()

In [None]:
def load_app_context():
    """
    Example showing how to:
    
      (1) Load a saved app_context
      
    """
    app_context = cnn.db.init_app_context({'name': 'mnist_test'})
    
    return app_context

%time load_app_context()

In [None]:
def remove_app_context():
    """
    Example showing how to:
    
      (1) Delete a saved app_context
      
    """
    cnn.db.remove_app_context({'name': 'mnist_test'})
    
%time remove_app_context()

# MongoDB

## Document schema

Each document in the database is a single dictionary (JSON object) that represents a single imaging study. This single document contains:

1. All series obtained in the study.
2. All labels for the study.
3. Any additional metadata.

General organization is as follows:

```
doc = {
  'series': [list of series],
  'labels': [list of labels],
  'study': {dict of additional metadata ...}
}
```

### Series-level information 

`doc['series']` contains a list dictionaries, each dictionary representing information from a single series. Thus, `doc['series'][0]` contains the first series, `doc['series'][1]` the second series, etc. Any given series dictionary contains the following structure:

```
doc['series'][0] = {
  'seriesid': (str) id, 
  'description': (str) description of series,
  'data': [list of data information]
}
```

The `data` entry in the series dictionary contains a list of dictionaries, each dictionary representing specific information about the series. This information is stored as a list so that, if necessary, multiple transformations of the same underlying data can be stored into the document, for example if one wishes to store both the original and co-registered versions of the same series. In this case, the original data will could be stored in `doc['series'][0]['data'][0]` while the co-registered data volume could be stored in `doc['series'][0]['data'][1]`. Any given data dictionary contains the following structure:

```
doc['series'][0]['data'][0] = {
  'file': (str) path to imported *.mvk file,
  'hash': (str) unique hash for voxel data,
  'dims': (list) Z x I x J describing voxel size,
  'shape': (list) Z x I x J x N describing volume shape,
  'slices': (int) # of slices,
  'tags': (list) tags that describe data,
  'type': (str) description of image space (default = 'orig'),
  'name': (str) unique name describing data (optional)
} 
```

### Labels-level information 

`doc['labels']` contains a list of dictionaries, each dictionary represeting information from a single label. Thus, `doc['labels'][0]` contains the first label, `doc['labels'][1]` contains the second label, etc. Any given labels dictionary contains the following structure (same as series-data):

```
doc['labels'][0] = {
  'file': (str) path to imported *.mzl file,
  'hash': (str) unique hash for voxel data,
  'dims': (list) Z x I x J describing voxel size,
  'shape': (list) Z x I x J x N describing volume shape,
  'slices': (int) # of slices,
  'tags': (list) tags that describe data,
  'type': (str) description of image space (default = 'orig'),
  'name': (str) unique name describing data (optional),
  'bounding_box': coordinates of bounding box,
  'nnz': # of nonzero voxels for each label value,
  'max_slice': slice location of maximum label 2D area 
}
```

### Meta-data 

`doc['study']` contains remaining metadata about the study. This dictionary can be modified to include more/less anonymized DICOM header data based on level of detail required for particular use case. Common fields include: 

```
doc['study'] = {
  'studyid': (str) identifier for study (commonly accession number),
  'patientid': (str) patient medical record number,
  'dscription': (str) description to study type,
  'date': (str) date of exam,
  'slices': (int) number of slices for the entire exam
  'valid': (list) validation cross-fold(s)
}
```

# Importer() class

The `Importer()` class is used to create objects for importing data into the database. There are generally two primary steps in importing data:

1. Creating a Python list of documents identifying files to import (commonly saved as a pickle file)
2. Feeding the documents list into the importer() class for data import

## MNIST data

For purposes of this tutorial and unit testing, we will download a custom copy of the MNIST dataset organized into separate `*.npy` files. The label for each example is encoded in the name `[x]-[id_number]` for the first number `[x]` represents the ground-truth label.

In [None]:
import glob, os, subprocess 

def download_mnist():
    """
    Example showing how to:
    
      (1) Check if MNIST data exists currently
      (2) If not, download and unzip the files into data/
      
    """
    mnist = glob.glob('**/mnist.zip', recursive=True)
    if mnist == []:
        
        # Clone and unzip data
        commands = [
            'git clone https://github.com/peterchang77/data',
            'unzip -q data/raw/mnist.zip -d data/raw',
        ]
        
        for c in commands:
            subprocess.run(c.split(' '))
    
    else:
        print('MNIST data already present')

%time download_mnist()

## Creating import documents

As described in the MongoDB schema above, each study is organized as a single document (JSON object). To import data into the MongoDB, first create a list of documents following the predefined schema structure, such that each document in the list corresponds to a study that you wish to import. Note that importing a document containing a `doc['study']['studyid']` value that already exists in the database will simply **merge** the data (series, labels, meta-data) from the new document into the existing MongoDB document; any data that has already been imported will be ignored, while any data that is new will be appended.

It is a good idea to save the import documents list as a separate `*.pkl` file. This will help you keep track of which studies have been imported or not. Importantly, it will also serve as a key between the filename to import and the studyid within the MongoDB. The naming convention for this file is `documents_[XXX].pkl` where `[XXX]` is some sort of identifier (keep in mind that a typical workflow will incorporate multiple serial data imports over time.

During this process, you will be responsible for creating a studyid for each imported exam. A few popular choices include the accession number (if importing from DICOM files), or hashed versions of the accession number or full path to file.

## Document structure

The documents follow the same JSON structure described above, with the one exception that the 'file' value will be set to the file **to be imported** (not a pointer to the already imported file). At minimum, each document must contain a `studyid` value and `file` paths for either `series` and/or `labels` level information. 

In [None]:
import glob, os, pickle

def create_documents():
    """
    Example showing how to:
    
      (1) Create list of import documents
      (2) Save document as pickle file
      
    Note we assume that MNIST data is in data/raw/mnist/*.npy as downloaded above
      
    """
    # Find studies 
    npys = glob.glob('data/raw/mnist/*.npy')
    
    # Append one study that doesn't exist to check error logging
    npys.append(npys[-1].replace('.npy', '_noexist.npy'))
    
    # Here studyid will just be defined by file basename (without *.npy)
    studyid = lambda x : os.path.basename(x.split('.')[0])
    
    # Create documents list
    documents = []
    for n, npy in enumerate(npys):
        
        documents.append({
            'study': {'studyid': studyid(npy)},
            'series': [{
                'data': [{
                    'file': npy,
                    'type': 'orig',
                    'tags': ['mnist-all']
                    }]
                }]})
    
    os.makedirs('pkls', exist_ok=True)
    pickle.dump(documents, open('pkls/documents_all.pkl', 'wb'))
    
%time create_documents()

## Importing documents 

Once the documents list has been created, we can now individually import them into the database. Note that the following steps are performed (automatically) during import:

* anonymization (pixel data as file pointers, meta-data as MongoDB documents)
* custom file format conversion (images into `*.mvk` files, labels into `*.mzl` files)
* reshaping all tensors to 4D (N x H x W x C) convention
* splitting data into cross-validation cohorts
* save app_context into database
* saving documents with import errors into logs

In [None]:
import pickle, cnn

def import_data():
    """
    Example showing how to:
    
      (1) Prepare an app_context for import
      (2) Instantiating a new Importer() object
      (3) Importing data
      (4) Split data in validation folds
      
    """
    # Use method above to define a test MNIST app_context
    app_context = define_app_context()
    
    # Importantly, add an import destination directory if not previously defined
    app_context['dst_root'] = 'data/mvk-mzl'
    
    # Create Importer() object, initialized with your app_context
    importer = cnn.db.Importer(app_context)
    
    # Add your documents list and override
    documents = pickle.load(open('pkls/documents_all.pkl', 'rb'))
    importer.documents = documents
    
    # Import
    importer.run()
    
    # Validation folds
    importer.validation_split(opts={'folds': 5})
    
%time import_data()

In [None]:
import unittest

class UnitTests(unittest.TestCase):
    
    def define_app_context(self):
        
        define_app_context()

In [None]:
if __name__ == '__main__':
    unittest.main()