In [1]:
%pylab inline
import numpy as np
import scipy as sc
import scipy.ndimage as ndi
import pylab as pl
import matplotlib as mpl
from IPython import display
from itertools import islice
rc("image", cmap="gray")
import dlinputs; reload(dlinputs); dli = dlinputs
#from dlinputs import inputs as dli; reload(dli)

Populating the interactive namespace from numpy and matplotlib


SyntaxError: invalid syntax (shardwriter.py, line 137)

# File Reader

Training data is often stored in the file system. The `dlinputs` library provides a number of convenient iterators over such data:

- `itdirtree` - iterates over samples stored in a directory tree
- `itbasenames` - the dataset is a file containing basenames, plus a list of extensions
- `ittabular` - the dataset is a file containing rows with filenames / data

In addition, `find_file` and `find_directory` can be used to write input pipelines that work in many different environments and search for datasets.

## Directory Trees

In [None]:
import dlinputs; reload(dlinputs); dli = dlinputs
data = dli.itdirtree("testdata/dirdata", "png,cls", size=6) | \
       dli.itmap(png=dli.pilreads, cls=int)
sample = data.next()
imshow(sample["png"])

## Basename Lists

In [None]:
!sed 5q testdata/dirdata.list

In [None]:
data = dli.itbasenames("testdata/dirdata.list", "png,cls", size=6) | \
       dli.itmap(png=dli.pilreads, cls=int)
sample = data.next()
print sample.keys()
print sample["__path__"]
imshow(sample["png"])

## Making Basename Lists

In [None]:
!find testdata/dirdata -name '*.png' > basenames
!sed 3q basenames

In [None]:
data = dli.itbasenames("basenames", "png,cls", size=6) | \
       dli.itmap(png=dli.pilreads, cls=int)
sample = data.next()
print sample.keys()
print sample["__path__"]
imshow(sample["png"])

## Tabular Dataset Descriptions

In [None]:
!sed 5q testdata/dirdata.tsv

In [None]:
data = dli.ittabular("testdata/dirdata.tsv", "png,cls", size=6) | \
       dli.itmap(png=dli.pilreads, cls=int)
sample = data.next()
print sample.keys()
print sample["__path__png"]
imshow(sample["png"])

## Inline Data in Tabular Datasets

In [None]:
!sed 5q testdata/dirdata.tsv2

In [None]:
data = dli.ittabular("testdata/dirdata.tsv2", "png,_cls", size=6) | \
       dli.itmap(png=dli.pilreads, _cls=int)
sample = data.next()
print sample.keys()
print sample["__path__png"]
imshow(sample["png"])

## Search Paths

In [None]:
path = "/work/DATABASES:./testdata"
dli.find_file(path, "sample.db", verbose=True)

In [None]:
path = "/work:./testdata"
dli.find_directory(path, "dirdata", "10.png", verbose=True)

# Database Reader

Sqlite databases are convenient for local datasets. They can be up to several terabytes large. `itsqlite` returns dictionaries containing
each column as a field.

In [None]:
!sqlite3 testdata/sample.db .schema

In [None]:
data = (dli.itsqlite("testdata/sample.db") |
        dli.itmap(image=dli.pilgray, cls=int))
for sample in data:
    imshow(sample["image"])
    print sample["cls"]
    break

# Tar Record Files

Tar record files are regular tar files. 

In [None]:
!tar -ztvf testdata/sample.tgz | sed 5q

Consecutive files with the same
basename are returned as items in a dictionary; the extension is used
as the key to each entry.

In [None]:
data = dli.ittarreader("testdata/sample.tgz")
for sample in data:
    print sample.keys()
    print sample["__key__"]
    print repr(sample["cls"])
    print repr(sample["png"])[:30]
    break

Usually, the output from an `ittarreader` is piped through something that decodes the string/buffer fields (`itmap`) and renames fields (`itren`). Decoders in `itmap` are just functions that map the contents of a field to new contents. The `dli.pilgray` function decodes a PNG-compressed image into a grayscale image represented as a numpy rank 2 array.

In [None]:
data = (dli.ittarreader("testdata/sample.tgz") |
        dli.itmap(png=dli.pilgray, cls=int) |
        dli.itren(image="png", cls="cls"))
for sample in data:
    imshow(sample["image"])
    print sample["cls"]
    break

The `ittarreader` can also read from URLs. This can be any web server, although often it is an S3-compatible storage server like Minio, Minio XL, Ceph, or Swift.

For desktop installations, the local Minio server is convenient.

In [None]:
!/bin/bash -c 'cd testdata && nohup python -m SimpleHTTPServer 9000 &'

In [None]:
data = (dli.ittarreader("http://localhost:9000/sample.tgz") |
        dli.itmap(png=dli.pilgray, cls=int))
for sample in data:
    imshow(sample["png"])
    print sample["cls"]
    break

# Sharded Files

For larger datasets, sharding is a good idea. Shards are stored in JSON-formatted URLs.

In [None]:
!wget -O - http://localhost:9000/imagenet.shards | sed 10q

Decoding is otherwise just like regular `ittarfile`. Note that the shard reader randomizes shard order by default.

In [None]:
data = (dli.ittarshards("http://localhost:9000/imagenet.shards") |
        # dli.itinfo() |
        dli.itmap(png=dli.pilrgb, cls=int))
for sample in data:
    imshow(sample["png"])
    print sample["cls"]
    break

To make input pipelines movable between different environments, you can also specify an `urlpath`, a list of URL roots to search (you can also supply these as a whitespace separated string).

In [None]:
urlpath = """
http://mars:9000/
http://jupiter:9000/
http://localhost:9000/
""".strip().split()

data = (dli.ittarshards("imagenet.shards", urlpath=urlpath) | \
        dli.itmap(png=dli.pilrgb, cls=int))
for sample in data:
    imshow(sample["png"])
    print sample["cls"]
    break

# Other Input Filters

There are more pipeline components:

- `itshuffle` shuffles samples inline
- `itstandardize` performs image size standardization
- `itbatch` performs batching of inputs

More are being added to `dlpipes`, including in-memory and on-disk caching, more data augmentation, and distributed and parallel pipes.

In [None]:
data = (dli.ittarshards("http://localhost:9000/imagenet.shards") |
        dli.itshuffle(1000) |
        dli.itmap(png=dli.pilrgb, cls=int) |
        dli.itren(image="png", cls="cls") |
        dli.itstandardize((256,256)) |
        dli.itbatch(5))
for sample in data:
    print sample["image"].shape
    imshow(sample["image"][0])
    print sample["cls"]
    break

# Loadable Inputs and Models

In many applications, it's useful to separate the input pipelines and model definitions from the source code of the application. The `dlpipes.loadable` module addresses this problem. It allows input pipelines and models to be defined with arbitrary Python code, but to be imported by file name rather than using the `import` statement.

In [None]:
!cat input-sample.py

Note that loadable input pipelines can be written using arbitrary Python code; they simply need to return Python iterators.

Different partitions of the dataset may get different `*_data` methods. You generally should have at least `training_data` and `test_data`. All and only datasets should have methods ending in `_data`.

Loadable input pipelines and models written in Python must end in `.py`; that's because the loader will eventually also load JSON and YAML definitions of pipelines and models.

In [None]:
factory = dli.loadable.load_input("input-sample.py")
training_data = factory.training_data()
training_data.next().keys()

Defining dataset iterators in this way allows us to create useful tools that operate over datasets. For example, `show-input` provides information about a dataset iterators; optionally, it can also benchmark. There are other tools for broadcasting datasets over the network, dumping them into sharded files, etc.

In [None]:
!./show-input input-sample.py

# Client Server Pipelines

It is often useful to run some preprocessing code distributed on multiple CPU-only servers, then send the data on to the deep learning model on a GPU machine. The `zmqserver` and `itzmq` functions make this easy. These functions use a simple and time-efficient encoding of tensors in multipart ZMQ messages. With `itzmq`, it is also easily possible to connect to ZMQ-based servers written in other languages, to build efficient PUB/SUB training pipelines for training many models simultaneously, to build efficient wide area distribution trees for training data across data centers.

In [None]:
%%writefile server.py
import dlinputs as dli
data = dli.ittarreader("testdata/sample.tgz") | \
       dli.itmap(png=dli.pilrgb, cls=int)
dli.zmqserver(data, bind="tcp://*:17006")

In [None]:
!/bin/bash -c 'python ./server.py &'

In [None]:
data = dli.itzmq(connect="tcp://localhost:17006")
imshow(data.next()["png"])

# Parallelizing Input Pipelines

In [None]:
def factory():
    return dli.itsqlite("testdata/sample.db") | \
           dli.itmap(image=dli.pilreads, cls=int)
data = factory()
imshow(data.next()["image"])

In [None]:
data = dli.parallel.parallelize_input(factory, 4)
imshow(data.next()["image"])