# Table of Contents
* [Requirements](#Requirements)
* [Getting started](#Getting-started)
	* [Installing license](#Installing-license)
* [Text adapter](#Text-adapter)
	* [Gzip Support](#Gzip-Support)
	* [Indexing CSV Data](#Indexing-CSV-Data)
	* [Regular Expressions](#Regular-Expressions)
	* [`iopro.loadtext()` versus `iopro.genfromtxt()`](#iopro.loadtext%28%29-versus-iopro.genfromtxt%28%29)
	* [S3 Support](#S3-Support)
* [JSON Support](#JSON-Support)
	* [Massaging data in the adapter](#Massaging-data-in-the-adapter)
		* [Combining regular expressions and typecastings](#Combining-regular-expressions-and-typecastings)
	* [Numba Integration](#Numba-Integration)


# Requirements

- Python 2.7, or 3.4+
- NumPy 1.10+

Python modules (optional):

- boto (for S3 support)
- Pandas (to use DataFrames)

# Getting started

IOPro loads NumPy arrays (and Pandas DataFrames) directly from files,
SQL databases, and NoSQL stores, without creating millions of temporary,
intermediate Python objects, or requiring expensive array resizing
operations. It provides a drop-in replacement for the NumPy functions
loadtxt() and genfromtxt(), but drastically improves performance and
reduces the memory overhead.

IOPro is included with [Anaconda Workgroup and Anaconda Enterprise
subscriptions](https://www.continuum.io/content/anaconda-subscriptions).

To start a 30-day free trial just download and install the IOPro
package.

If you already have [Anaconda](http://continuum.io/downloads.html) (free
Python distribution) installed:

    conda update conda
    conda install iopro

If you do not have Anaconda installed, you can download it
[here](http://continuum.io/downloads.html).

IOPro can also be installed into your own (non-Anaconda) Python
environment. For more information about IOPro please contact
[<sales@continuum.io>](mailto:sales@continuum.io).

## Installing license

Once you have obtained a license for long-term use of IOPro (and other Continuum products), you need to copy the license file to your `.continuum` directory under your home directory.  Generally in organizations, systems/IT will handle this.  For example, on my computer:

In [None]:
!jq "" ~/.continuum/license_bundle*.txt | sed 's/"sig": .*/"sig": "XXXXXX"/'

# Text adapter

Before we get started, let's create a sample CSV file to work with:

In [None]:
from random import random, randint, shuffle
import string

NUMROWS = 10
with open('data/table.csv','w') as data:
    # Header
    for n in range(1,5):
        print("f%d" % n, end=",", file=data)
    print("comment", file=data)

    # Body
    letters = list(string.ascii_letters)
    for n in range(NUMROWS):
        shuffle(letters)
        s = "".join(letters[:randint(5,20)])
        vals = (n, randint(1000,2000), random(), random()*100, s)
        print("%d,%d,%f,%f,%s" % vals, file=data)

Let's read in the local CSV file created here.  Obviously, for a small file like this that easily fits in memory, the `csv` or `pandas` modules might be more than sufficient.  We want to show the interfaces and capabilities that will apply to much larger data.

In [None]:
import iopro
adapter = iopro.text_adapter('data/table.csv', parser='csv')

In [None]:
adapter.get_field_names()

We can specify the data types for values in the columns of the csv file
being read though here we will instead rely upon the ability of IOPro's
TextAdapter to auto-discover the data types used.

We ask IOPro's TextAdapter to parse text and return records in NumPy
arrays from selected portions of the csv file using slicing notation:

In [None]:
# the inferred datatypes
array = adapter[:]
array.dtype

Define field dtypes (example: set field 0 to a 16-bit unsigned int and field 3 to a 32-bit float):

In [None]:
# massage the datatypes
adapter.set_field_types({0: 'u2', 3:'f4'})
array = adapter[:]
array.dtype

In [None]:
# the first five records
array = adapter[0:5]
print(array)

In [None]:
# read last five records
array = adapter[-5:]
print(array)

In [None]:
# read every other record
array = adapter[::2]
print(array)

In [None]:
# read first and second, third fields only
array = adapter[[0,1,2]][:]
list(array)

In [None]:
# read fields named 'f2' and 'comment' only
array = adapter[['f2','comment']][:]
list(array)

## Gzip Support

IOPro can decompress gzip data on the fly, simply by indicating a `compression` keyword argument.

```python
adapter = iopro.text_adapter('data.gz', parser='csv', compression='gzip')
```

As well as being able to store and work with your compressed data without having to decompress first, you also do not need to sacrifice any performance in doing so. For example, with a test 419 MB CSV file of numerical data, and a 105 MB file of the same data compressed with gzip, the following are run times for loading the entire contents of each file into a NumPy array:

 - uncompressed: 13.38 sec
 - gzip compressed: 14.54 sec

The compressed file takes slightly longer, but consider having to uncompress the file to disk before loading with IOPro:

 - uncompressed: 13.38 sec
 - gzip compressed: 14.54 sec
 - gzip compressed (decompress to disk, then load): 21.56 sec

## Indexing CSV Data

One of the most useful features of IOPro is the ability to index data to allow for fast random lookup.

For example, to retrieve the last record of the compressed 109 MB dataset we used above:

```
>>> adapter = iopro.text_adapter('data.gz', parser='csv', compression='gzip')
>>> array = adapter[-1]
```

Retrieving the last record into a NumPy array takes 14.82 sec. This is about the same as the time to read the entire record, because the entire dataset has to be parsed to get to the last record.

To make seeking faster, we can build an index:

```python
adapter.create_index('index_file')
```

The above method creates an index in memory and saves it to disk, taking 9.48 sec. Now when seeking to and reading the last record again, it takes a mere 0.02 sec.

Reloading the index only takes 0.18 sec. Build an index once, and get near instant random access to your data forever:

```python
adapter = iopro.text_adapter('data.gz', parser='csv', 
                             compression='gzip', index_name='index_file')
```

Let's try it with our more moderate sized example.

In [None]:
adapter = iopro.text_adapter('data/exoplanets.csv.gz', parser='csv', compression='gzip')
print(len(adapter[:]), "rows")
print(', '.join(adapter.field_names[:3]), 
      "...%d more..." % (adapter.field_count-6), 
      ', '.join(adapter.field_names[-3:]))

In [None]:
adapter.field_types

In [None]:
%time row=adapter[-1]

In [None]:
%time adapter.create_index('data/exoplanets.index')

In [None]:
%time row=adapter[-1]

In [None]:
%time row=adapter[-1]

In [None]:
new_adapter = iopro.text_adapter('data/exoplanets.csv.gz', parser='csv', 
                                 compression='gzip', index_name='data/exoplanets.index')

In [None]:
%time row=new_adapter[-1]

## Regular Expressions

> Some people, when confronted with a problem, think 
“I know, I'll use regular expressions.”   Now they have two problems. —Jamie Zawinski

IOPro supports using regular expressions to help parse messy data. Take for example the following snippet of actual NASDAQ stock data found on the Internet:

In [None]:
%%file data/stocks.csv
Name,Symbol,Exchange,Range
Apple,AAPL,NasdaqNM,363.32 - 705.07
Google,GOOG,NasdaqNM,523.20 - 774.38
Microsoft,MSFT,NasdaqNM,24.30 - 32.95

The first three fields are easy enough: name, symbol, and exchange. The fourth field presents a bit of a problem. Let's try IOPro's regular expression based parser:

In [None]:
regex_string = '([A-Za-z]+),([A-Z]{1,4}),([A-Za-z]+),'\
               '(\d+.\.\d{2})\s*\-\s*(\d+.\.\d{2})'
adapter = iopro.text_adapter('data/stocks.csv', parser='regex', 
                             regex_string=regex_string)

# Notice that header does not now match the regex
print(adapter.field_names)
# We can massage the headers to reflect our match pattern
adapter.field_names = adapter.field_names[0].split(',')[:3] + ["Low","High"]
adapter[:]

Regular expressions are compact and often difficult to read, but they are also very powerful. By using the above regular expression with the grouping operators '(' and ')', we can define exactly how each record should be parsed into fields. Let's break it down into individual fields:

 * `([A-Za-z]+)` defines the first field (stock name) in our output array,
 * `([A-Z]{1-4})` defines the second (stock symbol),
 * `([A-Za-z]+)` defines the third (exchange name),
 * `(\d+.\.\d{2})` defines the fourth field (low price)
 * `\s*\-\s*` is skipped because it is not part of a group
 * `(\d+.\.\d{2})` defines the fifth field (high price)

The output array contains five fields: three string fields and two float fields. Exactly what we want.

## `iopro.loadtext()` versus `iopro.genfromtxt()`

Within IOPro there are two closely related functions.  `loadtext()`, that we have been looking at, Makes a more optimistic assumption that your data is well-formatted.  `genfromtxt()` has a number of arguments for handling messier data, and special behaviors for dealing with missing data.

`loadtext()` is already highly configurable for dealing with data under many CSV and other delimitered formats.  `genfromtxt()` contains a superset of these arguments.

In [None]:
help(iopro.loadtxt)

In [None]:
help(iopro.genfromtxt)

## S3 Support

IOPro can parse CSV data stored in Amazon's S3 cloud storage service. In order to access S3 files, you need to specify some credentials along with the resource you are accessing.

The first two parameters are your AWS access key and secret key, followed by the S3 bucket name and key name. The S3 CSV data is downloaded in 128K chunks and parsed directly from memory, bypassing the need to save the entire S3 data set to local disk. 

In [None]:
# Health Insurance Marketplace data
import iopro
import urllib.request
url = 'http://s3.amazonaws.com/product-training/'
xml = urllib.request.urlopen(url).read()
#print(xml)

In [None]:
import bs4, re
r = re.compile(r'^(\s*)', re.MULTILINE)
def display(bs, encoding=None, formatter="minimal", indent=4):
    print(r.sub(r'\1' * indent, bs.prettify(encoding, formatter)))

display(bs4.BeautifulSoup(xml, "xml"))

In [None]:
user_name = "class1"
aws_access_key = "AKIAINKGGVI5HNOKN5MQ"
aws_secret_key = "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"
bucket = 'product-training'
key_name = 'BusinessRules.csv' # 21k lines, 8MB
# key_name = 'PlanAttributes.csv' # 77k lines, 95MB
# key_name = 'Rate.csv.gzip' # 13M lines, 2GB uncompressed, 110MB compressed
adapter = iopro.s3_text_adapter(aws_access_key, aws_secret_key, bucket, key_name)

In [None]:
# Don't try this with the really large datasets, works with the default one
df = adapter.to_dataframe()
df.iloc[:6,:6]

IOPro can also build an index for S3 data just as with disk based CSV data, and use the index for fast random access lookup. If an index file is created with IOPro and stored with the S3 dataset in the cloud, IOPro can use this remote index to download and parse just the subset of records requested. This allows you to generate an index file once and share it on the cloud along with the data set, and does not require others to download the entire index file to use it.

# JSON Support

Text data in JSON format can be parsed by specifying 'json' for the parser argument:

In [None]:
%%file data/one.json
{"id":123, "name":"xxx"}

In [None]:
# Single JSON object
adapter = iopro.text_adapter('data/one.json', parser='json')
adapter[:]

Currently, each JSON object at the root level is interpreted as a single NumPy record. Each JSON object can be part of an array, or separated by a newline. Examples of valid JSON documents that can be parsed by IOPro, with the NumPy array result:

In [None]:
%%file data/two.json
[{"id":123, "name":"xxx"}, {"id":456, "name":"yyy"}]

In [None]:
# Array of two JSON objects
iopro.text_adapter('data/two.json', parser='json')[:]

In [None]:
%%file data/three.json
{"id":123, "name":"xxx"}
{"id":456, "name":"yyy"}

In [None]:
# Two JSON objects separated by newline
iopro.text_adapter('data/three.json', parser='json')[:] 

Future versions of IOPro will have support for selecting specific JSON fields, using a query language similar to XPath for XML.

## Massaging data in the adapter

A custom function can be used to modify values as they are read.

In [None]:
import iopro, io, math
stream = io.StringIO('3,abc,3.3\n7,xxx,9.9\n4,,')
adapter = iopro.text_adapter(stream, parser='csv', field_names=False)

# Override default converter for first field
adapter.set_converter(0, lambda x: math.factorial(int(x)))

adapter[:]

We can also for data types and set fill values for missing data:

In [None]:
# Apply data types to columns
stream = io.StringIO('3,abc,3.3\n7,xxx,9.9\n4,,')
adapter = iopro.text_adapter(stream, parser='csv', field_names=False)
adapter.set_field_types({1:'S3', 2:'f4'})
adapter[:]

In [None]:
# Set fill value for missing values in each field
adapter.set_fill_values({1:'ZZZ', 2:999.999})
adapter[:]

### Combining regular expressions and typecastings

In [None]:
%%file data/transactions.csv
$2.56, 50%, September 20 1978
$1.23, 23%, April 5 1981

In [None]:
import iopro

regex_string = '\$(\d)\.(\d{2}),\s*([0-9]+)\%,\s*([A-Za-z]+)'
adapter = iopro.text_adapter('data/transactions.csv', 
                             parser='regex', 
                             regex_string=regex_string, 
                             field_names=False, 
                             infer_types=False)

# Set dtype of fields and their names
adapter.set_field_types({0:'i2', 1:'u2', 2:'f4', 3:'S10'})
adapter.set_field_names(['dollars', 'cents', 'percentage', 'month'])
adapter[:]

## Numba Integration

IOPro comes with experimental integration with NumbaPro, the amazing NumPy aware Python compiler also available in Anaconda. Previously when parsing messy csv data, you had to use either a very slow custom Python converter function to convert the string data to the target data type, or use a complex regular expression to define the fields in each record string. Using the regular expression feature of IOPro will certainly still be a useful and valid option for certain types of data, but it would be nice if custom Python converter functions weren't so slow as to be almost unusable. Numba solves this problem by compiling your converter functions on the fly without any action on your part. Simply set the converter function with a call to set_converter_function() as before, and IOPro + NumbaPro will handle the rest. To illustrate, I'll show a trivial example using the sdss data set again. Take the following converter function which converts the input string to a floating point value and rounds to the nearest integer, returning the integer value:

```
>>> def convert_value(input_str):
...     float_value = float(input_str)
...     return int(round(float_value))
```

We'll use it to convert field 1 from the sdss dataset to an integer. By calling the set_converter method with the use_numba parameter set to either True or False (the default is True), we can test the converter function being called as both interpreted Python and as Numba compiled llvm bytecode. In this case, compiling the converter function with NumbaPro gives us a 5x improvement in run time performance. To put that in perspective, the Numba compiled converter function takes about the same time as converting field 1 to a float value using IOPro's built in C compiled float converter function. That isn't quite an "apples to apples" comparison, but it does show that NumbaPro enables user defined python converter functions to achieve speeds in the same league as compiled C code.