<img src='img/logo.png' />

<img src='img/title.png'>

<img src='img/py3k.png'>

# Table of Contents
* [Learning Objectives:](#Learning-Objectives:)
	* [Things to know about CSV](#Things-to-know-about-CSV)
		* [CSV files are not well-structured](#CSV-files-are-not-well-structured)
		* [When to use CSV](#When-to-use-CSV)
		* [Basic Steps for Dealing with CSV](#Basic-Steps-for-Dealing-with-CSV)
		* [File structure](#File-structure)
* [The `csv` module](#The-csv-module)
	* [Manually converting the data fields](#Manually-converting-the-data-fields)
	* [Detecting the format](#Detecting-the-format)
* [Just use `pandas`](#Just-use-pandas)
* [Not-quite-CSV: Eyeballing the data](#Not-quite-CSV:-Eyeballing-the-data)
* [Working with Spreadsheets](#Working-with-Spreadsheets)
	* [What are Spreadsheets?](#What-are-Spreadsheets?)
		* [What are spreadsheets good for?](#What-are-spreadsheets-good-for?)
	* [Structure of Excel files](#Structure-of-Excel-files)
		* [New XML-based hotness: .xlsx ](#New-XML-based-hotness:-.xlsx)
		* [Old binary-format-based, but not busted: .xls](#Old-binary-format-based,-but-not-busted:-.xls)
	* [Structure of ODT (and ODS) files](#Structure-of-ODT-%28and-ODS%29-files)
		* [XML-based .odt and .ods](#XML-based-.odt-and-.ods)
		* [Picking one to use:](#Picking-one-to-use:)
	* [Basic Steps for Programmatically Working with Excel](#Basic-Steps-for-Programmatically-Working-with-Excel)
	* [Notes and Gotchas](#Notes-and-Gotchas)
	* [Exercises](#Exercises)
	* [Optional Exercises](#Optional-Exercises)
		* [What is a cell?](#What-is-a-cell?)
* [Machine and Human Readable Formats](#Machine-and-Human-Readable-Formats)
	* [Scale of difficulty](#Scale-of-difficulty)
	* [Common uses](#Common-uses)
	* [Terms](#Terms)
	* [JSON : JavaScript Object Notation   ](#JSON-:-JavaScript-Object-Notation)
		* [Why JSON?](#Why-JSON?)
		* [Why not JSON?](#Why-not-JSON?)
	* [YAML: YAML Ain't Markup Language](#YAML:-YAML-Ain't-Markup-Language)
		* [Why YAML?](#Why-YAML?)
		* [Why not YAML?](#Why-not-YAML?)
	* [XML: eXtensible Markup Language](#XML:-eXtensible-Markup-Language)
		* [Why (should you use) XML?](#Why-%28should-you-use%29-XML?)
		* [Why (should you) not (use) XML?](#Why-%28should-you%29-not-%28use%29-XML?)
	* [JSON](#JSON)
	* [YAML](#YAML)
	* [XML](#XML)
		* [expat](#expat)
		* [ElementTree](#ElementTree)
		* [SAX (Simple API for XML)](#SAX-%28Simple-API-for-XML%29)
		* [DOM (Document Object Model)](#DOM-%28Document-Object-Model%29)
	* [Exercise (representing and processing XML)](#Exercise-%28representing-and-processing-XML%29)
	* [HDF5 Summary](#HDF5-Summary)
		* [Composition](#Composition)
		* [Warning](#Warning)
		* [Questions](#Questions)
		* [Exploring an HDF5 file found "in the wild"](#Exploring-an-HDF5-file-found-"in-the-wild")
	* [NetCDF](#NetCDF)
	* [Exercise (export to scientific formats)](#Exercise-%28export-to-scientific-formats%29)
	* [Review of HDF5 and NetCDF](#Review-of-HDF5-and-NetCDF)
* [IDL .sav files](#IDL-.sav-files)
* [Learning Objectives](#Learning-Objectives)
	* [Preamble](#Preamble)
* [Sqlite3](#Sqlite3)
* [PostgreSQL (and DBAPI generally)](#PostgreSQL-%28and-DBAPI-generally%29)
	* [Fortran 77 Unformatted](#Fortran-77-Unformatted)

# Learning Objectives:

After completion of this module, learners should be able to:

* Read from and write to delimited data files, such as CSV
* Learn how to do so robustly
* Learn why to not do so if possible
* Understand the structure of Excel .xlsx files
* Read data from Excel files
* Write data to Excel files
* Learn what JSON, YAML, and XML are
* Learn when and why to use them
* Learn how to manipulate and construct each type
* Learn the limitations and risks associated with each
* Work with formats that mirror the native data structures of Python:
* JSON
* YAML
* Work with XML data using several APIs:
* expat
* ElementTree
* SAX (Simple API for XML)
* DOM (Document Object Model
* Work with data stored in fast, hierarchical scientific data formats:
* HDF5
* NetCDF
* IDL .sav files
* Fortran 77 Unformatted

## Things to know about CSV

### CSV files are not well-structured

* They don't include data types
* They don't enforce structure
* They are not standard
  * CSV == "Comma-Separated Values"
  * Is it actually comma-separated? No. (See [Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values) or [RFC 4180](https://tools.ietf.org/html/rfc4180))

### When to use CSV

The only reason to use CSV is when backwards-compatibility is required and and you can't fix the original.

### Basic Steps for Dealing with CSV

1. Look at a small sample of the input
  1. Check delimiters
  2. Check types
2. Read in a small sample of the input and convert it to the necessary format(s)
3. Robust-ify the code with error catching and exception handling
4. Test on a large sample of the input

In [None]:
# Now that we know everything about CSV, let's examine our data
import csv
aapl_stocks = "data/AAPL.csv"
aapl_stocks_01 = "data/AAPL01.csv"
#The "head" command will show the first ten lines of the file.
#It gives it to us in raw form.
!wc $aapl_stocks_01
print()
!head -6 $aapl_stocks_01

### File structure

The first row is a header that tells us the name of the columns.
* Is that header required?
* Is it possible to have multiple header rows?
* How many fields are there?
* What is the field delimeter?

What is the format of each line after that?
* Date, floating point, floating point, floating point, floating point, integer, floating point

# The `csv` module

In the easy case, a very simple use of the `csv` module gives us useful results.  Perhaps not perfect, but a good starting point.

In [None]:
# Basic comma-separated file
import csv
with open(aapl_stocks) as csvfile:
    stockreader = csv.reader(csvfile)
    for n, row in zip(range(6), stockreader):
        print("%d:" % n, row)

We might have made some wrong assumptions in the dialect of CSV being used though

In [None]:
# Of course, we assumed a comma is the actual delimiter
with open(aapl_stocks_01) as csvfile:
    stockreader = csv.reader(csvfile)
    for n, row in zip(range(6), stockreader):
        print("%d:" % n, row)

In [None]:
# It looks like this actually uses tabs; happily the module can deal with many variations
with open(aapl_stocks_01) as csvfile:
    stockreader = csv.reader(csvfile, delimiter='\t')
    for n, row in zip(range(6), stockreader):
        print("%d:" % n, row)

## Manually converting the data fields

In [None]:
from itertools import islice
from datetime import datetime
with open(aapl_stocks, 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    #Consume the file header, and then convert each line's 
    # data so that it has the type we need
    lines = [line for line in csvreader]
    header = lines.pop(0)
    data = [ [datetime.strptime(line[0],'%Y-%m-%d'), float(line[1]), 
              float(line[2]), float(line[3]), 
              float(line[4]), int(line[5]), float(line[6])] 
            for line in lines]
for line in data:
    print(line)

## Detecting the format

In [None]:
#We can determine useful information about input using the CSV sniffer
with open(aapl_stocks, 'r') as csvfile:
    sample = csvfile.read(4096)
    sniffer = csv.Sniffer()
    dialect = sniffer.sniff(sample)
    has_header = sniffer.has_header(sample)

print("Has header:".rjust(20), has_header)
print("Delimiter:".rjust(20), repr(dialect.delimiter))
print("Double quote:".rjust(20), dialect.doublequote)
print("Escape character:".rjust(20), dialect.escapechar)
print("Line terminator:".rjust(20), repr(dialect.lineterminator))
print("Quote character:".rjust(20), dialect.quotechar)
print("Quoting:".rjust(20), dialect.quoting)
print("Skip initial space:".rjust(20), dialect.skipinitialspace)

# Just use `pandas`

The truth is that a lot of these issues have been handled well by the Pandas library.  It's `.read_csv()` function comes with dozens of named arguments for dealing with the many edge cases in how real-world files are formatted.

In [None]:
import pandas as pd
aapl = pd.read_csv('data/AAPL.csv', index_col='Date')
aapl[:6]

In [None]:
help(pd.read_csv)

# Not-quite-CSV: Eyeballing the data

In [None]:
# More convoluted tab-separated with header lines, etc.
# Let's try to figure out how to work with the data
cowlitz_file = 'data/cowlitz_river_wa_usgs_flow_data.rdb'
cowlitz = open(cowlitz_file).readlines()
for line in cowlitz[:32]:
    print(line.rstrip())

Subject area experts will find the format familiar, I am sure.  The *rdb* format is described at http://help.waterdata.usgs.gov/faq/about-tab-delimited-output as well.  But I am a non-expert in the subject area, so I will just visually examine it, and figure out in a relatively ad hoc way how to read and utilize it.

Here are some things I notice:

* The file starts with a commented header, with each line beginning with a hash mark (`# `) and space.
* The next line after the header is a list of field names.
 * Some field names start with numbers, and are not valid Python identifiers.
* The next line after the field names is the data types of the columns; but I'm not sure exactly what those descriptions mean.
* The bulk of the file is tab-separated values.

Let's write a small custom function to parse what we see in this data format. Note that I actually *did* a quick search, and it appears the modules `Asciitable` and the package `Astropy` both seem to support this format (other existing libraries might also); but suppose it was something novel.

In [None]:
def read_rdb(filename):
    from collections import namedtuple, OrderedDict
    fh = open(filename)
    # First collect the comments, stopping at the field names
    comment_lines = []
    for line in fh:
        # We've gotten to the header
        if not line.startswith('#'):
            fields = line.rstrip().split('\t')
            break
        comment_lines.append(line[2:])
    # Make the individual lines into one string
    comment = ''.join(comment_lines)
    # Read the next line with the data formats
    formats = next(fh).rstrip().split('\t')
    # Make sure field names are valid Python identifiers
    field_names = [f if f[0].isalpha() else 'N_'+f for f in fields]
    # Define header as ordered mapping of field name to data type
    header = OrderedDict(zip(field_names, formats))
    row = namedtuple('Row', field_names)
    records = []
    for values in csv.reader(fh, delimiter='\t'):
        records.append(row(*values))
    # Close the file before we leave
    fh.close()
    return comment, header, records

In [None]:
comment, header, cowlitz_data = read_rdb(cowlitz_file)
for field, datatype in header.items():
    print("%s: %s" % (field, datatype))

In [None]:
print(comment)

In [None]:
print("%d records, show first five" % len(cowlitz_data))

print('----------')
for record in cowlitz_data[:5]:
    print(record)
    
print('----------')
print("Work with a particular record in a straightforward way")
my_row = cowlitz_data[1000]
print(my_row.datetime, my_row.site_no, my_row.N_01_00060_00003)

In [None]:
len([r for r in cowlitz_data if r.N_01_00060_00003_cd=='A:e'])

In [None]:
pd.DataFrame(cowlitz_data, columns=header.keys())

# Working with Spreadsheets

## What are Spreadsheets?

* Spreadsheets are files that can only be modified via lots of mouse-clicking. (Or is that true?)
* Databases
* Todo Lists
* Complex Programs
* A catchall for data for people who don't/can't know any better. (This is not true, but it often feels true.)

### What are spreadsheets good for?

* Rapid prototyping
* Easy to share understanding between technical and non-technical people
* Concrete structure makes it easy for non-programmers (and it makes it dangerous)


Microsoft Excel is the dominant spreadsheet program, so we'll focus on that, but give some examples with the ODT (Open DocumenT) championed by the Free Software community (specifically OASIS).

## Structure of Excel files

### New XML-based hotness: .xlsx 

* xlsx defines the structure of Excel spreadsheets that fit into the [OOXML framework](http://www.officeopenxml.com/anatomyofOOXML-xlsx.php). 
* One .xlsx file contains only one workbook (but worksheets in that workbook may refer to other workbooks in other files).
* A .xlsx file is actually a zip file (aka package) containing a number of parts. Some are required, some are not.
  * [Content_Types].xml is required
  * relationships between different things are required (between worksheets, styles, external resources, etc.)
* A workbook may contain one or more worksheets
* Each worksheet is kept in a different XML file

### Old binary-format-based, but not busted: .xls

* xls is a binary-format specification that defines the structure of Excel spreadsheets.
* An xls file is "... an OLE compound file. A compound file contains storages, streams, and substreams. Each stream or substream contains a series of binary records. Each binary record contains zero or more structured fields that contain the workbook data. (This brief excerpt taken from [MSDN](https://msdn.microsoft.com/en-us/library/office/cc313154%28v=office.12%29.aspx)
* The basic building block of xls files is the binary record. Each record is a variable-length sequence of bytes, and is composed of three things: record type, record size, and data.

In other words, xls is a complex format. (I hate this format now. But in truth, it is actually pretty amazing. Backwards compatible to the beginning of time, made to be fast on old computers (like the kind from 10+ years ago), and designed to solve the problems of the day while still being able to handle the future)

## Structure of ODT (and ODS) files

### XML-based .odt and .ods

* odt defines the structure of ODS spreadsheets that fit into the [ISO/IEC 26300-1:2015 specification](http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=66363) 
* odt files are composed of many XML elements (spreadsheets, charts, images, text, drawings, etc.)
* ods files are simply odt files that use the "ods" extension to tell what program should open the file.
  * In other words, there is absolutely NOTHING special about .ods
* Each spreadsheet element can contain table elements, calculation elements, and lots of other XML elements

### Picking one to use:

1. Should you use a spreadsheet in the first place?
  1. How is the data intended to be used?
  2. How much data is there?
  3. Who knows what logic and calculations have to be encoded? A business analyst or accountant?
  4. 
2. If so...
  3. Is backwards compatibility to Excel 2005 required? (Excel 2007 was one of the first versions to actually support OOXML, according to my "Google archaeology")

## Basic Steps for Programmatically Working with Excel

1. Look at a small sample of your data
2. Test on a small sample of the data
3. Robustify the code
4. Test on a larger sample of the data
5. Iterate

## Notes and Gotchas

* Python indexing is 0-based
* Excel indexing is 1-based
* This makes for a WEIRD mismash of indexing techniques
  * worksheet.cell(row=1, column=1) == worksheet.rows[0][0]
* openpyxl requires a LOT of memory, even for smallish spreadsheets

In [None]:
#conda install openpyxl xlrd xlwt
#This won't work: "conda install xlutils". It is apparently incompatible with python 3.4 (as of 2015-06-25)
import openpyxl
import xlrd
import xlwt

from openpyxl import load_workbook, Workbook

from pprint import pprint

aapl_xlsx = "data/AAPL01.xlsx"

In [None]:
wb = load_workbook(aapl_xlsx)
#A workbook should have one or more worksheets.
#Let's see
pprint(wb.worksheets)
AAPL_ws = wb['AAPL']
pprint(AAPL_ws)

#for row in AAPL_ws.rows[1:10]:
#    for cell in row[:7]:
#        print (cell.value)

#What is the difference bewteen that loop and this one?
for row in AAPL_ws['A2':'F11']:
    for cell in row:
        print (cell.value)

#The top loop is loading ALL columns (from A-ZZZZZZ whatever)
#This is fine if you can wait a while and have lots o' RAM

In [None]:
#Iterate over the opening prices and find and print the maximum

#Why use "maximum" instead of easier to write "max"?
maximum = float("-inf")
for cell in AAPL_ws.columns[1][1:]:
    if maximum < float(cell.value):
        maximum = float(cell.value)
print("The highest opening price is {}".format(maximum))
        

## Exercises

1. Find and print the maximum volume
2. Sum and print the volume over all time
3. Find and print any differences between the closing price and adjusted closing prices

## Optional Exercises

1. Find and print the maximum volume per year
2. Find and print the maximum and minimum opening price per year
3. Sum and print the volume over each year

In [None]:
my_first_workbook = "data/my_first_spreadsheet.xlsx"
new_wb = Workbook()

#Each workbook has at least one worksheet
ws = new_wb.active
ws.title = "Test1"


ws.cell('A1').value = "Header1"
ws.cell('B1').value = "Header2"
ws.cell('C1').value = "Header3"
ws.cell('D1').value = "Header4"

for col in range(1,5):
    for row in range(2,10):
        c = ws.cell(column=col, row=row)
        c.value = col*100 + row
        
new_wb.save(my_first_workbook)

### What is a cell?

A cell is a distinct collection of attributes and properties at a particular location (identified by a row and column) inside a worksheet. If that definition is too generic, try this:

"The cell is the primary place in which data is stored and operated on. A cell can have a number of characteristics,
such as numeric, text, date, or time formatting; alignment; font; color; and a border. Each cell is identified by a
cell reference, a combination of its column and row headings." ([ECMA OOXML Part 1](http://www.ecma-international.org/publications/standards/Ecma-376.htm))

In [None]:
#Boss says "You did great getting that Apple stock data, but I need one worksheet per year."
#What do?
#We could go in and manually separate each year into a different worksheet (from 2014 to 1980). Yuck!
#We could do it automatically. Yay!


#Basic scheme for the new workbook:
# for each year encountered, make a new worksheet
# populate that worksheet with the data for that year.
aapl_wb = load_workbook(aapl_xlsx)
aapl_ws = aapl_wb.active

headers = list(aapl_ws['A1': 'G1'])[0]
first_data_cell = 'a2'
last_data_cell = 'g%s' % (aapl_ws.max_row)
#last_data_cell = 'g1000'
year = aapl_ws.cell(row=2, column=1).value[:4]

aapl_separated_file = "data/AAPL_separated.xlsx"
aapl_separated_wb = Workbook()

ws = aapl_separated_wb.active
ws.title = year
ws.append([cell.value for cell in headers])

new_worksheets = {year: ws}

for row in aapl_ws[first_data_cell:last_data_cell]:
    #Each of these things is an individual cell
    date, p_open, p_high, p_low, p_close, p_vol, p_adj_close = row
    year = date.value[:4]
    if year not in new_worksheets:
        ws = aapl_separated_wb.create_sheet(title=year)
        new_worksheets[year] = ws
        ws.append([cell.value for cell in headers])
        
    else:
        ws = new_worksheets[year]
        
    ws.append([cell.value for cell in row])
    
aapl_separated_wb.save(aapl_separated_file)

# Machine and Human Readable Formats

## Scale of difficulty

1. JSON (easiest)
2. YAML
3. XML

## Common uses

* JSON is used to great success in programmatic web design (REST APIs for example)
* XML is used for heavyweight 
* YAML for config files

## Terms

1. Serialization
  * Serialization is the process of translating data structures or object state into a format that can be stored for later reconstruction and use. [Wikipedia](https://en.wikipedia.org/wiki/Serialization)
2. Markup Language
  * A markup language is a system for annotating a document in a way that is syntactically distinguishable from the text. [Wikipedia](https://en.wikipedia.org/wiki/Markup_language)

## JSON : JavaScript Object Notation   

What is [JSON](http://json.org/)? 
JSON is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

### Why JSON?

* It has a 5 page specification
  * Easy to parse, and therefore very fast to parse
* It is cross-language (i.e. every major and lots of minor ones has a json encoder and decoder)
* Simple structure, and easy to understand

### Why not JSON?

* No NaN
* Everything is a string (which means that information will be lost when converting to JSON)
  * You need to keep data types sometimes
* The kind of information that can be "JSONified" is more limited
* Simple structure, can be difficult to represent complex or interdependent structure
*

## YAML: YAML Ain't Markup Language

What is [YAML](http://www.yaml.org/spec/1.2/spec.html)?
YAML is a **data serialization** language, **not** a markup language

### Why YAML?

* YAML is a superset of JSON
  * YAML imposes additional constraints on input data that JSON doesn't, like the uniqueness of keys.
* YAML is easy for a human to read
* Indentation matters (just like in Python)
* It has datatypes

### Why not YAML?

* Not nearly as widely adopted as JSON or XML
*

## XML: eXtensible Markup Language

What is [XML](http://www.w3.org/TR/2008/REC-xml-20081126/#sec-intro)?

### Why (should you use) XML?

* Very stable and capable
* Wide adoption
* Structure can be pre-defined and enforce with DTDs (Document Type Definitions)

### Why (should you) not (use) XML?

Examples:
* OOXML (Microsoft Office)
* ODT
* RSS
* XHTML
* SVG (Scalable Vector Graphics)

If you do not have lxml installed in your conda environment run
```
% conda install -y lxml
```

In [None]:
import json
import yaml
import lxml.etree

json_first = '''{"libraries":["numpy", "scipy"], 
               "dependencies": ["fftw", "mkl"], 
               "name":"my_new_module"}'''
                
#What is the difference between json_first and json_second?
json_second = {"libraries":["numpy", "scipy"], 
               "dependencies": ["fftw", "mkl"], 
               "name":"my_new_module"}

my_dict = {"libraries":["numpy", "scipy"], 
           "dependencies": ["fftw", "mkl"], 
           "name":"my_new_module"}

yamlized = yaml.dump(my_dict)
jsonized = json.dumps(my_dict)

xml = lxml.etree.Element("module")
xml.append(lxml.etree.Element("name"))
xml[-1].text="my_new_module"

xml.append(lxml.etree.Element("libraries"))
xml[-1].text = "numpy"
xml.append(lxml.etree.Element("libraries"))
xml[-1].text = "scipy"

xml.append(lxml.etree.Element("dependencies"))
xml[-1].text = "fftw"
xml.append(lxml.etree.Element("dependencies"))
xml[-1].text = "mkl"

xmlized = lxml.etree.tostring(xml)

## JSON

More details at https://docs.python.org/3/library/json.html

In [None]:
import json
notebook = json.load(open('data/notebook.ipynb'))
notebook.keys()

In [None]:
notebook['metadata']

In [None]:
{cell['cell_type'] for cell in notebook['cells']}

In [None]:
code = [cell['source'] for cell in notebook['cells'] 
                       if cell['cell_type'] == 'code']
for n, block in enumerate(code):
    if n < 6:
        continue
    print(''.join(block))
    print('='*65)
    if n > 8: 
        break

In [None]:
import pandas as pd
nyc_harbor_file = "data/nyc_harbor_wq_2006-2014.xlsx"
harbor_data = pd.read_excel(nyc_harbor_file)

harbor_row = harbor_data[:1].get_values()
myrow = harbor_row.tolist()[0][4:]
myrow

In [None]:
sum(pd.isnull(x) for x in myrow)

In [None]:
json.dumps(myrow)

In [None]:
json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])

In [None]:
json.load(open('data/pyyaml-index.json'))

In [None]:
%cat data/pyyaml-index.json

## YAML

More details at http://pyyaml.org/wiki/PyYAMLDocumentation

In [None]:
import yaml
%cat data/graphviz-meta.yaml

In [None]:
yaml.load(open('data/graphviz-meta.yaml'))

Learn to write YAML by dumping Python objects

For example, a list of dictionaries with some None's:

In [None]:
print(yaml.dump([{'key1': 'a', 'key2': 2, 'list_key': [1, 2, 'abc']}, {'key1': 'b', 'key2': 3, 'list_key': [None,None]}]))

In [None]:
print(yaml.dump(['foo', {'bar': ('baz', None, 1.0, 2)}]))

## XML

In [None]:
# Some XML files from the HDF5 descriptions info
metadata = "data/Granule_Metadata.xml"
collection = "data/GES_DISC_GPM_3GPROFF16SSMIS_DAY_V03_dif.xml"

### expat

More details at https://docs.python.org/3/library/pyexpat.html#module-xml.parsers.expat

In [None]:
import xml.parsers.expat as expat
indent = 0  # global variable quick-and-dirty

# 3 handler functions
def start_element(name, attrs):
    global indent
    print("  "*indent + 'Start element:', name, attrs)
    indent += 1
def end_element(name):
    global indent
    indent -= 1
    print("  "*indent + 'End element:', name)
def char_data(data):
    global indent
    print("  "*(indent-1) + 'Character data:', repr(data))

p = expat.ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
p.ParseFile(open(metadata, 'rb'))

### ElementTree

More details at https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

In [None]:
import xml.etree.ElementTree as ET

def print_element(elem, indent=0):
    print("  "*indent + "Start element:", elem.tag, elem.attrib)
    print("  "*indent + "Character data:", repr(elem.text))
    for child in elem:
        print_element(child, indent+1)
    if elem.tail:
        print("  "*indent + "Character data:", repr(elem.tail))
    print("  "*indent + "End element:", elem.tag)

tree = ET.parse(metadata)
root = tree.getroot()
print_element(root)

In [None]:
for shortname in root.iter("ShortName"):
    print(shortname.text)

### SAX (Simple API for XML)

More details at https://docs.python.org/3/library/xml.sax.html#module-xml.sax

In [None]:
import xml.sax as sax

# Similar pull-based style as expat, slightly higher level.

### DOM (Document Object Model)

More details at https://docs.python.org/3/library/xml.dom.html#module-xml.dom

Really only use this if you need compatibility in programming style with older code, or with code in other programming languages like Java.  For the Pythonic high-level approach, use `ElementTree`

In [None]:
import xml.dom.minidom as DOM
dom = DOM.parse(metadata)
print(dom.childNodes)
root = dom.childNodes[2]
print(root.tagName, root.attributes.items())
# ... etc ...

## Exercise (representing and processing XML)

Remember the YAML file we looked at from a conda package?  See on your local system:

```
data/graphviz-meta.yaml
```

Many of your conda packages installed on your system have a similar file (i.e. called `meta.yaml` in a package directory).  For this exercise, imagine that Continuum Analytics were transported back in time to the early 2000s, and wanted to change the storage of all this package metadata into an XML format.

* Develop the XML dialect to be used to represent the data in this (and similar) YAML files.
  * You may define this dialect purely informally.  If you have way too much time, feel free to write a DTD (Document Type Definition), W3C XML Schema, or ISO RELAX NG, formal definitions of the dialect.
* Write the content of the mentioned data file as XML in the dialect you developed.
* Read the XML you have written out using one of the Python XML parsing libraries discussed.
* Write a utility function `get_requirements(meta, type_='build')` that will pull out a list of requirements for a package (either `build`, `conflicts`, or `run`) from your parsed representation of the XML.
  * If you have time, write a couple other utility functions that seem useful for working with your format.

## HDF5 Summary

More details at https://www.hdfgroup.org/HDF5/doc/H5.intro.html

1. HDF5 files that are accessed via h5py store and return numpy arrays
2. HDF5 files are composed groups and datasets
3. Storing numerical-ish data is strongly recommended
4. Groups can be accessed like both Python dicts and like Unix filesystem paths
```python
# Full path
hdf5_file['/group1/subgroup2/subsubgroup1']
# Equivalent to:
g = hdf5_file['group1']
g['subgroup2/subsubgroup1']
# Or to nested lookup:
hdf5_file['group1']['subgroup2']['subsubgroup1']
```

5. We won't be covering HDF (aka HDF4).
  * HDF5 and HDF4 are two different things, even though they are by the same group

### Composition

HDF5 files are composed of **groups** and **datasets**.
A group contains any number of groups and datasets plus supporting metadata.
A dataset is a multidimensional array of data elements plus supporting metadata.

HDF5 files are organized like UNIX paths.
Every HDF5 file has a group (the root) at "/".

HDF5 groups are somewhat similar to Python dicts.

### Warning

You may have problems if you try to use both pytables and h5py at the same time.
This has been fixed in recent versions, but some people still use old stuff!!

* http://stackoverflow.com/questions/28333470/use-both-h5py-and-pytables-in-the-same-python-process
* https://github.com/h5py/h5py/issues/390
  
**ALWAYS** close the HDF5 file not matter what, after each small sequence of access.  

Since merely opening a file doesn't require any reads or writes, it is safest to enclose each operation you wish to perform in a `with open("myfile.hdf5"): ...` block.

If you do not have h5py installed in your conda environment run
```
% conda install -y h5py
```

In [None]:
# Step 1: Let's make a file!
import h5py
import numpy as np
import pandas as pd

filename = "tmp/my_first_hdf5.hdf5"

# h5py.File can take a driver="driver", libver="latest|earliest", 
# and userblock=<size> arguments. In general, leave those options alone unless
#  - you are using parallel HDF5 (aka MPI). Then set driver="mpio"
#  - you have to squeeze every bit of performance from the application, 
#    and don't care if no-one else can use it. Then set libver="latest"
#  - userblock is NOT chunking. userblock is some space at the beginning of the 
#    file that really isn't a part of the file.
my_first_hdf5 = h5py.File(filename, mode='w')
my_first_hdf5.close()

# Hurray! We made our first (rather boring) hdf5 file.

In [None]:
# Step 2: Put something in the file
with h5py.File(filename, mode='w') as my_first_hdf5:
    data = list(range(1000))
    my_first_hdf5['dataset1'] = data
    
# This example easily put Python a list into an HDF5 dataset
# We can (sort of) put arbitrary Python things into HDF5, but we shouldn't. 
# What should we store? Numerical-ish things.
# What should we not store? Whatever we want.
#
# Whatever! I do what I want! 
#   - Eric Cartman (S6E3)

In [None]:
# Step 3: Read the data
with h5py.File(filename, mode='r') as my_first_hdf5:
    data2 = my_first_hdf5['dataset1']

print(data2)
# Hmmm. Instead of getting the data, we instead got a "closed HDF5 dataset".
# This is because h5py is lazily loading data instead of loading everything at once.
#
# This is really good!
# What would happen if our dataset was 200GB? Could we load all of that into memory at once?
# Probably not. (Unless you are very lucky to have access to a server with that much RAM)
# But even if we have the memory, it probably doesn't make sense to load the whole thing 
# and then start processing it is probably smarter to iteratively load and process the 
# data in chunks.

In [None]:
# Step 3a: Actually read the data
with h5py.File(filename, mode='r') as my_first_hdf5:
    data2 = my_first_hdf5['dataset1'][:]

print(type(data2))
print(data2[:10])
# We put a Python list into the dataset, but got a numpy array out.
# Why?

In [None]:
# Step 4: Let's play with groups
with h5py.File(filename, mode='w') as my_first_hdf5:
    g1 = my_first_hdf5.create_group("first")
    # We can create nested groups automatically
    # second, third, and fourth will each be different groups
    g2 = my_first_hdf5.create_group("second/third/fourth")
    # We can create groups under a previously created group
    # Note: g1.create_group instead of my_first_hdf5.create_group
    g3 = g1.create_group("nestedfirst")
    g4 = g1.create_group("nestedsecond")
    #Now the group "first" has 
    
    g5 = my_first_hdf5.create_group("first/nestedthird")
    
# Questions:
# Where is group "first"? group "second"?
# How many groups are nested under "first"?
# What is the absolute path to group "nestedsecond"?

In [None]:
# What is a group?
# What is a dataset?
# Can a group contain another group?
# Can a group contain a dataset?
    
with h5py.File(filename, mode='r') as my_first_hdf5:
    list_of_groups = []
    # visit() recursively visits every group and dataset in a file
    # It calls the function that is given as an argument, stopping
    #  if that function returns anything other than None
    my_first_hdf5.visit(list_of_groups.append)
    #my_first_hdf5.visit(print)

list_of_groups

In [None]:
# Step 4a: Let's play with groups
with h5py.File(filename, mode='w') as my_first_hdf5:
    g1 = my_first_hdf5.create_group("first")
    # We can create nested groups automatically
    # second, third, and fourth will each be different groups
    g2 = my_first_hdf5.create_group("second/third/fourth")
    # We can create groups under a previously created group
    # Note: g1.create_group instead of my_first_hdf5.create_group
    g3 = g1.create_group("nestedfirst")
    g4 = g1.create_group("nestedsecond")
    # Now the group "first" has 
    
    g5 = my_first_hdf5.create_group("first/nestedthird")

### Questions

1. Where is group "first"? group "second"?
2. How many groups are nested under "first"?
3. What is the absolute path to group "nestedsecond"?
4. What is a group?
5. What is a dataset?
6. Can a group contain another group?
7. Can a group contain a dataset?

In [None]:
# Step 5: Combining groups and datasets
filename = "tmp/my_second_hdf5.hdf5"
data = [[i+j*10 for i in range(10)] for j in range(100)]
data2 = np.arange(1000).reshape((10,20,5))

with h5py.File(filename, mode='w') as f:
    g = f.create_group("data")
    dset1 = g.create_dataset("dataset1", (100,10), np.dtype('i8'), data=data)
    # We could also have done it like so:
    # f['data/dataset1'] = data
    # What is the difference? create_dataset() is more flexible. It allows us to
    #  - specify size and shape
    #  - specify datatype
    #  - specify chunking
    #  - specify transparent compression
    #  - specify resizability
    dset2 = g.create_dataset("dataset2", data2.shape)
    dset2 = data2

In [None]:
with h5py.File(filename, mode='r') as f:
    dset1 = f['data/dataset1'][:]
    dset2 = f['data/dataset2'][:]
    
print(dset1.shape, "\n", dset1[:1])
print(dset2.shape, "\n", dset2[:1])
# Why is dset2 full of zeros?

In [None]:
with h5py.File(filename, mode='w') as f:
    g = f.create_group("data")
    # Option 1:
    dset2 = g.create_dataset("dataset2", shape=data2.shape, dtype=data2.dtype)
    dset2[:] = data2
    # The [:] is important!
    
    # Option 2:
    # f['dataset2'] = data2

with h5py.File(filename, mode='r') as f:
    dset2 = f['data/dataset2'][:]
    
print(dset1.shape, "\n", dset1[:1])
print(dset2.shape, "\n", dset2[:1])

In [None]:
# Iterating over datasets is also easy.
# Remember, each dataset is basically a numpy array that is read from disk on demand
with h5py.File(filename, mode='r') as f:
    for item in f['data/dataset2']:
        print(item)

In [None]:
# Step 6: Deleting datasets from a file
filename = "tmp/my_third_hdf5.hdf5"

with h5py.File(filename, "w") as f:
    f['data/dataset1'] = np.arange(100000).reshape(10,10000)

%ls -l $filename
with h5py.File(filename, "r+") as f:
    del f['data/dataset1']
    %ls -l $filename

#The dataset isn't actually deleted until the file is closed
%ls -l $filename

with h5py.File(filename, "r+") as f:
    try:
        del f['data/dataset1']
    except KeyError:
        print("Trying to delete dataset that doesn't exist")

In [None]:
# Step 6a: Deleting entire groups
with h5py.File(filename, "w") as f:
    f['data/dataset1'] = np.arange(100000).reshape(10,10000)
    f['data/dataset2'] = np.arange(100000,200000).reshape(10,10000)
    f['data/dataset3'] = np.arange(200000,300000).reshape(10,10000)
    
%ls -l $filename
with h5py.File(filename, "r+") as f:
    del f['data']
    %ls -l $filename

# The dataset isn't actually deleted until the file is closed
%ls -l $filename

with h5py.File(filename, "r+") as f:
    l = []
    f.visit(l.append)

# Notice that the file didn't shrink to a small number of bytes.
# The datasets and group have been unlinked, but the space hasn't been reclaimed.
# To shrink the file, we need to run an "h5repack" on it.
l

In [None]:
# Step 7: Updating an existing dataset
filename = "tmp/my_fourth_hdf5.hdf5"

with h5py.File(filename, "w") as f:
    f['data/dataset1'] = np.arange(100000).reshape(10000,10)
    f['data/dataset2'] = np.arange(100000,200000).reshape(10000,10)
    f['data/dataset3'] = np.arange(200000,300000).reshape(10000,10)

In [None]:
# Step 7: Updating datasets
with h5py.File(filename, "r+") as f:
    print(f['data/dataset1'][:10])
    f['data/dataset1'][:5] = -1
    
with h5py.File(filename, "r+") as f:
    print(f['data/dataset1'][:10])

In [None]:
# Step 8: resizing existing datasets
d1 = np.arange(100000).reshape(10000,10)
with h5py.File(filename, "w") as f:
    # make a new dataset that can grow to 10x the initial size
    dset1 = f.create_dataset("resizable/dataset1", d1.shape, 
                             maxshape=(d1.shape[0]*10, d1.shape[1]))
    dset1[:] = d1
    
    # Here is an alternate way to create the dataset
    # f.create_dataset("resizable/dataset1", d1.shape, 
    #                  maxshape=(d1.shape[0]*10, d1.shape[1]), data=d1)
%ls -l $filename    

with h5py.File(filename, "r+") as f:
    # double the size of the dataset
    dset1 = f["resizable/dataset1"]
    print(dset1.shape)
    print(dset1.maxshape)
    dset1.resize(dset1.shape[0]*2, axis=0)
    print(dset1.shape)
    
    dset1[dset1.shape[0]//2:] = d1

%ls -l $filename
with h5py.File(filename, "r+") as f:
    # Check that the dataset is actually the size we want
    dset1 = f["resizable/dataset1"]
    d1 = dset1[:]
    print(d1.shape)
    print(d1[-1])

In [None]:
with h5py.File(filename, "r+") as f:
    # resize again, past our original limit
    dset1 = f["resizable/dataset1"]
    print(dset1.shape)
    print(dset1.maxshape)
    dset1.resize(dset1.shape[0]*6, axis=0)
    print(dset1.shape)

In order for datasets to be resized, they *must* be chunked.

This chunking happens automatically in some cases, but can be specified. Chunking happens automatically when:

- compression is turned on
- maxshape is specified for the dataset

Intuition about chunking

- Specifying the chunk size is easy to get wrong! Especially when multiple subtle factors are in play:
  - Chunk size
  - Compression
  - Chunk cache size
  - Underlying disk subsystem (especially for parallel filesystems)

http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/
http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/Chunking_Tutorial_EOS13_2009.pdf

**If the chunk size is wrong, accessing the data can be 10-100 times slower than normal.**

Moral of the story: Don't set chunking yourself unless you can conclusively demonstrate that it is needed.


In [None]:
#Step 9: HDF5 Attributes on Groups and Datasets
#Step 10: Transparent compression
# - Why transparent compression?

### Exploring an HDF5 file found "in the wild"

In [None]:
import numpy as np
import h5py
metadata = "data/Granule_Metadata.xml"
collection = "data/GES_DISC_GPM_3GPROFF16SSMIS_DAY_V03_dif.xml"
hdf5_precip = "data/3A-DAY.F16.SSMIS.GRID2014R2.20150101-S000000-E235959.001.V03C.HDF5"

In [None]:
import webbrowser, os
try:
    from urllib.parse import quote
except ImportError:
    from urllib import quote # Python 2.7
webbrowser.open("file:///%s/%s" % (os.getcwd(), quote(metadata)))
webbrowser.open("file:///%s/%s" % (os.getcwd(), quote(collection)))

In [None]:
f = h5py.File(hdf5_precip, "r")
list(f.items())

In [None]:
f['InputFileNames']

In [None]:
f['InputFileNames'][0]

In [None]:
inputFileNames = list(f['InputFileNames'])[0].decode().split(',')
inputFileNames

In [None]:
grid_datasets = list(f['Grid'])
grid_datasets

In [None]:
rain = f['Grid']['liquidPrecipFraction']
print(rain)

In [None]:
rain[:5,:10]

We notice a lot of these apparently sentinal values in the datasets. The value -9999.90039062 seems to be used as a filled-in number in a presumably sparse array (the file size isn't large enough to hold all the data if it was non-sparse, as we will see)

In [None]:
# Let us see which datasets have meaningful values, and how commonly
from operator import mul
from functools import reduce

for dataset in grid_datasets:
    data = f['Grid'][dataset]
    non_sentinal = data[:] >= -9999
    print(dataset, "has real data in %d of %d positions" % (
                    non_sentinal.sum(), reduce(mul, data.shape, 1)))
    print("-", data)

In [None]:
pd.DataFrame(rain[:10,:10])

In [None]:
pd.DataFrame(rain[705:716,400:411])

In [None]:
drizzle = (.1 < rain[:]) & (rain[:] < .9)              
drizzle.sum()

In [None]:
times = f['InputGenerationDateTimes']
times[0].decode('utf-8').split(',')

In [None]:
list(f.attrs.keys())

In [None]:
f.attrs['FileInfo'].decode('utf-8').split('\n')

In [None]:
f.attrs['FileHeader'].decode('utf-8').split('\n')

In [None]:
# We've already seen that mixedWater is only those sentinal values
# But just want to show how to use a Pandas Panel for N dimensions
mixedWater = f['Grid']['mixedWater']
panel = pd.Panel(f['Grid']['mixedWater'][:])
panel

In [None]:
panel[10:15,700,400:411]

The basic creation of a new HDF5 data file is done with:

```python
>>> import h5py
>>> import numpy as np
>>> f = h5py.File("mytestfile.hdf5", "w")
>>> dset = f.create_dataset("mydataset", (100,), dtype='i')
```

## NetCDF

More details at http://unidata.github.io/netcdf4-python/

If you do not have netCDF4 installed in your conda environment run
```
% conda install -y netcdf4
```

In [None]:
import pandas as pd
import netCDF4
f = netCDF4.Dataset('data/sresa1b_ncar_ccsm3-example.nc')
f

In [None]:
f.variables.keys()

In [None]:
f['pr']

In [None]:
f['pr'][:].squeeze()

In [None]:
f['pr'].dimensions

In [None]:
precip_flux = pd.DataFrame(f['pr'][:].squeeze())
precip_flux.columns = f['lon']
precip_flux.index = f['lat']
precip_flux

## Exercise (export to scientific formats)

Using the NYC Harbor data set—and perhaps also the normalization work done in the previous exercise—save the data to compact scientific data formats, HDF5 and/or NetCDF. 

* Take advantage of the option of saving multiple datasets into HDF5 or NetCDF to break down the data.
* Store the data in its native types per column/cell (Pandas does a good job of inferring data types)
* How large is the resulting HDF5/NetCDF file compared to the original Excel file.
* Compose some interesting queries of the database to extract patterns or features of the data.

## Review of HDF5 and NetCDF

The tutorial at http://docs.h5py.org/en/latest/quick.html is likely to be useful.

The basic creation of a new NetCDF data file is done with:

```python
>>> from netCDF4 import Dataset
>>> rootgrp = Dataset("test.nc", "w", format="NETCDF4")
>>> print rootgrp.data_model
NETCDF4
>>> rootgrp.close()
```

The tutorial at http://nbviewer.ipython.org/github/Unidata/netcdf4-python/blob/master/examples/writing_netCDF.ipynb is likely to be useful.

# IDL .sav files

In [None]:
import pandas as pd
from scipy.io import readsav
datafile = "data/1985_2010_Cedar_Creek_Resident_Fish_Data_for_Analysis.bin"

In [None]:
def idl2df(fname, key='o', verbose=True):
    "Read a data frame from IDL; default key is one used by QEA"
    data = readsav(datafile, verbose=verbose)
    top = data[key]
    columns = top.dtype.names
    df = pd.DataFrame(list(zip(*top[0])), columns=columns)
    return df

def df_bytes2str(df, columns=None, encoding='utf-8'):
    columns = columns or df.columns
    for col in columns:
        if type(df[col][0]) == bytes:
            df[col] = df[col].str.decode(encoding)   

In [None]:
df = idl2df(datafile, verbose=False)
df_bytes2str(df)
df[df.SPECIES == 'Carp']

# Learning Objectives

* Work with SQLite3 single-file databases
* Work with RDBMS's using the DBAPI standard

## Preamble

In [None]:
# Load some data we'll use for later examples
import src.rdb as rdb
cowlitz_file = 'data/cowlitz_river_wa_usgs_flow_data.rdb'
comment, header, cowlitz_data = rdb.read_rdb(cowlitz_file)

# Notice the form of this data is a list of namedtuples
print("%d rows of data" % len(cowlitz_data), end='\n----------\n')
for row in cowlitz_data[:3]:
    print(row)

In [None]:
# Load some data we'll use for later examples
# Note the form of this data is a Pandas DataFrame
import pandas as pd
aapl = pd.read_csv('data/AAPL.csv', index_col='Date')
print("%d rows of data" % len(aapl), end='\n----------\n')
aapl[:3]

# Sqlite3

In [None]:
import os
try:
    os.remove('tmp/test-db')
except OSError:
    print("File already not there")
#!rm tmp/test-db

In [None]:
import sqlite3
db = sqlite3.connect("tmp/test-db")
db.execute("create table stocks "
           "(symbol text, shares integer, price real, "
           " primary key (symbol))")
db.commit()

In [None]:
db.execute("insert into stocks values (?, ?, ?)", ('IBM', 50, 91.10))
db.execute("insert into stocks values (?, ?, ?)", ('AAPL', 100, 123.45))
db.commit()

In [None]:
for row in db.execute("select * from stocks"):
    print(row)

In [None]:
stocks = [('GOOG', 75, 380.13), ('AA', 100, 14.20), ('AIG', 124, 0.99)]
db.executemany("insert into stocks values (?, ?, ?)", stocks)
db.commit()

In [None]:
list(db.execute("select * from stocks"))

In [None]:
list(db.execute("select symbol, price from stocks where shares >= 100"))

In [None]:
db.execute("insert into stocks values (?, ?, ?)", ('IBM', 100, 124.5))
db.commit()

In [None]:
%ls -l tmp/test-db

In [None]:
db.execute("CREATE TABLE cowlitz "
           "(agency_cd TEXT, site_no INTEGER, date DATE, "
           " discharge REAL, status TEXT, PRIMARY KEY (date))")
for row in cowlitz_data:
    db.execute("INSERT INTO cowlitz VALUES (?, ?, ?, ?, ?)", row)
db.commit()

In [None]:
est = db.execute('SELECT COUNT(*) FROM cowlitz WHERE status="A:e"')
list(est)

In [None]:
for d in db.execute('SELECT * FROM cowlitz WHERE '
                    'date >= "1988-01-01" AND date < "1988-01-10"'):
    print(d)

In [None]:
%ls -l tmp/test-db
%ls -l $cowlitz_file

In [None]:
# Need Pandas column names to be valid SQL column names
aapl['Adj_Close'] = aapl['Adj Close']
del aapl['Adj Close']
aapl[:3]

In [None]:
aapl.to_sql('AAPL', db)

In [None]:
for row in db.execute("SELECT * FROM AAPL LIMIT 10"):
    print(row)

# PostgreSQL (and DBAPI generally)

In [None]:
import psycopg2 # maybe import oracledb, mysql, db2
conn = psycopg2.connect(database='test', user='dmertz')
cursor = conn.cursor()
cursor.execute('SELECT version()')
version = cursor.fetchone()
print(version)

In [None]:
cursor.execute("drop table stocks")
conn.commit()

In [None]:
cursor.execute("create table stocks "
               "(symbol text, shares integer, price real, "
               "primary key (symbol))")
conn.commit()

In [None]:
cursor.execute("insert into stocks values (%s, %s, %s)", 
               ('IBM', 50, 91.10))
cursor.execute("insert into stocks values (%s, %s, %s)", 
               ('AAPL', 100, 123.45))
conn.commit()

In [None]:
cursor.execute("select * from stocks;")
for row in cursor:
    print(row)

In [None]:
stocks = [('GOOG', 75, 380.13), ('AA', 100, 14.20), ('AIG', 124, 0.99)]
cursor.executemany("insert into stocks values (%s, %s, %s)", stocks)
conn.commit()

In [None]:
cursor.execute("select * from stocks")
list(cursor)

In [None]:
cursor.execute("select column_name, data_type, character_maximum_length "
               "from INFORMATION_SCHEMA.COLUMNS "
               "where table_name = 'stocks'")
list(cursor)

In [None]:
cursor.execute("select symbol, price from stocks where shares >= 100")
lots_of_shares = cursor.fetchall()
lots_of_shares

In [None]:
try:
    cursor.execute("insert into stocks values (%s, %s, %s)", 
                   ('IBM', 100, 124.5))
finally:
    conn.rollback()

In [None]:
try:
    cursor.execute("drop table cowlitz")
except psycopg2.ProgrammingError:
    print("Table does not exist... create below")
finally:
    conn.commit()
cursor.execute("CREATE TABLE cowlitz "
               "(agency_cd TEXT, site_no INTEGER, date DATE, "
               " discharge REAL, status TEXT, PRIMARY KEY (date))")
cursor.executemany("INSERT INTO cowlitz VALUES (%s, %s, %s, %s, %s)", 
                   cowlitz_data)
conn.commit()

In [None]:
cursor.execute("SELECT COUNT(*) FROM cowlitz WHERE status = 'A:e'")
cursor.fetchall()

In [None]:
cursor.execute("SELECT * FROM cowlitz WHERE "
               "date >= '1988-01-01' AND date < '1988-01-10'")
cursor.fetchall()

## Fortran 77 Unformatted

In [None]:
import numpy as np
from scipy.io import FortranFile
fortran_raw = "data/gratsr-fortran.bin"

In [None]:
ff = FortranFile(fortran_raw, 'r')
print(ff.read_reals(dtype=np.float16))
print(ff.read_record(dtype=[('X', np.float32)]))
#ff.read_record(dtype=[('BT', '<f4')])
print(ff.read_record(dtype=[('X', '<a1')]))
print(ff.read_record(dtype=[('X', '<f4')]))

<img src='img/copyright.png'>