<img src='img/logo.png'>
<img src='img/title.png'>
<img src='img/py3k.png'>

# Table of Contents
* [Learning Objectives](#Learning-Objectives)
* [Working with US County polygons](#Working-with-US-County-polygons)
* [Problem: Parsing a custom data format](#Problem:-Parsing-a-custom-data-format)
	* [Pedantic style policy](#Pedantic-style-policy)
	* [Read the large dataset](#Read-the-large-dataset)
	* [Dealing with flawed data](#Dealing-with-flawed-data)
		* [Flaws in sample data](#Flaws-in-sample-data)


# Learning Objectives

* Read novel data formats
* Develop library code
* Write well-structured code

**The series of exercises in the Geospatial series are intended to present a realistically difficult set of development problems.  These exercises might take a half day or more for a class to work on extensively.**

# Working with US County polygons

We have a data file with polygons defining US county borders. This format is different from that used for census tracts that we looked at in the previous exercise.  We'd like to be able to read this format into our preferred represenation of polygons that is used by other library functions.

First, let us eyeball the format in question.

In [120]:
counties = open('data/US_Counties.csv').readlines()
print(len(counties))
counties[:2]

3250


['County Name,State-County,state abbr,State Abbr.,geometry,value,GEO_ID,GEO_ID2,Geographic Name,STATE num,COUNTY num,FIPS formula,Has error\n',
 'Autauga,AL-Autauga,al,AL,"<Polygon><outerBoundaryIs><LinearRing><coordinates>-86.41182,32.4757 -86.41177,32.46599 -86.41167,32.45054 -86.41157,32.44245 -86.41154,32.43993 -86.41138,32.42573 -86.41135,32.42417 -86.41128,32.42185 -86.41117,32.41017 -86.41117,32.40994 -86.41615,32.4072 -86.43178,32.40132 -86.43926,32.40025 -86.44653,32.40036 -86.45876,32.40573 -86.4612,32.40285 -86.46247,32.38769 -86.46356,32.37729 -86.46836,32.37368 -86.47092,32.37136 -86.47306,32.36874 -86.47476,32.36588 -86.47777,32.36418 -86.48023,32.36497 -86.48342,32.36667 -86.4871,32.36674 -86.49047,32.36532 -86.49265,32.36286 -86.49263,32.36032 -86.49181,32.35787 -86.4908,32.35513 -86.48994,32.35264 -86.4899,32.34985 -86.49071,32.34705 -86.49308,32.34509 -86.49637,32.34451 -86.49677,32.34444 -86.49697,32.34441 -86.51978,32.34039 -86.54242,32.36286 -86.56658,32.37296 -86.

# Problem: Parsing a custom data format

Write a custom function `read_county_data()` that reads this format and stores all the information in well-chosen data structures. You can see that the format of county borders is a kind of CSV data, with fragments of XML embedded inside of it.  Outside the XML fragment is various other data about the county being described by the boundary.  

As test cases to your code, draw some visualizations of the county boundaries in the file to assure yourself they look sensible.  As a further unit test, lookup the area of some counties using internet searches, and verify that the `polygon_area()` function you wrote in a previous exercise gives consistent results with the generally published reports of the size of counties.

In [None]:
%load src/read_counties.py

## Pedantic style policy

It is not necessarily a bad idea to use `pep8` or similar tools to assure compliance with Python style guides.  The above sample code passes this test with no complaints.

In [145]:
!pep8 src/read_counties.py

## Read the large dataset

Notice that there are a few consistency issues in this sample data.  What do we want to do to remediate these issues?

In [139]:
counties = read_county_data('data/US_Counties.csv')



In [153]:
counties[('CA', 'Los Angeles')]

{(Coord(lon=-118.3289, lat=32.87504),
  Coord(lon=-118.32571, lat=32.87173),
  Coord(lon=-118.31739, lat=32.86376),
  Coord(lon=-118.30489, lat=32.85346),
  Coord(lon=-118.2924, lat=32.83639),
  Coord(lon=-118.28878, lat=32.82083),
  Coord(lon=-118.2899, lat=32.81053),
  Coord(lon=-118.29584, lat=32.79826),
  Coord(lon=-118.30285, lat=32.78885),
  Coord(lon=-118.31853, lat=32.7768),
  Coord(lon=-118.33197, lat=32.77057),
  Coord(lon=-118.35349, lat=32.76581),
  Coord(lon=-118.36427, lat=32.76708),
  Coord(lon=-118.37386, lat=32.76885),
  Coord(lon=-118.37838, lat=32.77023),
  Coord(lon=-118.3812, lat=32.76808),
  Coord(lon=-118.38535, lat=32.76356),
  Coord(lon=-118.39256, lat=32.7585),
  Coord(lon=-118.39952, lat=32.75523),
  Coord(lon=-118.40832, lat=32.75191),
  Coord(lon=-118.4162, lat=32.75059),
  Coord(lon=-118.43654, lat=32.75004),
  Coord(lon=-118.44091, lat=32.75103),
  Coord(lon=-118.45381, lat=32.75511),
  Coord(lon=-118.46471, lat=32.76018),
  Coord(lon=-118.47391, lat=32.7

In [154]:
# Perhaps these utility function would be helpful
def counties_in_state(state, counties):
    return {k[1] for k in counties if k[0] == state}

def states_with_county(county, counties):
    return {k[0] for k in counties if k[1] == county}

In [160]:
counties_in_state('CT', counties)

{'Fairfield',
 'Hartford',
 'Litchfield',
 'Middlesex',
 'New Haven',
 'New London',
 'Tolland',
 'Windham'}

In [159]:
states_with_county('Wood', counties)

{'OH', 'TX', 'WI', 'WV'}

## Dealing with flawed data

Sometimes the data you receive contains errors.  Your function should be able to deal with errors in data files in an elegant way. Errors may be either in the format of a file or in the data values contained in them.  Some examples of problematic data are provided:

* data/Colorado_Counties.csv
* data/Colorado_Counties-err1.csv
* data/Colorado_Counties-err2.csv
* data/Colorado_Counties-err3.csv

Consider both failing a parse in a descriptive way and attempting to recover as much data as is possible.

### Flaws in sample data

The data you may find in the wild are diverse, and certainly these three artificial errors are not exhaustive.  However, catching each of these errors and providing *reasonable* error messages or remediation shows a degree of desirable defensive programming.  The below describe what might be reasonable parsing results.

---
`Colorado_Counties-err1.csv` contains an extra comma or field on line 21, with a blank 4th field rather than the embedded XML polygon.  A good result of calling it might look something like:

```python
>>> counties = read_county_data('Colorado_Counties-err1.csv')
```

<pre><font color="red">county_data.py:92: SyntaxWarning: Colorado_Counties-err1.csv : 
    Line 21 has 13 fields (12 expected) [SKIPPING]</font></pre>

You have to decide the failure mode of your API, which might be subject to a switch.  Disregarding the bad line of data is a reasonable option, perhaps controlled by a keyword argument to treat this condition as fatal or to continue with other lines.

---
`Colorado_Counties-err2.csv` contains "bad" data.  Line 8 contains **very** bad data that has two points of the polygon at approximately -39 degrees latitude rather than +39 degrees.  Visualizing this polygon will show its strange shape.  However, line 15 contains **slightly** bad data.  One of the longitude points is incorrectly about -106 degrees rather than 104 degrees.  How might we determine "how bad" is unexpected data to rule it corrupted versus merely unusual? (i.e. this misplaced vertex still falls within Colorado as a whole).  A possible result:

```python
>>> counties = read_county_data('Colorado_Counties-err2.csv', irregularity=2*sigma)
```

<pre><font color="red">county_data.py:117: RuntimeWarning: Colorado_Counties-err2.csv : 
    Line 8 contains out-of-bounds latitude -39.96458 [SKIPPING]</font></pre>

---
`Colorado_Counties-err3.csv` contains two stray null characters on line 8.  Somewhat artificially, the final 'r' in "Boulder" is replaced with a NULL in the first two fields. Guarding for this is context dependent—if this specific corruption is commonplace, simply stripping it out is probably reasonable; if the corruption is a one-time issue, a more generic failure seems appropriate.  This illustrates a file that most readers simply crash trying to read (or perhaps corrupt the data in other ways), so some more guarded behavior is good practice.  For example:

```python
>>> counties = read_county_data('Colorado_Counties-err3.csv', strip_nulls=True)
```

<pre><font color="red">county_data.py:134: RuntimeWarning: Colorado_Counties-err3.csv : 
    NULL bytes detected and removed [REPROCESSING]</font></pre>

```python
>>> counties = read_county_data('Colorado_Counties-err3.csv', strip_nulls=False)
```

<pre><font color="red">county_data.py:138: RuntimeError: Colorado_Counties-err3.csv : 
    Unable to read data file at line 8 [TERMINATING]</font></pre>

<img src='img/copyright.png'>