# Exercises due by EOD 2017.11.11 (*a Monday*)

## goal

in this homework assignment we will focus on interacting with the `nosql` databases `dynamodb` and `neo4j`.

## method of delivery

as mentioned in our first lecture, the method of delivery may change from assignment to assignment. we will include this section in every assignment to provide an overview of how we expect homework results to be submitted, and to provide background notes or explanations for "new" delivery concepts or methods.

this week you will be submitting the results of your homework via upload to two your own person `s3` homework submission bucket

summary:

| exercise | deliverable                         | method of delivery                 |
|----------|-------------------------------------|------------------------------------|
| 1        | `python` file `delays.py`           | upload to you `s3` homework bucket |
| 2        | `cyphyer` file `load_routes.cypher` | upload to you `s3` homework bucket |
| 2        | `cyphyer` file `analysis.cypher`    | upload to you `s3` homework bucket |

# exercise 1: using `dynamodb` to quantify airport delays

for about the last month I have been persisting the results of our beloved FFA `json` api (note: [it's moving!](https://services.faa.gov/)). let's pull records and calculate the proportion of time different airports report delays.

## 1.1: the schema

the primary key for this `dynamodb` table has the following two pieces:

1. hash key: `IATA`, the three-digit US airport code
2. range key: `Timestamp`, the time the `api` was called

a single record in this database will look like:

```json
{
  "city": "Washington",
  "delay": "false",
  "IATA": "DCA",
  "ICAO": "KDCA",
  "name": "Ronald Reagan Washington National",
  "state": "District of Columbia",
  "status": {
    "reason": "No known delays for this airport."
  },
  "Timestamp": "2017-11-07T21:20:02.467789",
  "weather": {
    "meta": {
      "credit": "NOAA's National Weather Service",
      "updated": "3:52 PM Local",
      "url": "http://weather.gov/"
    },
    "temp": "44.0 F (6.7 C)",
    "visibility": "3.0",
    "weather": "Rain Fog/Mist",
    "wind": "North at 13.8mph"
  }
}
```

## 1.2: querying counts

for each of our different `dynamodb` lookup methods, it's helpful to remember that each query is executed slightly differently. if we break it down by speed, the ability to filter or range results before returning, and the number of records read by `dynamodb` while the query is executing, we have

| type of lookup | speed     | filtering or ranging? | number of records seen by `dynamodb`    |
|----------------|-----------|-----------------------|-----------------------------------------|
| `get_item`     | immediate | no                    | 1                                       |
| `query`        | faster    | yes                   | the size of the primary key's partition |
| `scan`         | slowest   | yes                   | the size of the database                |

in addition to the above, it's also worth knowing that `dynamodb` implements one (and really only one) aggregation function -- `COUNT`. for `query` and `scan` lookups. the `COUNT` operator is a modifier of the `SELECT` method (that is, you can select all attributes, select certain attributes, or select the count of records). if you stipulate a `Select` method of `COUNT`, `dynamodb` will scan as many records as is needed (see the last field in the table above), apply any filters that are needed, and then return not the records but the count of the records that would be returned.

we can use that to our advantage here. if we `query` our table for a particular airport code and use the `COUNT` selection method, we can get

1. the count of all records for that IATA, or
2. the count of all records for that IATA satisfying a certain criterion

one immediate and easy criterion: whether or not the `delay` string has a value of `"true"`. as a bonus, if we manage to construct a query which can select for us those filtered records, `dynamodb`'s response to us will include *both* the number of those filtered records *and* the number or records it had to scan to figure that out (which is all of them).


## 1.3: using `boto3` to build `dynamodb` key and filter expressions

`boto3` has built-in functions for building `dynamodb` conditions, and they are generally preferred to the hard-coded versions with `#attrname` and `:attrvalue` strings and `ExpressionAttributeNames` and `ExpressionAttributeValues` mapping dictionaries.

for example, if we had a table with movies which had `Year` as their primary (hash) key, and had a `Title` attribute, we could filter those records down to those movies which were made in 2001 and had titles which begin with a letter between `A` and `L`, both of the following would work:

```python

# hard-coded string and name/value lookup dictionary method
movietable.query(
    KeyConditionExpression="year = :myyear",
    FilterExpression="title between :letter0 and :letter1",
    ExpressionAttributeValues={
        ':myyear': 2001,
        ':letter0': 'A',
        ':letter1': 'L'
    }
)

# object-oriented method
from boto3.dynamodb.conditions import Key, Attr

movietable.query(
    KeyConditionExpression=Key('year').eq(2001),
    FilterExpression=Attr('title').between('A', 'L')
)
```

aside from being more compact, this second method requires less visual jumping around while debugging or reading. let's use it.

## 1.4: accessing the `dynamodb` database

as we discussed in lecture, `dynamodb` is a 100%-`iam`-authenticated service. to access material in my `AirportWeather` `dynamodb` table, you need to inhabit an `iam` `role` or act as an `iam` `user`. I created a special `iam` `user` named `dynamodb_airportweather_readonly` for this specific purpose.

this `user` has access keys which can be used by the `boto3.session.Session` object. I have uploaded those values as a credentials `csv` to all of your individual `s3` buckets. you can acquire those `csv` files via the command

```
aws s3 cp s3://2017.fall.gu511.{GUID}/dynamodb_airportweather_readonly.csv .
```


## 1.5: calculating fractional time delayed

fill in the `python` script below to access count information for delays for a single airport. save your final result to a file named `delays.py`. you will need the access key id and value from the `csv` in the previous section.

```python
import boto3
import getpass

from boto3.dynamodb.conditions import Key, Attr

SESSION = boto3.session.Session(
    aws_access_key_id=getpass.getpass('aws access key id: '),
    aws_secret_access_key=getpass.getpass('aws secret access key: '),
    region_name='us-east-1
)

def delay_cts(iata):
    # create a dynamodb resource
    # ------------- #
    # FILL ME IN!!! #
    # ------------- #
    
    # use the dynamodb resource to access the AirportWeather table
    # (call it aw)
    # ------------- #
    # FILL ME IN!!! #
    # ------------- #
    
    # create a key condition expression which asserts that the key
    # `IATA` must have the value of our function's parameter, `iata`.
    # name this condition `kce`
    # ------------- #
    # FILL ME IN!!! #
    # ------------- #
    
    # create a filter expression which asserts that the attribute
    # `delay` must have the (string) value `"true"`. name this 
    # condition `fe`
    # ------------- #
    # FILL ME IN!!! #
    # ------------- #
    
    # replace the `None` items below with the appropriate 
    response = aw.query(
        Select="COUNT",
        KeyConditionExpression=kce,
        FilterExpression=fe
    )
    
    # the `response` object should have a key-value pair for both 
    # the total number of items scanned for this IATA and also for
    # the number of items which had a delay. get these two values
    # and return them along with the IATA code as a ditionary
    return {
        'iata': iata,
        'total_count': None,  # <-- REPLACE `None` !!!
        'delay_count': None,  # <-- REPLACE `None` !!!
    }
```

## 1.6: check your numbers

as time goes on these numbers will necessarily increase, and the proprtions may change. for an idea of the reasonable scale of these values, however, my implementation sees the following:

```python
delay_cts('DCA')
delay_cst('ORD')
```

yields

```python
{'iata': 'DCA', 'total_count': 2913, 'delay_count': 28}
{'iata': 'ORD', 'total_count': 3081, 'delay_count': 47}
```

## 1.7: submitting your work

##### upload your version of `delays.py` to your `s3` homework submission bucket

# exercise 2: building an airplane flight network

how easy is it to get from DCA to your home? what is the airport furthest from DCA?

flight networks are an obvious area for graph model applications. let's look into the network of airports and flights between them.

## 2.1: getting a `neo4j` sandbox

head on over to [the `neo4j` sandbox](https://neo4j.com/sandbox-v2/) and set up a *blank* sandbox.

you've done this several times now, so refer to the `neo4j` notes if you need help.


## 2.2: loading the data

go check out the data available at [OpenFlights.org](https://openflights.org/data.html). this open data set contains information about airports and flight routes (as well as airlines, flight schedules, ferry and train terminals)

the data happens to be [hosted on `github`](https://openflights.org/data.html) to boot. in particular, we have `csv` data dumps for

+ [airports](https://github.com/jpatokal/openflights/blob/master/data/airports.dat?raw=true)
+ [routes](https://github.com/jpatokal/openflights/blob/master/data/routes.dat?raw=true)

we are going to use the `LOAD CSV` clause to load both of these datasets. our end goal is a schema [like this](https://s3.amazonaws.com/shared.rzl.gu511.com/airport_schema.png)

*big hat tip to github user johynmontana, whose gist [here](https://gist.github.com/johnymontana/45009185d59c24e08cb4f3f8053546e5) inspired this exercise*


### 2.2.1: loading city and airport information

we'll walk through this step to get familiar with some common `neo4j` setup.

first, let's create some *constraints* (rules that transactions into this database must obey in order to be valid). in particular, let's make sure that

+ countries can be uniquely identified by their name
+ cities can be uniquely identified by their names
+ airports can be uniquely identified by their OpenFlights.org id

we do this using the `CREATE CONSTRAINT` clause:

```cypher
// create constraints on our three types of nodes
CREATE CONSTRAINT ON (c:Country) ASSERT c.name IS UNIQUE;
CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;
CREATE CONSTRAINT ON (a:Airport) ASSERT a.id IS UNIQUE;
```

next, we can unpack the records in the `csv` using the `LOAD CSV FROM ... AS` clause. then, because the `csv` doesn't have headers, we can create aliases for elements using the `WITH` clause.

once we have unpacked individual records this way (that is, iterate through lines and assign their values to variable names), we will look to create *five* things:

1. we `MERGE` (create or match if already created) an airport node
2. we `MERGE` a city node
3. we `MERGE` a country node
4. we `create` an `IS_IN` relationship between airports and cities
5. we `create` an `IS_IN` relationship between cities and countries

altogether, this looks like the following:

```cypher
LOAD CSV FROM "https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat" AS row
WITH toInteger(row[0]) AS id,
     row[1] AS name,
     row[2] AS city,
     row[3] AS country,
     // everything below here is optional; we'll do it for the fun of it
     row[4] AS code,
     row[5] AS ICAO,
     coalesce(row[4], row[5]) as normalizedCode,
     toFloat(row[6]) AS latitude,
     toFloat(row[7]) AS longitude,
     toFloat(row[8]) AS altitude,
     toInteger(row[9]) AS timezone
MERGE (a:Airport {id: id})
    ON CREATE SET a.name = name,
                  a.city = city,
                  a.country = country,
                  a.code = code,
                  a.ICAO = ICAO,
                  a.latitude = latitude,
                  a.longitude = longitude,
                  a.altitude = altitude,
                  a.timezone = timezone
MERGE (c:City {name: city})
MERGE (t:Country {name: country})
CREATE (a)-[:IS_IN]->(c)
CREATE (c)-[:IS_IN]->(t)
```


### 2.2.2: loading the route information

we will use many of the same clauses to load the contents of the [routes data](https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat) to our database. write a `cypher` query that does the following:

1. uses the `LOAD CSV` clause to load `csv` records into an alias `route`
2. unpack the record into:
    1. `srcApt`: the source airport code
    2. `srcAptId`: the source airport id number
    3. `dstApt`: the destination airport code
    4. `dstAptId`: the destination airport id number
3. filter down to records `where` both `srcAptId` and `dstAptId` are not `null`
4. create three things:
    1. `merge` the source airport
    2. `merge` the destination airport
    3. `create` a `ROUTE_TO` relationship from source to destination
    
**note**: for the airport id values above, you will need to use the `toInteger` function to cast strings to integers


```cypher
// load routes
LOAD CSV // FILL THIS IN !!!
WITH // FILL THIS IN !!!
MERGE // FILL THIS IN !!!
MERGE // FILL THIS IN !!!
CREATE // FILL THIS IN !!!
```

take the contents of that script above and save them to a file named `load_routes.cypher`


## 2.3: a few useful pieces of information

for each of the following questions, write a `cypher` query to answer it:

1. find the number of airports in our entire database
    + *hint: check out [the `count` function](https://neo4j.com/docs/developer-manual/current/cypher/functions/aggregating/#functions-count)*
2. find the number of airports in Canada
3. create a table listing countries and the number of airports in that country
    + we want the equivalent of `SELECT country, count(*) FROM airports GROUP BY country`
    + *hint: there is not `group by` in `cypher`; simply `return` a value alongside an aggregation function such as `count` and the grouping is implicit. read more [here](https://neo4j.com/docs/developer-manual/current/cypher/functions/aggregating/)*
4. identify which airport has the most **outbound** flights and how many that is
    + *hint*: there are at least two ways you can do this:
        1. `match` airport nodes and use the `size` scalar function to count outband edges matching a pattern (see [`size` of pattern expressions](https://neo4j.com/docs/developer-manual/current/cypher/functions/scalar/#functions-size-of-pattern-expression))
        2. `match` outbound flight patterns, use [the `count` aggregate function](https://neo4j.com/docs/developer-manual/current/cypher/functions/aggregating/#functions-count), and then use [the `max` aggregate function](https://neo4j.com/docs/developer-manual/current/cypher/functions/aggregating/#functions-max)
    + in either case you will need some combination of `order by`, `limit`, or `where` clauses to finish
5. identify which airport has the most **inbound** flights and how many that is
    + *hint*: see the previous question
6. identify (by airport code) which airports have direct flights from DCA
7. for the above, how many is that?
8. identify (by airport code) which airports are exactly two hops from DCA, excluding DCA itself
9. for the above, how many is that?
10. which 5 airports are furthest from DCA?
    + *hint*:
        1. `match` all airports that are *not* DCA, and then
        2. `match` the [`shortestpath` function](https://neo4j.com/docs/developer-manual/current/cypher/execution-plans/shortestpath-planning/) to identify (one of) the shortest path(s) from there to DCA, and then
        3. `order` those paths  [the `length` scalar function](https://neo4j.com/docs/developer-manual/current/cypher/functions/scalar/#functions-length) to figure out how many edges are in that matched path, and
        4. `limit` to the number of paths
11. get the top 10 airports in terms of Page rank
    + *hint*: use the [`algo.pageRank.stream` procedure](https://neo4j-contrib.github.io/neo4j-graph-algorithms/)

fill in the template below and save the results to a file called `analysis.cypher`

```cypher
// 1. how many airports are there in our database?
// FILL THIS IN !!!

// 2. how many airports are there in Canada?
// FILL THIS IN !!!

// 3. create a table of countries and the number of airports in those countries
// FILL THIS IN !!!

// 4. what is the largest number of flights *out of* an airport?
// FILL THIS IN !!!

// 5. what is the largest number of flights *into* an airport?
// FILL THIS IN !!!

// 6. what are the DCA direct connections?
// FILL THIS IN !!!

// 7. how many is that?
// FILL THIS IN !!!

// 8. what airports are reachable by two hops from DCA?
// FILL THIS IN !!!

// 9. how many is that?
// FILL THIS IN !!!

// 10. what are the 5 farthest aiports from DCA?
// FILL THIS IN !!!

// 11. get the top 10 airports in terms of their page rank
// FILL THIS IN !!!
```

## 2.4: delivering results

##### upload `load_routes.cypher` and `analysis.cypher` to your `s3` bucket