<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Hive EMR bikedata

---

For this lab we will use data from the [Bay Area Bike Share Open Data Website](http://www.bayareabikeshare.com/open-data)

We've downloaded a part of it which is made available in an S3 bucket. Each trip is anonymised and includes:

- Bike number
- Trip start day and time
- Trip end day and time
- Trip start station
- Trip end station
- Rider type – Annual or Casual (24-hour or 3-day member)

If it is an annual member trip, it will also include the member’s home zip code.

The data set also includes:

- Weather information per day per service area
- Bike and dock availability per minute per station

## Exercise 1: Spin up EMR cluster

Log into your AWS - EMR account and start a cluster with the following properties:
- Amazon EMR 4.7.0 Core Hadoop
- 3 instances (you can use m1.medium)

Use the cluster from the last lesson or repeat the previous steps.

## Exercise 2: Log into HUE

Connect to HUE on the master node as you have learned to do (using SSH tunnelling and FoxyProxy).

## Exercise 3: Import Data and create tables

- Download the zip file provided in the S3 bucket galdn-dsi3-babsdata.
- Import the file to the hadoop file system, either via the command line or directly using HUE.
- Import the files uploaded to the Hadoop file system into the hive database. Check that they are present and that all the content has been uploaded appropriately.

## Exercise 4: Top start stations

Let's start with some exploratory analysis. For example the trips table contains information on the trips. Let's find the top 10 most popular start stations based on the trip data.

- Formulate a HIVE query to retrieve the top 10 start stations and sort them by count in descending order. 
    - Try this in the HUE interface.
    - Try this also in the terminal through the ssh to the master ec2.

```SQL
SELECT `start terminal`, `start station`, count(*) as count  FROM `201408_trip_data`
GROUP BY `start terminal`, `start station`
ORDER BY count DESC
LIMIT 10;
```


||startterminal|startstation|count|
|---|
|0|70|San Francisco Caltrain (Townsend at 4th)|12950|
|1|50|Harry Bridges Plaza (Ferry Building)|8336|
|2|60|Embarcadero at Sansome|7010|
|3|69|San Francisco Caltrain 2 (330 Townsend)|7008|
|4|61|2nd at Townsend|6824|
|5|77|Market at Sansome|6819|
|6|55|Temporary Transbay Terminal (Howard at Beale)|6540|
|7|74|Steuart at Market|6238|
|8|65|Townsend at 7th|5479|
|9|76|Market at 4th|5241|

## Exercise 5: Chart of top start stations

Use the Chart tab in Hue to generate a chart of the results, sorting them with the most popular on the left.

![](../assets/images/popular_stations.png)

Question:
- What was the most popular start station (the station with the highest count)?
> San Francisco Caltrain (Townsend at 4th) was by far the most common start station.

## Exercise 6: Top destinations + Map 

For trips starting from the most popular station, determine which end stations were the most popular.

- Fetch the latitude and longitude coordinates for trips starting from the most common starting point. In order to do this you will need to join data from both the trip and station files.

- Return a table that contains the top 10 most common destinations with the following fields:
    - station_id
    - name
    - lat
    - long
    - count (number of trips between the most popular start and that station)
    
- Try this in the terminal and in HUE. 

- Display the results using the map chart. Note that you can decide what label to assign to the points.

```SQL
SELECT 
    s.station_id, 
    s.name, 
    s.lat, 
    s.long, 
    COUNT(*) AS count 
FROM `201408_trip_data` t 
JOIN `201408_station_data` s ON s.station_id = t.`end terminal` 
WHERE t.`start terminal` = 70 
GROUP BY s.station_id, s.name, s.lat, s.long 
ORDER BY count DESC LIMIT 10;
```

||s.station_id|s.name|s.lat|s.long|count|
|---|
|0|55|Temporary Transbay Terminal (Howard at Beale)|37.789756774902344|-122.39464569091797|929|
|1|77|Market at Sansome|37.789623260498047|-122.40081024169922|915|
|2|74|Steuart at Market|37.794139862060547|-122.39443206787109|826|
|3|51|Embarcadero at Folsom|37.791465759277344|-122.39103698730469|806|
|4|50|Harry Bridges Plaza (Ferry Building)|37.795391082763672|-122.39420318603516|749|
|5|68|Yerba Buena Center of the Arts (3rd @ Howard)|37.784877777099609|-122.40101623535156|670|
|6|65|Townsend at 7th|37.77105712890625|-122.40271759033203|554|
|7|63|Howard at 2nd|37.786979675292969|-122.39810943603516|552|
|8|42|Davis at Jackson|37.797279357910156|-122.3984375|468|
|9|61|2nd at Townsend|37.780525207519531|-122.39028930664062|461|


![](../assets/images/map.png)

### Bonus: Hourly data

Dig further into the trip data for the most popular station to find the total number of trips and average duration (in minutes) of those trips, grouped by hour.

You will need to do a nested query on the trip table:

- The inner query should parse the startdate and return just the hour as int for the trips originating in the most common start station.

- The outer query should count such trips and calculate the average duration grouped and sorted by hour. 


**Hints:**

- Be careful with trips that have no duration information.

- Return a table with the following fields:
    - hour
    - number of trips
    - average duration
    
- Display the results with two charts displaying the number of trips and the average duration as a function of the hour.

```SQL
SELECT
    hour,
    COUNT(1) AS trip,
    ROUND(AVG(duration) / 60) AS avg_duration
FROM (
SELECT hour(t.`start date`) AS hour,
        t.duration AS duration
    FROM `201408_trip_data` t 
    WHERE
        t.`start terminal` = 70
        AND
        t.duration IS NOT NULL
    ) r
GROUP BY hour
ORDER BY hour ASC;
```

||hour|trip|avg_duration|
|---|
|0|0|26|8|
|1|1|14|49|
|2|2|7|93|
|3|3|1|21|
|4|4|1|4|
|5|5|41|52|
|6|6|545|11|
|7|7|2126|12|
|8|8|3273|12|
|9|9|1702|14|
|10|10|481|12|
|11|11|306|22|
|12|12|235|20|
|13|13|205|30|
|14|14|181|19|
|15|15|211|21|
|16|16|347|18|
|17|17|1008|12|
|18|18|1029|11|
|19|19|684|10|
|20|20|229|9|
|21|21|130|28|
|22|22|113|9|
|23|23|55|10|

![](../assets/images/trips_by_hour.png)
![](../assets/images/average_duration.png)