# Part 2 - Refine Data

The second step for analyzing the data is to perform some additional preparations and enrichments. While the first step of storing the data into the structured zone should be mainly a technical conversion without losing any information, this next step will integrate some data and also preaggregate weather data to simplify working with it.

# 0 Prepare Python Environment

## 0.1 Spark Session

In [None]:
import pyspark.sql.functions as f
from pyspark.sql import SparkSession


if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","64G") \
        .getOrCreate()
    
spark.version

## 0.2 Matplotlib

In [None]:
%matplotlib inline
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters


register_matplotlib_converters()

# 1 Read Taxi Data

Now we can read in the taxi data from the structured zone.

## 1.1 Trip Data

Let us load the NYC Taxi trip data from the Hive table `taxi.trip` and let us display the first 10 records.

In [None]:
trip_data = # YOUR CODE HERE
trip_data.limit(10).toPandas()

Just to be sure, let us inspect the schema. It should match exactly the specified one.

In [None]:
trip_data.printSchema()

## 1.2 Fare information

Now we read in the second Hive table `taxi.fare` containing the trips fare information.

In [None]:
fare_data = # YOUR CODE HERE
fare_data.limit(10).toPandas()

In [None]:
fare_data.printSchema()

## 1.3 Join datasets

We can now join both the trip information and the fare information together in order to get a complete picture. Since the trip records do not contain a technical unique key, we use the following columns as the composite primary key of each trip:
* medallion
* hack_license
* vendor_id
* pickup_datetime

Finally the result is stored into the refined zone into the Hive table `refined.taxi_trip`.

In [None]:
# Create Hive database 'refined'
# YOUR CODE HERE

In [None]:
# Join trip_data with fare_data using the columns "medallion", "hack_license", "vendor_id", "pickup_datetime"
taxi_trips = # YOUR CODE HERE

In [None]:
# Save taxi_trips into the Hive table "refined.taxi_trip" using "parquet" file format
# YOUR CODE HERE

### Read from Refined Zone

In [None]:
taxi_trips = spark.read.table("refined.taxi_trip")
taxi_trips.limit(10).toPandas()

Let us have a look at the schema of the refined table

In [None]:
taxi_trips.printSchema()

Let us count the number of records in the table

In [None]:
# YOUR CODE HERE

# 2. Weather Data

The weather data also requires some additional preprocessing, especially when we want to join against weather data. The primary problem of all measurements is, that they might happen at different time intervals and not all measurements contain all metrics. Therefore we preaggregate the weather data to hourly and daily measurements, which can directly be used for joining.

## 2.1 Weather Data

We already have weather data, but only individual measurements. We do not know how many measurements there are per hour and per day, so the raw table is not very useable for joining. Instead we'd like to have an hourly and a daily weather table containing average temperature, wind speed and precipitation. Since we are only interested in the year 2013, we also only load that specific year.

In [None]:
weather = spark.read.table("isd.weather").where(f.col("year") == 2013)
weather.limit(10).toPandas()

## 2.2 Calculate derived metrics and preaggregate data

In order to simplify joining against weather data, we now preaggregate weather measurements to a single record per weather station and hour or per day.

### Hourly Preaggregation

For the hourly aggregation, we want to get the following columns
* `date` - day of the measurements. The day can be extracted from the timestamp column `ts` by using the Spark function `to_date` (available in the imported module `f`)
* `hour` - hour of the measurements. The hour can be extracted using the Spark function `hour`
* Grouping should be performed on the weather station IDs `usaf` and `wban` together with both extracted time columns `date` and `hour`
* For the following metrics, we are interested in the grouped averages: `wind_speed`, `air_temperature` and `precipitation_depth`

When performing the aggregation, you should ignore invalid measurements. This can be done by using the PySpark function `f.when` to conditionally only aggregate values where the correspondign quality flag (`wind_speed_qual` and `air_temperature_qual`) is not `9`. Note that it is enough to pick up only the valid values and let the `when` function return `NULL` for invalid values, since `NULL` is ignored in aggregations.

For averaging the precipitation, you should also only pick values where `precipitation_hours` equals `1`.

The final DataFrame should have the following columns (you might need to specify explicit aliases):
* `usaf`
* `wban`
* `date`
* `hour` (0-23)
* `wind_speed`
* `temperature`
* `precipitation`

In [None]:
hourly_weather = # YOUR CODE HERE

hourly_weather.limit(10).toPandas()

### Daily Preaggregation

In addition to the hourly metrics, we also preaggregate the data to daily records. This can easily be performed based on the hourly aggregations with a grouping on `usaf`, `wban` and `date`. Again we want to have the metrics `temperature`, `wind_speed` and `precipitation`. For the first two metrics, we are interested in the average (as this seems to make sense), while for precipitation we are interested in the sum (total amount of rainfall per day).

In [None]:
daily_weather = # YOUR CODE HERE

daily_weather.limit(10).toPandas()

### Save Preaggregated Weather

Finally we save both tables (hourly and daily weather), so we can directly reuse the data in the next steps.

In [None]:
hourly_weather.write.format("parquet").saveAsTable("refined.weather_hourly")
daily_weather.write.format("parquet").saveAsTable("refined.weather_daily")

## 2.3 Reload Data and draw Pictures

Now let us reload the data (just to make sure everything worked out nicely) and let's draw some pictures. We use a single station (which, by pure incident, is a weather station in NYC)

In [None]:
daily_weather = spark.read.table("refined.weather_daily")

In [None]:
nyc_station_usaf = "725053"
nyc_station_wban = "94728"

# Filter data only of that weather station, order it by date and convert it to a Pandas DataFrame
pdf = # YOUR CODE HERE

### Wind Speed

The first picture will simply contain the wind speed for every day in 2013.

In [None]:
# Make a Plot
plt.figure(figsize=(16, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot(pdf["date"],pdf["wind_speed"])

### Air Temperature

The next picture contains the average air temperature for every day in 2013.

In [None]:
# Make a Plot
plt.figure(figsize=(16, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot(pdf["date"],pdf["temperature"])

### Precipitation

The last picture contains the precipitation for every day in 2013.

In [None]:
# Make a Plot
plt.figure(figsize=(16, 6), dpi=80, facecolor='w', edgecolor='k')
plt.plot(pdf["date"],pdf["precipitation"])