# Weather Data Analytics
This notebook performs some basic weather data analytics using the PySpark RDD interface.

## Helper Methods
First we need some helper methods for converting the raw data into something that we can work with. We decide to use Python dictionaries instead of classes, since custom classes cannot be used within Zeppelin due to serialization issues

In [None]:
def _get_float(str):
    """
    Helper method for converting a string to a float. If this is not possible, None will be returned instead
    """
    if len(str) == 0:
        return None
    try:
        return float(str)
    except ValueError:
        return None


def extract_station(line):
    """
    Extract weather station data from a raw CSV line
    """
    raw_columns = line.split(',')
    columns = [c.replace('"','') for c in raw_columns]

    usaf = columns[0]
    wban = columns[1]
    name = columns[2]
    country = columns[3]
    state = columns[4]
    icao = columns[5]
    latitude = _get_float(columns[6])
    longitude = _get_float(columns[7])
    elevation = _get_float(columns[8])
    date_begin = columns[9]
    date_end = columns[10]
    return {
            'usaf':usaf, 
            'wban':wban, 
            'name':name,
            'country':country, 
            'state':state, 
            'icao':icao, 
            'latitude':latitude, 
            'longitude':longitude, 
            'elevation':elevation, 
            'date_begin':date_begin, 
            'date_end':date_end 
           }


def extract_weather(line):
    """
    Extract weather data from a raw data line.
    """
    date = line[15:23]
    time = line[23:27]
    usaf = line[4:10]
    wban = line[10:15]
    airTemperatureQuality = line[92] == '1'
    airTemperature = float(line[87:92]) / 10
    windSpeedQuality = line[69] == '1'
    windSpeed = float(line[65:69]) / 10
    return {
            'date':date, 
            'time':time, 
            'usaf':usaf, 
            'wban':wban, 
            'airTemperatureQuality':airTemperatureQuality, 
            'airTemperature':airTemperature, 
            'windSpeedQuality':windSpeedQuality, 
            'windSpeed':windSpeed 
        }

## Test extraction methods

In [None]:
# Load stations from 's3://dimajix-training/data/weather/isd-history'. 
# Transform the data into Python dictionary using extract_station
# YOUR CODE HERE
stations = ...

# Print a couple of elements from the transformed RDD
for s in stations.take(5):
    print(s)

In [None]:
# Load weather from 's3://dimajix-training/data/weather/2014'. 
# Transform the data into Python dictionary using extract_weather
# YOUR CODE HERE
weather = ...

# Print a couple of elements from the transformed RDD
...

# Join Data Sets

In order to analyse the data, we need to join the weather data with the station data, so we can get more detailed information where the weather actually was recorded.

In [None]:
# Create a key for every weather station using the values for 'usaf' and 'wban' from every record. This can be done using the keyBy method.
station_index = ...

# Create a key for every weather measurement element using the values for 'usaf' and 'wban' from every record. This can be done using the keyBy method.
weather_index = ...

# Now join weather and stations together using the keyed data. This can be done using the join method
joined_weather = ...

# Print some elements from joined_weather.
for d in ...:
    print(d)

## Caching Data

The join was really expensive. Before continuing you might want to cache the data and give it a nice name (for example "joined weather data") before continuing with the next steps.

In [None]:
# Cache the data for next operations
# YOUR CODE HERE

## Create appropriate Keys
We want to analyze the data grouped by country and year. So we need to create appropriate keys.

This will be done using a helper methid extract_country_year_weather, which should return a tuple

    ((country, year), weather)

for every record in joined_weather.

Pay attention to the layout of the elements in joined_weather, as can been see from the output above

In [None]:
def extract_country_year_weather(data):
    # data is a nested tuple, so we first need to extract the weather and the station data
    station = ...
    weather = ...
    # Now extract country from station
    country = ...
    # and the year from the weather measurement data
    year =  ...
    return ((country, year), weather)

# Perform extraction
weather_per_country_and_year = joined_weather.map(extract_country_year_weather)

## Perform Aggregation
We want to extract minimum and maximum of wind speed and of temperature per year and country (i.e. using the joined data above). We also want to consider cases where data is not valid (i.e. windSpeedQuality is False or airTemperature is False).

We will implement custom aggregation functions that work on dictionaries

In [None]:
def nullsafe_min(a, b):
    """
    Helper method for taking the min of two values. Also gracefully handles None values
    """
    from builtins import min
    if a is None:
        return b
    if b is None:
        return a
    return min(a,b)


def nullsafe_max(a, b):
    """
    Helper method for taking the max of two values. Also gracefully handles None values
    """
    from builtins import max
    if a is None:
        return b
    if b is None:
        return a
    return max(a, b)


# Neutral value used in aggregation
# YOUR CODE HERE
zero_wmm = { 'minTemperature':None, ... }


def reduce_wmm(wmm, data):
    """
    Used for merging in a new weather data set into an existing min/max dictionary. The incoming
    objects will not be modified, instead a new object will be returned.
    :param wmm: A Python dictionary representing min/max information
    :param data: A Python dictionary representring weather measurement information
    :returns: A new Python dictionary representing min/max information
    """
    # YOUR CODE HERE
    minTemperature = ...
    maxTemperature = ...
    minWindSpeed = ...
    maxWindSpeed = ...
    
    return { 'minTemperature':minTemperature, ... }


def combine_wmm(left, right):
    """
    Used for combining two dictionaries into a new min/max dictionary dictionary
    :param self: First Python dictionary representing min/max information
    :param other: Second Python dictionary representing min/max information
    :returns: A new Python dictionary representing combined min/max information
    """
    # YOUR CODE HERE
    minTemperature = ...
    maxTemperature = ...
    minWindSpeed = ...
    maxWindSpeed = ...

    return { 'minTemperature':minTemperature, ... }

In [None]:
# Aggregate min/max information per year and country
weather_minmax = weather_per_country_and_year.aggregateByKey(zero_wmm,reduce_wmm, combine_wmm)

for m in weather_minmax.take(5):
    print(m)

# Format Output

We want to create CSV data, so we need to reformat the Python dicts to nicely looking strings

In [None]:
def format_result(row):
    # Every row contains the key and the data.
    #   key is (country, year)
    #   value is Python dictionary containing min/max information
    (k,v) = row
    # Create a CSV line containing 'country,year,minTemperature,maxTemperature,minWindSpeed,maxWindSpeed'
    # YOUR CODE HERE
    line = ...
    # Encode as UTF-8, or we might experience some problems
    return line.encode('utf-8')

# Apply the function format_result to all records in the RDD weather_minmax
result = ...

for l in result:
    print(l)