# Weather Data Analytics
This notebook performs some basic weather data analytics using the PySpark RDD interface.

## Helper Methods
First we need some helper methods for converting the raw data into something that we can work with. We decide to use Python dictionaries instead of classes, since custom classes cannot be used within Zeppelin due to serialization issues

In [None]:
def _get_float(str):
    if len(str) == 0:
        return None
    try:
        return float(str)
    except ValueError:
        return None


def extract_station(line):
    raw_columns = line.split(',')
    columns = [c.replace('"','') for c in raw_columns]

    usaf = columns[0]
    wban = columns[1]
    name = columns[2]
    country = columns[3]
    state = columns[4]
    icao = columns[5]
    latitude = _get_float(columns[6])
    longitude = _get_float(columns[7])
    elevation = _get_float(columns[8])
    date_begin = columns[9]
    date_end = columns[10]
    return {'usaf':usaf, 'wban':wban, 'name':name, 'country':country, 'state':state, 'icao':icao, 'latitude':latitude, 'longitude':longitude, 'elevation':elevation, 'date_begin':date_begin, 'date_end':date_end }


def extract_weather(line):
    date = line[15:23]
    time = line[23:27]
    usaf = line[4:10]
    wban = line[10:15]
    airTemperatureQuality = line[92] == '1'
    airTemperature = float(line[87:92]) / 10
    windSpeedQuality = line[69] == '1'
    windSpeed = float(line[65:69]) / 10
    return {'date':date, 'time':time, 'usaf':usaf, 'wban':wban, 'airTemperatureQuality':airTemperatureQuality, 'airTemperature':airTemperature, 'windSpeedQuality':windSpeedQuality, 'windSpeed':windSpeed }

## Aggregation Functions
We want to extract minimum and maximum of wind speed and of temperature. We also want to consider cases where data is not valid (i.e. windSpeedQuality is False or airTemperature is False).

We will implement custom aggregation functions that work on dictionaries

In [None]:
def nullsafe_binop(a, b, op):
    if a is None:
        return b
    if b is None:
        return a
    return op(a,b)
    
def nullsafe_min(a, b):
    return nullsafe_binop(a, b, min)

def nullsafe_max(a, b):
    return nullsafe_binop(a, b, max)


def reduce_wmm(wmm, data):
    """
    Used for merging in a new weather data set into an existing WeatherMinMax object. The incoming
    objects will not be modified, instead a new object will be returned.
    :param wmm: WeatherMinMax object
    :param data: WeatherData object
    :returns: A new WeatherMinMax object
    """
    if data['airTemperatureQuality']:
        minTemperature = nullsafe_min(wmm['minTemperature'], data['airTemperature'])
        maxTemperature = nullsafe_max(wmm['maxTemperature'], data['airTemperature'])
    else:
        minTemperature = wmm['minTemperature']
        maxTemperature = wmm['maxTemperature']

    if data['windSpeedQuality']:
        minWindSpeed = nullsafe_min(wmm['minWindSpeed'], data['windSpeed'])
        maxWindSpeed = nullsafe_max(wmm['maxWindSpeed'], data['windSpeed'])
    else:
        minWindSpeed = wmm['minWindSpeed']
        maxWindSpeed = wmm['maxWindSpeed']

    return { 'minTemperature':minTemperature, 'maxTemperature':maxTemperature, 'minWindSpeed':minWindSpeed, 'maxWindSpeed':maxWindSpeed }


def combine_wmm(left, right):
    """
    Used for combining two WeatherMinMax objects into a new WeatherMinMax object
    :param self: First WeatherMinMax object
    :param other: Second WeatherMinMax object
    :returns: A new WeatherMinMax object
    """
    minTemperature = nullsafe_min(left['minTemperature'], right['minTemperature'])
    maxTemperature = nullsafe_max(left['maxTemperature'], right['maxTemperature'])
    minWindSpeed = nullsafe_min(left['minWindSpeed'], right['minWindSpeed'])
    maxWindSpeed = nullsafe_max(left['maxWindSpeed'], right['maxWindSpeed'])

    return { 'minTemperature':minTemperature, 'maxTemperature':maxTemperature, 'minWindSpeed':minWindSpeed, 'maxWindSpeed':maxWindSpeed }

# Create Station Index as Broadcast Variable

Instead of performing an shuffle join with the station data, we will broadcast the index to all workers and do the lookups locally. This will save us one shuffle.

We need to perform the following tasks:

1. Load the weather station data from HDFS
2. Create appropriate keys from wban and usaf
3. Convert the RDD to a local map using collectAsMap()
4. Create a broadcast variable from this local map

In [None]:
# Load station data from '/user/cloudera/data/weather/isd-history.csv' and extract data using extract_station
stations = ... 

# Create a tuple (usaf+wban, station) for every entry in stations
station_index_rdd = ...

# Transfer the RDD to a local Python dictionary using collectAsMap()
station_index_map = ...

# Create a broadcast variable from station_index_map
station_index_bc = ...

# Load Weather Data
Now we can load the weather data, as we have done before

In [None]:
weather = sc.textFile('/user/cloudera/data/weather/2014').map(lambda line: extract_weather(line))
print weather.take(5)

# Joining Station Data
Now we will perform the join of the weather data with the station data. But this time we will use the broadcast variable instead of an explicit shuffle join.

In [None]:
# The following function will take a weather data object and should return a tuple
#   ((country, date), weather)
def extract_country_year_weather(weather):
    # Extract station_id as usaf + wban from data
    station_id = ...
    # Lookup station in broadcast variable station_index_bc
    station = ...
    # Extract country from station
    country = ...
    # ... and year from weather
    year = ...
    return ((country, year), weather)

# Apply the function to all records
weather_per_country_and_year = weather.map(extract_country_year_weather)

# Aggregate min/max information per year and country. Nothing has changed here.
zero = { 'minTemperature':None, 'maxTemperature':None, 'minWindSpeed':None, 'maxWindSpeed':None }
weather_minmax = weather_per_country_and_year.aggregateByKey(zero,reduce_wmm, combine_wmm)

print weather_minmax.take(5)

# Format Output
Again we want to write the result as a CSV file.

In [None]:
def format_result(row):
    (k,v) = row
    country = k[0]
    year = k[1]
    minT = v['minTemperature'] or 0.0
    maxT = v['maxTemperature'] or 0.0
    minW = v['minWindSpeed'] or 0.0
    maxW = v['maxWindSpeed'] or 0.0
    line = "%s,%s,%f,%f,%f,%f" % (country, year, minT, maxT, minW, maxW)
    return line.encode('utf-8')

result = weather_minmax.map(format_result).collect()

for l in result:
    print l