# Weather Data Analytics
This notebook performs some basic weather data analytics using the PySpark RDD interface.

## Helper Methods
First we need some helper methods for converting the raw data into something that we can work with. We decide to use Python dictionaries instead of classes, since custom classes cannot be used within Zeppelin due to serialization issues

In [2]:
def _get_float(str):
    if len(str) == 0:
        return None
    try:
        return float(str)
    except ValueError:
        return None


def extract_station(line):
    raw_columns = line.split(',')
    columns = [c.replace('"','') for c in raw_columns]

    usaf = columns[0]
    wban = columns[1]
    name = columns[2]
    country = columns[3]
    state = columns[4]
    icao = columns[5]
    latitude = _get_float(columns[6])
    longitude = _get_float(columns[7])
    elevation = _get_float(columns[8])
    date_begin = columns[9]
    date_end = columns[10]
    return {'usaf':usaf, 'wban':wban, 'name':name, 'country':country, 'state':state, 'icao':icao, 'latitude':latitude, 'longitude':longitude, 'elevation':elevation, 'date_begin':date_begin, 'date_end':date_end }


def extract_weather(line):
    date = line[15:23]
    time = line[23:27]
    usaf = line[4:10]
    wban = line[10:15]
    airTemperatureQuality = line[92] == '1'
    airTemperature = float(line[87:92]) / 10
    windSpeedQuality = line[69] == '1'
    windSpeed = float(line[65:69]) / 10
    return {'date':date, 'time':time, 'usaf':usaf, 'wban':wban, 'airTemperatureQuality':airTemperatureQuality, 'airTemperature':airTemperature, 'windSpeedQuality':windSpeedQuality, 'windSpeed':windSpeed }

## Aggregation Functions
We want to extract minimum and maximum of wind speed and of temperature. We also want to consider cases where data is not valid (i.e. windSpeedQuality is False or airTemperature is False).

We will implement custom aggregation functions that work on dictionaries

In [3]:
def nullsafe_binop(a, b, op):
    if a is None:
        return b
    if b is None:
        return a
    return op(a,b)
    
def nullsafe_min(a, b):
    return nullsafe_binop(a, b, min)

def nullsafe_max(a, b):
    return nullsafe_binop(a, b, max)


def reduce_wmm(wmm, data):
    """
    Used for merging in a new weather data set into an existing WeatherMinMax object. The incoming
    objects will not be modified, instead a new object will be returned.
    :param wmm: WeatherMinMax object
    :param data: WeatherData object
    :returns: A new WeatherMinMax object
    """
    if data['airTemperatureQuality']:
        minTemperature = nullsafe_min(wmm['minTemperature'], data['airTemperature'])
        maxTemperature = nullsafe_max(wmm['maxTemperature'], data['airTemperature'])
    else:
        minTemperature = wmm['minTemperature']
        maxTemperature = wmm['maxTemperature']

    if data['windSpeedQuality']:
        minWindSpeed = nullsafe_min(wmm['minWindSpeed'], data['windSpeed'])
        maxWindSpeed = nullsafe_max(wmm['maxWindSpeed'], data['windSpeed'])
    else:
        minWindSpeed = wmm['minWindSpeed']
        maxWindSpeed = wmm['maxWindSpeed']

    return { 'minTemperature':minTemperature, 'maxTemperature':maxTemperature, 'minWindSpeed':minWindSpeed, 'maxWindSpeed':maxWindSpeed }


def combine_wmm(left, right):
    """
    Used for combining two WeatherMinMax objects into a new WeatherMinMax object
    :param self: First WeatherMinMax object
    :param other: Second WeatherMinMax object
    :returns: A new WeatherMinMax object
    """
    minTemperature = nullsafe_min(left['minTemperature'], right['minTemperature'])
    maxTemperature = nullsafe_max(left['maxTemperature'], right['maxTemperature'])
    minWindSpeed = nullsafe_min(left['minWindSpeed'], right['minWindSpeed'])
    maxWindSpeed = nullsafe_max(left['maxWindSpeed'], right['maxWindSpeed'])

    return { 'minTemperature':minTemperature, 'maxTemperature':maxTemperature, 'minWindSpeed':minWindSpeed, 'maxWindSpeed':maxWindSpeed }

# Create Station Index as Broadcast Variable

Instead of performing an shuffle join with the station data, we will broadcast the index to all workers and do the lookups locally. This will save us one shuffle.

We need to perform the following tasks:

1. Load the weather station data from HDFS
2. Create appropriate keys from wban and usaf
3. Convert the RDD to a local map using collectAsMap()
4. Create a broadcast variable from this local map

In [4]:
stations = sc.textFile('/user/cloudera/data/weather/isd-history.csv').map(lambda line: extract_station(line))
station_index = stations.keyBy(lambda data: data['usaf'] + data['wban']).collectAsMap()
station_index_bc = sc.broadcast(station_index)

# Load Weather Data
Now we can load the weather data, as we have done before

In [5]:
weather = sc.textFile('/user/cloudera/data/weather/2014').map(lambda line: extract_weather(line))
print weather.take(5)

[{'airTemperature': -13.6, 'windSpeedQuality': True, 'usaf': u'010060', 'windSpeed': 3.0, 'wban': u'99999', 'time': u'0100', 'date': u'20140101', 'airTemperatureQuality': True}, {'airTemperature': -14.2, 'windSpeedQuality': True, 'usaf': u'010060', 'windSpeed': 2.0, 'wban': u'99999', 'time': u'0200', 'date': u'20140101', 'airTemperatureQuality': True}, {'airTemperature': -10.7, 'windSpeedQuality': True, 'usaf': u'010060', 'windSpeed': 4.0, 'wban': u'99999', 'time': u'0400', 'date': u'20140101', 'airTemperatureQuality': True}, {'airTemperature': -11.2, 'windSpeedQuality': True, 'usaf': u'010060', 'windSpeed': 3.0, 'wban': u'99999', 'time': u'0500', 'date': u'20140101', 'airTemperatureQuality': True}, {'airTemperature': -10.0, 'windSpeedQuality': True, 'usaf': u'010060', 'windSpeed': 5.0, 'wban': u'99999', 'time': u'0600', 'date': u'20140101', 'airTemperatureQuality': True}]


# Joining Station Data
Now we will perform the join of the weather data with the station data. But this time we will use the broadcast variable instead of an explicit shuffle join.

In [6]:
def extract_country_year_weather(data):
    station_id = data['usaf'] + data['wban']
    station = station_index_bc.value.get(station_id, None)
    return ((station['country'], data['date'][0:4]), data)

weather_per_country_and_year = weather.map(extract_country_year_weather)

# Aggregate min/max information per year and country
zero = { 'minTemperature':None, 'maxTemperature':None, 'minWindSpeed':None, 'maxWindSpeed':None }
weather_minmax = weather_per_country_and_year.aggregateByKey(zero,reduce_wmm, combine_wmm)

print weather_minmax.take(5)

[((u'PO', u'2014'), {'maxWindSpeed': 15.4, 'maxTemperature': 32.0, 'minWindSpeed': 0.0, 'minTemperature': -1.0}), ((u'PL', u'2014'), {'maxWindSpeed': 14.9, 'maxTemperature': 32.0, 'minWindSpeed': 0.0, 'minTemperature': -15.0}), ((u'MY', u'2014'), {'maxWindSpeed': 9.8, 'maxTemperature': 36.0, 'minWindSpeed': 0.0, 'minTemperature': 19.0}), ((u'FI', u'2014'), {'maxWindSpeed': 18.0, 'maxTemperature': 30.3, 'minWindSpeed': 0.0, 'minTemperature': -28.6}), ((u'GM', u'2014'), {'maxWindSpeed': 13.4, 'maxTemperature': 31.0, 'minWindSpeed': 0.0, 'minTemperature': -9.0})]


# Format Output
Again we want to write the result as a CSV file.

In [7]:
def format_result(row):
    (k,v) = row
    country = k[0]
    year = k[1]
    minT = v['minTemperature'] or 0.0
    maxT = v['maxTemperature'] or 0.0
    minW = v['minWindSpeed'] or 0.0
    maxW = v['maxWindSpeed'] or 0.0
    line = "%s,%s,%f,%f,%f,%f" % (country, year, minT, maxT, minW, maxW)
    return line.encode('utf-8')

result = weather_minmax.map(format_result).collect()

for l in result:
    print l

PO,2014,-1.000000,32.000000,0.000000,15.400000
PL,2014,-15.000000,32.000000,0.000000,14.900000
MY,2014,19.000000,36.000000,0.000000,9.800000
FI,2014,-28.600000,30.300000,0.000000,18.000000
GM,2014,-9.000000,31.000000,0.000000,13.400000
IT,2014,-6.800000,24.000000,0.000000,20.600000
DA,2014,-9.000000,30.200000,0.000000,17.000000
UK,2014,-6.000000,30.400000,0.000000,20.600000
GK,2014,2.000000,24.000000,0.000000,21.100000
IC,2014,-7.000000,18.000000,0.000000,29.300000
US,2014,-37.200000,41.200000,0.000000,31.000000
SW,2014,-34.500000,28.900000,1.000000,16.000000
RS,2014,-28.900000,30.500000,0.000000,11.000000
BE,2014,-7.000000,33.100000,0.000000,16.000000
AU,2014,-11.000000,34.000000,0.000000,16.500000
AS,2014,0.900000,45.600000,0.000000,14.400000
LU,2014,-10.000000,32.100000,0.000000,13.400000
NO,2014,-35.700000,32.000000,0.000000,35.500000
SF,2014,0.900000,37.400000,0.000000,13.400000
EZ,2014,-15.000000,33.000000,0.000000,16.500000
JA,2014,-0.500000,33.900000,0.000000,19.600000
NL,2014,