Fukushima Heatmap: Subsecting Data
===============================
In some cases, the amount of data available is too much to graph all at once. While the Fukushima set is small enough to fit comfortably in memory, we can still use it to showcase some techniques for handling much larger sets. In this case, we will still process the entire set of data, but do so in coordinate chunks; each chunk is averaged to create a heatmap. You can configure the number of chunks created; the more there are, the less data is held in memory at a time, but the more queries are done overall.


Initial setup and finding all dates
--------------------------------------

We configure our graph (including the extent to which we'll divide our data), open a connection to our database of interest, and find what dates are available to us. We'll track data from each date separately.

In [None]:
import sina.datastores.sql as sina_sql

DATABASE = '/collab/usr/gapps/wf/examples/data/fukushima/fukushima.sqlite'

# Number of chunks along a side; this number squared is the total number of chunks
CHUNKS_PER_SIDE = 12

# Identify the coordinates, label, and label orientation for the power
# plant and selected cities as points of reference.
CITIES = [  # (lon, lat), desc, horizontal alignment
    [(141.0281, 37.4213), ' Daiichi Nuclear Power Plant', 'left'],
    [(141.0125, 37.4492), 'Futaba ', 'right'],
    [(141.0000, 37.4833), ' Namie', 'left'],
    [(140.9836, 37.4044), ' Okuma', 'right'],
    [(141.0088, 37.3454), ' Tomioka', 'left']]


# Configure how cities are marked
COLOR_CITIES = 'red'
AREA_CITIES = 12

# The coordinates our analysis will cover
X_COORDS = (140.9, 141.3)
Y_COORDS = (37.0, 37.83)

# The city coordinates need to be normalized to our grid (whose size depends on CHUNKS_PER_SIDE)
norm_x = [CHUNKS_PER_SIDE*((c[0][0]-X_COORDS[0])/(X_COORDS[1]-X_COORDS[0])) for c in CITIES]
norm_y = [CHUNKS_PER_SIDE*(1 - (c[0][1]-Y_COORDS[0])/(Y_COORDS[1]-Y_COORDS[0])) for c in CITIES]

# Create the data access object factory.
factory = sina_sql.DAOFactory(DATABASE)
record_handler = factory.createRecordDAO()
relationship_handler = factory.createRelationshipDAO()

# Get the ids of the experiments (which are their dates)
all_experiments = record_handler.get_all_of_type("exp")
dates = [str(x.id) for x in all_experiments]

print('Config loaded. Database has the following dates available: {}'.format(', '.join(dates)))

Filter the Data: Filtering Logic
========================
We subdivide our coordinate range (37.3-37.8, 140.9-141.4) into chunks_per_side^2 regions and find the records whose coordinates are within each range. We separate these out based on which day each Record is associated with. We then find that Record's gcnorm (counts per sec) and average to get that chunk's average for the day, and also track the total number of records per chunk per day (so we know around how confident we are in that average). 

This cell adds the functions to memory, plus does a bit of preprocessing. The functions themselves will be called once it's time to create the graph.

In [None]:
from sina.utils import ScalarRange
from collections import defaultdict
import random
import matplotlib.pyplot as plt


# First, we figure out which record ids are associated with which dates
records_at_dates = {}
for date in dates:
    records_at_dates[date] = set([str(x.object_id) for x in relationship_handler.get(subject_id=date,
                                                                                     predicate="contains")])
    
# Jupyter sometimes has an issue with the first call to plt.show(), so we make a dummy call
plt.show()
    
def find_in_range(lat_start, long_start, lat_dec, long_inc):
    """Returns all Records in a coordinate square"""
    latitude_req = ScalarRange(name="latitude",
                               min=lat_start - lat_dec,
                               min_inclusive=True,
                               max=lat_start,
                               max_inclusive=False)
    longitude_req = ScalarRange(name="longitude",
                                min=long_start,
                                min_inclusive=True,
                                max=long_start + long_inc,
                                max_inclusive=False)
    return record_handler.get_given_scalars((latitude_req,
                                             longitude_req))
    
    
def calculate_chunk(lat_start, long_start, lat_dec, long_inc):
    """
    Calculate the avg_gcnorm and count for the chunk across each day in records_at_dates.
    
    Returns a dictionary mapping total and average gcnorm & num samples in a chunk to a day
    """
    # print("Starting to find records")
    records = find_in_range(lat_start, long_start, lat_dec, long_inc)
    # print("Finding scalars")
    out = defaultdict(lambda: {"total":0.0, "count":0, "average":0.0})
    for record in records:
        for date in records_at_dates:
            if record.id in records_at_dates[date]:
                out[date]["total"] += record.data["gcnorm"]["value"]
                out[date]["count"] += 1
                break
    for date in records_at_dates:
        if out[date]["count"] > 0:
            out[date]["average"] = out[date]["total"]/(out[date]["count"])
    return(out)

def calculate_chunk_fake(lat_start, long_start, lat_dec, long_inc):
    """
    Calculate the avg_gcnorm and count for the chunk across each day in records_at_dates.
    
    Returns a dictionary mapping total and average gcnorm & num samples in a chunk to a day
    """
    # print("Starting to find records")
    # records = find_in_range(lat_start, long_start, lat_dec, long_inc)
    # print(lat_start, long_start, lat_inc, long_inc)
    # print("Getting record IDs")
    # record_ids = set([x.id for x in records])
    # print("Finding scalars")
    out = {}
    for date in records_at_dates:
        out[date] = {"total":120.0, "count":10, "average": random.randint(1,100)/1.0}
    return(out)


print("Functions loaded and date mappings built!")

Create the Graphs
===============
Now we divide up based on the number of chunks and collect the information for each chunk independently. Since we're only *reading* the underlying database, this could, in theory, be parallelized.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import clear_output, display

x_increment = (X_COORDS[1]-X_COORDS[0])/CHUNKS_PER_SIDE
x_range = [x_increment*offset+X_COORDS[0] for offset in range(CHUNKS_PER_SIDE)]
y_decrement = (Y_COORDS[1]-Y_COORDS[0])/CHUNKS_PER_SIDE
y_range = list(reversed([y_decrement*offset+Y_COORDS[0] for offset in range(CHUNKS_PER_SIDE)]))

def gen_plot():
    avgs_at_date = defaultdict(lambda: np.zeros((CHUNKS_PER_SIDE, CHUNKS_PER_SIDE)))
    counts_at_date = defaultdict(lambda: np.zeros((CHUNKS_PER_SIDE, CHUNKS_PER_SIDE)))
    graphs_at_date = defaultdict(dict)       
    chunks_completed = 0
    for x_offset, x_coord in enumerate(x_range):
        for y_offset, y_coord in enumerate(y_range):
            norms_at_dates = calculate_chunk(lat_start = (y_coord + y_decrement), long_start = x_coord,
                                             lat_dec = y_decrement, long_inc = x_increment)
            chunks_completed += 1
            progress = ("Progress: {}/{}, finished ([{},{}), [{}, {}))".format(chunks_completed,
                                                                         CHUNKS_PER_SIDE**2,
                                                                         x_coord,
                                                                         x_coord + x_increment,
                                                                         y_coord - y_decrement,
                                                                         y_coord))
            clear_output(wait=True)
            display(progress)
            for date in norms_at_dates:
                # The y and x offsets are essentially coordinate positions here. So y is "amount down"
                avgs_at_date[date][y_offset, x_offset] = (norms_at_dates[date].get("average"))
                counts_at_date[date][y_offset, x_offset] = (norms_at_dates[date]["count"])
    print("Creating graph...")
    for date in dates:
        fig, ax = plt.subplots(figsize=(7,7))
        plt.xlabel('Longitude')
        plt.ylabel('Latitude')
        heatmap_avg = ax.imshow(avgs_at_date[date])
        plt.colorbar(heatmap_avg, label="gcnorm")
        plt.title("Fukushima Radiation: Flight {}".format(date))
        scatter = ax.scatter(x=norm_x,
                             y=norm_y,
                             s=AREA_CITIES,
                             c=COLOR_CITIES)
        for x_coord, y_coord, city_info in zip(norm_x, norm_y, CITIES):
            _, desc, alignment = city_info
            ax.annotate(desc, (x_coord, y_coord), va="center", ha=alignment, color=COLOR_CITIES)
        # Matplotlib labels the boxes themselves, rather than the origin, so we need to calculate the centers
        ax.set_xticks(range(len(x_range)))
        ax.set_xticklabels((x+x_increment/2 for x in x_range))
        ax.set_yticks(range(len(y_range)))
        ax.set_yticklabels((y-y_decrement/2 for y in y_range))
        plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
                 rotation_mode="anchor")
        
        plt.show()
    
gen_plot()