Fukushima Heatmap: Subsecting Data
===============================
In some cases, the amount of data available is too much to graph all at once. While the Fukushima set is small enough to fit comfortably in memory, we can still use it to showcase some techniques for handling much larger sets. In this case, we will still process the entire set of data, but do so in coordinate cells; each cell is averaged to create a heatmap. You can configure the number of cells created; the more there are, the less data is held in memory at a time, but the more queries are done overall.


Setting number of cells and opening a connection
-----------------------------------------------------------

As mentioned above, we choose a number of cells that's a reasonable compromise between amount of data in memory and amount of querying needing done (also, a higher number of cells will naturally give a graph with greater fidelity). We then open a connection to our database of interest, and find what dates are available to us. We'll track data from each date separately.

In [None]:
from collections import defaultdict
from IPython.display import clear_output, display
import matplotlib.pyplot as plt
from matplotlib import patheffects
import numpy as np
import sina.datastores.sql as sina_sql
from sina.utils import get_example_path, DataRange

# Number of cells along a side; this number squared is the total number of cells
CELLS_PER_SIDE = 12

database = get_example_path('fukushima/data.sqlite')
print("Using database {}".format(database))

# Identify the coordinates, label, and label orientation for the power
# plant and selected cities as points of reference.
CITIES = [  # (lon, lat), desc, horizontal alignment
    [(141.0281, 37.4213), '  Daiichi Nuclear Power Plant', 'left'],
    [(141.0125, 37.4492), 'Futaba ', 'right'],
    [(141.0000, 37.4833), ' Namie', 'left'],
    [(140.9836, 37.4044), ' Okuma ', 'right'],
    [(141.0088, 37.3454), ' Tomioka', 'left']]

# The coordinates our analysis will cover
X_COORDS = (140.9, 141.3)
Y_COORDS = (37.0, 37.83)

# The city coordinates need to be normalized to our grid (whose size depends on CELLS_PER_SIDE)
norm_x = [CELLS_PER_SIDE * ((c[0][0] - X_COORDS[0]) / (X_COORDS[1] - X_COORDS[0])) for c in CITIES]
norm_y = [CELLS_PER_SIDE * ((c[0][1] - Y_COORDS[0]) / (Y_COORDS[1] - Y_COORDS[0])) for c in CITIES]

# Create the data access object factory.
factory = sina_sql.DAOFactory(database)
record_handler = factory.createRecordDAO()
relationship_handler = factory.createRelationshipDAO()

# Get the ids of the experiments (which are their dates)
dates = list(record_handler.get_all_of_type("exp", ids_only=True))
print('Database has the following dates available: {}'.format(', '.join(dates)))

Filter the Data: Filtering Logic
========================
We subdivide our coordinate range (37.3-37.8, 140.9-141.4) into $cells\_per\_side^2$ regions and find the records whose coordinates are within each range. We separate these out based on which day each Record is associated with. We then find that Record's gcnorm (counts per sec) and average to get that cell's average for the day, and also track the total number of records per cell per day (so we know around how confident we are in that average). 

This cell adds the functions to memory, plus does a bit of preprocessing. The functions themselves will be called once it's time to create the graph.

In [None]:
# First, we figure out which record ids are associated with which dates
records_at_dates = {}
for date in dates:
    records_at_dates[date] = set([str(x.object_id) for x in relationship_handler.get(subject_id=date,
                                                                                     predicate="contains")])

# Jupyter sometimes has an issue with the first call to plt.show(), so we make a dummy call
plt.show()

def calculate_cell(lat_min, lat_max, long_min, long_max):
    """
    Calculate the avg_gcnorm and count for the cell across each day in records_at_dates.

    :param lat_min: The minimum latitude of this cell (inclusive)
    :param lat_max: The maximum latitude of this cell (exclusive)
    :param long_min: The minimum longitude of this cell (inclusive)
    :param long_max: The maximum longitude of this cell (exclusive)

    :returns: a dictionary mapping total and average gcnorm & num samples in a cell to a day
    """
    record_ids = list(record_handler.data_query(latitude=DataRange(lat_min, lat_max),
                                                longitude=DataRange(long_min, long_max)))
    
    data = record_handler.get_data_for_records(record_ids, ["gcnorm"])
    out = defaultdict(lambda: {"total": 0.0, "count": 0, "average": 0.0})
    for id in record_ids:
        for date in records_at_dates:
            if id in records_at_dates[date]:
                out[date]["total"] += data[id]["gcnorm"]["value"]
                out[date]["count"] += 1
                break
    for date in records_at_dates:
        if out[date]["count"] > 0:
            out[date]["average"] = out[date]["total"] / (out[date]["count"])
    return out


print("Functions loaded and date mappings built!")

Create the Graphs
===============
Now we divide up based on the number of cells and collect the information for each cell independently. Since we're only *reading* the underlying database, this could, in theory, be parallelized. Generating this graph may take some time; see the progress indicator beneath the code for an idea of how much is left.

In [None]:
x_increment = (X_COORDS[1] - X_COORDS[0]) / CELLS_PER_SIDE
x_range = [x_increment * offset + X_COORDS[0] for offset in range(CELLS_PER_SIDE)]
y_increment = (Y_COORDS[1] - Y_COORDS[0]) / CELLS_PER_SIDE
y_range = [y_increment * offset + Y_COORDS[0] for offset in range(CELLS_PER_SIDE)]

avgs_at_date = defaultdict(lambda: np.zeros((CELLS_PER_SIDE, CELLS_PER_SIDE)))
counts_at_date = defaultdict(lambda: np.zeros((CELLS_PER_SIDE, CELLS_PER_SIDE)))


# This may take awhile! (around a minute for a 12*12 map)
def gen_data():
    """Generate the plot, including calculating the data it contains."""
    cells_completed = 0
    for x_offset, x_coord in enumerate(x_range):
        for y_offset, y_coord in enumerate(y_range):
            norms_at_dates = calculate_cell(lat_min=y_coord,
                                            lat_max=y_coord + y_increment,
                                            long_min=x_coord,
                                            long_max=x_coord + x_increment)
            cells_completed += 1
            if cells_completed % CELLS_PER_SIDE == 0:
                progress = ("Progress: {}/{}, finished ([{},{}), [{}, {}))"
                            .format(cells_completed, CELLS_PER_SIDE ** 2,
                                    '{:.3f}'.format(x_coord),
                                    '{:.3f}'.format(x_coord + x_increment),
                                    '{:.3f}'.format(y_coord),
                                    '{:.3f}'.format(y_coord + y_increment)))
                clear_output(wait=True)
                display(progress)
            for date in norms_at_dates:
                avgs_at_date[date][y_offset, x_offset] = (norms_at_dates[date]["average"])
                counts_at_date[date][y_offset, x_offset] = (norms_at_dates[date]["count"])


gen_data()
print("All cells calculated! You can now generate the graph (next cell).")

Configuring and Displaying the Graph
--------------------------------------------

There's a fair bit of configuration that goes into how the heatmap is displayed. Feel free to tweak these settings to maximize how readable the data is for you personally. Once you're ready (and the cell above has completed), run this cell to display the graph! You can re-run this cell after tweaking the config options to re-create your graph relatively quickly.

In [None]:
# How the cities are marked. Font/marker color and size, label outline and size
COLOR_CITIES = 'white'
COLOR_OUTLINE = 'black'
SIZE_CITY_FONT = 14
SIZE_OUTLINE = 5
AREA_CITIES = 50

# Heatmap colormap, see https://matplotlib.org/users/colormaps.html#grayscale-conversion
COLORMAP = "plasma"


def create_graph():
    """Configure and display the graph itself. Dependent on the data from gen_plot()."""
    for date in dates:
        fig, ax = plt.subplots(figsize=(9, 9))
        plt.xlabel('Longitude')
        plt.ylabel('Latitude')
        heatmap_avg = ax.imshow(avgs_at_date[date], origin='lower', cmap=COLORMAP)
        plt.colorbar(heatmap_avg, label="Counts Per Second")
        plt.title("Fukushima Radiation: Flight {}".format(date))
        _ = ax.scatter(x=norm_x,
                       y=norm_y,
                       s=AREA_CITIES,
                       c=COLOR_CITIES,
                       linewidths=SIZE_OUTLINE / 2,  # Correction to be around same size as font outline
                       edgecolor=COLOR_OUTLINE)
        for x_coord, y_coord, city_info in zip(norm_x, norm_y, CITIES):
            _, desc, alignment = city_info
            text = ax.text(x_coord, y_coord, desc,
                           va="center", ha=alignment,
                           size=SIZE_CITY_FONT, color=COLOR_CITIES)
            text.set_path_effects([patheffects.withStroke(linewidth=SIZE_OUTLINE,
                                                          foreground=COLOR_OUTLINE)])

        # Matplotlib labels the boxes themselves, rather than their
        # borders/the origins, so we need to calculate the centers
        ax.set_xticks(range(len(x_range)))
        ax.set_xticklabels(('{:.3f}'.format(x + x_increment / 2) for x in x_range))
        ax.set_yticks(range(len(y_range)))
        ax.set_yticklabels(('{:.3f}'.format(y + y_increment / 2) for y in y_range))
        plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
                 rotation_mode="anchor")
        plt.show()


create_graph()