Post-Processing with Sina
=====================

This notebook will demonstrate some simple post-processing using Sina. This can be used for things like adding derived quantities to a simulation run.

Post-processing a simulation run's output file
------------------------------

Commonly, simulations output sina.json files that contain a single Record (and no Relationships)--that is, the only thing in them is a single simulation run, and we often want to attach a bit more data to it. This chunk demonstrates how to read, process, and dump Sina without ever interacting with a datastore.

NOTE: **We're creating real files here!** Change paths at your peril.

In [None]:
import sina
import sina.utils

# We'll dump sample data here to create our input file. This json is our "simulation run".
source_path = "my_sample_simulation.json"
# This is the stringified form of what we'll be dumping, a mockup of some minimal
# simulation data. Don't worry about reading it--we'll have it in a clearer form shortly
data_to_dump = """{"records": [{"id": "sample_rec", "type": "sample", "data": {"runtime": {"value": 10.5, "units": "s"}, "user": {"value": "Anna Bob"}, "volume": {"value": [5, 6, 7, 12]}, "mass": {"value": [0.5, 0.5, 0.5, 0.5]}}}], "relationships": []}"""
with open(source_path, 'w') as f:
    f.write(data_to_dump)

# We'll write our data to here. In a real workflow, we may want to write back to the
# source path, or we may want to keep things "versioned", especially if we're still
# developing a workflow. We do the latter here. Use whatever works best for you!
dest_path = "my_sample_simulation_post.json"


# We first load up our "simulation" output json
rec = sina.utils.load_sole_record(source_path)


#################################
#          Adding Data          #
#################################

# Now our simulation data is available as a Python object!
# Let's add that we're post-processing it, just in case
rec.add_data("was_post_processed", True)

# Derived quantities are a common case for post-processing.
# Let's calculate density and add it to our record.
rec.add_data("density", [x / y for x, y in zip(rec.data_values["mass"],
                                               rec.data_values["volume"])])


#################################
#         Updating Data         #
#################################

# We can update existing data too. Let's add a tag.
rec.set_data("user", rec.data_values["user"], tags=["metadata"])

# Or how about a unit conversion? Our code output seconds, but we want milliseconds
rec.set_data("runtime", rec.data_values["runtime"] * 1000, units="ms")


#################################
#          Saving Data          #
#################################

# We're not quite sure about the changes we're making, so we'll save this as
# a different record entirely. You don't always want to do this (ex: if you
# have both of these in a datastore and make a scatterplot on, say, mass,
# it'll find 2x the records to plot), but it's useful in dev.
original_rec_id = rec.id
rec.id = original_rec_id + ("_post")

# Now let's dump this back to the filesystem!
rec.to_file(dest_path)


#################################
#           Verifying           #
#################################

# So, how did we measure up? Let's put both of those in a temporary datastore.
ds = sina.connect()
ds.records.insert([sina.utils.load_sole_record(source_path),
                   sina.utils.load_sole_record(dest_path)])

# Queries only return runs that contain the data we want, so we'll only get the edited one
print("Post-processed record: {}".format(next(ds.records.find_with_data(was_post_processed=True))))

# And of course, our old record still has its pre-edit units
print("Units for {}'s runtime: {}"
      .format(original_rec_id,
              ds.records.get(original_rec_id).data["runtime"]["units"]))
print("Units for {}'s runtime: {}"
      .format(original_rec_id + "_post",
              ds.records.get(original_rec_id + "_post").data["runtime"]["units"]))

Editing Records within the Datastore
--------------

Doing things through the datastore is much the same, just without file dumping. We'll keep using our records from above.

**Technical note**: so long as you're using MySQL, you can have multiple people editing the database at once. This is very useful for any workflow where things are being automatically added to the database. Not something you have to worry about if you've just got a personal store, of course, but something to keep in mind if you have multiple people sharing ex: a SQLite file (which is not safe for concurrent edits!) 

In [None]:
# This object is identical to the state we left `rec` in in the last cell
post_processed_rec = ds.records.get(original_rec_id + "_post")

# That said, it's a new object! Edits made here aren't reflected in `rec`!
post_processed_rec.add_data("favorite_color", "green")

# Edits won't be reflected in the database until we tell it to save our changes
# As such, we shouldn't find any that match yet...
print("Records whose favorite color is green: {} (expect None)"
      .format(list(ds.records.find_with_data(favorite_color="green"))))

# Including edits to existing values
post_processed_rec.set_data("runtime",
                            post_processed_rec.data_values["runtime"] + 100)

# Ready to go! We push our changes. Now we'll see a match!
ds.records.update(post_processed_rec)
print("Update pushed!")
print("Records whose favorite color is green: {} (expect one)"
      .format(list(ds.records.find_with_data(favorite_color="green"))))

Both styles of post-processing operate on the same basic concepts. Use whichever fits your workflow! In fact, it's possible to use Sina without any JSON files at all, instead creating Record objects and inserting them straight into the database (as we saw in the [basic usage tutorial](basic_usage.html)).