# VNIRSD admin programming manual and recipes
----
### michael st. clair --  v0.1 -- 02-23-2021
*This is a preliminary version of this document intended for internal
operations. Please do not publicly distribute.*

### general notes on usage
----
* many of these cells create huge amounts of output. to shrink this if you're tired of looking at it, double-click on the gutter to the left of the cell. to totally erase it, go up to the 'cell' menu and choose 'current outputs -> clear', or 'all outputs' -> clear to get rid of output for every cell.

### imports
----
Run these if you want the code to function.

In [None]:
from ast import literal_eval
from functools import partial
import json
from operator import or_
import os
import random

import numpy as np
import pandas as pd

from recipes import samples
from mars.models import Database, Sample
from mars.dj_utils import are_in, djget, eta, fields

# you're not actually doing anything scary and asynchronous to the database.
# however, ipython/jupyter wraps itself in an event loop that looks scary
# to django.
os.environ["DJANGO_ALLOW_ASYNC_UNSAFE"] = "true"

# I: database structure
----

## I.1: where it lives

The VNIRSD is backed by a SQLite database. This database is entirely contained
in one file: db.sqlite3. **Keep several backups of this file outside of the
working tree of the application.** This lets you freely experiment with the
database. If you do anything horrible to it, you can immediately repair it by
overwriting the file in the application directory with one of these backups.
You can even do this while the application is running. Users will only notice
if they make queries while the file is damaged or in the middle of being
overwritten. The only entry-specific items that are not stored in this database
are image files (links to the images are stored, but not binary image content
itself, because this is generally a bad idea; filesystems are better at storing
files than databases are).

## I.2: django and models

The VNIRSD primarily uses the Python framework Django to interact with the
database. Django abstracts SQL tables as instances of the class ```Model```.
There are five important models / tables in the VNIRSD proper:

* ```Sample``` (individual samples)
* ```Database``` (origin databases, like ASTER/ECOSTRESS)
* ```FilterSet``` (definitions for sets of filters, like Mastcam-Z's, used
    for generating simulated reflectance curves)
* ```Library``` (application- or team-specific groups of samples, and maybe
    other things later -- this is fully functional, but not currently populated)
* ```SampleType``` (top-level physical categories of sample, like minerals or
    coatings -- this is again fully functional, but only skeletally populated)

*Note: while you probably don't want to interact with them from the Python
shell, admin models, including users and their access information, are also
stored in the database. So, for instance, if you roll back to an earlier
version of the database after changing a user's password but before making a
new database backup, that password will be reset to the earlier version.*

# II. searching the database
----

## II.1: searching the database


### II.1.a: custom search functions

Django uses ```QuerySet``` objects and methods to interact with ```Model```
objects. These are powerful but often syntactically awkward (and un-Pythonic).
The next cell defines a simpler search function ```samples``` that looks for
samples that contain a particular value anywhere in a particular field, case-
insensitive.
 
*Note: ```samples``` is also included in the recipes.py module, but
manipulating the cell below will allow you to define different versions of
it.*

The syntax is simply: ```samples(value, field)```; it returns a ```QuerySet```
object (which can mostly be treated as a ```list``` with special
functionality) of all samples containing that value in that field.

Other useful values for ```querytype``` in cousins of ```samples``` include
'lt' or 'gt' (less/greater than) or 'iexact' (exact match). dropping the
leading 'i' makes the search case-sensitive.

The names Python prints for Sample objects in shell / Notebook are formatted
like this:

```sample name + _ + sample id (in database of origin) + _ + database-of-origin short name```

In [None]:
# define partially-evaluated convenience function
get_contains = partial(
    djget, 
    model=Sample, 
    value = "",
    # the field value is model-specific! you can omit it if you don't want to
    # use the shortened call types discussed in II.1.b
    field = "sample_name", 
    querytype='icontains'
)
# reorder arguments to prevent collisions
samples = eta(get_contains, "value", "field")

### II.1.b: fetch all hematites in the database and look at 5 of them

In [None]:
# note: the vanilla django equivalent to the next line is: 
# hematites = Sample.objects.filter(sample_name__icontains='hematite')
hematites = samples('hematite', 'sample_name')
random.choices(hematites, k=5)

### II.1.c: check total number of samples, or of a subset 

In [None]:
# samples() looks in the sample_name field by default.
# called with no arguments, it returns all values in the model.
# note: the vanilla django way to do that is to call Sample.objects.all().

len(samples("Smectite")), len(samples()), len(Sample.objects.all())

## II.2: fields and values of models

There are lots of ways to get fields and field values from the database. See
the next few cells for some simple ones.

### II.2.a: get every field of a model

In [None]:
fields(Sample), fields(Database)

### II.2.b: get values of a particular field from instances of a model

In [None]:
random_sample = random.choice(samples())
print(random_sample.sample_name)
smectites = samples("smectite")
print([
    sample.id for sample in smectites
])

### II.2.c: get unique values of a field, ordered alphabetically

In [None]:
names = [
    name_list[0] for name_list in
    set(samples().values_list('grain_size'))
]
names.sort()
names

## II.3 related model fields

Accessing fields of related objects from a different model is done by, depending on context:
* using chained accessors (like: ```model.other_model.field```)
* separating the related field name and the field you want to access on the
    other model by a double underscore (like: ```sc(value, "other_model__field")```)

### II.3.a: learn about a sample's database of origin

In [None]:
random_sample = random.choice(samples())
print(random_sample.origin.name) # full name of that sample's database of origin
print(random_sample.origin.url) # url for that sample's database of origin
# is that sample in the group of all samples whose databases of origin have that 
# full name? (hopefully yes, or something is very wrong) 
print(random_sample in samples(random_sample.origin.name, "origin__name"))

## II.4 interpreting Sample model fields

There are a *lot* of fields on the ```Sample``` model, and most of them are 
empty for most samples. This is because the table is intended to support
content ingested from a bunch of different databases,  each of which has its
own metadata standard. So, for instance, while we'd like to retain information
about resolution if it's available in an input database, most of our input
databases don't provide
resolution values in their metadata. The fields you can expect to be on every
or almost every sample are:
* sample_name (Name of the sample from the original database, like "Talc")
* sample_id (ID of the sample from the original database, retained for
traceability)
* id (unique ID number in VNIRSD, also known as a database primary key or PK)
    * bear in mind that because a primary key is how a database distinguishes
    objects, changing a sample's id field makes it a whole new entry
* date_added (last modification date of the sample)
* min_reflectance (minimum wavelength in the reflectance array)
* max_reflectance (maximum wavelength in the reflectance array)
* origin (database of origin -- this is an instance of the ```Database```
    model)
* released (has the sample been released to the public?)
* reflectance (reflectance array flattened into a simple string) 
* simulated_spectra (dictionary of ```pandas DataFrames``` giving simulated
    reflectance arrays flattened into a json string)

### II.4.a: get a random sample and look at all its fields

You can use the ```as_dict()``` method of a ```Sample``` object to get most
things about it in a ```dict``` -- note that the flattened reflectance and
simulated_spectra fields aren't very readable! See the next few cells for
 ways to interpret them as ```numpy``` arrays and ```pandas``` dataframes.

In [None]:
random_sample = random.choice(samples())
random_sample.as_dict()

### II.4.b: look at properties of that sample's reflectance data

In [None]:
reflectance = np.array(literal_eval(random_sample.reflectance))
print(reflectance[0:10, 0]) # first 10 wavelength values of spectrum
print(np.median(reflectance,0))  # median wavelength and reflectance of spectrum 
print(reflectance[:,1].mean()) # mean reflectance of spectrum

### II.4.c: look at a simulated spectrum for that sample

In [None]:
sim_landsat = pd.DataFrame(
    json.loads(literal_eval(random_sample.simulated_spectra)['LANDSAT 8 OLI'])
)
sim_landsat # dataframe containing simulated values for Landsat 8 OLI

# III: manipulating database entries
----
## III.1: field assignment and model entry updates

Similar methods can be used to modify entries in the database. The easiest way
is to assign values directly to fields of a model instance. This is useful if
you want to quickly modify items without using the admin console.
**Important:** updating the fields of a model instance in memory **does not**
automatically change the corresponding instance in the database. After
modifying a model instance, calling its ```clean()``` and ```save()``` methods
will validate its changed data and record the updated version in the database.
Some other stuff only happens after you call ```save()```. For instance,
simulated spectra are generated at that point for samples, and if a model
instance doesn't have an id / primary key, it gets assigned one.

### III.1.a: change an in-memory sample without saving it

In [None]:
sample = samples()[0]
sample.sample_name = sample.sample_name + "_TEST"
print(sample.sample_name) # great! working great, right? the sample is updated!
sample = samples()[0]
print(sample.sample_name) # aww...no, the sample wasn't updated.

### III.1.b: make a test version of a sample and save it in the database

In [None]:
sample.sample_name = sample.sample_name + "_TEST"
# remember that changing id / primary key makes something a "new" object from the 
# database's perspective...give this an arbitrary, ridiculously large id so
# that we don't overwrite a real sample
sample.id = 10000000 
sample.released = False # don't show visitors our silly test sample
sample.clean() # validate sample fields
sample.save() # save it in the database
# is it there, and different from the original? hopefully.
samples()[0], samples(10000000, "id")[0] 

### III.2.c: modify and save the test sample

In [None]:
# saving a model instance _without_ changing its id modifies the existing entry
# rather than creating a new one.
test_sample = samples(10000000, "id")[0]
print(test_sample.sample_name)
test_sample.sample_name = "Terrible Rock"
test_sample.clean()
test_sample.save()
print(test_sample.sample_name)

## III.2: deleting model instances

You can delete a database entry simply by calling its ```delete()``` method.
Note that if other entries link to it -- for instance, the database of origin
for many samples -- you won't be able to delete it while those other entries
still exist in the database.

### III.2.a: delete test sample

In [None]:
# it's probably better if we don't keep this Terrible Rock in the database 
# (see preceding section if you didn't make a Terrible Rock.)

terrible_rock = samples("Terrible Rock", "sample_name")[0]
terrible_rock.delete()
samples("Terrible Rock", "sample_name")

## III.3: bulk modification

These techniques can be combined with standard python control structures to
change many items at once. Most of these examples are 'disarmed', with their
```save``` or ```delete``` calls commented out. **Make sure you back the
database up first if you arm and run them!** Running these without
saving samples but leaving print() statements in acts as a 'dry run', and is
very useful to verify that your changes are good before you commit them.

### III.3.a: delete everything

Perhaps you want to delete every sample from the database, but you don't want
to dump the entire file -- for instance, you don't want to generate a new
password for every user. Here's a way to do that.

In [None]:
# back up first. really. not joking!

# for sample in samples():
#     sample.delete()

### III.3.b: reprocess every sample in the database

You might want to do this if you need to recalculate simulated spectra values
because you've added new filtersets, or if you suspect that some malformed
entries snuck in to the database and you'd like to reprocess entries one-by-one
to find them.

*Note: At current database size, assuming everything processes cleanly, this
will probably take between half an hour and an hour and a half depending on
operating environment. You might want to add a progress timer or something.
Also, this one is not 'disarmed' because it's mostly harmless -- if everything
is ok with a sample, it will just save it back to the database unchanged.*

In [None]:
for ix, sample in enumerate(samples()):
    # good to know in case it hits something bad and crashes -- 
    # you have the name, id, and index (list position) of the sample to investigate
    print(ix, sample.sample_name, sample.id) 
    sample.clean()
#     sample.save()

### III.3.b: find all samples without a sample name and assign placeholders

In [None]:
for sample in samples("", querytype="iexact"):
    sample.sample_name = sample.sample_

### III.3.c: regularize unit names across the database

The are some samples in the database that give grain size in micrometers as 'um' and some that give it as 'microns'. Also, some samples have spaces between SI unit abbreviations and numerals, and some don't. Let's say you'd like to regularize this to always use 'um' for micrometers and also not have spaces between numerals and abbreviations. This replacement may be too crude, so we include print statements to see if it's good or not. *(TODO: If it is, move on to regex, maybe or maybe not beyond the scope of this manual. -- michael)*

In [None]:
has_si_units = are_in(["cm", "mm", "nm", "um"], or_)
for sample in samples():
    if not sample.grain_size:
        continue # don't bother doing anything if there's no grain size metadata
    sample.grain_size = sample.grain_size.replace("micron", "um")
    # we don't want to remove spaces in phrases that don't contain si units
    if has_si_units(sample.grain_size):
        sample.grain_size = sample.grain_size.strip().replace(" ", "", 1)
    print(sample.grain_size)
    # sample.clean()
    # sample.save()
    else:
        print("nope")

### III.3.e: mark every sample from a particular origin as released

### III.4.f: assign all samples listed in an external file to a custom library

In [None]:
sample.filename

In [None]:
from mars.dj_utils import ingest_sample_csv

In [None]:
sample, filename, warnings, errors = ingest_sample_csv(
    '../wwu_spec_csvs/RLA_CARB_11_1.csv'
).values()
# active_sample.clean(warnings=upload_warnings)
# active_sample.min_reflectance
# active_sample.save(uploaded=True)

In [None]:
sample.reflectance

In [None]:
sample.save()

In [None]:
ingest_sample_csv(
    '../wwu_spec_csvs/RLA_CARB_11_1.csv'
)

In [None]:
djget(Sample, 'TEST', 'library__name')

In [None]:
for sample in Sample.objects.filter(library__name__icontains='TEST'):
    sample.library.set([None])
    sample.clean()
    sample.save()

In [None]:
Database.objects.all()