# Overview
This notebook provides the ability to generate random droplet parameters, write them to disk, and 
train a neural network with said droplet parameters to approximate the underlying ODEs that govern 
the droplet parameters.  Once trained researchers can generate a Fortran 90 module that provides 
the ability to estimate droplet radius and temperature for some time in the future.  

The intent is that a small, reasonably trained neural network can provide accurate enough droplet 
characteristic estimations that are significantly faster than an iterative Gauss-Newton technique.
Initial testing indicates a small 4-layer network (roughly 2400 parameters) with Fortran 90 module
generated by this notebook is 30-90x faster than the existing (as of 2024/09/25) iterative approach 
which results in roughly a factor of 2x overall simulation speedup.

This notebook is broken down into the following sections:

1. ODEs of interest
2. Mapping data to/from $[-1, 1]$
3. Generating random droplets
4. Training a neural network
5. Analyzing a network's performance
6. Exporting a network to Fortran 90


# Setup
Since training a neural network to approximate ODEs has multiple workflows (e.g. training a model
vs loading previous models to analyze their performance) there are several variables that need to
be set to exercise all of this notebook's functionality.  In particular, the following should
be reviewed and set depending on which workflows are of interest:

- Loading a previously trained model: `load_model_flag`, `model_load_path`
- Training a new model: `train_model_flag`, `training_data_path`, `number_epochs`
- Saving a newly trained model: `save_model_path`

These should be set at the top of the notebook.  Please adhere to the modification instructions
where the variables are defined.

## Training Data
Training data is not included in the repository due to size, though creating data on the fly
using `create_training_file()` is slow since it is single threaded and this notebook does
not implement a parallel data generation process.

Those who have access to Dr. Richter's group network drive can download previously created 
data sets.  Users external to the group can either slowly create data using the tools in this 
notebook or use scripts exported from an earlier version of this notebook 
(`generate_training_data.py` and `loop-training-data.sh`) to generate data in bulk on
multiple cores.

## Python Dependencies
The following Python packages are needed to exercise full functionality in this notebook 
(versions tested in parentheses):

- Python3 (3.11.5)
- Matplotlib (3.7.1)
- NumPy (1.24.3)
- Pandas (1.5.3)
- PyTorch (2.3.0)

There is no fundamental dependence on any particular version of the dependencies and, barring
any bugs encountered with the packages' main APIs, newer versions are expected to work.

# Notebook Care and Feeding
Notebooks are great for exploring ideas and rapidly iterating to a solution.  They are less
great for maintaining a "production" workflow that needs to be used by multiple people
on a semi-regular basis.  As such, the following guidelines should be followed when making
updates to this so as to preserve everyone's sanity:

- Do not commit the notebook with outputs from a previous run.  This greatly reduces the
  file size in the repository and avoids unnecessary changes when someone re-runs a cell
  whose output changes on each execution.
- The notebook should always allow execution of all cells without errors.  Restarting the
  kernel and running all cells should be run to confirm this (then restart and clear output).

# Future Work

## Hyperparameter Search on the Neural Network's Architecture
The network architecture and hyperparameters were chosen because they:

1. Resulted in a small network
2. Had sizes that *should be* efficient to work with on the CPU

A very limited hyperparameter search has settled on parameters that appear to reliably train
performant networks though the path from start to where we're at was very ad hoc.  Fiddling
with hyperparameters stopped as soon as a network that was reasonably accurate (and was
fairly reproducible) and fast enough when implemented in Fortran. 

A more thorough exploration of the following could be done in an attempt to generate a more accurate
neural network, or a smaller (faster) network so more particles could be simulated without impacting
simulation run-times.

Areas to explore include (roughly in priority order):

1. Number and size of MLP-layers
2. Learning rate and schedule
3. Weights regularization (e.g. L1 or L2 penalties)

## Improving Neural Network Performance
There are a handful of non-architecture/hyperparameter-related things to explore to improve
performance.  Unless otherwise specified, these are all speculations on Greg's part and aren't
a sure bet.

### Train on Sequences of Integration Times for the Same Parameter
Currently training data 

### Normalize the Neural Networks's Integration Time Parameter
Currently all of the neural network's inputs, except for integration time, are normalized into
the range $[-1, 1]$.  Integration time remains in the range of $(0, 10)$ with a focus on $[10^{-3}, 10]$
so as to cover both DNS and LES simulations.  The decision to leave integration time unnormalized
was by accident rather than a conscious choice.

That said, it is unknown as to whether this negatively impacts the performance of the model.  Since
DNS simulations focus on smaller time steps (in the range of $[10^{-3}, 10^{-1})$) and LES simulations on 
larger (in the range of $(10^{-1}, 10)$) it may be worthwhile to perform the same log-scale normalization
to the integration time as is done for the droplet's radius and salinity.  The thought is that the
network would be less sensitive to the scale separation between DNS and LES time steps and learn
them equally.  That said, no quantitative analysis has been performed that would suggest this - it is
purely an unfounded hypothesis at this point.

In [None]:
import os
import warnings

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn

from droplet_approximation import *

In [None]:
#
# NOTE: Do not change this cell directly!  Add a cell below and set new values for the
#       variables of interest.  This makes it significantly easier to revert your changes
#       when commiting changes to the repository - just delete the next cell.
#

# Change this to False if you've previously trained a model and want to to evaluate
# its performance rather than a pre-trained model from disk
load_model_flag = True

# Path to the model, relative to this notebook, to load for performance analysis.
model_load_path = "../models/network_box_residual_l1_epoch_14.pth"

# Change this to True if you want to train a model from scratch (loading and continuing
# training is not supported yet).  Since generating sufficient training data is typically
# longer than most researchers are willing to wait (O(1 hour) minimum with multiple cores)
# the training_data_path variable must also be set.
train_model_flag = False

# Path to previously created droplet parameters.
#
# NOTE: You must update this to where you've created/copied the training data!
#
# training_data_path = "../data/time_log_spaced.data"
training_data_path = None

# Path to validation data used to evaluate model performance during training.
validation_data_path = None

# Number of times the model should see the entirety of the training data
# before stopping the optimization process.
number_epochs      = 10

# Name of the currently trained model.  This is used when writing the model's weights
# as a Fortran module.  Default to the name of the loaded model, sans extension, as
# that should match the default training configuration that generated it.
model_name = model_load_path.split( "/" )[-1].split( "." )[0]

# Path to the model, relative to this notebook, to save newly trained models.
#
# NOTE: We disable saving by default to avoid accidentally overwriting an existing model!
#
#model_save_path = "../models/mlp_4layer-b=1024-lr=1e-3_halvingschedule-l2reg=1e-6.pth"
model_save_path = None

# Prefix to generate model checkpoint pathnames from.  Default to the model's save
# path, sans extension.
model_checkpoint_prefix = None
if model_save_path is not None:
    model_checkpoint_prefix, _ = os.path.splittext( model_save_path )

# Change this if the droplet_model.f90 module should be created.  This will be
# generated from either 1) a newly trained model or 2) a previously trained, loaded
# model, in that order.
write_weights_flag = False

# Ranges on the droplet parameters.
#
# NOTE: These *must* match both the training/evaluation data as well as the model.
#
parameter_ranges = get_parameter_ranges()

In [None]:
# Example training configuration.  Uncomment the variable assignments below to use.
# This trains a file from data in the ../data/ directory and saves the weights in
# the ../models/ directory.  It does not write out a new droplet_model.f90 module.

train_model_flag        = True
training_data_path      = "../data/training/time_log_spaced.training_data"
validation_data_path    = None
model_save_path         = "../models/NEW-mlp_4layer-b=1024-lr=1e-3_halvingschedule-l2reg=1e-6-TEST.pth"
model_name              = model_save_path.split( "/" )[-1].split( "." )[0]
model_load_path         = model_save_path
write_weights_flag      = True
custom_parameter_ranges = {}

In [None]:
# Overlay the custom parameter ranges and set them for the remainder of the
# notebook.
parameter_ranges.update( custom_parameter_ranges )

set_parameter_ranges( parameter_ranges )

In [None]:
# Force NumPy to print more values on each line rather than wrapping
# around 80 characters.
np.set_printoptions( linewidth=120 )

# Differential Equations of Interest
Example code that solves for a droplet's radius and temperature given its current radius and temperature
as well as the environment it is in, described by:

- The droplet's salinity
- Air temperature
- Relative humidity
- $\rho_a$

Solve the ODEs for a set of parameters in the middle of each of the valid ranges.

In [None]:
# Initial values for the droplet.
#
# NOTE: We must use a list for the initial values rather than a NumPy array
#       as an array will somehow cause a divide by zero.  I did not have
#       enough time to figure out why this is the case.
#
y0 = [0, 0]
y0[0] = np.float32( 1e-4 )  # Radius in meters.
y0[1] = np.float32( 293 )   # Temperature Kelvin.

# Environmental parameters to solve the ODEs with.
#
# Salinity (m_s)         ~ kg/m^3
# Air temperature (Tf)   ~ Kelvin
# Relative humidity (RH) ~ Fractional, typically in [0.5, 1.15]
# rhoa                   ~ kg/m^3
parameters = np.array( [1e-18, 291.0, 0.7, 1.0] )

# The ODEs are valid over t_span and solutions are returned over t_eval.
t_span = (0, 10)
t_eval = np.linspace( 0, 10, 1000 )

#
# NOTE: vectorized=True doesn't allow calling with arrays, but it lets the ODE solver
#       operate on multiple values at once if it chooses to.
#
solution = solve_ivp_float32_outputs( dydt, t_span, y0, method="Radau", t_eval=t_eval, args=(parameters,) )

plot_droplet_size_temperature( solution.y, solution.t )

# Mapping Droplet Parameters to/from $[-1, 1]$
Create routines for moving between the normal range of physical parameters and
a mapping on the range [-1, 1].  The ODEs are solved with physical parameters,
each with their own dynamic range, while the model operates on parameters that
each have the same range.

In [None]:
# Names of each of the parameter ranges, in the order within a droplet's parameters.
parameter_names = ["radius",
                   "temperature",
                   "salt_mass",
                   "air_temperature",
                   "relative_humidity",
                   "rhoa",
                   "time"]

# Print out the ranges so the user can review them.
longest_name_length      = max( map( lambda name: len( name ), parameter_names ) )
parameter_range_template = "  {{:<{:d}s}} [{{:.2f}}, {{:.2f}}]".format( longest_name_length + 4 )

print( "Parameter ranges used:\n" )
for parameter_name in parameter_names:
    print( parameter_range_template.format(
        parameter_name + ":",
        parameter_ranges[parameter_name][0],
        parameter_ranges[parameter_name][1] ) )
print()

In [None]:
# Test normalizing output parameters that are...
#
#   1. at the lower edge of the valid input space
#   2. in the middle of the valid input space
#   3. at the upper edge of the valid input space
#   4. below the lower edge of the valid input space
#   5. above the upper edge of the valid input space
#
# NOTE: We use a programmatic approach for selecting the values to test so that
#       these are scaled to the current parameter ranges.  Otherwise, whatever
#       values we select by hand will fail the test when the parameter ranges
#       change.
#
# Normalization is followed by an inverse scaling to bring the values back to
# physically interpretable ranges.  This is done three different times to
# test:
#
#  - Just radius and temperature
#  - Radius and temperature, along with the environmental parameters
#  - Radius and temperature, the environmental parameters, and an integration time
#

# The following tuples represent the above positions in parameter space as
# the index into the parameter range (0 for lower bound, 1 for upper bound)
# and a scale value for said bound.
LOWER_BOUND      = (0, 1)
MID_RANGE        = (0, 1.5)
UPPER_BOUND      = (1, 1)
INVALID_TOO_LOW  = (0, 0.75)
INVALID_TOO_HIGH = (1, 1.25)

TEST_CASES = [LOWER_BOUND,
              MID_RANGE,
              UPPER_BOUND,
              INVALID_TOO_LOW,
              INVALID_TOO_HIGH]

def param_value( parameter_ranges, parameter, test_tuple ):
    return parameter_ranges[parameter][test_tuple[0]] * test_tuple[1]

# Construct each of the test droplets, one at a time.
test_droplets = []
for test_case in TEST_CASES:
    test_droplet = np.empty( (1, len( parameter_ranges.keys() )) )
    
    for parameter_index, parameter_name in enumerate( parameter_names ):
        test_droplet[0, parameter_index] = param_value( parameter_ranges, parameter_name, test_case )
        
    # Radius, salt mass, and integration time have logarithmic ranges, so we
    # convert them to linear values
    test_droplet[0, 0] = 10**test_droplet[0, 0]
    test_droplet[0, 2] = 10**test_droplet[0, 2]
    test_droplet[0, 6] = 10**test_droplet[0, 6]
    
    test_droplets.append( test_droplet )

# Convert the test parameters into a 2D array.
test_parameters = np.vstack( test_droplets )

# Each test looks a larger subset of the parameters.
inputs_range          = range( 0, 2 )
inputs_env_range      = range( 0, 6 )
inputs_env_time_range = range( 0, 7 )

if not np.allclose( scale_droplet_parameters( normalize_droplet_parameters( test_parameters[:, inputs_range] ) ),
                    test_parameters[:, inputs_range] ):
    raise ValueError( "Scaling normalized outputs was not an inverse operation!" )

if not np.allclose( scale_droplet_parameters( normalize_droplet_parameters( test_parameters[:, inputs_env_range] ) ),
                    test_parameters[:, inputs_env_range] ):
    raise ValueError( "Scaling normalized inputs (without t_final) was not an inverse operation!" )

if not np.allclose( scale_droplet_parameters( normalize_droplet_parameters( test_parameters[:, inputs_env_time_range] ) ),
                    test_parameters[:, inputs_env_time_range] ):
    raise ValueError( "Scaling normalized inputs (with t_final) was not an inverse operation!" )

# GeneratingTraining Data
We want to create an on-disk training data set to streamline the training process.
Not training a model itself, per se, but rather making it easy to quickly
explore different models that are comparable without having questions about
what data were seen by each model.

A secondary benefit to this is that the training process is much faster as
we don't have to worry about the ability to quickly generate random data (which
requires solving ODEs) and slowing down the actual training loop.

In [None]:
# Demonstrate that we can generate input parameters, without t_final, by sampling [-1, 1]
# and scaling them to their physical range.
#
# NOTE: This does not generate random t_sample as the original code was not developed with
#       that in mind.  Partially because the distribution is different and partially because
#       t_final is not currently normalized on input to the model.  This should be evaluated.
#
random_inputs = scale_droplet_parameters( np.reshape( np.random.uniform( -1, 1, 30 ), (5, 6) ) )

print( random_inputs )

In [None]:
# Demonstrate generating a batch of droplets with a small number of evaluations.
number_droplets    = 128
number_evaluations = 4

[inputs,
 outputs,
 integration_times,
 weird_inputs,
 weird_outputs] = create_droplet_batch( number_droplets, number_evaluations=number_evaluations )

number_weird_inputs  = sum( map( lambda type_name: len( weird_inputs[type_name] ), weird_inputs ) )
number_weird_outputs = sum( map( lambda type_name: len( weird_outputs[type_name] ), weird_outputs ) )

print( "Generated {:d} droplet parameter{:s}.\n".format(
    number_droplets,
    "" if number_droplets == 1 else "s" ) )

if number_weird_inputs > 0:
    print( "{:d} weird input{:s}:\n".format(
        number_weird_inputs,
        "" if number_weird_inputs == 1 else "s" ) )
    
    for type_name, weird_things in weird_inputs.items():
        number_weird_things = len( weird_things )
        if number_weird_things == 0:
            continue
            
        print( "    {:d} {:s}".format(
            number_weird_things,
            type_name ) )
        
    print( "" )

if number_weird_outputs > 0:
    print( "{:d} weird output{:s}:\n".format(
        number_weird_outputs,
        "" if number_weird_outputs == 1 else "s" ) )
    
    for type_name, weird_things in weird_outputs.items():
        number_weird_things = len( weird_things )
        if number_weird_things == 0:
            continue
            
        print( "    {:d} {:s}".format(
            number_weird_things,
            type_name ) )
        
    print( "" )
    
write_weird_parameters_to_spreadsheet( "test.xlsx", weird_inputs, weird_outputs )

In [None]:
# Create a training file with a handful of droplets, filter out any invalid
# parameters (this should no longer happen), and read them back in to
# see the distribution of t_final.  
training_test_path = "foo.data"
weird_test_path    = "foo.xlsx"
create_training_file( training_test_path, 1024, weird_file_name=weird_test_path )

number_parameters, number_bad_parameters = clean_training_data( training_test_path )
print( "Removed {:d} parameter{:s} from {:d} parameter{:s} ({:.2f}%).".format(
    number_bad_parameters,
    "" if number_bad_parameters == 1 else "s",
    number_parameters,
    "" if number_parameters == 1 else "s",
    number_bad_parameters / number_parameters * 100.0 ) )

input_parameters, output_parameters, times = read_training_file( training_test_path )

fig_h, ax_h = plt.subplots( 1, 1 )
ax_h.plot( np.log10( times ), "." )
ax_h.set_xlabel( "Droplet #" )
ax_h.set_ylabel( "log10( $t_{final}$ )" )
ax_h.set_title( "Distribution of $t_{final}$" )

plt.show()

# Training a Model
We define a simple model architecture in PyTorch, setup our loss function and basic
hyperparameters, and then randomly walk through the training data one or more times.
If configured, the trained model is saved to disk.

In [None]:
# Train using a local GPU if we have one, otherwise stay on the CPU.  While
# SimpleNet isn't a large model having GPU acceleration for large batch
# counts and a non-trivial number of droplet parameters (>100 million) certainly
# helps.
device = torch.device( "cuda" if torch.cuda.is_available() else "cpu" )

print( "Training with '{}'.".format( device ) )

# Create an instance of ResidualNet and configure its optimization parameters:
#
#   - We use L1 loss since the model's outputs are in [-1, 1] making
#     "normal" errors smaller than 1, and MSE would result in even
#     smaller loss terms.
#   - We use Adam so we have momentum when performing gradient descent
#   - Given that we have a relatively large batch size we use a large
#     initial learning rate of 1e-3.
#   - We want smaller weights so we use a L2 regularization penalty of 1e-6
#     to encourage that (as well as having non-zero weights rather than
#     leaning heavily on just a subset of weights)
#
model     = ResidualNet()
criterion = nn.L1Loss()
optimizer = torch.optim.Adam( model.parameters(), lr=1e-3, weight_decay=1e-6 )

# Move the model to the device we're training with.
model = model.to( device )

In [None]:
def print_loss( model, epoch_number, optimizer, training_loss, validation_loss ):
    """
    Training callback that prints the mean training and validation loss to standard
    output.
    
    Takes 5 arguments:
    
      model           - Torch model object.
      epoch_number    - Epoch number that just completed.
      optimizer       - Torch optimizer object.
      training_loss   - Sequence of training loss values for the last epoch.
      validation_loss - Scalar validation loss for the last epoch.
    
    Returns nothing.
    
    """
    
    print( "Epoch #{:d} loss - T: {:f}, V: {:f}".format(
        epoch_number,
        np.mean( training_loss ),
        validation_loss ) )

In [None]:
if train_model_flag:
    if training_data_path is None:
        raise ValueError( "Set training_data_path to where the training data are located!" )

    (training_loss,
     validation_loss) = train_model( model, 
                                     criterion,
                                     optimizer,
                                     device,
                                     number_epochs, 
                                     training_data_path,
                                     validation_file=validation_data_path,
                                     epoch_callback=print_loss )

    # Determine how many minibatches were seen so we can correctly
    # plot the validation loss.
    number_mini_batches = len( training_loss ) // number_epochs
    
    # Plot the training loss to qualitatively assess how the model is doing.
    # Loss should consistently decrease over mini-batches.
    #
    # NOTE: This is only one aspect to evaluate a model's performance as the
    #       loss reported is on *training* data and not independent test
    #       data.  As a result, this curve will be overly optimistic.
    #
    fig_h, ax_h = plt.subplots( 1, 1, figsize=(4, 4) )

    ax_h.plot( np.log10( training_loss ), ".", label="Training" )
    ax_h.plot( range( number_mini_batches - 1, len( training_loss ), number_mini_batches), np.log10( validation_loss ), ".", label="Validaiton" )
    ax_h.set_xlabel( "Mini Batch Number" )
    ax_h.set_ylabel( "log10( loss )" )
    ax_h.set_title( "Training and Validation Loss - {:d} Epoch{:s}".format(
        number_epochs,
        "" if number_epochs == 1 else "s"
    ) )
    ax_h.legend()

    plt.show()

In [None]:
if model_checkpoint_prefix is None:
    warnings.warn( "Please set model_checkpoint_prefix if you wish to save the model to disk.  This is intentionally *NOT* set to avoid accidentally overwriting an existing model." )
else:
    checkpoint_path = save_model_checkpoint( model_checkpoint_prefix,
                                             -1,
                                             model,
                                             optimizer,
                                             criterion,
                                             training_loss )
    
    print( "Saved the current model to '{:s}'.".format( checkpoint_path ) )

# Performance Analysis
Load a previously trained model, if requested, and qualitatively assess its performance relative
to the ODEs it was trained to approximate.

In [None]:
if load_model_flag:
    model = ResidualNet()
    load_model_checkpoint( model_load_path, model )
    model = model.to( device )
    
    print( "Loaded a model from '{:s}'.".format( model_load_path ) )
else:
    print( "Using the previously trained model." )

In [None]:
# Verify that the model matches the ODEs outputs for a single parameter.
test_inputs      = np.array( [9.58937380e-05, 3.03864258e+02, 7.20879248e-16, 2.77617767e+02, 1.04355168e+00, 8.33695471e-01] ).reshape( (1, -1) )
test_time        = np.array( [1.0] )

t_span   = (0, 10)
solution = solve_ivp_float32_outputs( dydt,
                                      t_span,
                                      [test_inputs[0, 0], test_inputs[0, 1]],
                                      method="Radau",
                                      t_eval=test_time,
                                      args=(test_inputs[0, 2:],) )

#
# NOTE: We must transpose our expected outputs as solve_ivp() returns vectors
#       in a different direction as do_inference().
#
expected_outputs = np.array( [solution.y[0], solution.y[1]] ).T

# What does our model estimate for this set of inputs?
test_outputs = do_inference( test_inputs, test_time, model, device )

# Sanity check that we're within half a percent relative difference for radius
# and temperature.
#
# NOTE: Changing the model may require an updated tolerance as new models
#       may not have learned to approximate this particular input.  Be
#       careful when adjusting this tolerance!
#
# NOTE: We report a warning instead of raising an exception since training
#       an underperforming model shouldn't preclude the analysis and weights
#       export process.
#
if not np.allclose( test_outputs, expected_outputs, rtol=5e-3 ):
    warnings.warn( "The model did not compute the expected size/temperature ({}) for {} ({})".format(
        expected_outputs,
        test_inputs,
        test_outputs ) )

In [None]:
input_parameters = scale_droplet_parameters( np.random.uniform( low=-1.0, high=1.0, size=(1, 6) ).astype( "float32" ) )
analyze_model_performance( model, input_parameters )

# Serializing Model Weights into Fortran
Generate a simple export of the model's weight into a format that is usable for a naive implementation of inference with SimpleNet.

In [None]:
# Overwrite the module used by NTLP so future simulations use the current weights.
if write_weights_flag:
    
    # Pick the most descriptive name we have so researchers have an idea where
    # the weights came from.
    if load_model_flag:
        exported_model_name = model_load_path.split( "/" )[-1]
    else:
        exported_model_name = model_name
    
    exported_file_path = "../../droplet_model.f90"
    
    generate_fortran_module( exported_file_path,
                             exported_model_name,
                             model.state_dict() )
    
    print( "Wrote model weights and inferencing code to '{:s}'.".format(
        exported_file_path ) )
else:
    print( "Skipped writing model weights." )