# Fitting Thermoelectric data
Models and data are from Danny/Kedar.

## Import Modules, Functions, and Data

`functions.py` has the Python implementations of all the helper functions (I used a previously written package, `fdint`, for the Fermi-Dirac integrals).

In [None]:
import numpy as np
from functions import *

In [None]:
celldata = {}
celldata['xdata'] = np.loadtxt('xdata.csv',delimiter=',')
celldata['ydata'] = np.loadtxt('ydata.csv',delimiter=',')
celldata['n'] = 40

## Validate Python implementation
I did a test evaluation in Matlab and Python with the same input parameters. Let's import the results and compare to make sure we're getting the same thing.

In [None]:
test_in = [-2.499946233286e1,1.833885014595e-3,-2.2588468610036e-3,8.6217332036812e-4]
test_y,test_S,test_Rou=tefunnew(celldata,test_in)
matlab_y = np.loadtxt('matlab_y.csv',delimiter=',')
matlab_S = np.loadtxt('matlab_S.csv',delimiter=',')
matlab_Rou = np.loadtxt('matlab_Rou.csv',delimiter=',')

In [None]:
y_pct_diff=(test_y-matlab_y)/matlab_y
print('There is an average of a %.2f%% difference (with a standard deviation of %.2f%%) between the Matlab and Python implementations in the y output.'%(round(100.0*np.mean(y_pct_diff),2),round(100.0*np.std(y_pct_diff),2)))

In [None]:
S_pct_diff=(test_S-matlab_S)/matlab_S
print('There is an average of a %.2f%% difference (with a standard deviation of %.2f%%) between the Matlab and Python implementations in the S output.'%(round(100.0*np.mean(S_pct_diff),2),round(100.0*np.std(S_pct_diff),2)))

In [None]:
Rou_pct_diff=(test_Rou-matlab_Rou)/matlab_Rou
print('There is an average of a %.3f%% difference (with a standard deviation of %.3f%%) between the Matlab and Python implementations in the Rou output.'%(round(100.0*np.mean(Rou_pct_diff),3),round(100.0*np.std(Rou_pct_diff),3)))

Okay, so the differences aren't nothing, but they're small enough that I think we can work with them.

## Fitting with Bayesim
Now let's do a fit to the data using the grid approach implemented in the `bayesim` code.
### Import Things

In [1]:
import sys
sys.path.append('../../')
import bayesim.model as bym
import bayesim.param_list as byp
import functions as tefcns # model functions implemented in a separate file to keep this notebook tidy
import deepdish as dd # for interacting with HDF5 files
from joblib import Parallel, delayed # to parallelize model computations

### Initialize
First, we set up the list of parameters to be fit and their ranges.

In [2]:
fp = byp.param_list()
"""
fp.add_fit_param(name='P0', val_range=[1e-34,1e-20], spacing='log', length=28, units='sec.')
fp.add_fit_param(name='fs', val_range=[-1,2], length=21, units='eV')
fp.add_fit_param(name='r', val_range=[-1,2], length=21)
fp.add_fit_param(name='Z', val_range=[-10,10], length=20)
"""
fp.add_fit_param(name='P0', val_range=[1e-34,1e-20], spacing='log', length=7, units='sec.')
fp.add_fit_param(name='fs', val_range=[-1,2], length=5, units='eV')
fp.add_fit_param(name='r', val_range=[-1,2], length=5)
fp.add_fit_param(name='Z', val_range=[-10,10], length=5)


Next, define the experimental conditions.

In [3]:
ec = ['T','R','n']

Now, set up the `bayesim.model` object. All we need to feed in are the parameters, experimental conditions, and name of the output variable.

In [4]:
m = bym.model(params=fp,ec=ec,output_var='P')

### Attach Experimental Observations
The next thing to do is to attach the observed data. I reformatted it to work with `bayesim` and saved an HDF5 file. You can see the format in the Excel sheet `TE_expt_data.xlsx`. Here I use only every third point (integer values of resistances) to speed up model computation and also because that's probably enough data.

In [5]:
#m.attach_observations(fpath='TE_expt_data.h5')
m.attach_observations(fpath='TE_expt_data_sparse.h5')

Identified experimental conditions as ['n', 'T', 'R']. If this is wrong, rerun and explicitly specify them with attach_ec (make sure they match data file columns) or remove extra columns from data file.


### Attaching the Model
Next, we attach the model. In this example I'll precompute the modeled data and attach a file with the outputs. You could also attach the function used to do the modeling, but the code can't currently parallelize those computations so I do it outside `bayesim` to take advantage of both cores on my laptop.
First we write out a file with the list of all simulation points. (it's good practice to write this out rather than keep it only as a Python object so we can pick up where we left off later)

This next cell should take about 30 seconds to evaluate, but if you don't want to do the model computations yourself you can skip it.

In [None]:
#m.list_model_pts_to_run('./sim_list.h5')

The code in the next cell will actually do the model computations. On my two-core laptop, it takes about 24 minutes to evaluate. Assuming your processor supports multithreading (almost all modern ones do), you should set `n_jobs` to be twice the number of cores on your machine if you want to run this cell efficiently.

You can also just skip this cell and instead evaluate the following one to just load in the results of the computation that I did. :)

In [None]:
#sim_list = dd.io.load('./sim_list.h5')
#outputs=Parallel(n_jobs=4,verbose=7)(delayed(tefunnew_singlept)(sim[1][m.ec_names],sim[1][m.param_names]) for sim in sim_list.iterrows())
#sim_list['P'] = outputs
#dd.io.save('sim_outputs.h5',sim_list)

In [6]:
m.attach_model(mode='file',fpath='sim_outputs.h5')

On a sparse grid like this, it's important that the error values we use (i.e. standard deviation of Gaussians used for likelihood) are big enough to reach between boxes. This function computes the distance in output variable between model boxes at every experimental condition point and adds it to a column in model_data called 'deltas.'

In [8]:
m.calc_model_gradients()
m.model_data.sample(10)

Unnamed: 0,P0,fs,r,Z,n,T,R,P,deltas
160245,1e-25,0.5,-0.7,0.0,40.0,90.0,11.3333,0.1448839,0.5764464
111688,1e-27,-0.7,0.5,-8.0,40.0,20.0,19.6667,1.927539e-40,2.327972e-16
36779,1e-31,-0.7,-0.7,-4.0,40.0,130.0,8.0,13.1622,13.1622
55685,1e-31,0.5,1.1,0.0,40.0,10.0,6.0,8.937367e-45,1.019407e-21
95831,1.0000000000000001e-29,1.1,-0.1,-8.0,40.0,70.0,16.0,0.1047737,0.2231783
138863,1e-27,1.7,-0.7,4.0,40.0,130.0,12.0,3.044724,2.060596
156640,1e-25,-0.1,1.1,-8.0,40.0,30.0,2.33333,1.324969e-42,8.747941999999999e-19
240591,1e-21,1.1,-0.7,8.0,40.0,100.0,8.0,1.225535,1.714284
128132,1e-27,0.5,1.1,-4.0,40.0,130.0,11.0,3.404713e-35,3.671342e-12
77651,1.0000000000000001e-29,-0.7,1.1,0.0,40.0,120.0,9.33333,1.5958279999999999e-63,1.128167e-39


As you can see, because our grid is super sparse, the deltas are actually larger than the actual output values right now!

### First Bayes!
The `run` function randomizes the order of observations and stops feeding them in by default when 80% of the probability mass resides in 5% of the parameter space. These parameters can be tuned using the input parameters `th_pm` (default 0.8) and `th_pv` (default 0.05).

__If you don't want to have to run the new simulations yourself (they'll take longer than the first batch), don't run the code in this cell - I just left it so you can see what *was* run.__

(Because the `run` function randomizes observations, if you run it, the subdivided cells will likely not match exactly and you'll get an error if you try to just load in the results from my new simulation run)

In [None]:
m.run()
m.save_state(filename='states/state_1.h5')

Here we just load the model state that I saved and carry onward.

In [None]:
#m = bym.model(load_state=True, state_file='states/state_1.h5')
#%matplotlib inline
#m.probs.visualize()

I'm not sure what's going on with the upper left box right now. I'll fix it...

### Subdivide!
I've found that 0.001 seems to be a reasonable threshold probability for boxes to subdivide on the first round so that's the default value, but you can feed in other numbers for `threshold_prob` to this function.

Note that the `subdivide` function divides not only boxes meeting the threshold but any boxes immediately neighboring those. It will also write out an HDF5 of the new simulations that need to be run; that step can take awhile (this cell takes a few minutes on my computer) because it's writing every combination of new parameter points AND experimental condition points.

Again, if you don't want to run it, you can skip this cell and just use the `load_state` line in the next cell to start where I left off.

In [None]:
#m.subdivide()
#m.save_state('states/state_2.h5')

It's worth noting that there were originally 875 boxes in our super sparse grid, so in this case a majority of them were subdivided, which isn't too surprising.

### Run (more simulations and then) more inference!
I ran the batch of new simulations on Peregrine; the results are in the file `new_sim_outputs_.h5` which we'll load in here to do the next round of inference.

This cell takes about a minute to run.

In [None]:
#m = bym.model(load_state=True,state_file='states/state_2.h5')
#m.attach_model(mode='add',fpath='new_sim_outputs_1.h5')
#m.save_state('states/state_3.h5')

In [None]:
#m = bym.model(load_state=True,state_file='states/state_3.h5')
#m.run()

In [None]:
fp = byp.param_list()
fp.add_fit_param(name='A',val_range=[0,1],length=4)
fp.add_fit_param(name='B',val_range=[1,1000],length=3,spacing='log')
tm = bym.model(params=fp,ec=['C'],output_var='O')
new_probs = [0.05,0.02,0.17,0.06,0.07,0.09,0.23,0.05,0.04,0.06,0.08,0.08]
tm.probs.points['prob']=new_probs
from jupyterthemes import jtplot
jtplot.style('default')
tm.probs.visualize(just_grid=True)

In [None]:
tm.probs.points

In [None]:
tm.probs.subdivide(0.2)

In [None]:
tm.probs.visualize(just_grid=True)

In [None]:
tm.probs.points.head(10)

In [None]:
# test non-gridded gradient calc!

In [None]:
#pt = points.iloc[17]
def find_box(pt,bm,grid):
    points_min_grps = {p:grid.groupby(by=[p+'_min']) for p in bm.param_names}
    points_max_grps = {p:grid.groupby(by=[p+'_max']) for p in bm.param_names}
    min_match = [set(np.concatenate([points_min_grps[p].groups[k] for k in list(points_min_grps[p].groups.keys()) if pt[p]>k])) for p in bm.param_names]
    max_match = [set(np.concatenate([points_max_grps[p].groups[k] for k in list(points_max_grps[p].groups.keys()) if pt[p]<k])) for p in bm.param_names]
    min_match = min_match[0].intersection(*min_match[1:])
    max_match = max_match[0].intersection(*max_match[1:])
    box_ind = list(min_match.intersection(max_match))
    if len(box_ind)==0:
        return np.nan
    else:
        return box_ind[0]

In [None]:
import pandas as pd
data = []
for a in A_vals:
    for b in B_vals:
        pt = {'A':a,'B':b}
        data.append([a,b,find_box(pt,tm,points)])
        #print(pt,find_box(pt,tm,points))
pd.DataFrame.from_records(data=data,columns=tm.param_names+['ind'])