# Walkthough of the GAMMA-EM & MCMC Pipeline

***
## Table of contents
1. Sampling GAMMA for training and testing data
2. Creating the GAMMA-EM model emulators
3. Running Markov Chain Monte Carlo simulations
4. Finding and plotting MCMC results
5. Relevant literature
***

# 1. Sampling GAMMA for training and testing data

This is the step at which you can edit the variables included, their ranges, and what model inputs and outputs that are being emulated. Please note that this is the file to modify when choosing what model inputs and outputs will be compared to observations. It is advised to add that in under the "Extraction of Outputs" section of the GAMMA runs.

Step 1:  
* Edit __GAMMA_sampler_dict.py__ line 122 to specify the number of training samples needed (200 in the current model)
* Edit __1_GAMMA_sampling.sb__: change time to 00:10:00, ntasks to 200, mem-per-cpu to 2G
* Run command: __sbatch 1_GAMMA_sampling.sb__  
* Expected Outputs: 
   * /samples_GAMMA/em_sample_points200.npy
   * /samples_GAMMA/gal_FeH_mean_200.npy
   * /samples_GAMMA/gal_FeH_std_200.npy
   * /samples_GAMMA/gal_Mstar_200.npy   

Step 2:
* Edit __GAMMA_sampler_dict.py__ line 122 to specify the number of testing samples needed in the first generation (10000 currently)
* Edit __1_GAMMA_sampling.sb__: change time to 01:00:00, ntasks to 200, mem-per-cpu to 2G
* Run command: __sbatch 1_GAMMA_sampling.sb__  
* Expected Outputs: 
   * /samples_GAMMA/em_sample_points10000.npy
   * /samples_GAMMA/gal_FeH_mean_10000.npy
   * /samples_GAMMA/gal_FeH_std_10000.npy
   * /samples_GAMMA/gal_Mstar_10000.npy 

Step 3:
* Edit __GAMMA_sampler_dict.py__: change line 122 to specify the number of testing samples needed in the second generation (10000 currently)
* Edit __GAMMA_sampler_dict.py__: Add "\_2" to the end of the save lines 154, 277-279 in order to save a second generation of test samples. Otherwise, the first generation sample set will be overwritten
* Edit __1_GAMMA_sampling.sb__: change time to 01:00:00, ntasks to 200, mem-per-cpu to 2G
* Run command: __sbatch 1_GAMMA_sampling.sb__  
* Expected Outputs: 
   * /samples_GAMMA/em_sample_points10000_2.npy
   * /samples_GAMMA/gal_FeH_mean_10000_2.npy
   * /samples_GAMMA/gal_FeH_std_10000_2.npy
   * /samples_GAMMA/gal_Mstar_10000_2.npy

# 2. Creating the GAMMA-EM model emulators

Ensure the "/samples_GAMMA" folder has the correct contents:
* /samples_GAMMA/em_sample_points200.npy  
* /samples_GAMMA/gal_FeH_mean_200.npy  
* /samples_GAMMA/gal_FeH_std_200.npy  
* /samples_GAMMA/gal_Mstar_200.npy 
* /samples_GAMMA/em_sample_points10000.npy
* /samples_GAMMA/gal_FeH_mean_10000.npy
* /samples_GAMMA/gal_FeH_std_10000.npy
* /samples_GAMMA/gal_Mstar_10000.npy 
* /samples_GAMMA/em_sample_points10000_2.npy
* /samples_GAMMA/gal_FeH_mean_10000_2.npy
* /samples_GAMMA/gal_FeH_std_10000_2.npy
* /samples_GAMMA/gal_Mstar_10000_2.npy 

Run command: __sbatch 2_GAMMA_EM_run.sb__

Expected Outputs:
* /Emulator_results/metallicity_emulator.joblib
* /Emulator_results/stellar_mass_emulator.joblib
* An image comparing the emulator and GAMMA results, generated from a randomly sampled set of GAMMA parameters
* /Emulator_results/Testing_scores.txt

# 3. Running Markov Chain Monte Carlo simulations

Ensure the "/Emulator_results" folder has the correct contents.

Edit __MCMC_results.py__: edit lines 58-67 to match variable ranges specified in __GAMMA_sampler_dict.py__. Edit line 100 to match the number of walkers and steps desired (~200, ~10000-50000)

Edit __3_MCMC_run.sb__: edit time to match walkers and steps. Needs about 10 minutes for 200 walkers/10000 steps

Run command: __sbatch 3_MCMC_run.sb__

Expected Outputs:
* /MCMC_results/samples_(walker number)w(step number)s.joblib - MCMC samples
* /MCMC_results/likelihood_(walker number)w(step number)s.joblib - MCMC likelihoods associated with each step
* /MCMC_results/acceptance_frac_(walker number)w(step number)s.txt - Acceptance fractions of each walker

# 4. Finding and plotting MCMC results

Ensure "/MCMC_results" folder has the correct contents - please note that the current github repository lacks the ".joblib" files because of file size

Edit __MCMC_plots.py__: change lines 15-17 to match the number of walkers and steps in the desired samples/likelihood file. Change line 18 to a sensible burn in time (~10-20% of total steps). Also change line 26 to match the variable names.

Run command: __sbatch 4_MCMC_plot.sb__

Expected Outputs:
* /MCMC_results/trace_(walker number)w(step number)s.png
* /MCMC_results/likelihood_(walker number)w(step number)s.png
* /MCMC_results/triangle_pdfs_(walker number)w(step number)s.png - corner plot
* /MCMC_results/best_fit_values_(walker number)w(step number)s.txt - values identified to be the best fit

# 5. Relevant literature
#### Gaussian Process model emulators
* __An intuitive guide to Gaussian Process regression: https://towardsdatascience.com/an-intuitive-guide-to-gaussian-processes-ec2f0b45c71d__
* Gaussian Process for Dummies: https://katbailey.github.io/post/gaussian-processes-for-dummies/
* Gaussian Processes: A Quick Introduction: http://arxiv.org/abs/1505.02965
* Additive Gaussian Processes: http://papers.nips.cc/paper/4221-additive-gaussian-processes.pdf
* __Gaussian Process for Machine Learning: http://www.gaussianprocess.org/gpml/chapters/ (note: ignore the classification chapter, this isn't a classification problem)__
* Website all about Gaussian Processes: http://www.gaussianprocess.org/
* Kernel Cookbook: __https://www.cs.toronto.edu/~duvenaud/cookbook/__
* sklearn documentation: https://scikit-learn.org/stable/modules/gaussian_process.html
* __Useful stack overflow questions regarding sklearn's multi-input/output GP emulators__
    * https://stackoverflow.com/questions/50185399/multiple-output-gaussian-process-regression-in-scikit-learn
    * https://stackoverflow.com/questions/43618633/multi-output-spatial-statistics-with-gaussian-processes?noredirect=1&lq=1
    
##### Markov Chain Monte Carlo 
* emcee documentation:
    * https://emcee.readthedocs.io/en/stable/
    * Working github link: https://github.com/dfm/emcee/tree/v2.2.x
    * Associated paper: http://arxiv.org/abs/1202.3665