# ltu-ili jupyter interface
This is a tutorial for using the ltu-ili inference framework in a jupyter notebook. 

This notebook assumes you have installed the ltu-ili package from the installation instructions in [INSTALL.md](../INSTALL.md).

In [None]:
%load_ext autoreload
%autoreload 2

# ignore warnings for readability
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import os
import numpy as np

# seaparate into train and test set.
from sklearn.model_selection import train_test_split

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.colors as mcolors
import torch
from torch.distributions import Uniform, ExpTransform, TransformedDistribution #, AffineTransform

import torch.nn as nn

import ili
from ili.dataloaders import NumpyLoader
from ili.inference import InferenceRunner
from ili.validation.metrics import PosteriorCoverage, PlotSinglePosterior

from sbi.utils.user_input_checks import process_prior

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)

# Get theta

In [None]:
df_pars = pd.read_csv('/home/jovyan/camels/LH/CosmoAstroSeed_IllustrisTNG_L25n256_LH.txt', delim_whitespace=True)
df_pars

In [None]:
theta = df_pars[['Omega_m', 'sigma_8', 'A_SN1', 'A_AGN1', 'A_SN2', 'A_AGN2']].to_numpy()
print(theta)
print(theta.shape)

# Get data (x)

In [None]:
import os
import pandas as pd

# Define the directory containing the LH_X files
directory = "/home/jovyan/camels/LH/get_LF/output/"

# Get all files in the directory
files = os.listdir(directory)

# Filter out files that start with "LH_" and end with ".txt"
LH_X_files = [file for file in files if file.startswith("LH_") and file.endswith(".txt")]

# Initialize lists to store data
phia = []
phi_sigmaa = []
binsa = []
LH_X_values = []

# Iterate over LH_X files
for LH_X_file in LH_X_files:
    # Define the file path
    file_path = os.path.join(directory, LH_X_file)
    
    # Extract LH_X value from the file name (remove the ".txt" extension)
    LH_X = LH_X_file[:-4]
    
    # Initialize an empty dictionary to store variable names and their values
    variable_data = {}

    # Open the text file for reading
    with open(file_path, 'r') as file:
        # Initialize variables to store the current variable name and its values
        current_variable_name = None
        current_variable_values = []

        # Iterate over each line in the file
        for line in file:
            # Remove leading and trailing whitespace from the line
            line = line.strip()

            # Check if the line is empty
            if not line:
                continue

            # Check if the line is a variable name
            if line in ['phi', 'phi_sigma', 'hist', 'massBinLimits']:
                # If it's a new variable name, update the current variable name and reset the values list
                if current_variable_name is not None:
                    variable_data[current_variable_name] = current_variable_values
                    current_variable_values = []

                current_variable_name = line
            else:
                # If it's not a variable name, convert the value to float and append it to the values list
                current_variable_values.append(float(line))

        # Add the last variable data to the dictionary
        if current_variable_name is not None:
            variable_data[current_variable_name] = current_variable_values
        
        # Extract specific variables
        phi = variable_data.get('phi')
        phi_sigma = variable_data.get('phi_sigma')
        bins = variable_data.get('massBinLimits')

        phia.append(phi)
        phi_sigmaa.append(phi_sigma)
        binsa.append(bins)
        LH_X_values.append(LH_X)

# Create a DataFrame from the lists
df_x = pd.DataFrame({'LH_X': LH_X_values, 'phi': phia, 'phi_sigma': phi_sigmaa, 'bins': binsa})

# Display the DataFrame
df_x


In [None]:
# transform x data into log space.

In [None]:
df_x['phi']

In [None]:
type(df_x['phi'])

In [None]:
# clean up 0 values to avoid nan/-inf values when logging

# Function to replace 0 or negative values with 1
def replace_zeros(phi_list):
    # set as 1e-5 for now:
    return [1e-5 if x == 0 else x for x in phi_list]

# Apply the function to each entry in the 'phi' column
df_x['phi_0s'] = df_x['phi'].apply(replace_zeros)
df_x['phi_0s']

In [None]:
# convert pandas series to np.array
x = np.log10((np.array(df_x['phi_0s'].tolist())))
print(x)

## Toy NPE
This example attempts to infer 3 unknown parameters from a 20-dimensional 1D data vector using amortized posterior inference. We train the models from a simple synthetic catalog. This tutorial mirrors the same configuration as in [examples/toy_sbi.py](../examples/toy_sbi.py), but demonstrates how one would interact with the inference pipeline in a jupyter notebook.

In [None]:
# prior for omega m, between 0.1 and 0.5
plt.hist(theta[:, 0])

In [None]:
# prior for sigma_8, between 0.6 and 1
plt.hist(theta[:, 1])

In [None]:
# prior for A_SN1, between 0.25 and 4
plt.hist(theta[:, 2])

In [None]:
# prior for A_AGN1, between 0.25 and 4
plt.hist(theta[:, 3])

In [None]:
# prior for A_SN2, between 0.5 and 2
plt.hist(theta[:, 4])

In [None]:
# prior for A_AGN2, between 0.5 and 2
plt.hist(theta[:, 5])

In [None]:
from sklearn.model_selection import train_test_split

# Assuming x and theta are your data arrays
# First split: into train+validation and test sets
# can stratify the sets you pick to train but in this case we are OK for now to make random selection (fixed random selection)
x_temp, x_test, theta_temp, theta_test = train_test_split(x, theta, test_size=0.2, random_state=0)

# Second split: into train and validation sets
x_train, x_val, theta_train, theta_val = train_test_split(x_temp, theta_temp, test_size=0.25, random_state=0)  # 0.25 x 0.8 = 0.2

print('x: full data:', x.shape)
print('x: training set:', x_train.shape)
print('x: validation set:', x_val.shape)
print('x: testing set:', x_test.shape)

print('theta: full data:', theta.shape)
print('theta: training set:', theta_train.shape)
print('theta: validation set:', theta_val.shape)
print('theta: testing set:', theta_test.shape)


In [None]:
# this is your posterior - predicted value of the luminosity function...?
# maybe log it
plt.hist(x_train[:, 0])


In [None]:
# Plot some examples of the data
fig, ax = plt.subplots(figsize=(8, 6))
for i in range(6):
    ind = np.random.randint(len(theta_train))
    ax.plot(x_train[ind], alpha=0.5, label=f'({theta_train[ind, 0]:.2f}, {theta_train[ind, 1]:.2f}, {theta_train[ind, 2]:.2f},{theta_train[ind, 3]:.2f},{theta_train[ind, 4]:.2f},{theta_train[ind, 5]:.2f})')
#ax.set_yscale('log')
ax.legend(title='theta')
ax.set_title('Data vectors x')
plt.show()

The SBIRunner object will handle all of the data normalization and model training for us. We just need to provide it with:
- our parameter prior
- our inference type (SNPE/SNLE/SNRE)
- our desired neural network architecture
- our training hyperparameters

On the backend, it does a validation split among the provided training data, trains the neural networks with an Adam optimizer, and enforces an early stopping criterion to prevent overfitting. All the parameters of these processes can be independently configured.

In [None]:
def initialise_priors(device="cpu", astro=True, dust=True):

    combined_priors = []

    if astro:
        base_dist1 = Uniform(
            torch.log(torch.tensor([0.25], device=device)),
            torch.log(torch.tensor([4], device=device)),
        )
        base_dist2 = Uniform(
            torch.log(torch.tensor([0.5], device=device)),
            torch.log(torch.tensor([2], device=device)),
        )
        astro_prior1 = TransformedDistribution(base_dist1, ExpTransform())
        astro_prior2 = TransformedDistribution(base_dist2, ExpTransform())
        omega_prior = Uniform(
            torch.tensor([0.1], device=device),
            torch.tensor([0.5], device=device),
        )
        sigma8_prior = Uniform(
            torch.tensor([0.6], device=device),
            torch.tensor([1.0], device=device),
        )
        combined_priors += [
            omega_prior,# prior for omega m, between 0.1 and 0.5: uniform
            sigma8_prior,# prior for sigma_8, between 0.6 and 1: uniform
            astro_prior1,# prior for A_SN1, between 0.25 and 4: exponential
            astro_prior1,# prior for A_AGN1, between 0.25 and 4: exponential
            astro_prior2,# prior for A_SN2, between 0.5 and 2: exponential
            astro_prior2,# prior for A_AGN2, between 0.5 and 2: exponential
        ]

    prior = process_prior(combined_priors)

    return prior[0]

In [None]:
prior = initialise_priors()
print(prior)

In [None]:
# make a dataloader
loader = NumpyLoader(x=x_train, theta=theta_train)

# instantiate your neural networks to be used as an ensemble
# are the NN here setting our likelihoods?
nets = [
    ili.utils.load_nde_sbi(engine='NPE', model='maf', hidden_features=50, num_transforms=5),
    ili.utils.load_nde_sbi(engine='NPE', model='mdn', hidden_features=50, num_components=6)
]

# hyperparameter search
# for batch_size in [4, 8, 16, 32]:
# for learning_rate in [1e-2, 1e-3, 1e-4]:
# for hidden_features in [50, 100, 200]
# for num_components in [6, 12, 18]

# define training arguments
train_args = {
    'training_batch_size': 10, # batch_size
    'learning_rate': 1e-4 # learning_rate
}

#             training_batch_size=50,
#             learning_rate=5e-4,
#             validation_fraction=0.1,
#             stop_after_epochs=20,
#             clip_max_norm=5,

# initialize the trainer
runner = InferenceRunner.load(
    backend='sbi',
    engine='NPE',
    prior=prior,
    nets=nets,
    device=device,
    embedding_net=None,
    train_args=train_args,
    proposal=None,
    out_dir=None
)

# need to play with training arguments.

In [None]:
# train the model
posterior_ensemble, summaries = runner(loader=loader, seed=1)
# seed fixes the validation set 

In [None]:
posterior_ensemble

In [None]:
summaries

Here, the output of the runner is a posterior model and a log of training statistics. The posterior model is a [NeuralPosteriorEnsemble](https://github.com/mackelab/sbi/blob/6c4fa7a6fd254d48d0c18640c832f2d80ab2257a/sbi/utils/posterior_ensemble.py#L19) model and automatically combines samples and probability densities from its component networks.

In [None]:
# plot train/validation loss
fig, ax = plt.subplots(1, 1, figsize=(6,4))
c = list(mcolors.TABLEAU_COLORS)
for i, m in enumerate(summaries):
    ax.plot(m['training_log_probs'], ls='-', label=f"{i}_train", c=c[i])
    ax.plot(m['validation_log_probs'], ls='--', label=f"{i}_val", c=c[i])
ax.set_xlim(0)
ax.set_xlabel('Epoch')
ax.set_ylabel('Log probability')
ax.legend()

In [None]:
print(f"Shape of theta: {theta_train.shape}")
print(f"Shape of x: {x_train.shape}")


In [None]:
# Now, SBIRunner returns a custom class instance to be able to pass signature strings
# This class has simply for attributes a NeuralPosteriorEstimate and a string list 
print(posterior_ensemble.signatures)

# choose a random input
seed_in = 49
np.random.seed(seed_in)
ind = np.random.randint(len(theta_train))

# generate samples from the posterior using accept/reject sampling
seed_samp = 32
torch.manual_seed(seed_samp)
samples = posterior_ensemble.sample((1000,), torch.Tensor(x_train[ind]).to(device))

# calculate the log_prob for each sample
log_prob = posterior_ensemble.log_prob(samples, torch.Tensor(x_train[ind]).to(device))

samples = samples.cpu().numpy()
log_prob = log_prob.cpu().numpy()

In [None]:
print(ind)

In [None]:

# samples is the posterior, P(theta_hat | y_i), conditioned on data y_i, generated from theta_i (prior)
plt.hist(samples[:, 2])
print('Prior:',theta[ind, 2])
print('Histogram is Posterior:')

In [None]:
# Plot the posterior samples and the true value for all pairs of theta values
fig, axs = plt.subplots(5, 5, figsize=(15, 15))
fig.subplots_adjust(hspace=0.4, wspace=0.4)

for i in range(5):
    for j in range(5):
        if i != j:
            axs[i, j].plot(theta_train[ind, i], theta_train[ind, j], 'r+', markersize=10, label='true')
            im = axs[i, j].scatter(samples[:, i], samples[:, j], c=log_prob, s=4, label='samples', cmap='viridis')
            axs[i, j].set_xlabel(f'$\\theta_{i}$')
            axs[i, j].set_ylabel(f'$\\theta_{j}$')
            axs[i, j].legend()
        else:
            axs[i, j].axis('off')

# Add a color bar for log probability
cbar_ax = fig.add_axes([0.92, 0.15, 0.02, 0.7])
fig.colorbar(im, cax=cbar_ax, label='log probability')

plt.show()

print('True values are marked with red pluses, and the pairs of theta values are plotted against each other.')


In [None]:
labels = ['$\\Omega_m$', '$\\sigma_8$', 'A_SN1', 'A_AGN1', 'A_SN2', 'A_AGN2']

# use ltu-ili's built-in validation metrics to plot the posterior for this point
metric = PlotSinglePosterior(
    num_samples=1000, sample_method='direct', 
    labels=labels,
)

seed_in = 49
np.random.seed(seed_in)
ind_2 = np.random.randint(len(theta_test))


fig = metric(
    posterior=posterior_ensemble,
    x_obs = x_test[ind_2], theta_fid=theta_test[ind_2]
)
# this is using [ind] from test set

In [None]:
print(ind_2)

### Using the ensemble of trained posteriors models
By default, running a SampleBasedMEtric with posterior from above will compute the metrics using the ensemble model. That is to say the ensemble is considered one model, with the weights of each posterior in the ensemble being the val_log_prob.

In [None]:
# Drawing samples from the ensemble posterior
labels = ['$\\Omega_m$', '$\\sigma_8$', 'A_SN1', 'A_AGN1', 'A_SN2', 'A_AGN2']

metric = PosteriorCoverage(
    num_samples=1000, sample_method='direct', 
    labels=labels,
    plot_list = ["coverage", "histogram", "predictions", "tarp"],
    out_dir=None
    
)

fig = metric(
    posterior=posterior_ensemble, # NeuralPosteriorEnsemble instance from sbi package
    x=x_test, theta=theta_test
)

In the ensemble model, it looks like our posteriors are well-calibrated when evaluated on marginal distributions, but slightly negatively biased in the multivariate TARP coverage.

### Evaluating each trained posterior in the ensemble
Below, we compute separately each SampleBasedMetric for every posterior in the ensemble.

In [None]:
# Then for the MDN
fig = metric(
    posterior=posterior_ensemble.posteriors[1],
    x=x_test, theta=theta_test
)

From these results, we see that we are largely consistent and calibrated in the univariate coverage, with some slight negative bias shown in the multivariate coverage. It looks like the MAF model has slightly better constraints than the MDN model, while retaining the same calibration.